10 simple data errors that can ruin an investigation

Mistakes with numbers can have a cascading effect for investigative stories, and a damaging effect for audience trust, as many other erroneous figures, trend claims, and conclusions can flow from that initial error.

At the recent NICAR23 conference in Nashville, Tennessee — the annual data journalism summit organized by Investigative Reporters & Editors (IRE) — GIJN asked several speakers to suggest common data blunders or oversights that have threatened or ruined past investigations.

“Every journalist will make a mistake — it’s all about being smart about making sure you never make that mistake again, and about being transparent with your audience,” says Aarushi Sahejpal, data editor at the Investigative Reporting Workshop at American University. “But you can certainly minimize the chance of mistakes.”

In a summary echoed by other experts, Sahejpal says avoiding mistakes generally involves asking yourself three questions: Do you actually have the full dataset? Have you spoken to the person behind the data to know what it really means? And what does the data not tell you?

Still, mistakes do happen, and here are 10 common causes, according to data journalism experts.

1. Forgetting the threat of blank rows in spreadsheets.

According to data journalism trainer Samantha Sunne — currently a local reporting fellow at ProPublica — one common and devastating mistake is to wrongly assume that you’ve selected or highlighted an entire data column in your Google Sheet. The problem, she says, is that spreadsheets stop highlighting at blank rows lower down, and Sunne says a failure to spot this data exclusion has caused some reporters to reach the wrong conclusions in their investigations.

“Oftentimes, you’ll get blank rows in your data — maybe that’s where the page break was, or there was no data for that item — and you might easily not notice them if you don’t scroll down,” explains Sunne. “If you aren’t careful to truly select all, it can completely destroy your analysis.”

Her solution? After you’ve clicked on any data column, hit Control A (or Command A) once — and then hit Control A (or Command A) again, to capture the data below any blank row as well.

2. Failing to check whether government nomenclature or coding has changed.

Janet Roberts, data journalism editor at Reuters, says government and municipal agencies often change their codes for functions, and that this could happen while you’re collecting their data — and that it is crucial for reporters to check whether all the data in your dataset applies to the same thing ahead of publication.

“In St. Paul (Minnesota), we were doing an investigation into slumlords, and we got the building code violation data, and we were going to find the landlord with the most of a certain kind of offense,” Roberts recalls. “We did all our crunching — but it turned out that, at some point, the buildings department had changed the codes, so maybe an “02” used to mean rat infestation, but it now means you didn’t sweep your sidewalk, or whatever. Luckily, we found this out — albeit, very deep into the process — because, had we not found it out, the entire story would have been wrong.”

She adds: “The potential error here is failing to understand the data — failing to talk to the people who keep the data. Ask how the data evolves over time.”

3. Confusing percentages with percentage points.

This simple mistake is nevertheless a perennial problem — and can end up accidentally misleading audiences. “If something jumps from 20 to 30%, that’s actually a 50% increase, not a 10% increase — that can be tricky, and important to pay attention to,” explains Sunne. Data experts stress that percent change refers to a rate, but percentage point change means an amount. To avoid confusion, it’s better to describe a 100% increase by saying something “doubled.” “A lot of people don’t understand the difference between percentage points and percentages,” Sahejpal says. “Same with ‘per capita’ — using rates and per capita in the same sentence often doesn’t make sense, because per capita is per person.”

4. Accepting round numbers without double-checking.

Large round numbers, or round numbers of data rows, like 7,000 or 2,000, can often mean some limit on a records search or a data transfer, rather than a true total, according to Roberts.

“We had data that suggested that only 5,000 companies had filed their required reports on something, and we thought: ‘Exactly 5,000?’” Roberts recalls. “That seemed unusual, and also a low number. What the reporter hadn’t noticed was that the website limited search results to 5,000 records, and the true results turned out to be about three times that.”

“If you have a dataset of perfectly 1,000 or 10,000 rows, I would bet money something is off,” Sahejpal says. “And I can’t tell you the number of students I’ve had who downloaded a file, and didn’t realize they’d downloaded a filtered version. Another mistake is if you don’t check that the range of your dataset is equal to the reported range on a government website.”

5. Forgetting that number formats are different in different countries.

“$1,753.00 in the US is written as ‘$1.753,00’ in Latin America — the commas and periods and apostrophes are in different places — but spreadsheets don’t account for the different punctuation,” says Emilia Diaz-Struck, Latin American coordinator at the International Consortium of Investigative Journalists (ICIJ). “It’s also possible to make really basic conceptual mistakes if you don’t think about the origin of the numbers.”

6. Ignoring your gut when the data “just seems off.”

Even after the numbers have been checked in the spreadsheet, and double-checked with a human data source, experienced journalists sometimes find those figures jarring, or at odds with their knowledge of the topic. Dianna Hunt, senior editor at ICT (formerly Indian Country Today), says reporters should respect this feeling, and seek out alternate or historical data, or academic researchers, to check those numbers independently, or at least check if they’re in the “ballpark” for that topic. For instance, that feeling could indicate major errors by the original government data gatherers, or even just a decimal point typo at the input stage.

“You need to pay attention to your gut instinct when something seems wrong — that has certainly paid off in several investigations I’ve worked on,” says Hunt.

7. Failing to speak to the human behind the dataset.

“Before you use the data, you need to reach out to the source, and understand what every column means,” says Sahejpal. “Look, maybe you’re downloading from a website that has a perfect methodology set out — but I’d bet that a lot of the data you’re looking at is not clear in terms of what it actually means, and doesn’t mean. People in data journalism often don’t explain this, but, in fact, all of us talk to people way more than you think — we don’t just stare at computer screens.”

He adds: “Finding a way to reach the people inputting the data is a lot easier than figuring out what to do with their dataset.”

8. Assuming the dataset tells the whole story.

Having obtained a relevant dataset, Sahejpal suggests that reporters immediately compile — and prominently post — the set of relevant questions the dataset does not answer.

“The number one thing I do to avoid mistakes as an editor is to list what the data doesn’t tell you,” he says. “What we call the ‘limitations section’ on your dataset is your strongest ally, because if you know what it doesn’t tell you, you know to not say what you should not say, and what further questions to ask.”

Sahejpal adds: “If you have a dataset on, say, parking ticket violations in Washington, DC, you make a list of the regions and variables you do not have that could impact your analysis, and, right off the bat, you have a full picture of what you need. Then you get on the phone with the person in charge of the data and confirm what you do have.”

9. Using the wrong scale on graphs or charts.

Graphs published by media outlets — or even supplied to journalists — sometimes begin with an arbitrary number on the axes, like “1,500,” instead of zero, which can confuse audiences, or simply be wrong. “Be critical of the visualizations that you do put out,” says Sahejpal. “Make sure to check both the X and the Y axis, the variables compared, and the scale, to ensure accuracy. In any data visualization, it’s important to see if the scale starts wrong, or if the change increments don’t make sense. I see that kind of error all the time.”

10. Forgetting to tie columns together when sorting in Google Sheets.

Sorted data often provides easy angles, by arranging rows to show, for instance, worst-to-best: perhaps the highest death rates for some cause per town, at the top of a column, and the better-performing towns below.

Sorting in Google Sheets is surprisingly straightforward — and is even helped by pop-up suggestions from the program — but it requires a step by step sequence on the Sheet.

According to Tisha Thompson — a data reporter at ESPN — reporters can play around with many of the functions, but warns that the one step that reporters simply cannot forget is to click “the top left square” when sorting in Google Sheets: the blank box that selects both the column and row axes. This box ties a sorted column with the whole dataset. Forgetting this square, she says, can not only mangle your numbers — but can do so without you noticing the error prior to publication.

“Not paying attention to the top left corner is the easiest mistake you will make, and it can end careers,” warns Thompson. “You want to always keep your data tied to other lines and rows, so you need to highlight the whole kit-and-caboodle. Don’t ever sort only a single column; always use the top left hand corner — it should be like tying your shoes.”

This article first appeared on GIJN and is republished here, with permission, under a Creative Commons license.