"Social science done on deadline": Research chat with ASU's Steve Doig on data journalism

Steve Doig is the Knight Chair in Journalism at the Walter Cronkite School of Journalism and Mass Communication at Arizona State University, specializing in computer-assisted reporting. Prior to arriving at the Cronkite School, he spent 19 years at the Miami Herald, where his team won the 1993 Pulitzer Prize for Public Service Journalism for their investigation into how building codes and lax zoning and inspection contributed to the damage caused by Hurricane Andrew.

As he wrote in a 2008 article for Nieman Reports, Doig sees a convergence between journalism and social science, and he has worked to make the profession more research-oriented. Doig is a member of Investigative Reporters and Editors (IRE), and consults for other news organizations, including California Watch. In 2012, he was a data consultant for News21, a pioneering media project based in journalism schools as part of the Carnegie-Knight Initiative on the Future of Journalism Education. For the News21 project, Alex Remington, research assistant for Journalist’s Resource and a Harvard Kennedy School graduate student, collaborated with Doig to create a profile of American voters from census data. The overall project received national attention and was published in part in outlets such as the Washington Post.

Remington interviews Doig here, following their work together in summer 2012:

_____

Alex Remington: How would you define data journalism?

Steve Doig: Data journalism is really just another way of gathering information. It’s the equivalent of interviewing sources and looking at documents, except with data journalism you are essentially interviewing the data to let it tell you its secrets, basically. The power of data journalism is it lets you do stories that would either be difficult or otherwise impossible to do.

Remington: What are the limits of data in journalism?

Doig: The limits aren’t a matter of size. People these days are doing stories with databases that total millions, even hundreds of millions, of records. [Given] computer power these days, it really is no problem with the size of the data. The limitations of data journalism basically are with the quality of the data itself. The problems that we run into [with] data journalism stories typically involve making sure that the data is in fact accurate. Does it need to be cleaned up, standardized, those kinds of things. The actual analysis often can be very quick, a matter of making pivot tables, or doing a statistical analysis with something like SPSS.

Remington: Once the data is relatively clean, are there limits still to what can be done with a typical dataset?

Doig: The data will tell you all the so-called W’s and H’s except why. It’ll tell you what’s in it, when, where, all those kind of things, but it never really tells you why something has occurred. You have to look at patterns in the data, and then go back to doing traditional reporting, finding people who were affected [and] experts on the subject to explain their interpretations of the patterns in the data. So I guess that would be the closest thing to a limit.

One other area that’s essentially a limit is that data can [only] answer questions that are within the scope of the data. An example would be: In campaign finance you can readily do things like finding out which interest groups are getting close to which candidates, and how much money, the calendar patterns of giving, and those kinds of things. But you can’t get it to answer, at least readily, whether there’s a gender gap in giving for one candidate or another, because there is no gender variable in the data. You can estimate it by doing guesses based on gender pattern of first names, perhaps, but you can’t readily answer that kind of question.

Remington: Are there any common misuses of data?

Doig: On occasion there are problems when reporters do things like finding a correlation and thinking that it means causation, for instance. They will see a seeming relationship between two variables and then write the story as though this variable has caused the other variable. One thing statisticians learn is correlation does not imply causation. So I would say that’s a matter of enough statistical understanding to know that you shouldn’t automatically make that assumption that the correlation that you see is actually an effect relationship.

Other kinds of those type of errors: Simple math errors show up. Reporters are famously math-clumsy, I guess I would say, and to be a good data journalist you have to be at least intuitively comfortable with doing, I guess I would call it, basic math. You have to be able to calculate percentage change, for instance, and get an answer and look at it and say, “Does this answer make sense?” Too often, reporters just put numbers in a computer, numbers come out, and then they blindly accept the numbers without looking to see if the answer they’re getting makes sense.

Remington: Do you feel that there’s a problem with a lack of data literacy among readers? A lack of data literacy that might affect how broadly a story could be read or understood?

Doig: Just as there’s a range of number literacy with reporters, there certainly is with readers. A considerable portion of the population isn’t very sophisticated about numbers, and so it’s too easy to lead that portion of the population astray by writing a story where you’ve misinterpreted the numbers. On the other extreme, you do have a portion of your readership that in fact is pretty good with numbers, and when a reporter has made a simple mistake with the data — perhaps a correlation-causation thing or something like that — that portion of the readership looks at it and says: “This reporter is an idiot. I can’t trust what he or she is saying.” Even a simple math error is enough to hurt credibility in the entire story.

Remington: What can a writer or a data journalist do to help a reader who may not be data literate to understand the point that’s being made?

Doig: One good thing is to try and put numbers in context with things that most readers would understand. So rather than saying, “The oil spill was a million gallons” — How big is that? We don’t know — this could be perhaps translated into number of backyard swimming pools filled by that amount of oil. Basically, turning numbers into some kind of context that’s familiar.

Another example is lottery statistics — how hard it is to win the lottery. Just saying “1 chance in 13 million” is meaningless. But if you do something like picture stretching all 13 million lottery tickets from Miami to North Carolina, whatever the length would be, and then having to drive along it and at some point or other make the decision to slam on the brakes, get out of the car, go over and pick up one ticket. That’s how hard it is to win. Also, putting whatever pattern you’re finding into context with either other places or the same place in the past. In other words, looking at change over time and comparison to other places is a good way of, again, putting a number — which, standing by itself, you don’t know if that number is big or small, bad or good — but, by being able to compare it to other things, it helps.

Remington: What tools are necessary for a journalist to make effective use of data?

Doig: The first is a reasonable comfort with basic math. Add, subtract, multiply, divide, percentages, calculating rates, things like accident rates, crime rates, things like that. Being able to calculate inflation, consumer price index changes, those kinds of things are basic. A reporter who isn’t comfortable doing that would be dangerous turned loose on database journalism.

Beyond that, the most basic tool is a spreadsheet, typically Microsoft Excel. I would say Excel will let you do 80% or more of what you’re ever likely to need with database journalism. It can handle up to just over a million records and thousands of variables, which is something you actually rarely run into — usually it’s no more than dozens of variables. Being able to efficiently get Excel to do things like sorting and filtering, and to be able write functions that will create new variables based on the variables that are there, and then finally to use pivot tables to get counts and averages and things like that out of categorical data. That, I would say, is a basic ground-level tool that all reporters should get comfortable with.

Going beyond that, sometimes you run into datasets that involve more than a million records, for instance, or two tables that you need to bring together by a common variable. In that case you would go to a database manager, something like Microsoft Access or MySQL.

A good high-powered text editor is very useful — something that lets you open up a file, look at it and see what kind of delimiters are being used, whether there’s odd characteristics of the file that might be causing problems with importing the data. In Mac world, TextWrangler is a good example. Mapping tools are wonderfully useful because almost every dataset we have and almost every story we work on has “where” as an important part of it — there’s a variable that’s listing cities, addresses, latitudes and longitudes, or anything like that. Looking at the geographical pattern of the data often can be very revealing. Not only for the story but as a way to help reporters find where the story is, go out to that neighborhood or that particular place and see why that pattern has existed.

There are other kinds of tools that do specialized kind of database journalism tasks. There is a growing number of reporters who’ve become comfortable with writing programs in languages like Python or Perl, which allow you to do, for instance, what’s called “data scraping,” where you go to a website that has lots and lots of information that can’t be readily downloaded. But you can efficiently write a program that will go there and pull off all that data and turn it into something that you can work with in Excel, for instance.

An important sidelight of data journalism is the presentation of the data afterwards. [The data is] not only useful for you as a reporter in finding the story, but ultimately you want to somehow or other make it or its patterns available to your readers and viewers. So using tools that let you create interactive data visualizations are very popular now, things as simple as Google Fusion Tables. Young people who are trained in computer science are being drawn in to journalism because they see the opportunities of applying what they’ve learned to do some of these often very entertaining and informative data visualizations.

Remington: What’s the distinction between a journalist with some social science and computer science training and a social scientist or a computer scientist?

Doig: There’s not a real difference, other than the audience that each is writing for. Somebody who sees [him or herself] as an academic social scientist very often is doing work, they’ll do research but it will be research that is printed in academic journals within the specialty of that social scientist. I think it was Donald Graham or somebody who once talked about journalism as being history written on deadline. [Editor’s note: It was Donald Graham’s father, Philip Graham, who described journalism as “the first, rough draft of history,” though he may not have been the first.]

I sort of see data journalism — particularly applying these principles of social science, the so-called precision-journalism approach — as social science done on deadline. We’re using the tools that social scientists have used for years, whether it’s for survey research or statistical analysis or things like that, we’re applying those tools to journalism problems and using it to help us tell stories with more authority.

Remington: What’s your favorite data-driven story that you’ve worked on?

Doig: The one that I think had lots of impact, certainly, was the data analysis I was heavily involved in with the Miami Herald after Hurricane Andrew hit in 1992. We on the investigations team wanted to see if there was a way, basically, to answer the question of whether the scale of the disaster was an act of God or our own stupidity. So we used variety of data-journalism tools looking at patterns of the damage, things like campaign contributions, and a whole variety of other aspects. And we were able to conclude, basically, that the disaster really was one of our own making. We had weakened building codes over the 25 years or so [since] a hurricane had hit Miami, and the pattern in the data was, the newer the house, the more likely to be destroyed. That was the purest smoking gun I ever got out of a database analysis.

But I’ve had the opportunity to work on a whole variety of interesting ones. Most recently, a project of California Watch [looking] at a pattern of basically suspicious billing by a particular hospital chain in California. By analyzing tens of millions of publicly available hospital records, I was able to help California Watch know that in fact this chain was doing what is called “upcoding,” turning normal conditions into more exotic ones that pay higher from Medicaid.

Remington: What are some of the best current examples of data journalism in practice that you’ve seen?

Doig: Back when I started in the very early 1980s, there were a handful of reporters like myself who were beginning to learn and apply these kinds of things. Today, newsrooms of all sizes have at least one person who’s doing higher-level stuff, and often a range of other reporters are at least comfortable with something like Excel. Rather than single out particular stories, I would point people to the Investigative Reporters and Editors site. IRE is sort of the leading organization that spends much of its time training reporters how to use techniques like this [and] you can readily find examples of good database journalism there.

Tags: research chat, data journalism