# Statistical terms used in research studies; a primer for journalists

#### Tags: training

When assessing academic studies, reporters are often confronted by pages not only full of numbers, but also loaded with concepts such as “selection bias,” “p-value” and “statistical inference.”

Statistics courses are available at most universities, of course, but are often viewed by journalism students as something to be taken, passed and quickly forgotten. However, for working reporters it is imperative to do more than just read study abstracts; understanding the methods and concepts that underpin academic studies is essential to being able to judge the merits of a particular piece of research.

Most studies attempt to establish a correlation between two variables — for example, how community education levels might be “associated with” (a phrase often used by academics) crime rates. But detecting such a relationship is only a first step; the ultimate goal is to determine causation: that one of the two variables drives the other. There is a time-honored phrase to keep in mind: “Correlation is not causation.” (This can be usefully amended to “correlation is not necessarily causation,” as the nature of the relationship needs to be determined.)

Another key distinction to keep in mind is that studies can either explore observed data (descriptive statistics) or use observed data to predict what is true of areas beyond the data (inferential statistics). The statement “From 2000 to 2005, 70% of the land cleared in the Amazon and recorded in Brazilian government data was transformed into pasture” is a descriptive statistic; “Receiving your college degree increases your lifetime earnings by 50%” is an inferential statistic.

Here are some other basic statistical concepts that journalism students and working journalists should be familiar with:

• A sample is a portion of an entire population. Inferential statistics seek to make predictions about a population based on the results observed in a sample of that population.
• There are two primary types of population samples: random and stratified. For a random sample, study subjects are chosen completely by chance, while a stratified sample is constructed to reflect the characteristics of the population at large (gender, age or ethnicity, for example).
• Attempting to extend the results of a sample to a population is called generalization. This can be done only when the sample is truly representative of the entire population.
• Generalizing results from a sample to the population must take into account sample variation. Even if the sample selected is completely random, there is still a degree of variance within the population that will require your results from within a sample to include a margin of error. For example, the results of a poll of likely voters could give the margin of error in percentage points: “47% of those polled said they would vote for the measure, with a margin of error of 3 percentage points.” Thus, if the actual percentage voting for the measure was as low as 44% or as high as 50%, this result would be consistent with the poll.
• The greater the sample size, the more representative it tends to be of a population as a whole. Thus the margin of error falls and the confidence level rises.
• Most studies explore the relationship between two variables — for example, that prenatal exposure to pesticides is associated with lower birthweight. This is called the alternative hypothesis. Well-designed studies seek to disprove the null hypothesis — in this case, that prenatal pesticide exposure is not associated with lower birthweight.
• Significance tests of the study’s results determine the probability that the null hypothesis is invalid; the p-value indicates how much evidence there is. If the p-value is 0.05, there is a 95% probability that the null hypothesis is invalid; if the p-value is 0.01, there is a 99% probability.
• The other threat to a sample’s validity is the notion of bias. Bias comes in many forms but most common bias is based on the selection of subjects. For example, if subjects self-select into a sample group, then the results are no longer externally valid, as the type of person who wants to be in a study is not necessarily similar to the population that we are seeking to draw inference about.
• When two variables move together, they are said to be correlated. Positive correlation means that as one variable rises or falls, the other does as well — caloric intake and weight, for example. Negative correlation indicates that two variables move in opposite directions — say, vehicle speed and travel time. So if a scholar writes “Income is negatively correlated with poverty rates,” what he or she means is that as income rises, poverty rates fall.
• Elasticity, a term frequently used in economics studies, measures how much a change in one variable affects another. For example, if the price of vegetables rises 10% and consumers respond by cutting back purchases by 10%, the expenditure elasticity is 1.0 — the increase in price equals the drop in consumption. But if purchases fall by 15%, the elasticity is 1.5, and consumers are said to be “price sensitive” for that item. If consumption were to fall only 5%, the elasticity is 0.5 and consumers are “price insensitive” — a rise in price of a certain amount doesn’t reduce purchases to the same degree.
• Regression analysis is a way to determine if there is or isn’t a correlation between two variables and how strong any correlation may be. At its most basic, this involves plotting data points on a X/Y axis (in our example, community education levels and crime rates) looking for the average causal effect. This means looking at how the graph’s dots are distributed and establishing a trend line. Again, correlation isn’t necessarily causation.
• Standard deviation provides insight into how much variation there is within a group of values. It measures the deviation (difference) from the group’s mean (average).
• Be careful to distinguish the following terms as you interpret results: Average, mean and median. The first two terms are synonymous, and refer to the average value of a group of numbers. Add up all the figures, divide by the number of values, and that’s the average or mean. A median, on the other hand, is the central value, and can be useful if there’s an extremely high or low value in a collection of values — say, Bill Gates’s salary in a list of people’s incomes. (For more information, read “Math for Journalists” or go to one of the “related resources” at right.)
• In descriptive statistics, quantiles can be used to divide data into equal-sized subsets. For example, dividing a list of individuals sorted by height into two parts — the tallest and the shortest — results in two quantiles, with the median height value as the dividing line.  Quartiles separate data set into four equal-sized groups, deciles into 10 groups, and so forth. Individual items can be described as being “in the upper decile,” meaning the group with the largest values, meaning that they are higher than 90% of those in the dataset.
• Causation is when change in one variable alters another. For example, air temperature and sunlight are correlated (when the sun is up, temperatures rise), but causation flows in only one direction. This is also known as cause and effect.
• When causation has been established, the factor that drives change (in the above example, sunlight) is the independent variable. The variable that is driven is the dependent variable.
• While causation is sometimes easy to prove, frequently it can often be difficult because of confounding variables (unknown factors that affect the two variables being studied). Studies require well-designed and executed experiments to ensure that the results are reliable.

There are a number of free online statistics tutorials available, including one from Stat Trek and another from Experiment Resources. Stat Trek also offer a glossary that provides definitions of common statistical terms. Another useful resource is “Harnessing the Power of Statistics,” a chapter in The New Precision Journalism.

Note that understanding statistical terms isn’t a license to freely salt your stories with them. Always work to present studies’ key findings in clear, jargon-free language. You’ll be doing a service not only for your readers, but also for the researchers.

Tags: training

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

roberto smeraldi Sep 5, 2011 10:15

Very interesting and very useful post. However, I would like to suggest to use a different example, when it comes to explaining the difference between “descriptive” (in relation to “inferential”) statistics. In fact, both examples used are of “inferential nature”: the difference is that the early one is inferential about the past and the latter is inferential about the future. In order to be a rigorously descriptive one, the early example should spell something like the following:”From 2000 to 2005, 70% of the land cleared in the Amazon was occupied by cattle-ranching” (or “was transformed into pasture”, etc.). By focusing on the possible PURPOSE of those who converted the forest, the conclusion becomes inferential. Actually, the purpose might have been another one, i.e. speculating on land, obtaining access to public subsidies, (failed) attempt to establish crops etc.. So, the fact that deforestation resulted into pasture might be partially “on purpose” and partially not. Sorry for the rather technical comment, but the example is a typical one: I often have to explain to fellow journalists that “resulting use of land over a certain period” is not necessarily “THE cause” for land conversion. And the example helps to show how you can reach inferential conclusions about the past and not just about the future. Right?

Roberto Smeraldi (director, Amigos da Terra – Amazônia Brasileira), Brazil

John Wihbey Sep 5, 2011 11:11

Roberto – Thanks for your comments and feedback. I see your point and have amended the example as you suggest for the sake of clarity. That was very nice of you to take the time to write in. We are always looking for good new studies; if you have any to suggest relating to Brazil, please let me know at John_Wihbey@hks.harvard.edu. Much gratitude, John

[...] blog Journalist’s Resource (de Harvard) publicou recentemente termos de estatística usados em pesquisas, uma espécie de cartilha para jornalistas. Quem precisa [...]

Things to read « Kenny Smith | A few words … Sep 6, 2011 15:55

[...] Statistics for journalists. Great primer, there. Ten rules for visual storytelling, from Professor Mindy McAdams at Florida. It starts with this: “I want to know more. I always want to know who, when, and where. Always! For me this is part of authentication, which is part of what makes it journalism and not interpretive art. A photo without a caption is not journalism.” [...]

[...] Statistics for journalists. Great primer, there. Ten rules for visual storytelling, from Professor Mindy McAdams at Florida. It starts with this: “I want to know more. I always want to know who, when, and where. Always! For me this is part of authentication, which is part of what makes it journalism and not interpretive art. A photo without a caption is not journalism.” [...]

[...] a lista de ontem sobre os termos estatísticos que jornalistas deveriam saber. O post original foi publicado pelo blog Journalist’s Resource e recomendado pelo José Roberto de Toledo. Em [...]

Reliable Sources in an Age of Too Much Information - NYTimes.com Sep 28, 2011 16:04

[...] site is full of useful background and tools for writers, including a journalist’s primer on statistical terms used in research studies and a short reflection on online research by Alexis Madrigal of The [...]

[...] ID laws [are] associated with a 2.2% point decline in turnout, and photo ID laws are correlated with a 1.6% point decline.” In [...]

Dr. Denny Wilkins Oct 25, 2011 11:12

Excellent, succinct post. It will be required reading for my students. Many thanks.

Master of Public Health Jan 4, 2012 13:44

Master of Public Health…

[...]Statistical terms used in research studies; a primer for journalists – Journalist's Resource: Research for Reporting, from Harvard Shorenstein Center[...]…

dental assisting Jan 5, 2012 19:30

dental assisting…

[...]Statistical terms used in research studies; a primer for journalists – Journalist's Resource: Research for Reporting, from Harvard Shorenstein Center[...]…

[...] Lemann: I think there are three things. One is some kind of basic statistical literacy. A lot of academic literature has at least some statistics in it. The course I teach in the fall, [...]

Physiology Jan 7, 2012 9:25

Human Anatomy…

[...]Statistical terms used in research studies; a primer for journalists – Journalist's Resource: Research for Reporting, from Harvard Shorenstein Center[...]…

Jenni May 11, 2012 21:11

All good, but I thought the section on null hypothesis/p values needed simpler clarification.

Leighton W. Klein May 18, 2012 14:11

Done! Thanks for the feedback.

[...] It means every j-school instructor emphasizing sophisticated web search skills, some knowledge of statistical concepts, a familiarity with research databases, and fostering the strong desire for a non-anecdotal, [...]

Do you have what it takes to become a news anchor? | Gemma Mar 12, 2013 19:36

[...] Klein, L. (2011, September 9). Statistical terms used in research studies; a primer for journalists. Journalist’s Resource. Retrieved November 4th, 2012, from http://journalistsresource.org/reference/research/statistics-for-journalists/ [...]