What is Big Data? Research roundup, reading list

“Big Data” is an emerging, catch-all term that is defined many ways by many different groups. For the media, it’s a phenomenon to watch, describe and report on, but it also has deep implications for how the information business itself may evolve, carrying with it both the strong possibility of further disruption and creative opportunities. Beyond the hype around the possibilities for e-commerce and industry, there are a huge number of ethical and legal questions.

In terms of specific implications for journalism, Jennifer LaFleur of ProPublica and David Donald of the Center for Public Integrity argue that analytic rigor needs to be brought to bear to expose the limitations of datasets. “Journalism by numbers does not mean ceding human process to the bots,” writes Emily Bell in the Columbia Journalism Review, adding, “There must be transparency and a set of editorial standards underpinning the data collection.” And for more on the promises and perils for the media, see this post at Nieman Journalism Lab.

As researchers danah boyd and Kate Crawford state in a 2012 article in Information, Communication & Society, Big Data is best described as a phenomenon playing out in several dimensions: It is about “maximizing computation power and algorithmic accuracy to gather, analyze, link, and compare large data sets”; it is also about “drawing on large data sets to identify patterns in order to make economic, social, technical, and legal claims.” Behind all of this, the researchers note with skepticism, is the “widespread belief that large data sets offer a higher form of intelligence and knowledge that can generate insights that were previously impossible, with the aura of truth, objectivity, and accuracy.” Also see Crawford’s article for the Harvard Business Review, “The Hidden Biases of Big Data,” where she discusses examples where data failed to tell the full story, and her talk on “algorithmic illusions”:

Finally, in terms of the need for further rigor — and a healthy dose of skepticism — it is worth noting the research of David Lazer and Ryan Kennedy of Northeastern/Harvard, Gary King of Harvard and Alessandro Vespignani of Northeastern, which shows how hype around the ability to spot flu trends through Internet-generated data was not supported by the evidence. Their 2014 paper “The Parable of Google Flu: Traps in Big Data Analysis” finds that “research on whether search or social media can predict x has become commonplace … and is often put in sharp contrast with traditional methods and hypotheses. Although these studies have shown the value of these data, we are far from a place where they can supplant more traditional methods or theories.”

Data can be text and numbers but can also include maps and images. An array of machines — from underwater sensors and pet collars to mobile phones and traffic signals — can capture reams of data waiting to be sliced, diced and analyzed. In recent years, technological advances have expanded the types of Big Data that can be harnessed and stored — and who has access to these data. No longer the exclusive province of numbers experts in academia and the hard sciences, public resources such as Data.gov allow anyone with an Internet connection to download large datasets on topics ranging from recent earthquake activity and local unemployment statistics to the U.S. Census Bureau’s voluminous data on everything from business activity to rates of depression by census tract.

The growth of Big Data is both celebrated and condemned. Proponents see it as enabling new businesses and promoting transparency in markets and government. Detractors fear that this transparency will extend into the personal realm, as was the case when Target’s data crunchers correctly determined from shopping patterns that a teenager was pregnant before she had disclosed her condition to family members — or the store. For social science researchers, huge data sets are the key to drawing out subtle patterns with important implications: The 2012 Nature study analyzing Facebook’s powerful impact on voter turnout — based on a dataset of more than 61 million people — is a prime example.

At any rate, many expect there to be a new “race” for data of all kinds, as corporations, governments, individuals and various groups compete for ownership and economic advantage through data mining and other techniques.

For a brief, general overview of the topic, good places to start are articles in The Economist, “Data, Data Everywhere,” the New York Times, “The Age of Big Data,” and the Wall Street Journal, “Meet the New Boss: Big Data.” For an overview of how the field of data journalism is expanding, see this overview at Journalist’s Resource; and read this interview with expert data journalist Steve Doig, of ASU’s Cronkite School, who discusses “social science done on deadline.” Finally, read what one of the world’s leading academic data experts, Harvard’s Gary King, has to say about connection points with journalism.

Below are studies and articles that bring a research perspective to questions around Big Data:

_____

“Critical Questions for Big Data”
boyd, danah; Crawford, Kate. Information, Communication & Society, 2012, 15:5, 662-679. doi: http://dx.doi.org/10.1080/1369118X.2012.678878.

Abstract: “The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists and many others are clamoring for access to the massive quantities of information produced by and about people, things and their interactions. Diverse groups argue about the potential benefits and costs of analyzing information from Twitter, Google, Verizon, 23andMe, Facebook, Wikipedia and every space where large groups of people leave digital traces and deposit data. Significant questions emerge. Will large-scale analysis of DNA help cure diseases? Or will it usher in a new wave of medical inequality? Will data analytics help make people’s access to information more efficient and effective? Or will it be used to track protesters in the streets of major cities? Will it transform how we study human communication and culture, or narrow the palette of research options and alter what ‘research’ means? Some or all of the above? This essay offers six provocations that we hope can spark conversations about the issues of Big Data. Given the rise of Big Data as both a phenomenon and a methodological persuasion, we believe that it is time to start critically interrogating this phenomenon, its assumptions and its biases.”

“The Future of Big Data”
Anderson, Janna; Rainie, Lee. Pew Internet & American Life Project, July 2012.

Abstract: “While enthusiasts see great potential for using Big Data, privacy advocates are worried as more and more data is collected about people — both as they knowingly disclose such things as their postings through social media and as they unknowingly share digital details about themselves as they march through life. Not only do the advocates worry about profiling, they also worry that those who crunch Big Data with algorithms might draw the wrong conclusions about who someone is, how she might behave in the future, and how to apply the correlations that will emerge in the data analysis. There are also plenty of technical problems. Much of the data being generated now is ‘unstructured,” and as a consequence, “getting it into shape for analysis is no tiny task.”

“Trending: the Promises and the Challenges of Big Social Data”
Manovich, Lev. Debates in the Digital Humanities, July 2011, University of Minnesota Press.

Findings: “Is it true that ‘surface is the new depth’ — in a sense that the quantities of ‘deep’ data that in the past was obtainable about a few can now be automatically obtained about many? Theoretically, the answer is yes, as long as we keep in mind that the two kinds of deep data have different content. Practically, there are a number of obstacles before this can become a reality. I tried to describe a few of these obstacles, but there are also others I did not analyze. However, with what we already can use today (social media companies APIs, Infochimps.com data marketplace and data commons, free archives such as Project Gutenberg, Internet Archive, etc.), the possibilities are endless — if you know some programming and data analytics, and also are open to asking new types of questions about human beings, their social life and their cultural expressions and experiences.”

“Big Data: The Next Frontier for Innovation, Competition, and Productivity”
Manyika, James; Chui, Michael; Brown, Brad; Bughin, Jacques; Dobbs, Richard; Roxburgh, Charles; Hung-Byers, Angela. McKinsey Global Institute, May 2011.

Findings: “There are five broad ways in which using big data can create value. First, big data can unlock significant value by making information transparent and usable at much higher frequency. Second, as organizations create and store more transactional data in digital form, they can collect more accurate and detailed performance information on everything from product inventories to sick days, and therefore expose variability and boost performance. Leading companies are using data collection and analysis to conduct controlled experiments to make better management decisions; others are using data for basic low-frequency forecasting to high-frequency nowcasting to adjust their business levers just in time. Third, big data allows ever-narrower segmentation of customers and therefore much more precisely tailored products or services. Fourth, sophisticated analytics can substantially improve decision-making. Finally, big data can be used to improve the development of the next generation of products and services. For instance, manufacturers are using data obtained from sensors embedded in products to create innovative after-sales service offerings such as proactive maintenance (preventive measures that take place before a failure occurs or is even noticed).”

“’Big Data’ Versus ‘Big Brother’: On the Appropriate Use of Large-scale Data Collections in Pediatrics”
Currie, Janet. Pediatrics, Vol. 131, No. Supplement 2,S127 -S132. doi: 10.1542/peds.2013-0252c.

Abstract: “Discussions of ‘big data’ in medicine often revolve around gene sequencing and biosamples. It is perhaps less recognized that administrative data in the form of vital records, hospital discharge abstracts, insurance claims, and other routinely collected data also offer the potential for using information from hundreds of thousands, if not millions, of people to answer important questions. However, the increasing ease with which such data may be used and reused has increased concerns about privacy and informed consent. Addressing these concerns without creating insurmountable barriers to the use of such data for research is essential if we are to avoid a ‘missed opportunity’ in pediatrics research.”

“Digital Fuel of the 21st Century: Innovation through Open Data and the Network Effect”
Kundra, Vivek. Shorenstein Center on Media, Politics and Public Policy, January 2012, Discussion Paper Series No. D-70, Harvard Kennedy School.

Findings: “If … data isn’t sliced, diced and cubed to separate signal from noise, it can be useless. But, when made available to the public and combined with the network effect — defined by Reed’s Law, which asserts that the utility of large networks, particularly social networks, can scale exponentially with the size of the network — society has the potential to drive massive social, political and economic change…. Channeling the power of this open data and the network effect can help: (1) Fight government corruption, improve accountability and enhance government service; (2) change the default setting of government to open, transparent and participatory; (3) create new models of journalism to separate signal from noise to provide meaningful insights; (4) launch multi-billion dollar businesses based on public sector data.”

“The Danger of Big Data: Social Media as Computational Social Science”
Oboler, Andre; Welsh, Kristopher; Cruz, Lito. First Monday, July 2012, Vol. 17, No 7.

Findings: “The capacity to collect and analyze data sets on a vast scale provides leverage to reveal patterns of individual and group behaviour (Lazer, et al., 2009). The revelation of these patterns can be a concern when they are made available to business and government. It is, however, precisely business and government who today control the vast quantities of data used for computational social science analysis.. The potential damage from inappropriate disclosure of information is sometimes obvious. However, the potential damage of multiple individually benign pieces of information being combined to infer, or a large dataset being analysed to reveal, sensitive information (or information which may later be considered sensitive) is much harder to foresee. A lack of transparency in the way data is analysed and aggregated, combined with a difficulty in predicting which pieces of information may later prove damaging, means that many individuals have little perception of potential adverse effects of the expansion in computational social science.”

“The Machine that Would Predict the Future”
Weinberger, David. Scientific American, December 2011. Volume 305, 52-57.
doi:10.1038/scientificamerican1211-52

Findings: “Agent-based modeling works only in a very narrow set of circumstances, according to Gary King, director of the Institute for Quantitative Social Science at Harvard. In the case of a highway or the hajj, everyone is heading in the same direction, with a shared desire to get where they are going as quickly and safely as possible. Helbing’s FuturICT system, in contrast, aims to model systems in which people are acting for the widest variety of reasons (from selfish to altruistic); where their incentives may vary widely (getting rich, getting married, staying out of the papers); where contingencies may erupt (the death of a world leader, the arrival of UFOs); where there are complex feedback loops (an expert’s financial model brings her to bet against an industry, which then panics the market); and where there are inputs, outputs and feedback loops from related models…. Scientists raise a number of interrelated challenges that such a comprehensive system would have to overcome. To begin with, we don’t have a good theory of social behavior from which to start. King explains that when we have a solid idea of how things work — in physical systems, for example — we can build a model that successfully predicts outcomes. But whatever theories of social behavior we do have fall far short of the laws of physics in predictive power.”

“Validation: What Big Data Reveal About Survey Misreporting and the Real Electorate”
Ansolabehere, Stephen; Hersh, Eitan. Political Analysis, Summer 2012, Vol. 20, No. 3, 1-23. doi: 10.1093/pan/mps023

Abstract: “Social scientists rely on surveys to explain political behavior. From consistent overreporting of voter turnout, it is evident that responses on survey items may be unreliable and lead scholars to incorrectly estimate the correlates of participation. Leveraging developments in technology and improvements in public records, we conduct the first-ever 50-state vote validation. We parse overreporting due to response bias from overreporting due to inaccurate respondents. We find that nonvoters who are politically engaged and equipped with politically relevant resources consistently misreport that they voted. This finding cannot be explained by faulty registration records, which we measure with new indicators of election administration quality. Respondents are found to misreport only on survey items associated with socially desirable outcomes, which we find by validating items beyond voting, like race and party. We show that studies of representation and participation based on survey reports dramatically misestimate the differences between voters and nonvoters.”

“The Rise of Big Data: What Does it Mean For Education, Technology, and Media Research?”
Eynon, Rebecca. Learning, Media and Technology, 2013. doi: 10.1080/17439884.2013.771783

Findings: “The final set of challenges are those around issues of inequality, and how Big Data may both reinforce and perhaps even exacerbate existing social and educational inequalities in a number of ways. First are issues around the question of whose data traces will be analysed using Big Data, and in simple terms it is likely that only those who are better off will be represented in such research, as these are the people who will be online more. For example, a lot of work in Big Data focuses on Twitter, the blogosphere, and search engine queries. All of these activities are not undertaken equally by the whole population. Second are issues around the question of which researchers have access to these data sets, which are often owned by commercial companies (boyd and Crawford 2012). Similarly, access and use of open data is unlikely to be equally available to everyone due to existing structural inequalities.”

“Big Data Privacy Issues in Public Social Media”
Smith, Matthew: Szongotti, Christian; Henne, Benjamin; von Voight, Gabriele. 2012 IEEE International Conference on Digital Ecosystems Technologies (DEST), 18-20 June 2012, 1-6.

Abstract: “Big Data is a new label given to a diverse field of data intensive informatics in which the datasets are so large that they become hard to work with effectively. The term has been mainly used in two contexts, firstly as a technological challenge when dealing with data-intensive domains such as high energy physics, astronomy or internet search, and secondly as a sociological problem when data about us is collected and mined by companies such as Facebook, Google, mobile phone companies, retail chains and governments. In this paper we look at this second issue from a new perspective, namely how can the user gain awareness of the personally relevant part of Big Data that is publicly available in the social web. The amount of user-generated media uploaded to the web is expanding rapidly and it is beyond the capabilities of any human to sift through it all to see which media impacts our privacy. Based on an analysis of social media in Flickr, Locr, Facebook and Google+, we discuss privacy implications and the potential of the emerging trend of geo-tagged social media. We then present a concept with which users can stay informed about which parts of the social Big Data deluge is relevant to them.”

“The Computational Turn: Thinking About the Digital Humanities”
Berry, David M. Culture Machine, 2011, Vol. 12.

Findings: “In cutting up the world [into data chunks], information about the world necessarily has to be discarded in order to store a representation within the computer. In other words, a computer requires that everything is transformed from the continuous flow of our everyday reality into a grid of numbers that can be stored as a representation of reality which can then be manipulated using algorithms. These subtractive methods of understanding reality (episteme) produce new knowledges and methods for the control of reality (techne). They do so through a digital mediation, which the digital humanities are starting to take seriously as they’re problematic.”

“Forgetting Footprints, Shunning Shadows: A Critical Analysis of the ‘Right to Be Forgotten’ in Big Data Practice”
Koops, Bert-Jaap. SCRIPTed, December 2011. Vol. 8, No. 3, 229-256.

Abstract: “The so-called ‘right to be forgotten’ has been put firmly on the agenda, both of academia and of policy. Although the idea is intuitive and appealing, the legal form and practical implications of a right to be forgotten have hardly been analyzed so far. This contribution aims to critically assess what a right to be forgotten could or should entail in practice. It outlines the current socio-technical context as one of Big Data, in which massive data collections are created and mined for many purposes. Big Data involves not only individuals’ digital footprints (data they themselves leave behind) but, perhaps more importantly, also individuals’ data shadows (information about them generated by others). And contrary to physical footprints and shadows, their digital counterparts are not ephemeral but persistent. This presents particular challenges for the right to be forgotten, which are discussed in the form of three key questions. Against whom can the right be invoked? When and why can the right be invoked? And how can the right be effected?”

“Understanding the Mechanics of Online Collective Action Using ‘Big Data’”
Hale, Scott A.; Zerlina Margetts, Helen. Working paper, March 2012.

Abstract: “Now that so much of collective action takes place online, Web-generated data can further understanding of the mechanics of Internet-based mobilization. This ‘big data’ offers social science researchers the potential for new forms of analysis, using real-time transactional data based on entire populations, rather than sample-based surveys of what people think they did or might do. This paper uses a ‘big data’ approach to track the growth of over 8,000 petitions to the UK Government on the No. 10 Downing Street website for two years, analyzing the rate of growth per day and testing the hypothesis that the distribution of daily change will be leptokurtic (rather than normal) as previous research on agenda setting would suggest. This hypothesis is confirmed, suggesting that Internet-based mobilization is characterized by tipping points (or punctuated equilibria) and explaining some of the volatility in online collective action. We find also that most successful petitions grow quickly and that the number of signatures a petition receives on its first day is the most significant factor explaining the overall number of signatures a petition receives during its lifetime. These findings could have implications for the strategies of those initiating petitions and the design of web sites with the aim of maximizing citizen engagement with policy issues.”

“Usage Data in Web Search: Benefits and Limitations”
Baeza-Yates, Ricardo; Maarek, Yoelle. Scientific and Statistical Database Management, 24th International Conference, “Lecture Notes in Computer Science,” 2012. Vol. 7338/2012, 495-506. doi: 10.1007/978-3-642-31235-9_33.

Findings: “In this paper we have shown the importance of usage data in current Web search engines and how the wisdom of crowds was a driving principle in leveraging usage data. However, in spite of its benefits, usage data cannot be fully leveraged due to the conflicting demands of big data, personalization and privacy. We have explored here these conflicting factors and proposed a solution based on applying the wisdom of crowds in a different manner. We propose to build “ad hoc” crowds around common tasks and needs, where a user will be aggregated to different users in various configurations depending on the considered task or need. We believe that more and more “ad hoc crowds” will be considered, in particular in other Web services and inside large Intranets where this idea can impact revenue or/and productivity.”

“The End of Theory: Will the Data Deluge Makes the Scientific Method Obsolete?”
Anderson, Chris. Wired, June 2008.

Findings: “Scientists are trained to recognize that correlation is not causation, that no conclusions should be drawn simply on the basis of correlation between X and Y (it could just be a coincidence). Instead, you must understand the underlying mechanisms that connect the two. Once you have a model, you can connect the data sets with confidence. Data without a model is just noise. But faced with massive data, this approach to science — hypothesize, model, test — is becoming obsolete…. Petabytes allow us to say: ‘Correlation is enough.’ We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot.”

Tags: research roundup, technology, privacy, telecommunications, data mining, data journalism