Making federal data more useful and accessible to fuel media and democracy

From modifying website interfaces to ensuring that metadata and contact information are clear and uniform, federal statistical agencies can better help news media and the public. A report for the Federal Committee on Statistical Methodology, 2014.

By John Wihbey, Assistant Director for Journalist’s Resource

Summary

As part of our participation in the December 2014 meeting of the Federal Committee on Statistical Methodology, Journalist’s Resource at Harvard’s Shorenstein Center authored recommendations to help the principal federal statistical agencies communicate better with media and, by extension, interested citizens. A variety of ideas were generated through website analysis, testing and conversations with journalists and experts. Agencies could substantially increase their potential audience by designing more for the “broad middle” of Web users, who may not be familiar with the federal statistics landscape. Proposed ideas are as follows:

1) Media-communications recommendations: Hold regular workshops with journalists of all kinds, particularly non-specialists; when journalists need help with data, provide access to expert government officials; and regularly offer media organizations the chance to articulate their needs for original data collection. 2) Data content- and presentation-related recommendations: Rethink what agencies collect with a more “citizen-centric” approach; find data that speak to the technology revolution and related changes; stay relevant and current on the Web, repurposing materials as news trends emerge and emphasizing shorter “quick hits” from large datasets and reports; and feature salient data points in large reports and design visualizations for news sites. 3) Technical and Website recommendations: Consider a more standardized Web user interface (UI) across agencies — and responsive design; build an intuitive, app-like version of each site for more general users; design for better site search and search engine optimization; be clear about the quality of data; strengthen user information and metadata; offer many data formats and consider use cases and accessibility; and create more Application Programming Interfaces (APIs) that are tailored to the needs of news and information companies.

Background

From unemployment levels and energy usage to housing patterns and rates of violent crime, the 13 federal principal statistical agencies of the United States produce vitally important public data. Without this information, the functioning of civil society, politics and commercial enterprise would be diminished, as citizens, policymakers, businesses and media of all kinds rely on it every day to understand the world, inform their communities and make crucial decisions.

Each year the federal government spends about $3.7 billion on the data collection, processing and dissemination performed by its principal statistical agencies. This includes well-known government institutions such as the Census Bureau and the Bureau of Labor Statistics, in addition to agencies that examine agriculture, transportation and myriad other areas of American life. Specifically, these chief agencies are:

In addition to the principal statistical agencies, more than 80 other federal agencies participate in this data collection work.

The expenditure for these principal agencies represents a tiny fraction of the federal budget. The data help guide vast spending programs and allocation, as well as regulatory decisions. Many of America’s roughly 90,000 different local government bodies draw on these insights. However, although a decennial Census is mandated by the U.S. Constitution, some of the funding for future data collection and statistical work is uncertain, and many of the agencies face flat or shrinking budgets. Some of the data collected, such as that through the Census Bureau’s American Community Survey, has been under political attack for some time for its perceived intrusiveness. And given the increase in data collection by the private sector, particularly the rise of so-called digital “big data,” it can appear that government is increasingly less relevant in the data collection business. This impression persists despite the fact that no private firm could provide the comprehensive, reliable, consistent and confidential data that government guarantees.

A July 2014 Department of Commerce report, “Fostering Innovation, Creating Jobs, Driving Better Decisions: The Value of Government Data,” estimated the value to the U.S. economy at a minimum of $24 billion and an upper bound of $221 billion, while an October 2013 report from the McKinsey Global Institute estimated that the United States could ultimately “unlock” $1.1 trillion in economic potential annually through further open data initiatives.

Data.gov showcases some of the tangible ways in which the economy benefits from current open data efforts and the businesses that are built on them, including firms such as Zillow, Kayak, BrightScope and the Climate Corporation. Former U.S. Chief Information Officer Vivek Kundra has called open data “digital fuel for the 21st century” economy.

Beyond commercial enterprises, the academic world and other research groups — NGOs and nonprofits, think tanks, advocacy groups and commercial research firms — rely on this data, from nearly all universities and their scholars to The Conference Board, Nielsen Media, The Brookings Institution and the Urban Institute.

Improving connections with news media

Media businesses such as Thomson-Reuters and Bloomberg rely on government data to create value-added information products. And on any given day, a wide variety of news stories that reach millions of citizens are informed by government statistics. There is also heightened interest across newsrooms and journalism schools in data journalism. News outlets such as ProPublica and the New York Times are demonstrating new possibilities for what can be done with data, while organizations such as the Sunlight Foundation and the Investigative Reporters and Editors’ NICAR program and database are building capacity and knowledge infrastructure.

Of course, many journalists consider themselves first and foremost watchdogs of government, and some regularly request additional data through Freedom of Information requests. When it comes to “accountability” or investigative journalism, reporters want both public-sector performance data (indicating how well government is operating) and regulatory data related to the private sector (revealing how citizens are being treated in the marketplace). News media members also benefit from the steady output of descriptive statistics from the non-partisan research agencies, which can provide vital knowledge and context for audiences.

As the open data movement has grown, much of the effort has been driven by those who are already data literate and familiar with government’s offerings. It has also focused on encouraging the government to expand the sheer volume of datasets released, but the results have sometimes lacked a strong sense of the needs and priorities of the general public.

Much more could be done to broaden the audience for existing government data, particularly among the many thousands of news reporters and editors who have neither a specific subject-matter expertise nor a familiarity with the complex landscape of federal data and statistics. As a group, journalists have worked to build alternative platforms for accessing the available data, building sites such as Census Reporter because of a general recognition that media have different needs than subject-matter experts or data specialists, and want a simplified and more intuitive interface. Both government and media want to see the data used effectively. There are many areas where the government’s professional researchers and journalists would find common ground, overcoming some traditional tensions and feelings of mistrust between media and government.

General-assignment reporters, editors and producers — of which there are an increasing number in the news industry, as specialty beats are cut back — also stand as a proxy for a wider set of interested citizens, a “broad middle” ranging from state and local-government workers to small-business owners to local activists and students. Often, these communities have relatively limited knowledge and familiarity with government data. They are not entrepreneurs creating new products on live data streams, academic researchers or specialized investigative reporters. They are the millions of people who are just trying to make sense of the world quickly and who might look to the government statistical agencies to help them.

The December 2014 meeting of the Federal Committee on Statistical Methodology, sponsored by the Council of Professional Associations of Federal Statistics (COPFAS), in Washington, D.C., is an occasion to reflect on potential reforms, changes and innovations that might help to better serve the media and broader generalist communities.

One of the items that emerged from that Committee meeting is that federal agencies face a standing challenge with news media: Gaining proper attribution and credit. While news media are under no legal obligation to fully credit the agencies and authors of government datasets and reports, agencies rely in part on perceived “relevance” to justify their continued work and funding to the public and policymakers. In any case, the simple practice of using the agencies’ name and providing a hyperlink when data is drawn upon in news stories seems sensible, and fits squarely within the journalistic norm of giving proper credit and transparently sourcing others’ work.

The following are recommendations directed to the 13 principal federal statistical agencies, though they are applicable across government. The recommendations are based on conversations with working journalists, media researchers, research librarians and former government officials.

Data content and presentation recommendations

1) Rethink some of what you collect. Being “citizen-centric” has become a popular idea in the open data and open-government movements, and it is good to keep this squarely in mind, as citizens demand increasingly that government be more responsive and user-friendly online. This requires rethinking how content serves users and remains relevant to their lives, and how it can empower and protect. Of course, the principal federal agencies have a venerable tradition of collecting certain kinds of core data. No one would want Web traffic numbers to drive all statistical agency decisions — it is not a popularity contest, after all. But one key metric to focus on is reuse. If datasets are not being reused — and media mentions and use count heavily here, in addition to academic research, civic hacking and commercial enterprise — it may be time to rethink collection of that dataset. News media are a good index of public value. If reporters can’t imagine the data being communicated to the public — and no one is doing so — revisit whether its collection is a good use of resources.

2) Find data that speak to the technology revolution. Societal change takes place at the margin, and news media — and the broader public — are intensely interested in changes that may indicate larger future patterns. Yet it remains striking how little of the data collected by the federal statistical agencies now speaks to the widely disruptive changes related to the Internet revolution, affecting all areas, from education and energy to law enforcement and agriculture. One model to consider is the Pew Internet and American Life Project. In order to demonstrate relevance, each year agencies might contemplate a new dataset related to these changes.

3) Stay relevant and current on the Web. By definition, the “broad middle” of Web users have a wide range of interests and online abilities. Help them by providing both quick hits (statistics, graphs) and comprehensive materials (reports, datasets), with interconnections. Both should be continuously updated — something new every day — and ideally touch on current events. Think about reports, charts and data that can be resurfaced and re-promoted to media and the public as news topics and events arise. Again, the research agenda and outreach strategy of the Pew Research Center furnishes an instructive model. This means participating in social media and other third-party platforms. It may mean helping knowledge efforts such as Wikipedia. Much of the way the media and the public learns about issues is through searching sites such as Wikipedia, so it may be worth the effort to ensure data is current and citations are correct. Government researchers could help convene outside groups with common interests, such as those in NGOs, libraries and academia, to advance these public knowledge goals.

4) Feature salient data points and visuals of moderate complexity. Reports should have key findings right up top, similar to an executive summary. This could be the only page visitors read, so make it worth their while, and also make the case why the report as a whole is important. Visuals are important, both on pages and in reports: A good graphic that can be repurposed is often worth a thousand-page PDF for news media. Frequently, the graphics produced by professional researchers are too complex for news purposes, so try to limit variables and datapoints represented for visualizations designed for public consumption. At the same time, provide direct links to underlying data/reports. Finally, don’t assume an expert audience; make language as clear and direct as possible, and avoid obscure acronyms and terminology. If you must use specialist jargon, explain it first, then move forward.

Technical and Website recommendations

1) Consider a more standardized Web interface across agencies — and responsive design. The 13 statistical agencies’ websites vary in their design quality and the degree to which they are intuitive for non-expert users. At first glance, many home pages appear cluttered and reflect a user interface (UI) pattern that is characteristic of a previous era in Web design, where putting more information was considered better. We are now in an era of “layered functionality.” Creating a more uniform UI on data across agencies has several advantages: First, a crucial aspect of designing Web software involves extensive user testing, and this could be done more efficiently and at greater scale with the cooperation of multiple agencies. Second, for reporters and interested members of the public, the “learning curve” for accessing and using data would diminish, as navigating one site would provide a knowledge base for how to navigate others. Many newer city-based open data efforts, for example, appear to have adopted a more uniform UI. At the very least, the federal agencies could commit to UI best practices and open source design. The Energy Information Administration (EIA) website has many commendable features worth noting and emulating. Finally, trends in Web usage show that the American public is migrating in massive numbers to mobile devices — smartphones of increasingly varying sizes and tablets — and any redesigns should be adaptive, or “responsive”: The site should automatically conform to the device accessing the Web page and data.

2) Build an intuitive, app-like version of each site for general users. In each agency area, there are some core questions that the public are typically interested in. What’s the national crime rate? How many organic farms are there and where are they? Are incomes going up or down? Some of the federal websites point to answers to basic questions, but many do not. Consider designing a version of the site for non-experts and unsophisticated persons — the proverbial “average intelligent high school student.” Design the user interface like a mobile app, with limited choices and places to click at each layer, and large, visually appealing icons; use warm colors and fonts, and consider “flat” design principles. At the same time, the key statistics should link directly to the deeper data sources, so that if someone wants to learn more, she can.

3) Design for better site search and search engine optimization. Many datasets are functionally invisible to the public because they are not very searchable. Be sure to put plenty of keywords and narrative detail on landing pages so that search engines can crawl and index those pages. Agencies’ internal site search is also frequently problematic. Allow for advanced search interfaces to enable phrase searches, proximity searches and more. (USA Spending.gov offers a potential model.) It might serve government agencies well to partner with major search engine companies such as Google, Bing and Yahoo to draw on their expertise. There should also be a simple, direct and intuitive way for users to sort and filter extensive lists of resources. When listing reports and data, columns should indicate release date, title, significance (popularity/downloads/citations), type (PDF, CSV, etc.); filters should be easily accessible and work instantly. Content should also be accessible in more than one way — in other words, no single path (perhaps unknowable to the user) is required to find it. Instead, ensure that they can get at content in multiple ways — comprehensive interconnections. Think about your use cases: A journalist wanting key statistics; a researcher wanting a comprehensive report; a member of the public or a nonprofit wanting crime data. They all should be able to get what they need quickly and easily, without digging, repeated searches and guesswork.

4) Be clear about the quality of data. Strengthen user information and metadata. All datasets and statistical presentations should be accompanied with easily accessible metadata that clearly lay out strengths and weaknesses of the data collection and analysis. The IRS does a nice job explaining methodologies, for example. It is now a requirement that all government datasets fulfill a “common core” for metadata — title, description, URL, date, point of contact, etc. — but this is a floor, not a ceiling. Emails and points of contact are essential for media, particularly as they work on deadline and as they have questions about related available data (and try to confirm that something related doesn’t exist — the time-intensive problem of “proving a negative”). Provide context and information on data quality to help journalists make caveats and inform readers about uncertainty. Include strong documentation — source notes; what every column means; where it comes from; and limitations of the data. The Bureau of Justice Statistics does a good job of this in many cases. Many reporters and researchers want to know if datasets have a state or local dimension, so clearly indicate the levels to which consumers can drill down and localize. Finally, the broad research community uses notebook and citation tools such as Zotero, RefWorks and EndNote; optimizing websites for compatibility with these tools would reduce friction and allow metadata to flow more easily into people’s digital research notebooks.

5) Think hard about data formats and accessibility. Be wary of the PDF file. They are not ideal for Web uptake and consumption. Be especially wary of giant PDF files, which bury key data and insights and require time-consuming downloads for users. Reports and data should always have HTML landing pages with summaries; don’t just feature a link to a PDF or XLS document. Agencies may need to accept the fact that the basic unit of journalistic analysis in many cases is not the professional statistics package but is frequently Excel. Just as a basic example, why not package the monthly employment numbers as a big workbook? It doesn’t have to be in a proprietary format either (.XLS or .XLSX). Why not ODF (an open source format)? CSV files are increasingly standard, which is a good thing. While the latter can only handle single data sheets, they are accessible to all, no matter what program is used. Any agency releasing geo-spatial information should be releasing it in truly open formats. It should be “clear and liquid” — a good test is whether it is easily imported into and rendered in Google Maps applications.

6) The importance of dates — and time. Many government datasets are collected in one year and then reports about them are issued in subsequent years. Be very, very clear about when data was collected versus when it was published. Further, on some sites, it can be surprisingly labor-intensive to track trends over time. In the Census Bureau’s American FactFinder, for instance, so much is organized by year, yet there’s no ready way to identify a statistic that you’re interested in and just say: Let me see this for every year you have.

7) Create more Application Programming Interfaces (APIs). Many federal agencies are already doing this, and they are to be commended. But they should go faster and more should join the effort. The more rapidly this “data infrastructure” is built, the more the data will be used and the more opportunities there will be for information-based businesses. Users’ needs should be considered first; work with developers and entrepreneurs to figure out what data is actually used as opposed to creating APIs blindly and filling them with all manner of data streams.

8) Make Data.gov the single portal for government statistics and data. “Portals” in general are considered an older Web idea, but they still frame the way many people behave and operate. There is obviously a difference between raw data and synthesized statistical information, but for the public, such distinctions remain obscure. Ask anyone on the street the difference among FedStats, Data.gov, the American FactFinder and the now-discontinued Statistical Abstracts of the United States and you will draw blank looks. The organizational reality, however, is that Data.gov does not have enough staff to make any unified portal a reality. Agencies should consider lending technical support to Data.gov to help build this resource and make it a portal to all data and value-added statistical tools and presentations. (As an aside, the St. Louis Federal Reserve’s FRED research platform, which aggregates data across agencies and providers, is widely used by financial journalists. Consider helping other open third-party platforms and portals, which serve as allies in communications, to surface important data. Journalists also like FRED because it has an Excel macro that helps directly update the data in spreadsheets.)

9) Linked data and the future. It is worth noting that there is a broad aspiration among Web thought leaders, originating with World Wide Web founder Tim Berners-Lee, to create a world of “linked data,” where all data on the Web is structured in such a way that it has relational qualities. Disparate but related datasets can then be put together in a smart way through algorithms. Agencies should each have a working group that is planning to help achieve this vision. (See the DBpedia project for an example of this type of thinking.)

Media and communications recommendations

1) Hold regular workshops with journalists of all kinds. Even small, brief online workshops can be helpful and keep agencies and journalists up to speed with one another’s thinking and needs. Agencies should be asking groups of journalists for feedback and for tips for improvement. (The Bureau of Labor Statistics, for example, just did this in summer 2014.) Government researchers should have several reporters or editors with whom they regularly correspond, even if it’s just to stay in touch and keep them apprised of new work and datasets.

2) When journalists need help with data, provide access to expert officials. It has become a common complaint among media members that public information officers who know less about the data are most likely to interface with news media. (Some agencies, however, such as the Bureau of Labor Statistics state that they have an “open” policy whereby journalists can ask questions of a wide variety of experts.) When journalists are being given embargoed data, make sure to offer substantial expert support so they understand the data. Don’t create embargo rules so onerous that journalists can’t do pre-reporting to help them understand the societal and human impact of the numbers. Provide raw data files as well, and not just reports that the public affairs office wants to push. Journalists have substantive questions; connect them with staff that have knowledge. One further note that applies to government agencies broadly: Just because there is “personally identifiable information” does not mean that it is necessarily “personal” information. Agencies should not always hide individual records as a matter of policy. When reporters can speak with the people who are represented by the records, you get a nuanced view. Statistical generalizations bore readers and news audiences — journalists need particular stories.

3) Offer media organizations the chance to articulate needs for original data collection. The Department of Commerce states in its July 2014 report that “statistical agencies frequently consult with private-sector entities in the research and development of new data products.” Agencies should be consulting, too, with media organizations. (It is worth noting that they, too, are part of the “private sector” and there are about 80,000 “news analysts, reporters and correspondents” in the U.S. labor force, according to the 2006-10 American Community Survey.) Ask reporters, “What would you like to see collected and how could that help you?” And then deliver on providing a new kind of dataset.

Conclusion: Sharing web data, tracking trends

To help facilitate greater visibility, reuse and relevance, the federal statistical agencies would profit by pooling data on media reuse and press mentions of their data products, as well as broad Web analytics. It is unclear how much this is being done currently. Comparisons of patterns across the 13 principal agencies and their websites may yield interesting and actionable insights. There may be certain kinds of content and presentation styles that could be replicated across agencies; and new, important Web and market trends will emerge. If agencies want to remain relevant to the public in an increasingly crowded and noisy space online — and remain relevant to the public at large — they need to make every effort to continually track user trends and respond to opportunities, innovations and changes in patterns of information consumption.

Special thanks to the following for conversations and/or suggestions toward the preparation of this report: Sarah Cohen, editor of computer-assisted reporting, and Jack Begg, news research supervisor, of the New York Times; data journalist Evan Horowitz of the Boston Globe; Greg Ip, U.S. economics editor at The Economist; Alex Howard of TechRepublic and ePluribusUnum, and a fellow at the Tow Center for Digital Journalism at Columbia Journalism School; Mark Horvit, executive director of Investigative Reporters and Editors; at Harvard’s Shorenstein Center on Media, Politics and Public Policy, former U.S. deputy CTO Nick Sinai, Journalist’s Resource research editor Leighton Walter Kille, and executive director Nancy Palmer; at the Harvard Kennedy School Library, research and knowledge service librarian Valerie Weis, research and instruction librarians Keely Wilczek and Kristen Koob, and manager of research, instruction and knowledge services Heather McMullen; Michael Levi, Associate Commissioner for Publications and Special Studies at the Bureau of Labor Statistics; and Katherine R. Smith, executive director of the Council of Professional Associations on Federal Statistics.

_______

John Wihbey is Assistant Director for Journalist’s Resource at the Shorenstein Center on Media, Politics and Public Policy, at Harvard Kennedy School. He can be reached at 617-496-9068 or [email protected]. Twitter: @JournoResource and @wihbey.