As the U.S. government continues to remove data and make radical changes to its websites, reporters are encountering health data that’s incomplete, altered or missing entirely.
At the Association of Health Care Journalists’ annual conference in Los Angeles on May 30, a panel of data-savvy journalists and data scientists shared strategies to uncover, safeguard and verify vital data.
“In my entire career, I feel like the trajectory has been towards greater transparency and availability of data, especially public health data, and I really worry that we’re now entering an era of data scarcity,” said J. Emory Parker, data editor for STAT News, during his presentation. “But I don’t want to be entirely pessimistic. I think there is a lot of work we can do, and we can try.”
The panel, titled “Finding data when people are hiding it from you,” was moderated by MedPage Today’s Washington editor Joyce Frieden. The speakers included Parker, Adam Rhodes, training director for the journalism organization Investigative Reporters & Editors (IRE), Sydney Lupkin, pharmaceutical correspondent for NPR, and David Radley, director of data and analytics at the Center for Evidence-Based Policy and senior scientist at the Commonwealth Fund.
We have summarized their advice in eight tips.
1. Archive important data before it disappears
When the Centers for Disease Control and Prevention’s various datasets were at risk of being taken down, Parker of STAT News sprang into action by backing up data from data.cdc.gov using scripting code and the site’s Application Programming Interface. API allows users to write code to get data from a website.
“And specifically, what was really useful about that is that there’s a way to ask this website to generate just a complete list of every single dataset that is available,” Parker said.
He downloaded about 150 gigabytes of data and created a system to track changes to the available datasets in real time, starting January 28, 2025.
The news organization’s CDC data archives are publicly available.
Parker also used the Wayback Machine to find some of the snapshots of the lists of datasets on data.cdc.gov to compare the rate of data removal before the second Trump administration began. The Wayback Machine is a digital archive of the Internet maintained by the Internet Archive, a nonprofit organization.
He was able to compare the lists posted between November 2024 and January 2025, finding that the net drop in the number of datasets was about 20.
In comparison, between Jan. 28 and Feb. 7, 2025, 153 datasets were removed, Parker found in his analysis.
Internet Archive “is going to be the single biggest resource we have going forward,” Parker said.
The WayBack Machine also has a Save Page Now tool and similar website extensions, which you can use to save and store a site in the archives.
2. Look beyond federal data sources
Find “the organizations that deal most specifically with the population that you’re interested in talking to,” Rhodes of IRE said. “Institutions and organizations that are more on the ground in those communities or serving those institutions, I think, will probably be the biggest bet going forward.”
For instance, the Trevor Project is a great source of data on LGBTQ+ mental health, Rhodes said.
The Journalist’s Resource has a list of non-government health data alternatives and archives.
Also, search for experts who may already be collecting the data you need.
For instance, Rhodes saved time when reporting on a story about access to HIV treatment in Puerto Rico by finding a researcher at Columbia University who had already been studying the topic for years.
“I didn’t have to interview hundreds of people,” they said. “I just talked to that one researcher.”
Collaboration among journalists and organizations can also help fill some of the data gaps.
Parker cited the COVID Tracking Project as a model of collaboration among journalists and other volunteers. The organization, launched in 2020 from The Atlantic, included hundreds of volunteers who compiled, published and interpreted COVID-19 data from state, county and city health agencies at a time when the CDC was falling short of providing this critical information.
“I think to the extent that we’re going to be facing really difficult data challenges over the next few years, I think we’re probably going to be needing to think about other ways that we can sort of replicate that kind of mass collaboration between not just ourselves within newsrooms, but between newsrooms and between organizations,” Parker said.
3. Verify all the data you find
“When you find something online, the first thing you should think of is, ‘How can I verify this?’” Rhodes said.
Find the primary source to verify the data you find online or in news stories. Find the researcher responsible for compiling a dataset in a document or study, Rhodes advised.
Call an expert, or two or three, especially when you’re not sure about the validity of the data, Lupkin of NPR advised.
4. Don’t let perfection be the enemy of good
Sometimes you won’t be able to get a comprehensive database, but you can access parts of it. And that part can “give you a window in, even if it isn’t a complete picture,” Lupkin said.
Be sure to describe the data in your story and explain its limitations.
When Lupkin learned that the drug list price database was proprietary and not available to her, she sought out other companies that pay to access those datasets. One of those companies was GoodRx. The company provided Lupkin with a set of data that didn’t violate its user agreement with the owners of the original datasets.
Ask yourself the following questions when looking for data, Lupkin advised:
- If it’s not the original company, who else might have access to the data?
- Who else might collect it?
- Who else does something similar?
- Are there any government, academic or think tank reports that suggest this data exists?
- Who else would benefit from or need this data?
- If it’s federal data, is it possible that it could also be held in a state database?
While the Centers for Medicare and Medicaid Services has hospital inspection reports, so do states. So instead of filing a records request via the Freedom of Information Act with CMS, which might have taken months to fulfill, Lupkin filed one with a state and got what she needed in a week.
“Sometimes knowing that there’s a redundancy of the thing that you want can help you get around the problem,” of getting the data, she said.
5. Use advanced search techniques
Boolean operators (words like AND, OR, NOT), filetype filters, and site-specific searches on Google’s Advanced Search can help you find documents that may not show up with a simple Google search.
For instance, if you’re looking for a database of injuries from a government website, you can type in Google’s search box “injuries site:gov filetype:csv,” Rhodes said. (A CSV, or Comma-Separated Values file, is a simple, plain text file format to store data that can be easily imported into a spreadsheet program.)
Online search expert Hank van Ess has several tips on optimizing Google searches.
Lupkin and Rhodes also shared the following tools:
To track changes to a website
- Fluxguard (formerly Versionista)
- ChangeDetection
- FollowThatPage
To search for academic research
Tools for archiving and organizing content
- Webrecorder.io: records the content of a web page
- Archive.is: web page archiver
- Chrono download manager: download all files on a page
- Zotero: a free organizing tool
- Multy: creates a shareable list of links
Other free tools
- Tabula: extracts tables from PDFs
- Pinpoint: transcribes audio and video, explores and analyzes large collections of documents
- Beautiful Soup4: scrapes websites.
6. Pregame your FOIA research
A FOIA request is an official way to seek access to government records under the Freedom of Information Act, which took effect in 1967. But not every request for government data needs to start with formal paperwork.
“My advice is always to see what you can get [by saying], ‘Hey, I’m really interested in this. I know you keep it. Can I get it?’” Rhodes said. “A well-meaning email goes a long way, especially in the climate that we’re in right now.”
The dataset you want might have already been requested and paid for, which means you won’t need to start your request from scratch. The requests are usually captured on the agencies’ FOIA logs, which are lists of FOIA requests filed with government agencies.
For instance, you can find the FOIA logs for the Food and Drug Administration and the Department of Homeland Security.
FOIA logs aren’t always posted online, even though in the past it was standard practice, Rhodes wrote in an email after the panel.
So if you can’t find the FOIA logs on a federal website, they’ve probably been taken down from public view, but you can still file a FOIA for them, Rhodes wrote. You can do the same for state and local government agencies if their FOIA logs are not posted online.
Another great resource for FIOA requests is MuckRock, a nonprofit, collaborative news site and a “repository of hundreds of thousands of pages of original government materials, information on how to file requests and tools to make the requesting process easier,” according to its website.
7. Leverage SEC filings to uncover business data
When data is unavailable due to corporate secrecy, check public companies’ earnings reports and filings on the U.S. Securities and Exchange Commission website. SEC filings are financial statements and other regulatory documents that public companies are required to submit to the SEC.
The filings sometimes reveal contract disputes, pricing strategies or blame-shifting.
“If something is going wrong, and it reaches the level that somebody has to tell their investors about [it], you might find more information [in the SEC filings],” Lupkin said.
8. Use administrative data when possible
Administrative data — the digital record of various types of services that are provided by government agencies and public or private organizations — can be an imperfect but useful fallback as other data disappears, said Radley of the Center for Evidence-Based Policy.
One example is data from health insurance claims, Radley wrote in an email following the panel.
“Analysts (like me) can ask for that data to do research on the kinds of health care services that get provided,” Radley wrote.
There are other types of administrative data.
One of Radley’s projects — the Oregon Child Integrated Dataset — uses administrative data from the state’s juvenile justice system, showing when kids have contact with the system, including dates, locations, type of offense and disciplinary action, Radley wrote.
Administrative datasets aren’t easy to access because they’re always owned by an agency or company. Many times, the data includes private individual information or proprietary information like drug prices.
“Generally, to get access, you need to identify the data owner, make a request, demonstrate you have the ability to keep the data secure, etc.,” Radley wrote. “All of these things are specified in a legally binding Data Use Agreement.”
“Still, the reason why I think [administrative] data will be even more important is that it will continue to be generated, even when other federal data collection (e.g., survey) is being scaled back,” he wrote. “Fully leveraging this type of data will require new and closer collaboration between researchers and journalists.”
Expert Commentary