Expert Commentary

Researchers rush to preserve federal health databases before they disappear from government websites

Journalists have long relied on federal health data for their reporting, but several government websites have been taken down in the past week. We include several tips that they can use to help researchers preserve the data.

(A screen capture of what users see when attempting to visit a now-defunct White House web page announcing the establishment of the first-ever Office of Gun Violence Prevention under President Joe Biden.)

A group of researchers and students at the Harvard T.H. Chan School of Public Health is gathered today for a data preservation marathon, scraping and downloading data related to health equity from U.S. government agency websites before they disappear. Their goal is to make the downloaded data publicly available through repositories such as the Harvard Dataverse.

The new Trump administration has at least temporality halted most communications from the Department of Health and Human Services and has begun taking down government websites, from the Spanish-language version of the White House website to the Office of Gun Violence Prevention to many pages that include DEI initiatives. CDC’s Youth Risk Behavior Survey site, which monitors health behaviors of high-school students, including sexual behavior, mental health and tobacco use, is no longer available.

Health researchers worry that more of their trusted federal health databases could disappear in the coming hours and days. It’s not clear whether the changes are permanent or the websites will once again become available.

“In my lifetime, in the United States I don’t know of another situation where researchers have been this concerned about losing access to data that they’ve had access to their whole career,” says Jonathan Gilmour, a data scientist at the Chan School who is researching human health impacts of climate change. “It’s dire.”

Federal health data and publications help researchers with their studies and shed light on the state of health in the U.S. and potential public health threats. Journalists rely on the resources for their reporting.

“It’s really important to understand that we can’t have a full picture of what’s going on in the United States and around the world if we stop making data available,” Gilmour says. “I don’t want to start thinking about or listing the risks if we don’t have this data, because it imperils our way of life.”

For the first time in its 70-year history, The Morbidity and Mortality Weekly Report, the official journal of the Centers for Disease Control and Prevention, was not published last week as part of a communication pause among federal health agencies. One of the studies slated to appear in the publication was about the risk of bird flu infection among veterinarians who treat cattle, reported Amy Maxman in KFF Health News on January 30. The MMWR, which historically has been published on Thursdays, was not published this week, either.

Data preservation efforts

The ad hoc group that organized Friday’s data marathon at Chan School calls itself “The Preserving Public Health Data Collective” and it’s part of a growing effort among researchers and academic institutions across the U.S. to save federal health websites and databases.

Researchers are using different tools, including downloading datasets, scraping websites and archiving them with the Wayback Machine, which is an initiative of the Internet Archive, a nonprofit digital library of Internet sites. It enables users to see how websites looked in the past.

The changes to government websites are happening faster than researchers can keep up with.

“There’s no way of knowing how much has disappeared so far,” says Gilmour, who’s part of the Climate Change and Health Research Coordinating Center, and is working with dozens of researchers across the country to preserve health and climate data. “So many resources and tools that were environmental-justice related have disappeared in just a week.”

But researchers have been successful in preserving some data.

The Climate and Economic Justice Screening Tool, which helped identify and invest in at-risk communities, was taken down on January 22.

Anticipating that action, Gilmour and a volunteer coalition called the Public Environment Data Project had made a copy of the federal website before it disappeared and they uploaded it to their website and made it available to the public.

NIH’s Ending Structural Racism website was taken down on January 22, but an archived version of the site from January 19 is available via the Wayback Machine. So was the website for Women’s Health Equity & Inclusion (archive available here).

Government websites do change as administrations change, but legacy datasets have mostly remained available through the administration. In 2017, the Trump administration deleted almost all mentions of climate change soon after inauguration, The New York Times reported. But researchers worry that this time it’s different.

“I’m sort of dumbfounded at the pace of and scale of change we’re seeing,” Gilmour says.

For the past five presidential terms, a collaborative effort called the End of Term Web Archive has preserved what’s on United States government websites at the end of a presidential administration, using the Wayback Machine. The project is spearheaded by the Common Crawl Foundation, Environmental Data and Governance Initiative, Internet Archive, Stanford University Libraries, and University of North Texas Libraries.

“Efforts like the End of Term Archive proved especially important under the last Trump administration when science-based information about environmental issues was censored from federal websites,” according to a December 2024 post by Darya Minovi, a senior analyst at the Union of Concerned Scientists, a national science advocacy organization based in Cambridge, Mass.

A February 2021 report by the Environmental Data and Governance Initiative found that the first Trump administration changed or removed information on federal websites about water pollution and climate change about 1,400 times between 2016 and 2020. A February 2021 research article in the journal PLOS One explains several visual techniques to analyze these changes.

“Websites have become the primary means by which the U.S. federal government communicates about its operations and presents information for public consumption,” the authors write in the abstract of the study. “In formulating ways to visualize and assess the alteration of websites, our study lays important groundwork for both systematically tracking changes and holding officials more accountable for their web practices. Our techniques enable researchers and watchdog groups alike to operate at the scale necessary to understand the breadth of impact an administration can have on the online face of government.”

EDGI was formed in 2016 to document changes to federal environmental data. The organization is continuing to archive datasets, including information on EPA’s justice programs and the race and ethnicity components of the U.S. Census.

“When access to public digital information — including historical information — is reduced, the ability to effectively contribute to democracy and to make informed decisions is curtailed,” write the authors of a 2021 PLOS One article.

On Friday, as the Trump administration continued purging federal health agency websites, the Association of Health Care Journalists sent a letter to the agency, protesting the removal of the data.