How different are preprints from their published versions? 2 studies explore

Two new papers, published on Feb. 1 in PLOS Biology, add to the growing body of research that’s attempting to measure how much research papers change between the time they’re posted by authors on preprint servers to when they’re peer reviewed and published in an academic journal.

Both studies find that most COVID-19 research papers don’t drastically change, but one of the studies also shows that about 1 in 5 COVID-19 preprints do have major changes in their conclusions by the time they’re published, a reminder for journalists to be careful and critical when covering scientific studies.

One study, led by Liam Brierley, an epidemiologist and statistician at the University of Liverpool, manually compares 184 life science preprints with their published versions. It finds that for most preprints only minor changes were made to the conclusions in the abstracts of their published version. But it also finds that 17% of COVID-19 preprints had major changes in the conclusion of their abstracts when published. That’s compared with 7% studies that were not about COVID-19.

The other study, led by David Nicholson, a doctoral candidate at the University of Pennsylvania’s Perelman School of Medicine, uses machine learning and text analysis to explore the relationships between nearly 18,000 life science preprints and their published versions. It shows that most differences between the two versions were due to changes in typesetting and the mention of supplementary materials or additional files.

Neither study explores the percentage of preprints that are never published or are retracted.

Preprints and peer review explained

Covering biomedical research preprints amid the coronavirus: 6 things to know

Preprints are research papers that are posted by authors to a server before the formal peer review process and publication in an academic journal.

Many life science and biomedical studies, including those related to the pandemic, are posted to the health sciences server medRxiv (pronounced med-archive) and the biological sciences server bioRxiv (pronounced bio-archive). arXiv is another open-access server for papers on physics, math, computer science and economics. Overall, there are more than 60 preprint servers.

Before the COVID-19 pandemic, preprints were mostly used and discussed by scientists. But since early 2020, the number of preprints posted on servers has exponentially grown and preprint studies have been discussed on social media, covered by traditional media and have influenced public opinion and policy.

Peer review is a process that research papers, including preprints, go through in order to be published in an academic journal. The journals’ editors take advice from experts, also called referees, who assess the study. The articles are typically published only after the authors have addressed referees’ concerns and the journal editors are satisfied, according to an explainer on medRxiv.

The peer review process usually takes months, and sometimes more than a year. During the pandemic, many researchers needed to communicate their findings quickly. So they turned to preprint servers, where they can upload their papers and reach a wide audience.

Many preprints are eventually peer reviewed and published. According to some estimates about two-thirds to three-quarters of biomedical preprints are eventually published in academic journals. But with the rapid growth of preprints, there are more discussions around the peer-review process and its role. Researchers are also exploring questions about the reliability of preprints because their conclusions might change after the peer-review process.

What’s peer review? 5 things you should know before covering research

There’s no simple answer to the ongoing discussion, but the bottom line for journalists stays the same: take a pause and scrutinize any study you plan on reporting.

“Before covering a preprint, or any unreviewed or preliminary research, ask yourself: ‘Do the benefits for my audience outweigh the potential risks?'” advised Alice Fleerackers, a researcher at Simon Fraser University’s Scholarly Communications Lab who has been studying the preprint landscape, in an email interview. She was not involved in the studies discussed in this piece.

And remember that “the peer review is not a silver bullet quality control mechanism,” Fleerackers added. “Journalists should be careful and critical when covering any scientific research, peer reviewed or not.”

What the two studies show

Tracking Changes Between Preprint Posting and Journal Publication During a Pandemic
Liam Brierley; et al. PLOS Biology, February 2022.

The study’s aim: Researchers wanted to find out whether preprints withstand the scrutiny of the peer review process and whether their conclusions change by the time they’re published.

How they did it: The team identified 105 COVID-19 preprints that were posted on bioRxiv and medRxiv servers between January and April 2020, as well as 105 non-COVID-19 preprints posted between September 2019 and April 2020, and that were eventually published in a peer-reviewed journal. After excluding several studies for various reasons such as lacking an abstract in the published version, they narrowed down the total to 184 preprint-published study pairs. They then used a computer program and the Microsoft Word Track Changes feature to compare the text of the abstract in the preprint and corresponding published version.

Researchers didn’t analyze the entire text of the article and used the abstracts instead. Abstracts are considered the first port of call for most readers. They often contain the summary of the key findings and main conclusions of the study and are freely accessible, even for journals that have paywalls.

What they found: Overall, the study shows that preprints were most often published with only minor changes to the conclusions in their abstracts. This suggests that the publication process has a minimal but beneficial effect on preprints by increasing sample sizes or statistics or by making author language more conservative, the authors write. The study also shows:

Overall, most abstracts are comparable between the preprinted and published article, but COVID-19 articles underwent greater textual changes in their abstracts compared with non-COVID-19 articles. Specifically, 17.2% of COVID-19 abstracts had major changes in their conclusions compared with 7.2% of non-COVID-19 abstracts.
More than 85% of preprints didn’t have any changes in authorship when published. However, COVID-19 preprints were almost three times as likely to have additional authors when published, compared with non-COVID-19 preprints (17.2% versus 6.2%).
On average there was no difference in the total number of figures and tables when comparing preprints with their published version. The authors also find that in more than two-thirds of published studies, the content of the figures didn’t change. But in 23%, there was significant content added or removed.
The team also investigated the impact of public discourse on preprints — such as discussion on Twitter — and changes to the abstract or figures in the published version. Overall, they didn’t find a strong correlation between the number of comments or tweets and amount of change in publication.
Also of note, they report COVID-19 preprints don’t always share their data publicly and many authors provide data only upon request. Also, many published articles had faulty hyperlinks to the supporting information. “The biggest surprise was how difficult it was to access supplemental data in published papers — many of the links on journal websites were broken or looped back to the main paper,” wrote Jonathon Alexis Coates, one of the study’s seven co-authors, in an email interview.

Findings apply to more recent preprints. Coates, a post-doctoral researcher at Queen Mary University of London, started a podcast in 2021 called Preprints in Motion, where he discusses preprints with the authors. “Through this, and my observations of using preprints, it definitely appears that our data holds up and that there would probably not be significant differences if we analyzed pairs from 2021,” Coates said in his email. “More, scientifically, we included a control data set of non-COVID preprints that were posted and published during the same time period (or as close as we could get). This data showed a similar pattern to the COVID preprints, suggesting that the results are applicable beyond pandemic-related work.”

The bigger picture: The pandemic has had some impact on the scientific community’s view of preprints, Coates wrote in his email. “Preprints are much more accepted and scientists within the biosciences have a greater awareness and understanding of preprints generally,” he wrote. “Many had positive experiences posting pandemic-related preprints and have, anecdotally stated they will preprint again in the future. I have also noticed that more scientists appear to be actively thinking about the publication process and how it needs to change which I think is a big positive.” He added that the study is not a direct comment on the peer review process.

What other experts say about the study: Fleerackers said it was “shocking” to see that 17.2% of COVID-19 preprints underwent major changes in their conclusions. “This is a scary finding, considering how much preprints have been used in pandemic reporting and policy decisions,” she wrote in her email. “This has major implications for journalists who rely on these preprints in their reporting, and for audiences who try to make health decisions based on this unreviewed evidence. It’s also important for researchers who cite and build on these results in their research.”

Advice to journalists: “For journalists covering preprints, I would consider focusing on big picture findings rather than specific statistics, contextualizing any results within a larger body of evidence, and emphasizing that these findings could change in future — as is the case with all science,” wrote Fleerackers. “[Do] your homework as a journalist: read the methods and limitations critically and seek opinions from independent researchers, particularly on those parts of the manuscript that you don’t have the expertise to vet yourself. This is true for all research, but the results of this study suggest it may be even more important when covering preprints, particularly those about COVID-19.”

Study limitations: Because researchers didn’t compare the entire content of the studies, it’s not clear whether changes in the abstract reflect changes throughout the manuscript. Researchers also add:

They looked at a small sample of studies that were published in academic journals shortly after they were posted on preprint servers, so they study excludes preprints that may have been published more slowly.
Because of its short time frame, the study doesn’t use unpublished preprints as a control. This comparison would provide a stronger and more direct findings on the role of journal peer review and the reliability of preprints, the authors note.
In addition, the study doesn’t measure how much of the changes were introduced by the peer-review process since it is difficult to determine when the preprint was posted relative to submission to the journal. Some preprints are revised and posted on servers several times before they are published.
Researchers also note that it is difficult to objectively compare two versions of a research paper. For example, abstracts that had many changes in the text, such as rearrangements of words, didn’t have a meaningful change in their conclusion.

Conflicts of Interest: The researchers report that one of the authors is the executive director of ASAPbio, a non-profit organization promoting the productive use of preprints in the life sciences. Another is a bioRxiv affiliate, part of a volunteer group of scientists that screen preprints deposited on the bioRxiv server.

Examining Linguistic Shifts Between Preprints and Publications
David Nicholson; et al. PLOS Biology, February 2022.

The study’s goal: The team wanted to compare the text of preprints in bioRxiv and their corresponding published studies to examine how peer review changes the documents. They used computer programs to analyze and compare the texts. Researchers also used the programs to identify similar papers and journals. It’s important to note that the study doesn’t investigate similarities in results and conclusions.

How they did it: They downloaded a snapshot of PubMed Central, which is an open access digital archive of full text peer-reviewed biomedical and life science research and is part of the U.S. National Library of Medicine on Jan. 31, 2020. They also downloaded a snapshot of the content of bioRxiv on Feb. 3, 2020. In addition, they downloaded a snapshot of New York Times Article Archives on July 7, 2020 as a representative of general English text and to identify bioRxiv preprints linked to a published article. They linked 17,952 preprints with a published version in PubMed.

What they found: Over 77% of bioRxiv preprints had a corresponding published version. “This suggests that most work from groups participating in the preprint ecosystem is now available in final form for literature mining and other applications,” Nicholson and co-authors write. They also find:

A subset of preprints posted in the first four months of the pandemic were published faster than the broader set of bioRxiv preprints, showing that peer review was accelerated.
The most common change between a preprint and published version was typesetting and the mention of supporting information or additional materials.
Preprints that had more versions posted on the servers and more changes in text took longer to publish. Every additional preprint version was associated with an increase of 51 days before a preprint was published. This suggests that authors may be updating their preprints in response to peer reviews or other external feedback.
The team also created the Preprint Similarity Search website that sifts through 1.7 million PubMed Central open access documents and lets users to find 10 papers and journals that are most similar to the textual content of a bioRxiv or medRxiv preprint.

What other experts say about the study: “I am surprised that more than three-quarters (77%) of the preprints they analyzed are now available in a peer reviewed journal,” wrote Fleerackers in an email. “This is higher than what was found in previous studies, and is likely an underestimate of the true number of preprints that are eventually published. For journalists, this is a really important takeaway as it answers another key question that could influence their decision to cover preprints: ‘How often do preprints actually get published in peer reviewed journals?'”

Fleerackers added that the Nicholson study adds more context to the study by the Brierley team because it looks at a much larger number of preprints and their corresponding published versions.

Advice to journalists: “Treat preprints with a grain of salt, but treat peer-reviewed publications that way too,” said Casey Greene, who’s one of Nicholson’s five co-authors, a professor in the Department of Biochemistry and Molecular Genetics and the founding director of the Center for Health AI in the University of Colorado School of Medicine. “Ultimately, both are simply steps along our path to better understanding the world around us.”

Conflicts of Interest: Researchers report that one author receives a salary from Elsevier, a Netherlands-based publishing company specializing in scientific, technical and medical content.

What other studies show

In “Comparing published scientific journal articles to their pre-print versions,” published in the International Journal on Digital Libraries in February 2018, researchers compared the text of title, abstract and body of preprints posted on arXiv and bioRxiv servers with their published version. They ended up with 12,202 preprints posted between 2003 and 2015 on arXiv and 2,516 posted between 2013 and 2016 on bioRxiv that had a final published version. Their analysis shows that “the text contents of the scientific papers generally changed very little from their pre-print to final published versions.”

News media outlets vary widely in how they cover preprint studies, new research finds

In “Cross-sectional study of preprints and final journal publications from COVID-19 studies: discrepancies in results reporting and spin in interpretation,” published in BMJ Open in July 2021, researchers compare preprints and final journal publications for 67 COVID-19 studies and find that one-third had no discrepancies in results. About a quarter had at least one outcome that was included in the journal publication but not in the preprint.

In 12% of the studies, at least one outcome was reported in the preprint only.

They also evaluated the studies for spin, which refers to specific reporting practices that distort the interpretation of results so that results are viewed more favorably.

“The COVID-19 preprints and their subsequent journal publications were largely similar in reporting of study characteristics, outcomes and spin,” the authors write. “All COVID-19 studies published as preprints and journal publications should be critically evaluated for discrepancies and spin.”

Meanwhile, in “COVID-19 randomized controlled trials in medRxiv and PubMed,” a small study published in the European Journal of Medicine in November 2020, researchers compare the full text of 13 preprint studies posted on medRxiv with 16 published studies in PubMed, all of which were about COVID-19 and were randomized controlled trials. The preprint studies were not related to the published studies.

Their analysis shows an increased rate of spin — positive terms used in the title or abstract section — in preprints compared with published studies. “Readers should pay attention to the overstatements in preprints of [randomized controlled trials],” the authors write.

Additional reading

Rise of the preprint: how rapid data sharing during COVID-19 has changed science forever, Nature Medicine, January 2022