Does science face a crisis of reproducibility as a result of AI

Scientists fear that a plethora of faulty or pointless research is being produced by the improper application of artificial intelligence.

In certain nations, diagnostic kits for the viral infection were scarce during the COVID-19 pandemic in late 2020. Therefore, it seemed reasonable to diagnose infection using a widely used medical procedure: chest X-rays. A team in India stated that artificial intelligence (AI) could successfully distinguish between infected and non-infected persons by analyzing a batch of X-ray pictures using machine learning.

The study, which is one of many that explore the concept, has received over 900 citations. However, in September of the following year, computer scientists Lior Shamir and Sanchari Dhar at Manhattan’s Kansas State University examined it more closely2. The same photos were used to train a machine-learning system, but only blank backdrop regions devoid of any visible body components were employed. However, their AI was still able to identify COVID-19 cases far more frequently than not.

The issue seems to be that the backgrounds of the medical photos in the data set varied over time. Without learning any clinically significant traits, an AI system may recognize those artefacts and succeed in the diagnostic task, rendering it medically useless.

Shamir and Dhar discovered a number of additional instances where purportedly successful AI image categorization yielded comparable findings from meaningless or blank areas of the images, ranging from cell kinds to face identification. When it came to identifying faces without faces and cells without cells, the algorithms outperformed chance. A few of these articles have received hundreds of citations.

Shamir remarks, “These instances can seem entertaining, but in biomedicine, incorrect classification could mean the difference between life and death. “The issue is far more widespread than most of my colleagues would like to think,” the problem’s prevalence being expressed. A different review conducted in 2021 looked at 62 research that used machine learning to detect COVID-19 using computed tomography scans or chest X-rays; it came to the conclusion that due to methodological errors or biases in image data sets, none of the AI models was clinically useful3.

These are only a few of the ways that machine learning can lead to study claims that are false, as discovered by Shamir and Dhar. Sayash Kapoor and Arvind Narayanan, computer scientists at Princeton University in New Jersey, reported earlier this year that reproducibility problems in 17 fields they examined, affecting hundreds of papers, have been caused by data leakage, a problem that arises when there is insufficient separation between the data used to train an AI system and those used to test it. They contend that an irresponsible use of AI is creating a crisis in repeatability.

Powerful statistical methods like machine learning (ML) and other forms of artificial intelligence (AI) have improved nearly every field of study by identifying patterns in data that are frequently invisible to researchers who are human. Simultaneously, other researchers are concerned that a plethora of publications containing claims that are either false or practically useless, or that cannot be verified, are being caused by the improper use of AI software.

Overcoming the issue

A set of guidelines for reporting AI-based science9 was put forth by Kapoor, Narayanan, and others in August of this year. The checklist consists of 32 questions that include topics including data quality, modelling specifics, and data leakage hazards. “Provides a cross-disciplinary bar for reporting standards in ML-based science,” is how they describe their list. There are also checklists made for certain subjects, such chemistry11 and the life sciences 10.

Many contend that research articles utilizing AI should fully disclose their data and methodology. According to a 2019 study conducted by data scientist Edward Raff of the Virginia-based analytics company Booz Allen Hamilton, just 63.5% of 255 publications that used AI techniques could be replicated exactly as written. Reproducibility increases to 85% if the original authors actively provide data and code, according to a statement made subsequently by computer scientist Joelle Pineau of McGill University in Montreal, Canada (who is also vice-president of AI research at Meta) and others12. In light of this, Pineau and her associates put up a methodology for articles utilizing AI techniques, which stipulates that the source code must be submitted with the work and that, similar to Kapoor and Narayan’s suggestions, can be evaluated in comparison to a common ML reproducibility checklist13.

While code must be made public for a peer-reviewed proceedings paper to be published, this is not yet the case for all computer-science conferences. It’s a mixed bag, but the major conferences take it more seriously, according to Vanschoren.

The fact that there aren’t enough data available to adequately test the models could be a contributing factor in the issue. “Researchers may publish low-quality results that show great performance if there aren’t enough public data sets to evaluate their models correctly,” says Joseph Cohen, a scientist at Amazon AWS Health AI and the director of the US-based non-profit Institute for Reproducible Research. “This is a serious problem for medical research.”

Jain is particularly worried about the potential for intentional exploitation of generative AI as a quick and simple method to produce graphics that appear authentically scientific. “It’s difficult to ignore the worry that there might be more integrity problems in science,” he says.

Changes in culture

Some academics believe that altering cultural norms surrounding the reporting and presentation of data is the only way to effectively solve the issues. Haibe-Kains is not very hopeful about the ease of engineering such a transformation. He and his colleagues disapproved of a well-known paper published in 2020 that was written by a group that includes Google Health14 researchers and explored the possibility of using ML to identify breast cancer in mammograms. “The lack of adequately documented methods and computer code underlying the study effectively undermines its scientific value,” Haibe-Kains and his co-authors concluded.15.

However, the study’s authors acknowledged in a public response that they were unable to release all of the data because some of it originated from a US institution and their privacy concerns prevented them from sharing it. They went on to say that they “strove to keep the paper accessible to a clinical and general scientific audience while documenting all relevant machine learning methods”16.

More generally, Varoquaux and computer scientist Veronika Cheplygina of the IT University of Copenhagen have contended that the incentives that are now in place for publishing, particularly the need to create attention-grabbing headlines, work against the validity of discoveries based on AI17. Haibe-Kains continues, saying that journal editors frequently fail to sufficiently challenge authors who do not always “play the game in good faith” by adhering to data-transparency rules.

Haibe-Kains contends that the issue is more likely to stem from editors and reviewers who may be “poorly educated on the real versus fictitious obstacles for sharing data, code, and so on, so they tend to be content with very shallow, unreasonable justifications [for not sharing such information]” rather than from editors who disregard transparency regulations. It is possible that authors are unaware of the necessary steps to guarantee the consistency and repeatability of their work. Bennett states, “If you don’t fully understand what you are doing, it’s hard to be completely transparent.”

Opinions on the suitability of peer review for journal publications pertaining to AI varied in this year’s Nature study, which polled over 1,600 researchers on the subject. Of the scientists who employed AI in their research, Regarding reviews, 25% of respondents felt they were sufficient, 25% felt they weren’t, and almost 50% claimed they had no idea (see ‘Quality of AI review in research publications’ and Nature 621, 672–675; 2023).