The scientific revolution has been fueled by using data to test theories, so it might be assumed that big data has now created a golden age for science. If anything, the opposite is true.
While ChatGPT and other large language models (LLMs) have garnered significant attention as potential disruptors in higher education, they represent just one of the many challenges posed by the rapid growth and widespread adoption of big data technologies in academic and research contexts.
When ChatGPT was released publicly on November 30, 2022, students and educators recognized almost immediately that LLMs can be used to do homework, take tests, and write essays. A possible silver lining is that many instructors may modify their teaching approach. Instead of focusing on multiple-choice tests and descriptive essays, which LLMs excel at, teachers may focus instead on the critical thinking skills that students need and LLMs lack because LLMs literally do not understand what words mean. They are very much like the young savant who could recite every word in all six volumes of The History of the Decline and Fall of the Roman Empire without comprehending any of the content.
Professors, too, may be tempted to use LLMs to write papers for them. Computer-generated papers are not new. In 2012, Cyril Labbé and Guillaume Cabanac reported discovering 243 published papers that had been written entirely or in part by SCIgen, a computer program that uses randomly chosen words to generate sham computer science papers. The 19 publishers involved claim that their journals employ rigorous peer review but even a casual reading of a SCIgen paper would see that it was nonsense.
The prevalence of completely fabricated papers is now increasing because LLMs generate articulate papers that generally must be read carefully to detect the deception, and reviewers have little incentive to read carefully. Even papers clearly written by LLMs can slip through the review process. One paper published in an Elsevier journal began, “Certainly, here is a possible introduction for your topic,” while another Elsevier paper included this: “I’m very sorry, but I don’t have access to real-time information or patient-specific data, as I am an AI language model.” It is, of course, more difficult to detect LLM-generated papers that have been stripped of such obvious markers.
The assault on science posed by big data goes far beyond LLMs. Much research is based on the reasonable argument that researchers should assess whether their results might plausibly be explained by chance: for example, in the random assignment of subjects to treatment and control groups. The standard assessment tool is the p-value, which is the probability of observing, by chance alone, effects equal to or larger than those actually observed.
Sir Ronald Fisher famously endorsed a 5 percent cutoff for results to be considered statistically significant: “It is convenient to draw the line at about the level at which we can say: ‘Either there is something in the treatment, or a coincidence has occurred.’ <…> Personally, the writer prefers to set a low standard of significance at the 5 percent point, and ignore entirely all results which fail to reach this level.”
However, as Goodhart’s Law predicts, “When a measure becomes a target, it ceases to be a good measure.” Researcher efforts to get p-values below 5 percent have undermined the usefulness of p-values.
One strategy is p-hacking, or massaging the model and data until the p-value dips below 5 percent. For example, a study reporting that Asian-Americans are prone to heart attacks on the fourth day of the month omitted data that contradicted that conclusion. So did a study claiming that female-named hurricanes are deadlier than male-named hurricanes, and a study asserting that power poses (for example, hands on hips) can increase testosterone and reduce cortisol. As Nobel laureate Ronald Coase cynically observed, “If you torture data long enough, they will confess.”
A second strategy is HARKing (“hypothesizing after the results are known”), or looking for statistical patterns with no particular model in mind. For example, a study sponsored by the US National Bureau of Economic Research looked at the correlations between bitcoin returns and 810 variables, including the Canadian dollar vs. US dollar exchange rate, the price of crude oil, and stock returns in the automobile, book, and beer industries. Of these 810 correlations, 63 had p-values below 10 percent, which is fewer than the 81 that would be expected if they had just correlated bitcoin prices with random numbers.
P-hacking and HARKing have contributed to the replication crisis that is undermining the credibility of scientific research. Far too many media-friendly studies have been discredited when tested with fresh data. All four of the studies mentioned above were published in top journals. All four failed to replicate.
To gauge the extent of the crisis, a team led by Brian Nosek tried to replicate 100 studies published in three premier psychology journals; 64 failed. Teams led by Colin Camerer reexamined 18 experimental economics papers published in two top economics journals and 21 experimental social science studies published in Nature and Science; 40 percent did not replicate.
While Nosek’s Reproducibility Project was underway, auction markets for the 44 studies that had not yet been completed allowed researchers to bet on whether a replication would be successful—a result with a p-value less than 5 percent and in the same direction as the original result. 46 percent of the studies were given less than a 50 percent chance of replicating. Even that pessimistic expectation turned out to be too optimistic, as 61 percent did not replicate.
Bogus papers, p-hacking, and HARKing have been around for decades but modern computers and big data have contributed to the replication crisis by making these flawed practices convenient and powerful.
LLMs trained on enormous text databases can write remarkably articulate fake papers and do so faster than any human. Large databases also facilitate systematic p-hacking by providing many more ways in which data can be manipulated until a desired statistically significant result is obtained. Big data also creates an essentially unlimited number of ways to look for patterns—any patterns—until something statistically significant is found. In each case, the results are flawed and unlikely to replicate.
Several steps can be taken to help restore the luster of science.
First, journals should not publish empirical research until authors have made all nonconfidential data and methods available publicly. (Many journals now require authors to share their data and methods after publication but this requirement is not easily enforced and too often ignored.)
Second, journals should compensate reviewers who do careful, thorough appraisals. Postpublication, reproducibility and replication studies might be supported by private or public grants and required by universities for a PhD or other degree in an empirical field. If researchers know that their papers may be checked, they may be more careful.
It will not be easy to protect science from the temptations created by big data but it is a battle worth fighting.
Gary N. Smith is the Fletcher Jones professor of economics at Pomona College, California, United States. E-mail: [email protected]. Website: http://garysmithn.com.