If John Carlisle had a cat flap, scientific fraudsters might rest easier at night. Carlisle routinely rises at 4.30 a.m. to let out Wizard, the family pet. Then, unable to sleep, he reaches for his laptop and starts typing up data from published papers on clinical trials. Before his wife鈥檚 alarm clock sounds 90 minutes later, he has usually managed to fill a spreadsheet with the ages, weights and heights of hundreds of people 鈥?some of whom, he suspects, never actually existed.
By day, Carlisle is an anaesthetist working for England鈥檚 National Health Service in the seaside town of Torquay. But in his spare time, he roots around the scientific record for suspect data in clinical research. Over the past decade, his sleuthing has included trials used to investigate a wide range of health issues, from the benefits of specific diets to guidelines for hospital treatment. It has led to hundreds of papers being retracted and corrected, because of both misconduct and mistakes. And it has helped to end the careers of some large-scale fakers: of the six scientists worldwide with the most retractions, three were brought down using variants of Carlisle鈥檚 data analyses.
鈥淗is technique has been shown to be incredibly useful,鈥?says Paul Myles, director of anaesthesia and perioperative medicine at the Alfred hospital in Melbourne, Australia, who has worked with Carlisle to examine research papers containing dodgy statistics. 鈥淗e鈥檚 used it to demonstrate some major examples of fraud.鈥?/p>
Carlisle鈥檚 statistical sideline is not popular with everyone. Critics argue that it has sometimes led to the questioning of papers that aren鈥檛 obviously flawed, resulting in unjustified suspicion.
But Carlisle believes that he is helping to protect patients, which is why he spends his spare time poring over others鈥?studies. 鈥淚 do it because my curiosity motivates me to do so,鈥?he says, not because of an overwhelming zeal to uncover wrongdoing: 鈥淚t鈥檚 important not to become a crusader against misconduct.鈥?/p>
Together with the work of other researchers who doggedly check academic papers, his efforts suggest that the gatekeepers of science 鈥?journals and institutions 鈥?could be doing much more to spot mistakes. In medical trials, the kind that Carlisle focuses on, that can be a matter of life and death.
Anaesthetists behaving badly
Torquay looks like any other traditional provincial English town, with pretty floral displays on the roundabouts and just enough pastel-coloured cottages to catch the eye. Carlisle has lived in the area for 18 years and works at the town鈥檚 general hospital. In an empty operating theatre, after a patient has just been stitched up and wheeled away, he explains how he began to look for faked data in medical research.
More than ten years ago, Carlisle and other anaesthesiologists began chattering about results published by a Japanese researcher, Yoshitaka Fujii. In a series of randomized controlled trials (RCTs), Fujii, who then worked at Toho University in Tokyo, claimed to have examined the impact of various medicines on preventing vomiting and nausea in patients after surgery. But the data looked too clean to be true. Carlisle, one among many concerned, decided to check the figures, using statistical tests to pick up unlikely patterns in the data. He showed in 2012 that, in many cases, the likelihood of the patterns having arisen by chance was 鈥渋nfinitesimally small鈥?sup>1. Prompted in part by this analysis, journal editors asked Fujii鈥檚 present and former universities to investigate; Fujii was fired from Toho University in 2012 and had 183 of his papers retracted, an all-time record. Four years later, Carlisle co-published an analysis of results from another Japanese anaesthesiologist, Yuhji Saitoh 鈥?a frequent co-author of Fujii鈥檚 鈥?and demonstrated that his data were extremely suspicious, too2. Saitoh currently has 53 retractions.
Statistical errors
Other researchers soon cited Carlisle鈥檚 work in their own analyses, which used variants of his approach. In 2016, researchers in New Zealand and the United Kingdom, for example, reported problems in papers by Yoshihiro Sato, a bone researcher at a hospital in southern Japan3. That ultimately led to 27 retractions, and 66 Sato-authored papers have been retracted in total.
Anaesthesia had been rocked by several fraud scandals before Fujii and Saitoh鈥檚 cases 鈥?including that of German anaesthetist Joachim Boldt, who has had more than 90 papers retracted. But Carlisle began to wonder whether only his own field was at fault. So he picked eight leading journals and, working in his spare moments, checked through thousands of randomized trials they had published.
In 2017, he published an analysis in the journal Anaesthesia stating that he had found suspect data in 90 of more than 5,000 trials published over 16 years4. At least ten of these papers have since been retracted and six corrected, including a high-profile study published in The New England Journal of Medicine (NEJM) on the health benefits of the Mediterranean diet. In that case, however, there was no suggestion of fraud: the authors had made a mistake in how they randomized participants. After the authors removed erroneous data, the paper was republished with similar conclusions5.
Carlisle has kept going. This year, he warned about dozens of anaesthesia studies by an Italian surgeon, Mario Schietroma at the University of L鈥橝quila in central Italy, saying that they were not a reliable basis for clinical practice6. Myles, who worked on the report with Carlisle, had raised the alarm last year after spotting suspicious similarities in the raw data for control and patient groups in five of Schietroma鈥檚 papers.
The challenges to Schietroma鈥檚 claims have had an impact in hospitals around the globe. The World Health Organization (WHO) cited Schietroma鈥檚 work when, in 2016, it issued a recommendation that anaesthetists should routinely boost the oxygen levels they deliver to patients during and after surgery, to help reduce infection. That was a controversial call: anaesthetists know that in some procedures, too much oxygen can be associated with an increased risk of complications 鈥?and the recommendations would have meant hospitals in poorer countries spending more of their budgets on expensive bottled oxygen, Myles says.
The five papers Myles warned about were quickly retracted, and the WHO revised its recommendation from 鈥榮trong鈥?to 鈥榗onditional鈥? meaning that clinicians have more freedom to make different choices for various patients. Schietroma says his calculations were assessed by an independent statistician and through peer review, and that he purposely selected similar groups of patients, so it鈥檚 not surprising if the data closely match. He also says he lost raw data and documents related to the trials when L鈥橝quila was struck by an earthquake in 2009. A spokesperson for the university says it has left enquiries to 鈥渢he competent investigating bodies鈥? but did not explain which bodies those were or whether any investigations were under way.
Spotting unnatural data
The essence of Carlisle鈥檚 approach is nothing new, he says: it鈥檚 simply that real-life data have natural patterns that artificial data struggle to replicate. Such phenomena were spotted in the 1880s, were popularized by the US electrical engineer and physicist Frank Benford in 1938, and have since been used by many statistical checkers. Political scientists, for example, have long used a similar approach to analyse survey data 鈥?a technique they call Stouffer鈥檚 method after sociologist Samuel Stouffer, who popularized it in the 1950s.
In the case of RCTs, Carlisle looks at the baseline measurements that describe the characteristics of the groups of volunteers in the trial, typically the control group and the intervention group. These include height, weight and relevant physiological characteristics 鈥?usually described in the first table of a paper.
In a genuine RCT, volunteers are randomly allocated to the control or (one or more) intervention groups. As a result, the mean and the standard deviation for each characteristic should be about the same 鈥?but not too identical. That would be suspiciously perfect.
Stat-checking software stirs up psychology
Carlisle first constructs a P value for each pairing: a statistical measurement of how likely the reported baseline data points are if one assumes that volunteers were, in fact, randomly allocated to each group. He then pools all these P values to get a sense of how random the measurements are overall. A combined P value that looks too high suggests that the data are suspiciously well-balanced; too low and it could show that the patients have been randomized incorrectly.
The method isn鈥檛 foolproof. The statistical checks demand that the variables in the table are truly independent 鈥?whereas in reality, they often aren鈥檛. (Height and weight are linked, for example.) In practice, this means that some papers that are flagged up as incorrect actually aren鈥檛 鈥?and for that reason, some statisticians have criticized Carlisle鈥檚 work.
But Carlisle says that applying his method is a good first step, and one that can highlight studies that might deserve a closer look, such as requesting the individual patient data behind the paper.
鈥淚t can put up a red flag. Or an amber flag, or five or ten red flags to say this is highly unlikely to be real data,鈥?says Myles.
Mistakes versus miscreants
Carlisle says that he is careful not to attribute any cause to the possible problems he identifies. In 2017, however, when Carlisle鈥檚 analysis of 5,000 trials appeared in Anaesthesia 鈥?of which he is an editor 鈥?an accompanying editorial by anaesthetists John Loadsman and Tim McCulloch at the University of Sydney in Australia took a more provocative line7.
It talked of 鈥渄ishonest authors鈥?and 鈥渕iscreants鈥?and suggested that 鈥渕ore authors of already published RCTs will eventually be getting their tap on the shoulder鈥? It also said: 鈥淎 strong argument could be made that every journal in the world now needs to apply Carlisle鈥檚 method to all the RCTs they鈥檝e ever published.鈥?/p>
This provoked a strongly worded response from editors at one journal, Anesthesiology, which had published 12 of the papers Carlisle highlighted as problematic. 鈥淭he Carlisle article is ethically questionable and a disservice to the authors of the previously published articles 鈥榗alled out鈥?therein,鈥?wrote the journal鈥檚 editor-in-chief, Evan Kharasch, an anaesthesiologist at Duke University in Durham, North Carolina8. His editorial, co-written with anaesthesiologist Timothy Houle at Massachusetts General Hospital in Boston, who is the statistical consultant for Anesthesiology, highlighted problems such as the fact that the method could flag up false positives. 鈥淎 valid method to detect fabrication and falsification (akin to plagiarism-checking software) would be welcome. The Carlisle method is not such,鈥?they wrote in a correspondence to Anaesthesia9.
In May, Anesthesiology did correct one of the papers Carlisle had highlighted, noting that it had reported 鈥渟ystematically incorrect鈥?P values in two tables, and that the authors had lost the original data and couldn鈥檛 recalculate the values. Kharasch, however, says he stands by his view in the editorial. Carlisle says Loadsman and McCulloch鈥檚 editorial was 鈥渞easonable鈥?and that the criticisms of his work don鈥檛 undermine its value. 鈥淚鈥檓 comfortable thinking the effort worthwhile whilst others might not,鈥?he says.
The data checkers
Carlisle鈥檚 isn鈥檛 the only method to emerge in the past few years for double-checking published data.
Mich猫le Nuijten, who studies analytical methods at Tilburg University in the Netherlands, has developed what she calls a 鈥?a href="https://www.nature.com/news/stat-checking-software-stirs-up-psychology-1.21049" data-track="click" data-label="https://www.nature.com/news/stat-checking-software-stirs-up-psychology-1.21049" data-track-category="body text link">spellcheck for statistics鈥?that can scan journal articles to check whether the statistics described are internally consistent. Called statcheck, it verifies, for example, that data reported in the results section agree with the calculated P values. It has been used to flag errors, usually numerical typos, in journal articles going back decades.
And Nick Brown, a graduate student in psychology at the University of Groningen, also in the Netherlands, and James Heathers, who studies scientific methods at Northeastern University in Boston, Massachusetts, have used a program called GRIM to double-check the calculation of statistical means, as another way to flag suspect data.
Neither technique would work on papers that describe RCTs, such as the studies Carlisle has assessed. Statcheck runs on the strict data-presentation format used by the American Psychological Association. GRIM works only when data are integers, such as the discrete numbers generated in psychology questionnaires, when a value is scored from 1 to 5.
There is growing interest in these kinds of checks, says John Ioannidis at Stanford University in California, who studies scientific methods and advocates for the better use of statistics to improve reproducibility in science. 鈥淭hey are wonderful tools and very ingenious.鈥?But he cautions about jumping to conclusions over the reason for the problems found. 鈥淚t鈥檚 a completely different landscape if we鈥檙e talking about fraud versus if we鈥檙e talking about some typo,鈥?he says.
Brown, Nuijten and Carlisle all agree that their tools can only highlight problems that need to be investigated. 鈥淚 really don鈥檛 want to associate statcheck with fraud,鈥?says Nuijten. The true value of such tools, Ioannidis says, will be to screen papers for problematic data before they are published 鈥?and so prevent fraud or mistakes reaching the literature in the first place.
Carlisle says an increasing number of journal editors have contacted him about using his technique in this way. Currently, most of this effort is done unofficially on an ad hoc basis, and only when editors are already suspicious.
At least two journals have taken things further and now use the statistical checks as part of the publication process for all papers. Carlisle鈥檚 own journal, Anaesthesia, uses it routinely, as do editors at the NEJM. 鈥淲e are looking to prevent a rare, but potentially impactful, negative event,鈥?a spokesperson for the NEJM says. 鈥淚t is worth the extra time and expense.鈥?/p>
Carlisle says he is very impressed that a journal with the status of the NEJM has introduced these checks, which he knows at first hand are laborious, time-consuming and not universally popular. But automation would be needed to introduce them on the scale required to check even a fraction of the roughly two million papers published across the world each year, he says. He thinks it could be done. Statcheck works in this way, and is being used routinely by several psychology journals to screen submissions, Nuijten says. And text-mining techniques have allowed researchers to assess, for instance, the P values in thousands of papers as a way to investigate P-hacking 鈥?in which data are tweaked to produce significant P values.
One problem, several researchers in the field say, is that funders, journals and many in the scientific community give a relatively low priority to such checks. 鈥淚t is not a very rewarding type of work to do,鈥?Nuijten says. 鈥淚t鈥檚 you trying to find flaws in other people鈥檚 work, and that is not something that will make you very popular.鈥?/p>
Even finding that a study is fraudulent does not always end the matter. In 2012, researchers in South Korea submitted to Anesthesia & Analgesia a report of a trial that looked at how facial muscle tone could indicate the best time to insert breathing tubes into the throat. Asked, unofficially, to take a look, Carlisle found discrepancies between patient and summary data, and the paper was rejected.
Remarkably, it was then submitted to Carlisle鈥檚 own journal with different patient data 鈥?but Carlisle recognized the paper. It was rejected again, and editors on both journals contacted the authors and their institutions with their concerns. To Carlisle鈥檚 astonishment, a few months later the paper 鈥?unchanged from the last version 鈥?was published in the European Journal of Anaesthesiology. After Carlisle shared the paper鈥檚 dubious history with the journal editor, it was retracted in 2017 because of 鈥渋rregularities in their data, including misrepresentation of results鈥?sup>10.
After seeing so many cases of fraud, alongside typos and mistakes, Carlisle has developed his own theory of what drives some researchers to make up their data. 鈥淭hey think that random chance on this occasion got in the way of the truth, of how they know the Universe really works,鈥?he says. 鈥淪o they change the result to what they think it should have been.鈥?/p>
As Carlisle has shown, it takes a determined data checker to spot the deception.
Nature 571, 462-464 (2019)