Research

Can Science’s Reproducibility Crisis Be Reproduced?

March 03, 2016

Gary Miller, FilmMagic, Getty Images
A 2011 study of whether happier people are more patient used footage of a Broadway performance by Robin Williams (pictured in 2009). But Mr. Williams’s suicide, in 2014, might well have affected researchers’ ability to replicate the original study, as people are now likely to respond differently to his brand of humor.
More than a year after he committed suicide, is Robin Williams still funny?

The answer, both before his tragic death and after, probably is a matter of personal preference. It’s also now a key to assessing how seriously to take the much-feared crisis of reproducibility in scientific research.

Broad fears over reproducibility were stoked by a 2005 article in PLOS Medicine by John P.A. Ioannidis, a professor of health research and policy at Stanford University, contending that most published research findings are false. Last year a team of hundreds of researchers raised further alarm. After working over three years to faithfully repeat 100 studies that had been published in psychology journals, the team reported that it could not replicate most of the original results.

“It's very easy to come to the wrong conclusion when you try to replicate other people's research.”
Now, two new studies, published on Thursday in Science magazine, are pushing back. One, a Harvard-led critique of the project that repeated 100 psychology studies, suggests that that ambitious effort overlooked some critical factors. The other, an attempt to repeat 18 studies in leading economics journals, found that 61 percent of them replicated successfully.

"Our results were pretty encouraging," said the lead author of the economics study, Colin F. Camerer, a professor of behavioral economics at the California Institute of Technology.

Together, the two papers this week should help calm the widespread worries about the reliability of science fanned by Mr. Ioannidis, said the lead author of the psychology critique, Daniel T. Gilbert, a professor of psychology at Harvard.

"It’s very easy to come to the wrong conclusion when you try to replicate other people’s research," Mr. Gilbert said.

Inconclusive Results

On that much, Mr. Ioannidis and other advocates of replication studies say they agree, with Robin Williams and his brand of comedy serving as a case in point. Among the 18 economics studies that Mr. Camerer’s team attempted to replicate was a 2011 report by a pair of Santa Clara University researchers that explored whether happier people are more patient.

For the experiment, conducted among 69 students at Santa Clara, the half chosen for the "happy" group were shown a nine-minute clip of Mr. Williams during a 2002 Broadway performance. For Mr. Camerer’s replication attempt, he tested the clip on 131 students at the University of Oxford’s Nuffield College, where he happened to be working at the time.

Mr. Camerer acknowledged the obvious risks in that attempt at replication. For one thing, British and American senses of humor differ. For another, Mr. Williams committed suicide in August 2014, before the replication study was conducted, perhaps coloring test subjects’ emotional response to his work. Such attempts are considered "near replications" because the conditions of the original study cannot be repeated, Mr. Camerer said.

That raises the question, however, of whether any attempt to replicate an experiment involving human subjects warrants a conclusion about the strength of the initial finding.

The coordinator of the 100 replicated psychology studies, Brian A. Nosek, a professor of psychology at the University of Virginia, said he was well aware that there are limits that are only beginning to be understood.

In November 2011, Mr. Nosek assembled a team of some 200 researchers, known as the Open Science Collaboration, who spent three years seeing if they could replicate 100 studies published in leading journals. Of those original studies, 97 percent reported a statistically significant finding. In the replication attempts, only 36 percent produced significant results.

Beyond those results, however, Mr. Nosek said he had been driven to learn more about how to really measure reproducibility and how to improve it. The project involved extensive collaboration with the original authors to get their input on how best to conduct the replication attempts. For a follow-up effort, Mr. Nosek said, his team is looking more closely at 11 studies whose original authors did not endorse the replication design, to see if their suggestions might improve the rate of reproducibility.

Despite that commitment to scientifically exploring reproducibility, the findings of the Open Science Collaboration left some bruises. Mr. Gilbert called the project an unwarranted slap at the field of psychology, and said he had set out to find what might have been wrong about the replication attempts. "We were as chagrined as anybody to get this news" from the Open Science Collaboration, he said.

‘Don’t Trust the Headlines’

For their paper this week, Mr. Gilbert and his co-authors pored over the data made available by Mr. Nosek’s team and suggested a variety of flaws in how the Open Science Collaboration had chosen studies to review, selected participants for the replication attempts, and defined statistically significant effects.

According to Mr. Gilbert, the Open Science Collaboration’s replications used Italians rather than Americans to repeat a study of racial attitudes, and used students not enrolled in college to repeat an exploration in which college students had been asked to imagine being called on by a professor.

The lesson, Mr. Gilbert said, is: "Don’t trust the headlines when you see that somebody replicated a study. You have to look carefully to see what they really did."

Mr. Nosek said he agreed that replications can be very difficult. The lesson, however, is to keep trying to make replications even better, he said, and to design studies from the beginning so that replication attempts are easier.

“The arguments that the Harvard team is raising are not really very serious.”
The attacks by the Harvard group seem especially disingenuous, Mr. Ioannidis said, and motivated more by a desire to defend the field of psychology than to explore ways of improving reproducibility.

At least a couple of the Open Science Collaboration’s decisions — trying to reproduce only papers in leading journals, and excluding papers whose authors were reluctant to cooperate — more likely served to underestimate the problem of reproducibility rather than to exaggerate it, Mr. Ioannidis said. "The arguments that the Harvard team is raising are not really very serious," he said.

And as Mr. Camerer’s group is showing, even success is hard to define. Its attempt to affirm that happier people are more patient was scored a failure — perhaps, Mr. Camerer admitted, because of changes in perceptions of Mr. Williams’s comedic talents. But over all, Mr. Camerer was encouraged to see that 61 percent of the original studies had been effectively replicated.

"So there’s a little bit of room for improvement, but it’s not a disaster," Mr. Camerer said. That message is important as a counterpoint to some of the "dramatic claims" by Mr. Ioannidis that most research findings are false, he added.

Mr. Nosek was less impressed. "I would certainly hope that we could do better than 60 percent," he said. Either way, he said, this week’s papers reflect growing attention to the problem and to the pursuit of better ways to define and promote reproducibility. "It’s a victory for openness," he said.

Paul Basken covers university research and its intersection with government policy. He can be found on Twitter @pbasken, or reached by email at paul.basken@chronicle.com.