Peer-Review Research Fails Reproducibility Test

September 11, 2015 6 min read Premium comments

On August 28, 2015 the journal Science published the result of a massive reproducibility study in which researchers attempted to replicate 100 experimental and correlational studies which had previously been published in three peer-reviewed psychology journals.

The results were not good. After re-running the previously peer-reviewed and published studies the researchers found that only 36% of replications had significant results while 97% of the original, peer-reviewed published studies reported significant results.

The researchers used high-powered designs and original materials when available to test the reproducibility of these studies. And they found that the average (mean) effect size of the replication effects (M_r = 0.197, SD = 0.257) had dropped by almost exactly 50% from the mean effect of the original studies (M_r = 0.403, SD = 0.188).

This is a shocking result. Even more stunning, the lead author of the reproducibility study had submitted one of his own studies for testing and it failed.

How trust worthy are peer review studies?

The implications of this outcome are not insignificant.

Each year orthopedic peer review journals publish thousands of studies which, we ALL assume are used to guide patient treatment. Given this new information, should all peer review studies be treated with even more skepticism than they are currently afforded until they pass a reproducibility test? You can be sure this will not go unnoticed by payers.

Here are the reproducibility study results in graphic form. This is sobering data.

The pace of clinical study publishing has been rising for decades. Not only is the funding from industry driving clinical study growth but ever more clinical research flows out of Asia, for example, and it is exponentially increasing the number of peer review studies in print.

This reproducibility study puts all this published research in a new light (the difference between 97% and 36% is huge) and begs many questions. Including why not encourage more reproducibility studies?

Blame the “Impact Factor”

Peer review journals are measured by something known as the “impact factor”.

Pity the poor researcher who doesn’t adapt their work to the “impact factor.” They are not published and probably served you a Starbucks Frappuccino this morning.

The Impact Factor measures the average number of citations to recent articles published in a journal. It’s frequently used as a proxy for the relative importance of a journal within its field, with journals with higher impact factors deemed to be more important than those with lower ones.

Here are the impact factors for the top 20 orthopedic journals.

Rank	Abbreviated Journal Title	Impact Factor
1	J BONE JOINT SURG AM	5.28
2	AM J SPORT MED	4.362
3	OSTEOARTHR CARTILAGE	4.165
4	J PHYSIOTHER	3.708
5	J BONE JOINT SURG BR	3.309
6	ARTHROSCOPY	3.206
7	KNEE SURG SPORT TR A	3.053
8	J ORTHOP SPORT PHYS	3.011
9	J ORTHOP RES	2.986
10	ACTA ORTHOP	2.771
11	CLIN ORTHOP RELAT R	2.765
12	GAIT POSTURE	2.752
13	J ARTHROPLASTY	2.666
14	J AM ACAD ORTHOP SUR	2.527
15	PHYS THER	2.526
16	SPINE J	2.426
17	SPINE	2.297
18	J SHOULDER ELB SURG	2.289
19	CLIN J SPORT MED	2.268
20	J SPINAL DISORD TECH	2.202

Source: Impactfactorsweekly.com

To maximize their journal’s impact factor, journal editors understand that they need to publish innovative research. In fact there are many self-help articles for academicians to help them improve their “impact” in the peer-review publishing world. Here are four common suggestions:

Think of a sexy title. “Academics who wish to improve the citation rate of their journal articles should ensure that title names are informative and memorable.” – Maximizing the Impacts of Your Research, LSE Public Policy Group
Be a networking machine. “Improving professional communication, such as through multi-author blogs, will help academics disseminate their research more broadly.” – LSE Public Policy Group
Issue a press release and perhaps even call The New York Times. (source: The Spine Journal)
Finally, add to the “dynamic knowledge inventory, a constantly developing stock of knowledge”. – LSE Public Policy Group

And if a sad, clueless researcher decides to submit a paper which merely confirms another person’s research? Then that poor sap is not impactful, not relevant.

Except that they are. Actually, they are critically relevant.

Blame the “Impact Factor.” Reproducibility studies don’t impact the impact factor.

Why Weren’t These Studies Reproducible?

The studies in this reproducibility test were psychology studies and therefore, to a great extent, dependent on subjective measures. There were, therefore, many potentially confounding variables. Of course, the same can be said for orthopedic studies which rely on such subjective measures as the Visual Analog Scale or the Western Ontario & McMaster pain score.

But there are other reasons.

Both the original study and the confirming study may be wrong. There may be an unknown variable acting independently of the measured variable.
The original study result may NOT have been a false positive. The confirming study results may have been a false negative. There may well have been unanticipated factors in the sample, setting or procedure in the confirming study which altered the observed effect magnitudes.
Publication and reporting bias. The replication studies were not affected by either of these biases but the original studies—especially the low powered ones—were. The replication studies significantly reduced these biases because of replication preregistration and pre-analysis.

It’s Biology, Stupid

With apologies to James Carville and his catch-phrase “It’s the economy, stupid, ” these studies are about biology, which is complex. And the smart researchers aren’t looking for correlational data, they’re looking for causational data. They are seeking to unlock the underlying mechanisms of pain or healing or degeneration. It’s biology, stupid.

So, as the authors of the reproducibility study said in their concluding comments:

“The observed variation in replication and original results may reduce certainty about the statistical inferences from the original studies but also provides an opportunity for theoretical innovation to explain differing outcomes, and then new research to test those hypothesized explanations. The correlational evidence, for example, suggests that procedures that are more challenging to execute may result in less reproducible results, and that more surprising original effects may be less reproducible than less surprising original effects.”

Another way of thinking about this is to remember all of the important “failures” which led to advancing the knowledge of biology. When the outcomes are unexpected, dig into the failure and sometimes reveal an even more valuable insight. Here are two famous examples of failures which opened the door to major biologic advances.

Penicillin. Wrong petri dish. Penicillin was discovered by Alexander Fleming in 1928 after a fortuitous accident (so Dr. Fleming would relate in later years) where he’d mistakenly left a petri dish open and it was contaminated by mold—except for one portion of the dish which seemed to be killing bacteria. Dr. Fleming was a famously poor communicator and orator so his discovery was ignored. He could not even recruit a chemist to help him extract and stabilize the new compound. He continued to persevere—even publishing a widely dismissed paper entitled “A Medium for the Isolation of Pfeiffer’s Bacillus.” Had other researchers paid closer attention, penicillin for medicinal would likely have sparked great interest and sped its development by almost a decade.
Heart Pacemaker. Wrong part. In 1956 a medical engineer in Buffalo, New York, Wilson Greatbatch. was trying to build an oscillator to record heart sounds at the University of Buffalo when he reached into a box and pulled out a resistor of the wrong size and plugged it into the circuit. When he installed it, it began to give off a rhythmic electrical impulse and he recognized the rhythmic lub-dub as the sound of the human heart. The beat, according to his 2001 obituary in The New York Times, reminded him of chats he had had with other scientists about whether an electrical stimulation could make up for a breakdown in the heart’s natural beats. Before then, pacemakers were hulking machines the size of TVs. He spent two years refining his device and was awarded a patent for the world’s first implantable pacemaker. His first pacemaker was implanted in a 77-year-old patient who lived 18 months with the device. Now, more than half a million of the devices are implanted every year.

Reproducibility Matters But So Does Failure

Reproducibility is one of the defining features of science. But so is failure and the bridge between the two is scientific inquiry and the embrace of the unexpected result. True scientists are curious and comfortable with complexity.

In medicine today there is a powerful trend to using correlational studies and mega-data to make medical decisions. Within that context, the fact that most of these studies did not reproduce well is problematic. But that’s not the right conclusion.

Within the context of scientific inquiry these results provide, instead, the valuable opportunity to dig deeper. Perhaps also, these results can push peer review journals and their editors to publish or even encourage confirmatory studies.

Impact factor be damned.

React:

Discussion

Dr. Sarah MitchellOrthopedic Surgeon · Mayo Clinic

This is a fascinating development. In my practice we've seen similar outcomes with the revised protocol. The key differentiator seems to be patient selection criteria. Has anyone else noticed the correlation with BMI thresholds?

James Thornton, MDSpine Fellow · HSS

Great point. I'd push back slightly on the conclusion, the sample size in the cited study is too small to draw population-level inferences. That said, the directional signal is compelling and worth a larger RCT.

R. PatelSports Medicine · Stanford

We implemented a similar approach last year. Early results are promising but we're still gathering 12-month follow-up data. Happy to share our protocol if anyone is interested.

Join the conversation

Orthopedic professionals are discussing this. Sign in and upgrade to read every comment and add your voice.

Peer-Review Research Fails Reproducibility Test

September 11, 2015 6 min read Premium comments

Source: Wikimedia Commons and Chris

This is a shocking result. Even more stunning, the lead author of the reproducibility study had submitted one of his own studies for testing and it failed.

How trust worthy are peer review studies?

The implications of this outcome are not insignificant.

Here are the reproducibility study results in graphic form. This is sobering data.

Blame the “Impact Factor”

Peer review journals are measured by something known as the “impact factor”.

Pity the poor researcher who doesn’t adapt their work to the “impact factor.” They are not published and probably served you a Starbucks Frappuccino this morning.

Here are the impact factors for the top 20 orthopedic journals.

Rank	Abbreviated Journal Title	Impact Factor
1	J BONE JOINT SURG AM	5.28
2	AM J SPORT MED	4.362
3	OSTEOARTHR CARTILAGE	4.165
4	J PHYSIOTHER	3.708
5	J BONE JOINT SURG BR	3.309
6	ARTHROSCOPY	3.206
7	KNEE SURG SPORT TR A	3.053
8	J ORTHOP SPORT PHYS	3.011
9	J ORTHOP RES	2.986
10	ACTA ORTHOP	2.771
11	CLIN ORTHOP RELAT R	2.765
12	GAIT POSTURE	2.752
13	J ARTHROPLASTY	2.666
14	J AM ACAD ORTHOP SUR	2.527
15	PHYS THER	2.526
16	SPINE J	2.426
17	SPINE	2.297
18	J SHOULDER ELB SURG	2.289
19	CLIN J SPORT MED	2.268
20	J SPINAL DISORD TECH	2.202

Source: Impactfactorsweekly.com

Think of a sexy title. “Academics who wish to improve the citation rate of their journal articles should ensure that title names are informative and memorable.” – Maximizing the Impacts of Your Research, LSE Public Policy Group
Be a networking machine. “Improving professional communication, such as through multi-author blogs, will help academics disseminate their research more broadly.” – LSE Public Policy Group
Issue a press release and perhaps even call The New York Times. (source: The Spine Journal)
Finally, add to the “dynamic knowledge inventory, a constantly developing stock of knowledge”. – LSE Public Policy Group

And if a sad, clueless researcher decides to submit a paper which merely confirms another person’s research? Then that poor sap is not impactful, not relevant.

Except that they are. Actually, they are critically relevant.

Blame the “Impact Factor.” Reproducibility studies don’t impact the impact factor.

Why Weren’t These Studies Reproducible?

But there are other reasons.

Both the original study and the confirming study may be wrong. There may be an unknown variable acting independently of the measured variable.
The original study result may NOT have been a false positive. The confirming study results may have been a false negative. There may well have been unanticipated factors in the sample, setting or procedure in the confirming study which altered the observed effect magnitudes.
Publication and reporting bias. The replication studies were not affected by either of these biases but the original studies—especially the low powered ones—were. The replication studies significantly reduced these biases because of replication preregistration and pre-analysis.

It’s Biology, Stupid

So, as the authors of the reproducibility study said in their concluding comments:

Penicillin. Wrong petri dish. Penicillin was discovered by Alexander Fleming in 1928 after a fortuitous accident (so Dr. Fleming would relate in later years) where he’d mistakenly left a petri dish open and it was contaminated by mold—except for one portion of the dish which seemed to be killing bacteria. Dr. Fleming was a famously poor communicator and orator so his discovery was ignored. He could not even recruit a chemist to help him extract and stabilize the new compound. He continued to persevere—even publishing a widely dismissed paper entitled “A Medium for the Isolation of Pfeiffer’s Bacillus.” Had other researchers paid closer attention, penicillin for medicinal would likely have sparked great interest and sped its development by almost a decade.
Heart Pacemaker. Wrong part. In 1956 a medical engineer in Buffalo, New York, Wilson Greatbatch. was trying to build an oscillator to record heart sounds at the University of Buffalo when he reached into a box and pulled out a resistor of the wrong size and plugged it into the circuit. When he installed it, it began to give off a rhythmic electrical impulse and he recognized the rhythmic lub-dub as the sound of the human heart. The beat, according to his 2001 obituary in The New York Times, reminded him of chats he had had with other scientists about whether an electrical stimulation could make up for a breakdown in the heart’s natural beats. Before then, pacemakers were hulking machines the size of TVs. He spent two years refining his device and was awarded a patent for the world’s first implantable pacemaker. His first pacemaker was implanted in a 77-year-old patient who lived 18 months with the device. Now, more than half a million of the devices are implanted every year.

Reproducibility Matters But So Does Failure

Impact factor be damned.

React:

Discussion

Dr. Sarah MitchellOrthopedic Surgeon · Mayo Clinic

James Thornton, MDSpine Fellow · HSS

R. PatelSports Medicine · Stanford

We implemented a similar approach last year. Early results are promising but we're still gathering 12-month follow-up data. Happy to share our protocol if anyone is interested.

Join the conversation

Orthopedic professionals are discussing this. Sign in and upgrade to read every comment and add your voice.

Peer-Review Research Fails Reproducibility Test

Rank

Abbreviated Journal Title

Impact Factor

Discussion

Join the conversation

Peer-Review Research Fails Reproducibility Test

Rank

Abbreviated Journal Title

Impact Factor

Discussion

Join the conversation