COI: I was a co-author of the RP:P paper (but not of the response to Gilbert et al).
The first question GKPW address in their commentary is: “So how many of their [the RP:P] replication studies should we expect to have failed by chance alone?”
They estimate this, using Many Labs data, and they come to the conclusion that 65.5% can be expected to replicate, and thus the answer is 1-65.5%, or 34.5%.
This 65.5% is an important number, because it underlies the claim in their ‘oh screw it who cares about our reputation’ press release that: “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%”.
In the article, they compare another meaningless estimate of the number of successful replications in the RP:P with the 65.5 number and conclude: “Remarkably, the CIs of these estimates actually overlap the 65.5% replication rate that one would expect if every one of the original studies had reported a true effect.”
So how did GKPW get at this 65.5 number that rejects the idea that there is a reproducibility crisis in psychology? Science might not have peer reviewed this commentary (I don’t know for sure, but given the quality of the commentary, the fact that two of the authors are editors at Science, and my experience writing commentaries which are often only glanced over by editors, I’m 95% confident), but they did require the authors to share the code. I’ve added some annotations to the crucial file (see code at the bottom of this post), and you can get all the original files here. So, let's peer-review this claim ourselves.
GKPW calculated confidence intervals around all effect sizes. They then take each of the 16 studies in the Many Labs project. For each study, there are 36 replications. They take the effect size of single study at a time, and calculate how many of the remaining replications have a confidence interval around the effect size where the lower limit is larger than the effect size of the single study, or where the upper limit is smaller than the effect size. Thus, they count how many times the confidence intervals from the other studies do not contain the effect size from the single study.
As I explained in my previous blog post, they are calculating a capture percentage. The authors ‘acknowledge’ their incorrect definition of what a confidence interval is:
@StuartBuck1 @a_strezh Fair enough, but we're just employing the same metric they used, regardless of lack of precision in our language...— Stephen Pettigrew (@rink_stats) March 3, 2016
They also suggest they are just using the same measure we used in the RP:P paper. This is true, except that we didn’t suggest, anywhere in the RP:P paper, that there is a certain percentage that is ‘expected based on statistical theory’, as GKPW state. However, not hindered by any statistical knowledge, GKPW write in the supplementary material [TRIGGER WARNING]:
“OSC2015 does not provide a similar baseline for the CI replication test from Table 1, column 10, although based on statistical theory we know that 95% of replication estimates should fall within the 95% CI of the original results.”
Reading that statement physically hurts.
The capture percentage indicates that a single 95% confidence interval will in the long contain 83.4% of future parameters. To extend my previous blog post: There are some assumptions for this number. This percentage is only true if the sample sizes are equal (another is unbiased CI in the original studies, which is also problematic here, but not even necessary to discuss). If the replication study is larger the capture percentage is higher, and when the replication study is smaller, the capture percentage is lower. Richard Morey made a graph that plots capture percentages as a function of the difference between the sample size in the original and replication study.
The Many Labs data does not consist of 36 replications per lab, each with exactly the same sample size. Instead, sample sizes varied from 79 to 1329.
Look at the graphs below. Because the variability is much larger in the small sample (n=79, top) than in the big sample (n=1329, bottom), it's more likely that the mean in the bottom study will fall within the 95% of the top study, than it is that the mean of the top study will fall within the 95% CI of the bottom study. In an extreme case (n = 2 vs n = 100000), the mean of study n = 100000 will always fall within the 95% CI of the n = 2 study, but the mean of the n=2 study will rarely fall within the CI of the n = 100000 study, yielding a lower long-run limit of 50% for the capture percentage as calculated by GKPW.
Calculating a capture percentage across the Many Labs studies does not give an idea of what we can expect in the RP:P, if we allow some variation between studies due to 'infidelities'. The number you get says a lot about differences in sample sizes in the Many Labs study, but this can't be generalized to the RP:P. The 65.5 is a completely meaningless number with respect to what can be expected in the RP:P.
The conclusions GKPW draw based on this meaningless number, namely that “If every one of the 100 original studies that OSC attempted to replicate had described a true effect, then more than 34 of their replication studies should have failed by chance alone.” is really just complete nonsense. The statement in their press release that “the reproducibility of psychological science is quite high and, in fact, statistically indistinguishable from 100%”, based on this number, is equally meaningless.
The authors could have attempted to calculate the capture percentage for the RP:P based on the true differences in sample sizes between the original and replication studies (where 70 studies had a larger sample size, 10 the same sample size, and 20 a smaller sample size). But this would not give us the expected capture percentage, assuming all studies are true, only allowing for 'infidelities' in the replication. In addition to variation in sample sizes between original and replication studies, the capture percentage is substantially influenced by publication bias in the original studies. If we take this into account, the most probable capture percentages should be even lower. Had GKPW taken this bias into account, they would not have had to commit the world's first case of CI-hacking by only looking at the subset of 'endorsed' protocols to make the point that the 95% CI around the observed success rate for endorsed studies includes the meaningless 65.5 number.
In Uri Simonsohn’s recent blog post he writes: “the Gilbert et al. commentary opens with what appears to be an incorrectly calculated probability. One could straw-man argue against the commentary by focusing on that calculation”. I hope to have convinced the readers that focusing on this incorrectly calculated probability is not a straw man. It completely invalidates a third of their commentary, the main point they open with, and arguably the only thing that was novel about the commentary. (The other two points about power and differences between original studies and replications were discussed in the original report [even though the detailed differences between studies could not be discussed in detail due to word limitations; however, the commentary doesn’t adequately discuss this issue either]).
The use of the confidence interval interpretation of replicability in the OSC article was probably a mistake, too much based on the 'New Statistics' hype two years ago. The number is basically impossible to interpret, there is no reliable benchmark to compare it against, and it doesn't really answer any meaningful question.
But the number is very easy to misinterpret. We see this clearly in the commentary by Gilbert, King, Pettigrew and Wilson.
To conclude: How many replication studies should we expect to have failed by chance alone? My answer is 42 (and the real answer is: We can't know). Should Science follow Psychological Science's recent decision to use statistical advisors? Yes.
P.S. Marcel van Assen points out in the comments that the correct definition, and code, for the CI overlap measure were readily available in the supplement. See here, or the screenshot below: