Recently, people have wondered why researchers seem to have a special interest in replicating studies that demonstrated unexpected or surprising results. In this blog post, I will explain why, statistically speaking, this makes sense.
When we evaluate the likelihood that findings reflect real effects, we need to take the prior likelihood that the null-hypothesis is true into account. Null-hypothesis significance testing ignores this issue, because p-values give us the probability of observing the data (D), assuming H0 is true, or Pr(D|H0). If we want to know the probability the null-hypothesis is true, given the data, or Pr(H0|D) we need Bayesian statistics. I generally like p-values, so I will not try to convince you to use Bayesian statistics (although it’s probably smart to educate yourself a little on the topic), but I will explain how you can use calibrated p-values to get a feel for the probability H0 and H1 are true, given some data (see Sellke, Bayarri, & Berger, 2001). This nicely shows how p-values can be related to Bayes Factors (see also Good, 1992, Berger, 2003).
Everything I will talk about can be applied with the help of the nomogram below (taken from Held, 2010). On the left, we have the prior probability that H0 is true. For now, let’s assume the null hypothesis and the alternative hypothesis are equally likely (so the probability of H0 is 50%, and the probability of H1 is 50%). The middle line gives the observed p-value in a statistical test. It goes up to p = .37, and for a mathematical reason cannot be used for higher p-values. The right scale is the posterior probability of the null-hypothesis, from almost 0 (it is practically impossible H0 is true) to 50% probability that H0 is true (where 50% means that after we have performed a study, H0 and H1 are still equally likely to be true. By drawing straight lines between two of the scales, you can read off the corresponding value on the third scale. For example, assuming you think H0 and H1 are equally likely to be true before you begin (a prior probability of H0 of 50%), and you observe a p-value of .37, a straight line will bring us to a posterior probability for H0 of 50%, which means the likelihood that H0 or H1 is true has not changed, even though we have collected data.
If we observe a p = .049, which is a statistical difference with an alpha level of .05, the posterior likelihood that H0 is true is still a rather high 29%. The likelihood of the alternative hypothesis (H1) is 100% - 29% = 71%. This gives a Bayes Factor (the probability of H1, given the data, divided by the probability of H0, given the data, or Pr(H1|D)/Pr(H0|D) of 0.40, or 2,5 to 1 odds against H0. Bayesian do not consider this strong enough support against H0 (instead, it should be at least 3 to 1 odds against H0). This might be a good moment to add that these calculations are a best case scenario. This prior distribution is chosen in a way to given the highest possible Bayes Factor, so the real Bayes Factor is the value that follows from the nomogram, or worse. Also, now you’ve seen how easy it is to use the nomogram, I hope showing the Sellke et al., 2001 formula these calculations are based on won’t scare you away:
What if, a-priori, it seems the hypothesized alternative hypothesis is at least somewhat unlikely? This is a subjective judgment, and difficult to quantify, but you often see researchers themselves describe a result as ‘surprising’ or ‘unexpected’. Take a moment to think how likely the H0 should be for a finding to be ‘surprising’ and ‘unexpected’. Let’s see what happens if you think the a-priori probability of H0 is 75% (or 4 to 1 odds for H0). Observing a p = .04 would in that instance lead to, at best, a 51% probability H0 is true, and only a 49% probability H1 is true. That means that even though the observed data are unlikely, assuming H0 is true (or Pr(D|H0)), it is still more likely that H0 is true (Pr(HO|D) than that H1 is true (Pr(H1|D). I've made a spreadsheet you can use to perform these calculations (without any guarantees), in case you want to try out some different values of the prior probability and the observed p-value.
With a prior probability of 50%, a p = .04 would give a posterior probability of 26%. To have the same posterior probability of 26%, with an prior probability for H0 of 75%, the p-value would need to be p = .009. In other words, with decreasing a-priori likelihood, we need lower p-values to achieve a comparable posterior probability that H0 is true. This is why Lakens & Evers (2014, p. 284) stress that “When designing studies that examine an a priori unlikely hypothesis, power is even more important: Studies need large sample sizes, and significant findings should be followed by close replications.” To have a decent chance of observing a low enough p-value, you need to have a lot of statistical power. When reviewing studies that use the words 'unexpected' and 'surprising', be sure to check whether, given the a-priori likelihood of H0 (however subjective this assessment is) the p-values lead to a decent posterior probability that H1 is true. If we would do this consistently and fairly, there would be a lot less complaining about effects that are 'sexy but unreliable'.
This statistical reality has as a consequence that given two studies with equal sample sizes that yielded results with identical p-values, researchers who choose te replicate the more ‘unexpected and surprising’ finding are doing our science a favor. After all, that is the study where H0 still has the highest posterior likelihood, and is thus the finding where the likelihood that H1 is true is still relatively low. Replicating the more uncertain result leads to the greatest increase in posterior likelihoods. You can disagree about which finding is subjectively judged to be a-priori less likely, but the choice to replicate a-priori less likely results (all else being equal) makes sense.