David Colquhoun (2014) recently wrote “If you use p = 0.05 to suggest that you have made a discovery, you will be wrong at least 30% of the time.” At the same time, you might have learned that if you set your alpha at 5%, the Type 1 error rate (or false positive rate) will not be higher than 5%. How are these two statements related?
First of all, the statement by David Colquhoun is obviously incorrect – peer reviewers nowadays are not what they never were – but we can correct his sentence by changing ‘will’ into ‘might, under specific circumstances, very well be’. After all, if you would only examine true effects, you could never be wrong when you suggested, based on p = 0.05, that you made a discovery.
The probability that a statement about a single study being indicative of a true effect is correct, depends on the percentage of studies you do where there is an effect (H1 is true), and when there is no effect (H0 is true), the statistical power, and the alpha level. The false discovery rate is the percentage of positive results that are false positives (not the percentage of all studies that are false positives). If you perform 200 tests with 80% power, and 50% (i.e., 100) of the tests examine a true effect, you’ll find 80 true positives (0.8*100), but in the 50% of the tests that do not examine a true effect, you’ll find 5 false positives (0.05*100). For the 85 positive results (80 + 5), the false discovery rate is 5/85=0.0588, or approximately 6% (see the Figure below, from Lakens & Evers, 2014, for a visualization).
At the same time, the alpha of 5% guarantees that not more than 5% of all your studies will be Type 1 errors. This is also true in the Figure above. Of 200 studies, at most 0.05*200 = 10 will be false positives. This happens only when H0 is true for all 200 studies. In our situation, only 5 studies (2.5% of all studies) are Type 1 errors, which is indeed less than 5% of all the studies we’ve performed.
So what’s the problem? The problem is that you should not try to translate your Type 1 error rate into the evidential value of a single study. If you want to make a statement about a single p < 0.05 study representing a true effect, there is no way to quantify this without knowing the power in the studies where H1 is true, and the percentage of studies where H1 is true. P-values and evidential value are not completely unrelated, in the long run, but a single study won’t tell you a lot – especially when you investigate counterintuitive findings that are unlikely to be true.
So what should you do? The solution is to never say you’ve made a discovery based on a single p-value. This will not just make statisticians, but also philosophers of science, very happy. And instead of making a fool out of yourself perhaps as often as 30% of the time, you won't make a fool out of yourself at all.
A statistically significant difference might be ‘in line with’ predictions from a theory. After all, your theory predicts data patterns, and the p-value tells you the probability of observing data (or more extreme data), assuming the null hypothesis is true. ‘In line with’ is a nice way to talk about your results. It is not a quantifiable statement about your hypothesis (that would be silly, based on a p-value!), but it is a fair statement about your data.
P-values are important tools because they allow you to control error rates. Not the false positive discovery rate, but the false positive rate. If you do 200 studies in your life, and you control your error rates, you won't say that there is an effect, when there is no effect, more than 10 times (on average). That’s pretty sweet. Obviously, there are also Type 2 errors to take into account, which is why you should design high-powered studies, but that’s a different story.
Some people recommend lowering p-value thresholds to as much ass 0.001 before you announce a ‘discovery’ (I've already explained why we should ignore this), and others say we should get rid of p-values altogether. But I think we should get rid of ‘discovery’, and use p-values to control our error rates.
It’s difficult to know, for any single dataset, whether a significant effect is indicative of a true hypothesis. With Bayesian statistics, you can convince everyone who has the same priors. Or, you can collect such a huge amount of data, that you can convince almost everyone (irrespective of their priors). But perhaps we should not try to get too much out of single studies. It might just be the case that, as long as we share all our results, a bunch of close replications extended with pre-registered novel predictions of a pattern in the data will be more useful for cumulative science than quantifying the likelihood a single study provides support for a hypothesis. And if you agree we need multiple studies, you'd better control your Type 1 errors in the long run.