A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Friday, July 25, 2014

Why Psychologists Should Ignore Recommendations to Use α < .001



Recently some statisticians have argued we have to lower the widely used p < .05 threshold. David Colquhoun got me thinking about this by posting a manuscript here, but Valen Johnson’s paper in PNAS is probably better known. They both suggest a p < .001 threshold would lower the false discovery rate. The false discovery rate (or concluding an observed significant effect is true, when it is actually false) is 1 – the positive predictive value by Ioannidis, 2005 (see this earlier post for details).

Using p < .001 works to reduce the false discovery rate in much the same way as lowering the maximum speed to 10 kilometers an hour works to prevent lethal traffic accidents (if people would adhere to speed limits). With such a threshold, it is extremely unlikely bad things will happen. It has a strong prevention focus, but ignores a careful cost/benefit analysis of implementing such a threshold. (I’ll leave it up to you to ponder the consequences in the case of car driving – in The Netherlands there were 570 deaths in traffic in 2013 [not all would have been prevented by lowered the speed limit], and we apparently find this an acceptable price to pay for the benefits of being able to drive faster than 10 kilometers an hour).

The cost of lowering the threshold for considering a difference support for an hypothesis (see how hard I’m trying not to say ‘significant’?) is clear: we need larger samples to achieve the same level of power as we would with a p < .05 threshold. Colquhoun doesn’t talk about the consequence of having to increase sample sizes. Interestingly, he mentions power only when stating why power is commonly set to 80%: “Clearly it would be better to have 99% [power] but that would often mean using an unfeasibly large sample size.)”. For an independent two-sided t-test examining an effect expected to be of a size d = 0.5 with an α = .05, you need 64 participants in each cell for .80 power. To have .99 power, with α = .05, you need 148 participants in each cell. To have .80 power with α = .001 you need 140 participants in each cell. So, Colquhoun is stating .99 power often requires ‘unfeasibly large sample sizes’ only to recommend p < .001 which often requires equally large sample sizes.

Johnson discusses the required increase in sample sizes when lowering the threshold to p < .001: “To achieve 80% power in detecting a standardized effect size of 0.3 on a normal mean, for instance, decreasing the threshold for significance from 0.05 to 0.005 requires an increase in sample size from 69 to 130 in experimental designs. To obtain a highly significant result, the sample size of a design must be increased from 112 to 172.”

Perhaps that doesn’t sound so bad, but let me give you some more examples in the table below. As you see, decreasing the threshold from p < .05 to p < .001 requires approximately doubling the sample size.


d = .3
power .80
d = .3
power .90
d = .5
power .80
d = .5
power .90
d = .8
power .80
d = .8
power .90
0,05
176
235
64
86
26
34
0,001
383
468
140
170
57
69
Ratio
2,18
1,99
2,19
1,98
2,19
2,03

Now that we have surveyed the benefit (lower false discovery rate) and the cost (doubling sample size for independent t-tests – the costs are much higher when you examine interactions) let’s consider alternatives and other considerations.

The first thing I want to notice is how silent these statisticians are about problems associated with Type 2 errors. Why is it really bad to say something is true, when it isn’t, but is it perfectly fine to have 80% power, which means you have a 20% change of concluding there is nothing, while there actually was something? Fiedler, Kutzner, and Krueger (2012) discuss this oversight in the wider discussion of false positives in psychology, although I prefer the discussion of this issue by Cohen (1988) himself. Cohen realized we would be using his minimum 80% power recommendation as a number outside of a context. So let me remind you his reason for recommending 80% power was that he preferred a 1 to 4 balance between Type 1 and Type 2 errors. If you have a 5% false positive rate, and a 20% Type 2 error rate (because of 80% power) this basically means you consider Type 1 errors four times more serious than Type 2 errors. By recommending p < .001 and 80% power, Colquhoun and Johnson as saying a Type 1 error is 200 times as bad as a Type 2 error. Cohen would not agree with this at all.

The second thing I want to note is that because you need to double the sample size when using α = .001, you might just as well perform two studies with p < .05. If you find a statistical difference from zero in two studies, the false discovery rate has gone down substantially (for calculations, see my previous blog post). Doing two studies with p < .05 instead of one study with p < .001 has many benefits. For example, it allows you to generalize over samples and stimuli. This means you are giving the tax payer more worth for their money.

Finally, I don’t think single conclusive studies are the most feasible and efficient way to do science, at least in psychology. This model might work in medicine, where you sometimes really want to be sure a treatment is beneficial. But especially for more exploratory research (currently the default in psychology) an approach where you simply report everything, and perform a meta-analysis over all studies, is a much more feasible approach to scientific progress. Otherwise, what should we do with studies that yield p = .02? Or even p = .08? I assume everyone agrees publication bias is a problem, and if we consider only studies with p < .001 as worthy of publication, publication bias is likely to get much worse.

I think it is often smart to at least slightly lower the alpha level (say to α = .025) because in principle I agree with the main problem Colquhoun and Johnson try to address, that high p-values are not very strong support for your hypothesis (see also Lakens & Evers, 2014). It’s just important that the solution to this problem is realistic, instead of overly simplistic. In general, I don’t think fixed alpha levels are a good idea (instead, you should select and pre-register the alpha level as a function of the sample size, the prior probability the effect is true, and the balance between Type 1 errors and Type 2 errors you can achieve, given the sample size you can collect - more on that in a future blog post). These types of discussions remind me that statistics is an applied science. If you want to make good recommendations, you need to have sufficient experience with a specific field of research, because every field has its own specific challenges.