It’s important to communicate any remaining uncertainty in the results of an experiment. P-values have been criticized for inviting dichotomous ‘real vs. not real’ judgments, and recently, we’ve seen recommendations to stop reporting p-values, and report 95% confidence intervals instead (e.g. Cumming, 2014). It’s important to realize that we are switching from an arguably limited and often misunderstood metric (the p-value) to another limited and often misunderstood metric (a confidence interval). One thing that strikes me as a benefit of continuing to report p-values and 95% CI side by side is that people have a gut-feeling for p-values they don’t yet have for 95% CI. Their gut feeling is not statistically accurate, but it has been sufficient to lead to scientific progress, in general. Confidence intervals provide more information, but in all fairness, it’s made up of three numbers (the 95, and 2 boundary values), so it’s no surprise the p-value can’t compete with that. What if we use p-values to communicate uncertainty?
Let’s say I have made up data for two imaginary independent groups of 50 people, with M1 = 6.25, SD1 = 2.23, and M2 = 7.24, SD2 = 2.46. If I report this in line with recent recommendations, I’d say something like: the means were different, t(98) = 2.09, Cohen’s d = 0.42, 95% CI [0.02, 0.81]. This means that in the long run, 95% of the confidence intervals calculated for close replications (with the same sample size!) will contain the true effect size. To me, this all sounds pretty good. There’s a difference, there are two boundary values that do not include zero, and I can say something about what will happen in 95% of future studies I run.
I could also say: the means were different, t(98) = 2.09, p = .039, Cohen’s d = 0.42. In approximately 83.4% of close replication studies we can expect a Cohen’s d between 0.02 and .081, and thus a p-value (given our sample size of 50 per condition) between .0001 and .92. To me, this sounds pretty bad. It tells me that only 83.4 (or five out of six) studies will observe an effect that, with a sample size of 50 in each condition, will provide a p-value that will fall somewhere between .0001 and .92. Furthermore, I know that relatively high p-values are at best very weak support for the alternative hypothesis (e.g., Lakens & Evers, 2014), and seeing that p-value is useful to gauge how confident I should be in the observed effect (i.e., not very much). If we want to communicate uncertainty, I have to say this second formulation is doing a much better job, for me.
If we would have observed the same effect size (d = .42) but had 1000 participants in each of two groups, we would be considerably more certain there was a real difference. We could say: t(1998) = 9.36, Cohen’s d = 0.42, 95% CI [0.33, 0.51]. To be honest, I’m not feeling this huge increase in certainty. I feel I know the effect with more precision, and I might feel a little more certain, but this increase in certainty is difficult to quantify. I could also say: t(1998) = 9.36, p < .001, Cohen’s d = 0.42. In approximately 83.4% of close replication studies we can expect a p < .001 (actually, any Cohen’s d higher than 0.14 would yield a p < .001 with 2000 participants).
Note that if you report a 99% CI (which would not be weird if you have 2000 participants), you could report Cohen’s d = 0.42, 99% CI [0.30, 0.54]. You could also say that on average in 93.1% of close replications, the p-value will be smaller than .001. To me, this really drives home a point. With 99% CI's the boundaries around the observed effect size become slightly wider, but I don’t have a good feeling for how much a 99% CI [0.30, 0.54] is an indication of more certainty than a 95% CI [0.33, 0.51]. However, I do understand that 93.1% of replications yielding a p < .001 is much better than 83.4% of replications yielding a p < .001. (See Cumming & Maillardet, 2006, for the explanation why 83.4% and 93.1% of studies will yield a Cohen’s d that falls within the 95% or 99% CI of a single study.)
Given time, I might feel a specific 95% CI will be capable of driving home the point in exactly the same manner, but for now, it doesn’t. That’s why I think it’s a good move to keep reporting p-values alongside 95% CI. If you teach the meaning of 95% CI to students that already have a feel for p-values, you might actually want to go back and forth between p-values and 95% CI to improve their understanding of what a 95% CI means. People who are enthusiastic about 95% CI might shiver after this suggestion, but I sincerely wonder whether I will ever feel the difference between a 99% CI [0.30, 0.54] and a 95% CI [0.33, 0.51]. I'd also be more than happy to hear how I can accurately gauge relative differences in the remaining uncertainty communicated by confidence intervals without having to rely on something so horribly subjective and erronous as my gut feeling.
P.S. I only now realize ESCI by Geoff Cumming only allows you to calculate 95% CI around Cohen's d if you have less then 100 subjects in two conditions in an independent t-test, which is quite a limitation. I've calculated 95% CI around d using these excellent instructions and SPSS files by Karl Wuensch.