The journal of Basic and Applied
Social Pychology banned the p-value
in 2015, after Trafimow (2014) had explained in an editorial a year earlier that
inferential statistics were no longer required. In the 2014 editorial, Trafimow
notes how: “The null hypothesis
significance procedure has been shown to be logically invalid and to provide
little information about the actual likelihood of either the null or experimental
hypothesis (see Trafimow, 2003; Trafimow & Rice, 2009)”. The goal of
this blog post is to explain why the arguments put forward in Trafimow &
Rice (2009) are incorrect. Their simulations illustrate how meaningless questions provide meaningless answers, but they do not reveal a problem with p-values. Editors can do with their journal as they like - even ban p-values. But if
the simulations upon which such a ban is based are meaningless, the ban itself
becomes meaningless.
To calculate the probability that the
null-hypothesis is true, given some data we have collected, we need to use Bayes’
formula. Cohen (1994) shows how the posterior probability of the
null-hypothesis, given a statistically
significant result (the data), can be calculated based on a formula that is
a poor man’s Bayesian updating function. Instead of creating distributions
around parameters, his approach simply uses the p-value of a test (which is related to the observed data), the
power of the study, and the prior probability the null-hypothesis is true, to
calculate the posterior probability H0 is true, given the observed data. Before
we look at the formula, some definitions:
P(H0)
is the prior probability (P) the null hypothesis (H0)
is true.
P(H1) is the probability (P) the alternative hypothesis (H1) is true. Since I’ll
be considering only a single alternative hypothesis here, either the null
hypothesis or the alternative hypothesis is true, and thus P(H1) = 1- P(H0). We
will use 1-P(H0) in the formula below.
P(D|H0) is the probability (P) of the data (D), or more extreme data, given that the null hypothesis
(H0) is true. In Cohen’s approach, this is the p-value of a study.
P(D|-H0) is the probability of the data (a significant result), given that H0 is
not true, or when the alternative
hypothesis is true. This is the
statistical power of a study.
P(H0|D) is the probability of the null-hypothesis, given the data. This is our
posterior belief in the null-hypothesis, after the data has been collected.
According to Cohen (1994), it’s what we really want to know. People often
mistake the p-value as the
probability the null-hypothesis is true.
If we ignore the prior probability
for a moment, the formula in Cohen (1994) is simply:
More formally, and including the
prior probabilities, the formula is:
In the numerator, we calculate the
probability that we observed a significant p-value
when the null hypothesis is true, and divide it by the total probability of
finding a significant p-value when either
the null-hypothesis is true or the alternative hypothesis is true. The formula
shows that the lower the p-value in
the numerator, and the higher the power, the lower the probability of the
null-hypothesis, given the significant result you have observed. Both depend on
the same thing: the sample size, and the formula gives an indication why larger
sample sizes mean more informative studies.
How are p
and P(H0|D) related?
Trafimow and Rice (2009) used the
same formula mentioned in Cohen (1994) to calculate P(HO|D) to examine whether p-values drawn from a uniform
distribution between 0 and 1 were linearly correlated with P(H0|D). In their
simulations, the value for the power of the study is also drawn from a uniform
distribution, as is the prior P(H0). Thus, all three variables in the formula
are randomly drawn from a uniform distribution. Trafimow & Rice (2009) provide
an example (I stick to the more commonly used D for the data, where they use F):
“For example, the first
data set contained the random values .540 [p(F|H0)], .712 [p(H0)],
and .185 [p(F|–H0)]. These values were used in Bayes’ formula to derive p(H0|F)
= .880.”
The first part of the R code below
reproduces this example. The remainder of the R script reproduces the
simulations reported by Trafimow and Rice (2009). In the simulation, Trafimow
and Rice (2009) draw p-values from a
uniform distribution. I simulate data when the true effect size is 0, which
also implies p-values are uniformly
distributed.
The correlation between p-values and the probability the
null-hypothesis is true, given the data (P(H0|D) is 0.37. This is a bit lower
than the correlation of 0.396 reported by Tramifow and Rice (2009). The most
likely reason for this is that they used Excel, which has a
faulty random number generator that should not be used for simulations.
Although Trafimow and Rice (2009) say that “The
present authors’ main goal was to test the size of the alleged correlation. To
date, no other researchers have done so” we already find in Kreuger (2001):
“Second, P(D|H0) and P(H0|D) are correlated (r = .38)”
which was also based on a simulation in Excel (Krueger, personal
communication). So, it seems Krueger was the first to examine this correlation,
and the estimate of r = 0.37 is most likely correct. Figure 1 presents a plot
of the simulated data.
It is difficult to
discern the pattern in Figure 1. Based on the low correlation
of 0.37, Trafimow & Rice (2009, p. 266) remark that this result “fails to provide
a compelling justification for computing p values”, and it “does not
constitute a compelling justification for their routine use in social science
research”. But they are wrong.
They also note the correlation only accounts for 16% in the variance between the
relation – which is what you get, when calculating a linear correlation coefficient
for values that are logarithmically related, as we will see below. The only
conclusion we can draw based on this simulation, is that the authors asked a
meaningless question (calculating a linear correlation coefficient), which they
tried to answer with a simulation in which it is impossible to see the pattern they
are actually interested in.
Using a fixed P(H0) and power
The low correlation is
not due to the ‘poorness’ (Trafimow &
Rice, 2009, p. 264) of the relation between
the p-value and P(H0|D), which is, as
I will show below, perfectly predictable, but with their choice to randomly
choose values for the P(D|-H0) and P(H0). If we fix these values (to any value you deem reasonable) we
can see the p-value and P(H0|D) are directly related. In Figure 2, the prior probability of H0 is fixed to 0.1, 0.5,
or 0.9, and the power (P(D|-H0) is also fixed to 0.1, 0.5, or 0.9. These plots show that the p-value and P(H0|D) are directly related and fall on a logarithmic scale.
Lower p-values always mean P(H0|D) is lower,
compared to higher p-values. It’s
important to remember that significant p-values
(left of the vertical red line) don’t necessarily mean that the probability
that H0 is true is less likely than
the probability that H1 is true (see the bottom-left plot, where P(H0|D) is
larger than 0.5 after a significant effect of p = 0.049). The horizontal red lines indicate the prior probability
that the null hypothesis is true. We see that high p-values make the probability that H0 is true more likely (but
sometimes the change in probability is rather unimpressive), and low p-values makes this probability less
likely.
The problem Trafimow
and Rice have identified is not that p-values
are meaningless, but that large simulations where random values are chosen for
the prior probability and the power do not clearly reveal a relationship
between p-values and P(H0|D), and
that quantifying the relation between two variables with the improper linear
term does not explain a lot of variation. Figure 1 consists of many single
points from randomly chosen curves as shown in Figure 2.
No need to ban p-values
This poor man’s Bayesian updating
function lacks some of the finesse a real Bayesian updating procedure has. For
example, it treats a p = 0.049 as an outcome that has a 5% probability of being
observed when there is a true effect. This dichotomous thinking keeps things
simple, but it’s also incorrect, because in a high-powered experiment, a p =
0.049 will rarely be observed (p’s
< 0.01 are massively more likely) and more sensitive updating functions,
such a true Bayesian statistics, will allow you to evaluate outcomes more
precisely. Obviously, any weakness in the poor man’s Bayesian updating formula
also applies to its use to criticize the relation between p-values and the posterior probability the null-hypothesis is true,
as Trafimow and Rice (2009) have done (also see the P.S.).
Significant p-values generally make the null-hypothesis less likely, as long as
the alpha level is chosen sensibly. When the sample size is very large, and the
statistical power is very high, choosing an alpha level of 0.05 can lead to
situations where a p-value smaller
than 0.05 is actually more likely to be observed when the null-hypothesis is
true, than when the alternative hypothesis is true (Lakens & Evers, 2014).
Researchers who have compared true bayesian statistics with p-values acknowledge they will often lead to similar conclusions, but recommend to decrease the alpha level as a function of the sample size (e.g., Cameron
& Trivedi, 2005, p. 279; Good, 1982; Zellner, 1971, p. 304). Some
recommendations have been put forward, but these have not yet been evaluated
extensively. For now, researchers are simply advised to use their own
judgment when setting their alpha level for analyses where sample sizes are
large, or statistical power is very high. Alternatively, researchers might opt
to never draw conclusions about the evidence for or against the
null-hypothesis, and simply aim to control their error rates in lines of
research that test theoretical predictions, following a Neyman-Pearson
perspective on statistical inferences.
If you really want to make statements
about the probability the null-hypothesis is true, given the data, p-values are not the tool of choice
(Bayesian statistics is). But p-values
are related to evidence (Good, 1992), and in exploratory research where priors
are uncertain and the power of the test is unknown, p-values might be the something to fall back on. There is
absolutely no reason to dismiss or even ban them because the ‘poorness’ of the
relation between p-values and the
posterior probability that the null-hypothesis is true. What is needed is a
better understanding of the relationship between p-values and the probability the null-hypothesis is true by educating
all researchers how to correctly interpret p-values.
P.S.
This might be a good moment to note that
Trafimow and Rice (2009) calculate posterior probabilities of p-values assuming the alternative
hypothesis is true, but simulate studies with uniform p-values, meaning that the null-hypothesis is true. This is
somewhat peculiar. Hagen (1997) explains the
correct updating formula under
the assumption that the null-hypothesis is true. He proposes to use exact p-values, when the null hypothesis is
true, but I disagree. When the null-hypothesis is true, every p-value is equally likely, and thus I
would calculate P(H0|D) using:
Assuming H0 and H1 are a-priori
equally likely, the formula simplifies to:
This formula shows how the
null-hypothesis can become more likely when a non-significant result is observed, contrary to popular belief that non-significant findings don't tell you anything about the likelyhood the null-hypothesis is true, not as a function of the p-value you observe (after all they are
uniformly distributed, so every p-value
is equally uninformative), but through Bayes’ formula. The higher the power of
a study, the more likely the null-hypothesis becomes after a non-significant
result.
References
Cameron, A. C. and P. K.
Trivedi (2005). Microeconometrics:
Methods and Applications. New York: Cambridge University Press.
Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49,
997-1003.
Good, I. J.
(1982). Standardized tail-area probabilities. Journal of
Statistical Computation and Simulation, 16, 65-66.
Good, I. J. (1992). The Bayes/non-Bayes compromise: A
brief review. Journal of the American Statistical Association, 87,
597-606.
Hagen,
R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.
Lakens, D. & Evers, E. (2014). Sailing from the
seas of chaos into the corridor of stability: Practical recommendations to
increase the informational value of studies. Perspectives on Psychological
Science, 9, 278-292. DOI:
10.1177/1745691614528520.
Lew,
M. J. (2013). To P or not to P: on the evidential nature of P-values and their
place in scientific inference. arXiv:1311.0081.
Trafimow D. (2014).
Editorial. Basic and Applied Social
Psychology, 36, 1–2.
Trafimow D., Marks, M.
(2015). Editorial. Basic and Applied
Social Psychology, 37, 1–2.
Trafimow,
D., & Rice, S. (2009). A test of the null hypothesis significance testing
procedure correlation argument. The Journal of General Psychology, 136,
261-270.
Zellner, A. (1971). An introduction to Bayesian
inference in econometrics. New York: John Wiley.