The null-hypothesis assumes the difference between the means in the two populations is exactly zero. However, the two means in the samples drawn from these two populations vary with each sample (and the less data you have, the greater the variance). The difference between two means will get really really close to zero when the number of samples approaches infinity. This is a core assumption in Frequentist approaches to statistics. It’s therefore not important that the observed difference in your sample isn’t exactly zero, as long as the difference in the population is zero.
Some researchers, such as Cohen (1990) have expressed their doubt that the difference in the population is ever exactly zero. As Cohen says:
The null hypothesis, taken literally (and that's the only way you can take it in formal hypothesis testing), is always false in the real world. It can only be true in the bowels of a computer processor running a Monte Carlo study (and even then a stray electron may make it false). If it is false, even to a tiny degree, it must be the case that a large enough sample will produce a significant result and lead to its rejection. So if the null is always false, what’s the big deal about rejecting it? (p. 1308).
One ‘big deal’ about rejecting it, is that to reject a small difference (e.g., a Cohen’s d of 0.001) you need a sample size of at least 31 million participants to have a decent chance of observing such a statistical difference in a t-test. With such sample sizes, almost all statistics we use (e.g., checks for normality) break down and start to return meaningless results.
Another ‘big deal’ is that we don’t know whether the observed difference will remain equally large irrespective of the increase in sample size (as should happen, when it is an accurately measured true effect) or whether it will become smaller and smaller, without ever becoming statistically significant, the more measurements are added (as should happen when there is actually no effect). Hagen (1997) explains this latter situation in his article ‘In Praise of the Null-Hypothesis Significance Test’ to prevent people from mistakenly assuming that every observed difference will become significant if you simply add participants. He writes:
‘Thus, although it may appear that larger and larger Ns are chasing smaller and smaller differences, when the null is true, the variance of the test statistic, which is doing the chasing, is a function of the variance of the differences it is chasing. Thus, the "chaser" never gets any closer to the "chasee."’
What’s a ‘real’ effect?
The more important question is whether it is true that there are always real differences in the real world, and what the ‘real world’ is. Let’s consider the population of people in the real world. While you read this sentence, some individuals in this population have died, and some were born. For most questions in psychology, the population is surprisingly similar to an eternally running Monte Carlo simulation. Even if you could measure all people in the world in a millisecond, and the test-retest correlation was perfect, the answer you would get now would be different from the answer you would get in an hour. Frequentists (the people that use NHST) are not specifically interested in the exact value now, or in one hour, or next week Thursday, but in the average value in the ‘long’ run. The value in the real world today might never be zero, but it’s never anything, because it’s continuously changing. If we want to make generalizable statements about the world, I think the fact that the null-hypothesis is never precisely true at any specific moment is not a problem. I’ll ignore more complex questions for now, such as how we can establish whether effects vary over time.
When perfect randomization to conditions is possible, and the null-hypothesis is true, every p-value is going to be just as likely. There a great blog post by Jim Grange explaining that p-values are uniformly distributed if the null is true using simulations in R. Take the script from his blog, and change the sample size (e.g., to 100000 in each group), or change the variances, and as long as the means of the two groups remain identical, p-values will be uniformly distributed. Although it is theoretically possible that differences are randomly fluctuating around zero in the long term, some researchers have argued this is often not true. Especially in correlational research, or in any situation where participants are not randomly assigned to conditions, this is a real problem.
Meehl talks about how in psychology every individual-difference variable (e.g., trait, status, demographic) correlates with every other variable, which means the null is practically never true. In these situations, it’s not that testing against the null-hypothesis is meaningless, but it’s not informative. If everything correlates with everything else, you need to create good models, and test those. A simple null-hypothesis significance test will not get you very far. I agree.
Random Assignment vs. Crud
To illustrate when NHST can be used to as a source of information in large samples, and when NHST is not informative in large samples, I’ll analyze data of large dataset with 6344 participants from the Many Labs project. I’ve analyzed 10 dependent variables to see whether they were influenced by A) Gender, and B) Assignment to the high or low anchoring condition in the first study. Gender is a measured individual difference variable, and not a manipulated variable, and might thus be affected by what Meehl calls the crud factor. Here, I want to illustrate this is A) probably often true for individual difference variables, but perhaps not always true, and B) it is probably never true for when analyzing differences between groups individuals were randomly assignment to.
You can download the CleanedData.sav Many Labs Data here, and my analysis syntax here. I perform 8 t-tests and 2 Chi-square tests on 10 dependent variables, while the factor is either gender, or the random assignment to the high or low condition for the first question in the anchoring paradigm. You can download the output here. When we analyze the 10 dependent variables as a function of the anchoring condition, none of the differences are statistically significant (even though there are more than 6000 participants). You can play around with the script, repeating the analysis for the conditions related to the other three anchoring questions (remember to correct for multiple comparisons if you perform many tests), and see how randomization does a pretty good job at returning non-significant results even in very large sample sizes. If the null is always false, it is remarkably difficult to reject. Obviously, when we analyze the answer people gave on the first anchoring question, we find a huge effect of the high vs. low anchoring condition they were randomly assigned to. Here, NHST works. There is probably something going on. If the anchoring effect was a completely novel phenomenon, this would be an important first finding, to be followed by replications and extensions, and finally model building and testing.
The results change dramatically if we use Gender as a factor. There are Gender effects on dependent variables related to quote attribution, system justification, the gambler’s fallacy, imagined contact, the explicit evaluation of arts and math, and the norm of reciprocity. There are no significant differences in political identification (as conservative or liberal), on the response scale manipulation, or on gain vs. loss framing (even though p = .025, such a high p-value is stronger support for the null-hypothesis than for the alternative hypothesis with 5500 participants). It’s surprising that the null-hypothesis (gender does not influence the responses participants give) is rejected for seven out of ten effects. Personally (perhaps because I’ve got very little expertise in gender effects) I was actually extremely surprised, even though the effects are small (with Cohen d’s or around 0.09). This, ironically, shows that NHST works - I've learned gender effects are much more widespread than I'd have though before I wrote this blog post.
It also shows we have learned very little, because NHST when examining gender differences does not really tell us anything about WHY gender influences all these different dependent variables. We need better models to really know what’s going on. For the studies where there was no significant effect (such as political orientation), it is risky to conclude gender is irrelevant – perhaps there are moderators, and gender and political identification are related.
We can reject the hypothesis that the null is always false. Generalizing statements about how the null-hypothesis is always false, and thus how null-hypothesis significance testing is a meaningless endeavor, are only partially accurate. The null hypothesis is always false, when it is false, but it’s true when it’s true. It's difficult to know when a not statistically significant difference reflects a Type 2 error (there is an effect, but it will only become significant if the statistical power is increased, for example by collecting more data), or whether it actually means the null is true. Null-hypothesis significance testing cannot be used to answer these questions. NHST can only reject the null-hypothesis, and when observed differences are not statistically significant, the outcome of a significance test necessarily remains inconclusive. But assuming the null-hypothesis is true in exploratory research, at least in experiments where random assignment to conditions is possible, is a useful statistical tool.