Nuijten et al (2015) created statcheck, a free R package that you can set to work on a pdf or html file, or a folder of files, to check the reported t-tests, F-tests, correlations, and some others tests. Like your spellchecker, you will want to run statcheck when working as an editor, reviewer, author, supervisor, or teacher on any empirical article that contains t-tests, F-tests, correlations, or chi-square tests.
Here’s how it works. First, you need to install open source software that will allow R to convert PDF files to text. The steps are a bit long and tricky, but I made a step-by-step summary which should help you to get this to work.
Then, to check a single article, run the following R code (changing the path to the PDF you want to check):
# install and load statcheck
You will get output (click on the screenshot below for a bigger version) where you can see a column for the reported p-values, the re-computed p-values, a summary of each test, and then a column called Error which will say FALSE if there is no error, and TRUE if there is an error. I analyzed a recent paper my PhD student Chao Zhang wrote, and I was happy to see the way we worked on this article (Chao doing the analyses, me double-checking them) prevented us from making errors. I also looked at earlier papers, and I regrettably did make a few rounding errors and copy-paste errors in my publications. Even though nothing changed the conclusions (indicated by the column ‘DecisionError’), usin Statcheck would have easily prevented these errors. Statcheck can make some errors, so be sure to check where each tests is identified correctly, especially when it flags something as an error.
Some Errors We Make
Nuijten and colleagues applied Statcheck to a huge amount of articles, and report how often people make errors when reporting statistical tests in a new paper. When reading the paper, I immediately saw how useful Statcheck was. But I also felt some annoyance that there was no clear analysis of the things we did wrong. I felt someone told me I was doing things wrong, without telling me what it was I did wrong. But then a wise man said I should not blame the authors for not writing the paper I would have written. Which is especially true given that Nuijten et al have shared all their data, and their beautiful and reproducible analysis script.
So I took a look at what we did wrong (R script), and below I will give a recommendation on how to fix a large majority of the problems.
Of the 258105 tests, there were 24961 errors, of which 3581 were decision errors (changing the conclusion of p > 0.05 to p < 0.05 or vice versa), but they are all caused mainly by the same types of errors. First, people make copy-paste errors. Second, people reported p = 0.000 1279 times, when they should have reported p < 0.001. Three errors are worth looking into in some more detail.
Incorrect use of < instead of =
By far the largest number of errors is the use of < instead of =. For example, F(1, 68) = 4.88, p < .03 is incorrect, because the p-value is actually 0.0305, which is not < 0.03. It happens thousands and thousands of times. Indeed, if we look at the difference between the reported and re-computed p-values for all the errors, we see the difference in p-values is mostly tiny (smaller than 0.01). This is the main reason. When you read the byline ‘One in eight articles contain data-reporting mistakes that affect their conclusions' you might not think the solution is simply to replace ‘<’ by ‘=’. I believe it largely is (but this deserves a closer look).
Use of one-sided tests
Using one-sided tests, without saying so (or at least without Statcheck recognizing the words ‘one-sided’, ‘one-tailed’, or ‘directional’ in the text) is another source of errors. The frequency of one-tailed tests (as I assume, without pre-registration of the analysis plan) is rather high. One-tailed tests are fine, and perhaps even more in line with your prediction than a two-tailed test, but I’d feel more comfortable if people pre-register one-sided predictions if they have them, and report them if they are performed. Statcheck is great for finding non-disclosed one-tailed tests.
Incorrect Rounding and Reporting
963 times, people round a p-value between 0.05 and 0.06 to p < 0.05. The latter is clearly wrong (but remember people make the same rounding error far removed from the magical p = 0.05 threshold as well, so this is just the incorrect use of < instead of = as noted above). 241 times, researchers report a p >= 0.055 to p < 0.05, and 128 times, people round a p-value between 0.055 and 0.06 to p = 0.05 (really using the = sign). This is just pathetic. When you hear ‘1.4% of p-values are grossly inconsistent’, this is the kind of behavior you think about. It makes up approximately 10% of the 3581 decision errors, and even though it is just 0.14% of all reported p-values, I think it is depressingly high. Statcheck can help reduce these errors.
Altogether, the 3581 decision errors are made up mostly by incorrect rounding, the use of one-sided tests without explicitly stating this through the words ‘one-tailed’, ‘one-sided’ or ‘directional’, the use of < instead of =, and the approximately 350 (give or take a hundred) false positives (note there might also be false negatives, which would increase the number of errors).
These errors are visible in the plot below. In the left of the graph, we see differences of -1, where Statcheck often computes a p-value of 1 because it misunderstands the test. The large bar in the center is mainly due to the use of < instead of =, and the slightly larger slope on the left of this large bar is due to the use of one-sided tests, and incorrect rounding.
My main goal in looking at the data in detail was to be able to provide practical recommendations to prevent the specific errors we make (even though Nuijten et al suggest co-authors double-check their analyses and share all data). The recommendation is surprisingly straightforward, and nicely with the theme of this blog on how 20% of the effort will fix 80% of the problems:
Report exact p-values, rounded to three decimals (e.g., p = 0.016), or use p < 0.001. Mention the use of one-tailed tests. Double-check all numbers (for example by using Statcheck!).
I'd like to thanks Michele Nuijten for her help in correcting some of my assumptions and analyses, and for feedback on an earlier draft of this blog post.