The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Thursday, March 8, 2018

Prediction and Validity of Theories

What is the goal of data collection? This is a simple question, and as researchers we collect data all the time. But the answer to this question is not straightforward. It depends on the question that you are asking of your data. There are different questions you can ask from your data, and therefore, you can have different goals when collecting data. Here, I want to focus on collecting data to test scientific theories. I will be quoting a lot from De Groot’s book Methodology (1969), especially Chapter 3. If you haven’t read it, you should – I think it is the best book about doing good science that has ever been written.

When you want to test theories, the theory needs to make a prediction, and you need to have a procedure that can evaluate verification criteria. As De Groot writes: “A theory must afford at least a number of opportunities for testing. That is to say, the relations stated in the model must permit the deduction of hypotheses which can be empirically tested. This means that these hypotheses must in turn allow the deduction of verifiable predictions, the fulfillment or non-fulfillment of which will provide relevant information for judging the validity or acceptability of the hypotheses” (§ 3.1.4).

This last sentence is interesting – we collect data, to test the ‘validity’ of a theory. We are trying to see how well our theory works when we want to predict what unobserved data looks like (whether these are collected in the future, or in the past, as De Groot remarks). As De Groot writes: “Stated otherwise, the function of the prediction in the scientific enterprise is to provide relevant information with respect to the validity of the hypothesis from which it has been derived.” (§ 3.4.1).

To make a prediction that can be true or false, we need to forbid certain states of the world and allow others. As De Groot writes: “Thus, in the case of statistical predictions, where it is sought to prove the existence of a causal factor from its effect, the interval of positive outcomes is defined by the limits outside which the null hypothesis is to be rejected. It is common practice that such limits are fixed by selecting in advance a conventional level of significance: e.g., 5 %, 1 %, or .1 % risk of error in rejecting the assumption that the null hypothesis holds in the universe under consideration. Though naturally a judicious choice will be made, it remains nonetheless arbitrary. At all events, once it has been made, there has been created an interval of positive outcome, and thus a verification criterion. Any outcome falling within it stamps the prediction as 'proven true.” (§ 3.4.2). Note that if you prefer, you can predict an effect size with some accuracy, calculate a Bayesian highest density interval that excludes some value, or a Bayes factor that is larger than some cut-off – as long as your prediction can be either confirmed or not confirmed.

Note that the prediction gets a ‘proven true’ stamp – the theory does not. In this testing procedure, there is no direct approach from the ‘proven true’ stamp to a ‘true theory’ conclusion. Indeed, the latter conclusion is not possible in science. We are mainly indexing the ‘track record’ of a theory, as Meehl (1990) argues: “The main way a theory gets money in the bank is by predicting facts that, absent the theory, would be antecedently improbable.” Often (e.g., in non-experimental settings) rejecting a null hypothesis with large sample sizes is not considered a very improbable event, but that is another issue (see also the definition of a severe test by Mayo (1996, 178): a passing result is a severe test of hypothesis H just to the extent that it is very improbable for such a passing result to occur, were H false).

Regardless of how risky the prediction we made was, when we then collect data, and test the hypothesis, we either confirm our prediction, or we do not confirm our prediction. In frequentist statistics, we add the outcome of this prediction to the ‘track record’ of our theory, but we can not draw conclusions based on any single study. As Fisher (1926, 504) writes: “if one in twenty does not seem high enough odds, we may, if we prefer it, draw the line at one in fifty (the 2 per cent point), or one in a hundred (the 1 per cent point). Personally, the writer prefers to set a low standard of significance at the 5 per cent point, and ignore entirely all results which fail to reach this level. A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance (italics added).

The study needs to be ‘properly designed’ to ‘rarely' fail to give a level of evidence – which despite Fisher’s dislike for Neyman-Pearson statistics, I can read in no other way than to make sure you run well-powered studies for whatever happens to be your smallest effect size of interest. In other words: When testing the validity of theories through predictions, where you keep track of a ‘track record’ of predictions, you need to control your error rates to efficiently distinguish hits from misses. Design well-powered studies, and do not fool yourself by inflating the probability of observed in false positive.

I think that when it comes to testing theories, assessing the validity through prediction is extremely (and for me, perhaps the most) important. We don’t want to fool ourselves when we test the validity of our theories. An example of ‘fooling yourself’ are the studies on pre-cognition by Daryl Bem (2011). An example of a result I like to use in workshops is the following result of Study 1 where people pressed a left or right button to predict whether a picture was hidden behind a left or right curtain.

If we take this study as it is (without pre-registration) it is clear there are 5 tests (for erotic, neutral, negative, positive, and ‘romantic but non-erotic’ pictures). A Bonferroni correction would lead us to use an alpha level of 0.01 (an alpha of 0.05/5 tests) and the result (0.01, but more precisely 0.013) would not be enough to support our prediction, given the pre-specified alpha level. Note that Bem (Bem, Utts, and Johnson, 2011) explicitly says this test was predicted. However, I see absolutely no reason to believe Bem without a pre-registration document for the study.

Bayesian statistics do not provide a solution when analyzing this pre-cognition experiment. As Gelman and Loken (2013) write about this study (I just realized this ‘Garden of Forking paths’ paper is unpublished, but has 150 citations!): “we can still take this as an observation that 53.1% of these guesses were correct, and if we combine this with a flat prior distribution (that is, the assumption that the true average probability of a correct guess under these conditions is equally likely to be anywhere between 0 and 1) or, more generally, a locally-flat prior distribution, we get a posterior probability of over 99% that the true probability is higher than 0.5; this is one interpretation of the one-sided p-value of 0.01.” The use of Bayes factors that quantify model evidence provides no solution. Where Wagenmakers and colleagues argue based on ‘default’ Bayesian t-tests that the null-hypothesis is supported, Bem, Utts, and Johnson (2011) correctly point out this criticism is flawed, because the default Bayesian t-tests use completely unrealistic priors for pre-cognition research (and most other studies published in psychology, for that matter).

It is interesting that the best solution Gelman and Loken come up with is that “perhaps researchers can perform half as many original experiments in each paper and just pair each new experiment with a preregistered replication”. What matters is not just the data, but the procedure used to collect the data. The procedure needs to be able to demonstrate a strong predictive validity, which is why pre-registration is such a great solution to many problems science faces. Pre-registered studies are the best way we have to show you can actually predict something – which gets your theory money in the bank.

If people ask me if I care about evidence, I typically say: ‘mwah’. For me, evidence is not a primary goal of doing research. Evidence is a consequence of demonstrating that my theories have high validity as I test predictions. It is important to end up with, and it can be useful to try to quantify model evidence through likelihoods or Bayes factors, if you have good models. But if I am able to show that I can confirm predictions in a line of pre-registered studies, either by showing my p-value is smaller than an alpha level, a Bayesian highest density interval excludes some value, a Bayes factor is larger than some cut-off, or by showing the effect size is close enough to some predicted value, I will always end up with strong evidence for the presence of some effect. As De Groot (1969) writes: “If one knows something to be true, one is in a position to predict; where prediction is impossible, there is no knowledge.”


Bem, D. J. (2011). Feeling the future: experimental evidence for anomalous retroactive influences on cognition and affect. Journal of Personality and Social Psychology, 100(3), 407–425.
Bem, D. J., Utts, J., & Johnson, W. O. (2011). Must psychologists change the way they analyze their data? Journal of Personality and Social Psychology, 101(4), 716–719.
Gelman, A., & Loken, E. (2013). The garden of forking paths: Why multiple comparisons can be a problem, even when there is no “fishing expedition” or “p-hacking” and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
De Groot, A. D. (1969). Methodology. The Hague: Mouton & Co.
Mayo, D. G. (1996). Error and the growth of experimental knowledge. University of Chicago Press.
Wagenmakers, E.-J., Wetzels, R., Borsboom, D., & van der Maas, H. L. J. (2011). Why psychologists must change the way they analyze their data: the case of psi: comment on Bem (2011). Journal of Personality and Social Psychology, 100(3), 426–432.

Thursday, January 18, 2018

The Costs and Benefits of Replications

This blog post is based on a pre-print by Coles, Tiokhin, Scheel, Isager, and Lakens “The Costs and Benefits of Replications”, submitted to Behavioral Brain Sciences as a commentary on “Making Replication Mainstream”.

In a summary of recent discussions about the role of direct replications in psychological science, Zwaan, Etz, Lucas, and Donnellan (2017) argue that replications should be more mainstream. The debate about the importance of replication research is essentially driven by disagreements about the value of replication studies, in a world where we need to carefully think about the best way to allocate limited resources when pursuing scientific knowledge. The real question, we believe, is when replication studies are worthwhile to perform.

Goldin-Meadow stated that "it’s just too costly or unwieldy to generate hypotheses on one sample and test them on another when, for example, we’re conducting a large field study or testing hard-to-find participants" (2016). A similar comment is made by Tackett and McShane (2018) in their comment on ZELD: “Specifically, large-scale replications are typically only possible when data collection is fast and not particularly costly, and thus they are, practically speaking, constrained to certain domains of psychology (e.g., cognitive and social).”

Such statements imply a cost-benefit analysis. But these scholars do not quantify their costs and benefits. They hide their subjective expected utility (what is a large-scale replication study worth to me) behind absolute statements, as they write “is” and “are” but really mean “it is my subjective belief that”. Their statements are empty, scientifically speaking, because they are not quantifiable. What is “costly”? We can not have a discussion about such an important topic if researchers do not specify their assumptions in quantifiable terms.

Some studies may be deemed valuable enough to justify even quite substantial investments to guarantee that a replication study is performed. For instance, because it is unlikely that anyone will build a Large Hadron Collider to replicate the studies at CERN, there are two detectors (ATLAS and CMS) so that independent teams can replicate each other’s work. That is, not only do these researchers consider it important to have a very low (5 sigma) alpha level when they analyze data, they also believe it is worthwhile to let two team independently do the same thing. As a physicist remarks: “Replication is, in the end, the most important part of error control. Scientists are human, they make mistakes, they are deluded, and they cheat. It is only through attempted replication that errors, delusions, and outright fraud can be caught.” Thus, high cost is not by itself a conclusive argument against replication. Instead, one must make the case that the benefits do not justify the costs. Again, I ask: what is “costly”?

Decision theory is a formal framework that allows researchers to decide when replication studies are worthwhile. It requires researchers to specify their assumptions in quantifiable terms. For example, the expected utility of a direct replication (compared to a conceptual replication) depends on the probability that a specific theory or effect is true. If you believe that many published findings are false, then directly replicating prior work may be a cost-efficient way to prevent researchers from building on unreliable findings. If you believe that psychological theories usually make accurate predictions, then conceptual extensions may lead to more efficient knowledge gains than direct replications. Instead of wasting time arguing about whether direct replications are important or whether conceptual replications are important, do the freaking math. Tell us at which probability that H0 is true you think it is efficient enough to weed out false positives from the literature through direct replications. Show us, by pre-registering all your main analyses, that you are building on strong theories that allow you to make correct predictions with a 92% success rate, and that you therefore do not feel direct replications are the more efficient way to gain knowledge in your area.

I am happy to see our ideas about the importance of using decision theory to determine when replications are important enough to perform were independently replicated in this commentary on ZELD by Hardwicke, Tessler, Peloquin, and Frank. We have collaboratively been working on a manuscript to specify the Replication Value of replication studies for several years, and with the recent funding I received, I’m happy that we can finally dedicate the time to complete this work. I look forward to scientists explicitly thinking about the utility of the research they perform. This is an important question, and I can’t wait for our field to start discussing ways to answer how we can quantify the utility of the research we perform. This will not be easy. But unless you never think about how to spend your resources, you are making these choices implicitly all the time, and this question is too important to give up without even trying. In our pre-print, we illustrate how all concerns raised against replication studies basically boil down to a discussion about their costs and benefits, and how formalizing these costs and benefits would improve the way researchers discuss this topic.

Tuesday, December 5, 2017

Understanding common misconceptions about p-values

A p-value is the probability of the observed, or more extreme, data, under the assumption that the null-hypothesis is true. The goal of this blog post is to understand what this means, and perhaps more importantly, what this doesn’t mean. People often misunderstand p-values, but with a little help and some dedicated effort, we should be able explain these misconceptions. Below is my attempt, but if you prefer a more verbal explanation, I can recommend Greenland et al. (2016).

First, we need to know what ‘the assumption that the null-hypothesis is true’ looks like. Although the null-hypothesis can be any value, here we will assume the null-hypothesis is specified as a difference of 0. When this model is visualized in text-books, or in power-analysis software such as g*power, you often see a graph like the one below, with t-values on the horizontal axis, and a critical t-value somewhere around 1.96. For a mean difference, the p-value is calculated based on the t-distribution (which is like a normal distribution, and the larger the sample size, the more similar the two become). I will distinguish the null hypothesis (the mean difference in the population is exactly 0) from the null-model (a model of the data we should expect when we draw a sample when the null-hypothesis is true) in this post. 

I’ve recently realized that things become a lot clearer if you just plot these distributions as mean differences, because you will more often think about means, than about t-values. So below, you can see a null-model, assuming a standard deviation of 1, for a t-test comparing mean differences (because the SD = 1, you can also interpret the mean differences as a Cohen’s d effect size). 

The first thing to notice is that we expect that the mean of the null-model is 0: The distribution is centered on 0. But even if the mean in the population is 0, that does not imply every sample will give a mean of exactly zero. There is variation around the mean, as a function of the true standard deviation, and the sample size. One reason why I prefer to plot the null-model in raw scores instead of t-values is that you can see how the null-model changes, when the sample size increases.

When we collect 5000 instead of 50 observations, we see the null-model is still centered on 0 – but in our null-model we now expect most values will fall very close around 0. Due to the larger sample size, we should expect to observe mean differences in our sample closer to 0 compared to our null-model when we had only 50 observations.

Both graphs have areas that are colored red. These areas represent 2.5% of the values in the left tail of the distribution, and 2.5% of the values in the right tail of the distribution. Together, they make up 5% of the most extreme mean differences we would expect to observe, given our number of observations, when the true mean difference is exactly 0 – representing the use of an alpha level of 5%. The vertical axis shows the density of the curves.

Let’s assume that in the figure visualizing the null model for N = 50 (two figures up) we observe a mean difference of 0.5 in our data. This observation falls in the red area in the right tail of the distribution. This means that the observed mean difference is surprising, if we assume that the true mean difference is 0. If the true mean difference is 0, we should not expect such a extreme mean difference very often. If we calculate a p-value for this observation, we get the probability of observing a value more extreme (in either tail, when we do a two-tailed test) than 0.5.

Take a look at the figure that shows the null-model when we have collected 5000 observations (one figure up), and imagine we would again observe a mean difference of 0.5. It should be clear that this same difference is even more surprising than it was when we collected 50 observations.

We are now almost ready to address common misconceptions about p-values, but before we can do this, we need to introduce a model of the data when the null is not true. When the mean difference is not exactly 0, the alternative hypothesis is true – but what does an alternative model look like?

When we do a study, we rarely already know what the true mean difference is (if we already knew, why would we do the study?). But let’s assume there is an all-knowing entity. Following Paul Meehl, we will call this all-knowing entity Omniscient Jones. Before we collect our sample of 50 observations, Omniscient Jones already knows that the true mean difference in the population is 0.5. Again, we should expect some variation around this true mean difference in our small sample. The figure below again shows the expected data pattern when the null-hypothesis is true (now indicated by a grey line) and it shows an alternative model, assuming a true mean difference of 0.5 exists in the population (indicated by a black line).

But Omniscient Jones could have said the true difference was much larger. Let’s assume we do another study, but now before we collect our 50 observations, Omniscient Jones tells us that the true mean difference is 1.5. The null model does not change, but the alternative model now moves over to the right. 

Now, we are finally ready to address some common misconceptions about p-values. Before we look at misconceptions in some detail, I want to remind you of one fact that is easy to remember, and will enable you to recognize many misconceptions about p-values: p-values are a statement about the probability of data, not a statement about the probability of a theory. Whenever you see p-values interpreted as a probability of a theory or a hypothesis, you know something is not right. Now let’s take a look at why this is not right.

1) Why a non-significant p-value does not mean that the null-hypothesis is true.

Let’s take a concrete example that will illustrate why a non-significant result does not mean that the null-hypothesis is true. In the figure below, Omniscient Jones tells us the true mean difference is again 0.5. We have observed a mean difference of 0.35. This value does not fall within the red area (and hence, the p-value is not smaller than our alpha level, or p > .05). Nevertheless, we see that observing a mean difference of 0.35 is much more likely under the alternative model, than under the null-model. 

All the p-value tells us is that this value is not extremely surprising, if we assume the null-hypothesis is true. A non-significant p-value does not mean the null-hypothesis true. It might be, but it is also possible that the data we have observed is more likely when the alternative hypothesis is true, than when the null-hypothesis is true (as in the figure above).

2) Why a significant p-value does not mean that the null-hypothesis is false.

Imagine we generate a series of numbers in R using the following command:

rnorm(n = 50, mean = 0, sd = 1)

This command generates 50 random observations from a distribution with a mean of 0 and a standard deviation of 1. We run this command once, and we observe a mean difference of 0.5. We can perform a one-sample t-test against 0, and this test tells us, with a p < .05, that the data we have observed is surprisingly extreme, assuming the random number generator in R functions as it should.
Should we decide to reject the null-hypothesis that the random number generator in R works? That would be a bold move indeed! We know that the probability of observing surprising data, assuming the null hypothesis is true, has a maximum of 5% when our alpha is 0.05. What we can conclude, based on our data, is that we have observed an extreme outcome, that should be considered surprising. But such an outcome is not impossible when the null-hypothesis is true. And in this case, we really don’t even have an alternative hypothesis that can explain the data (beyond perhaps evil hackers taking over the website where you downloaded R).

This misconception can be expressed in many forms. For example, one version states that the p-value is the probability that the data were generated by chance. Note that this is just a sneaky way to say: The p-value is the probability that the null hypothesis is true, and we observed an extreme p-value just due to random variation. As we explained above, we can observe extreme data when we are basically 100% certain that the null-hypothesis is true (the random number generator in R works as it should), and seeing extreme data once should not make you think the probability that the random number generator in R is working is less than 5%, or in other words, that the probability that the random number generator in R is broken is now more than 95%.

Remember: P-values are a statement about the probability of data, not a statement about the probability of a theory or a hypothesis.

3) Why a significant p-value does not mean that a practically important effect has been discovered.

If we plot the null-model for a very large sample size (N = 100000) we see that even very small mean differences (here, a mean difference of 0.01) will be considered ‘surprising’. We have such a large sample size, that all means we observe should fall very close around 0, and even a difference of 0.01 is already considered surprising, due to our substantial level of accuracy because we collected so much data. 

Note that nothing about the definition of a p-value changes: It still correctly indicates that, if the null-hypothesis is true, we have observed data that should be considered surprising. However, just because data is surprising, does not mean we need to care about it. It is mainly the verbal label ‘significant’ that causes confusion here – it is perhaps less confusing to think of a ‘significant’ effect as a ‘surprising’ effect (as long as the null-model is realistic - which is not automatically true).

This example illustrates why you should always report and interpret effect sizes, with hypothesis tests. This is also why it is useful to complement a hypothesis test with an equivalence test, so that you can conclude the observed difference is surprisingly small if there is no difference, but the observed difference is also surprisingly closer to zero, assuming there exists any effect we consider meaningful (and thus, we can conclude the effect is equivalence to zero).

4) If you have observed a significant finding, the probability that you have made a Type 1 error (a false positive) is not 5%.

Assume we collect 20 observations, and Omniscient Jones tells us the null-hypothesis is true. This means we are sampling from the following distribution:

If this is our reality, it means that 100% of the time that we observe a significant result, it is a false positive. Thus, 100% of our significant results are Type 1 errors. What the Type 1 error rate controls, is that from all studies we perform when the null is true, not more than 5% of our observed mean differences will fall in the red tail areas. But when they have fallen in the tail areas, they are always a Type 1 error. After observing a significant result, you can not say it has a 5% probability of being a false positive. But before you collect data, you can say you will not conclude there is an effect, when there is no effect, more than 5% of the time, in the long run.

5) One minus the p-value is not the probability of observing another significant result when the experiment is replicated.

It is impossible to calculate the probability that an effect will replicate, based on the p-value, and as a consequence, the p-value can not inform us about the p-value we will observe in future studies. When we have observed a p-value of 0.05, it is not 95% certain the finding will replicate. Only when we make additional assumptions (e.g., the assumption that the alternative effect is true, and the effect size that was observed in the original study is exactly correct) can we model the p-value distribution for future studies.

It might be useful to visualize the one very specific situation when the p-value does provide the probability that future studies will provide a significant p-value (even though in practice, we will never know if we are in this very specific situation). In the figure below we have a null-model and alternative model for 150 observations. The observed mean difference falls exactly on the threshold for the significance level. This means the p-value is 0.05. In this specific situation, it is also 95 probable that we will observe a significant result in a replication study, assuming there is a true effect as specified by the alternative model. If this alternative model is true, 95% (1-p) of the observed means will fall on the right side of the observed mean in the original study (we have a statistical power of 95%), and only 5% of the observed means will fall in the blue area (which contains the Type 2 errors). 

This very specific situation is almost always not your reality. It is not true when any other alternative hypothesis is correct. And it is not true when the the null-hypothesis is true. In short, the p-value basically never, except for one very specific situation when the alternative hypothesis is true and of a very specific size you will never know you are in, gives the probability that a future study will once again yield a significant result.


Probabilities are confusing, and the interpretation of a p-value is not intuitive. Grammar is also confusing, and not intuitive. But where we practice grammar in our education again and again and again until you get it, we don’t practice the interpretation of p-values again and again and again until you get it. Some repetition is probably needed. Explanations of what p-values mean are often verbal, and if there are figures, they use t-value distributions we are unfamiliar with. Instead of complaining that researchers don’t understand what p-values mean, I think we should try to explain common misconceptions multiple times, in multiple ways.

Daniel Lakens, 2017

Saturday, November 11, 2017

The Statisticians' Fallacy

If I ever make a follow up to my current MOOC, I will call it ‘Improving Your Statistical Questions’. The more I learn about how people use statistics, the more I believe the main problem is not how people interpret the numbers they get from statistical tests. The real issue is which statistical questions researchers ask from their data.

Our statistics education turns a blind eye to training people how to ask a good question. After a brief explanation of what a mean is, and a pit-stop at the normal distribution, we jump through as many tests as we can fit in the number of weeks we are teaching. We are training students to perform tests, but not to ask questions.

There are many reasons for this lack of attention in training people how to ask a good question. But here I want to focus on one reason, which I’ve dubbed the Statisticians' Fallacy: Statisticians who tell you ‘what you really want to know’, instead of explaining how to ask one specific kind of question from your data.

Let me provide some example of the Statisticians' Fallacy. In the next quotes, pay attention to the use of the word ‘want’. Cohen (1994) in his ‘The earth is round (p < .05)’ writes:

Colquhoun (2017) writes:

Or we can look at Cumming (2013):

Or Bayarri, Benjamin, Berger, and Sellke (2016):

Now, you might have noticed that these four statements by statisticians of ‘what we want’ are all different. The one says 'we want' to know the posterior probability that our hypothesis is true, the others says 'we want' to know the false positive report probability, yet another says 'we want' effect sizes and their confidence intervals, and yet another says 'we want' the strength of evidence in the data.

Now you might want to know all these things, you might want to know some of these things, and you might want to know yet other things. I have no clue what you want to know (and after teaching thousands of researchers the last 5 years, I’m pretty sure often you don't really have a clue what you want either - you've never been trained to thoroughly ask this question). But what I think I know is that statisticians don’t know what you want to know. They might think some questions are interesting enough to ask. They might argue that certain questions follow logically from a specific philosophy of science. But the idea that there is always a single thing ‘we want’ is not true. If it was, statisticians would not have been criticizing what other statisticians say ‘we want’ for the last 50 years. Telling people 'what you want to know' instead of teaching people to ask themselves what they want to know will just get us another two decades of mindless statistics.

I am not writing this to stop statisticians from criticizing each other (I like to focus on easier goals in my life, such as world peace). But after reading many statements like the ones I’ve cited above, I have distilled my main take-home message in a bathroom tile:

There are many, often complementary, questions you can ask from your data, or when performing lines of research. Now I am not going to tell you what you want. But what I want, is that we stop teaching researchers there is only a single thing they want to know. There is no room for the Statistician’s Fallacy in our education. I do not think it is useful to tell researchers what they want to know. But I think it’s a good idea to teach them about all the possible questions they can ask.

Further Reading:
Thanks to Carol Nickerson who, after reading this blog, pointed me to David Hand's Deconstructing Statistical Questions, which is an excellent article on the same topic - highly recommended.

Monday, October 16, 2017

Science-Wise False Discovery Rate Does Not Explain the Prevalence of Bad Science

Science-Wise False Discovery Rate Does Not Explain the Prevalence of Bad Science

This article explores the statistical concept of science-wise false discovery rate (SWFDR). Some authors use SWFDR and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are a priori true. I disagree. While SWFDR is valid statistically, the real cause of bad science is “Publish or Perish”.


Is science broken? A lot of people seem to think so, including some esteemed statisticians. One line of reasoning uses the concepts of false discovery rate and its complement, positive predictive value, to argue that most (or, at least, many) published scientific results must be wrong unless most hypotheses are a priori true.

The false discovery rate (FDR) is the probability that a significant p-value indicates a false positive, or equivalently, the proportion of significant p-values that correspond to results without a real effect. The complement, positive predictive value (\(PPV=1-FDR\)) is the probability that a significant p-value indicates a true positive, or equivalently, the proportion of significant p-values that correspond to results with real effects.

I became interested in this topic after reading Felix Sch√∂nbrodt’s blog post, “What’s the probability that a significant p-value indicates a true effect?” and playing with his ShinyApp. Sch√∂nbrodt’s post led me to David Colquhoun’s paper, “An investigation of the false discovery rate and the misinterpretation of p-values” and blog posts by Daniel Lakens, “How can p = 0.05 lead to wrong conclusions 30% of the time with a 5% Type 1 error rate?” and Will Gervais, “Power Consequences”.

The term science-wise false discovery rate (SWFDR) is from Leah Jager and Jeffrey Leek’s paper, “An estimate of the science-wise false discovery rate and application to the top medical literature”. Earlier work includes Sholom Wacholder et al’s 2004 paper “Assessing the Probability That a Positive Report is False: An Approach for Molecular Epidemiology Studies” and John Ioannidis’s 2005 paper, “Why most published research findings are false”.


Being a programmer and not a statistician, I decided to write some R code to explore this topic on simulated data.

The program simulates a large number of problem instances representing published results, some of which are true and some false. The instances are very simple: I generate two groups of random numbers and use the t-test to assess the difference between their means. One group (the control group or simply group0) comes from a standard normal distribution with \(mean=0\). The other group (the treatment group or simply group1) is a little more involved:

  • for true instances, I take numbers from a standard normal distribution with mean d (\(d>0\));
  • for false instances, I use the same distribution as group0.

The parameter d is the effect size, aka Cohen’s d.

I use the t-test to compare the means of the groups and produce a p-value assessing whether both groups come from the same distribution.

The program does this thousands of times (drawing different random numbers each time, of course), collects the resulting p-values, and computes the FDR. The program repeats the procedure for a range of assumptions to determine the conditions under which most positive results are wrong.

For true instances, we expect the difference in means to be approximately d and for false ones to be approximately 0, but due to the vagaries of random sampling, this may not be so. If the actual difference in means is far from the expected value, the t-test may get it wrong, declaring a false instance to be positive and a true one to be negative. The goal is to see how often we get the wrong answer across a range of assumptions.


To reduce confusion, I will be obsessively consistent in my terminology.

  • An instance is a single run of the simulation procedure.
  • The terms positive and negative refer to the results of the t-test. A positive instance is one for which the t-test reports a significant p-value; a negative instance is the opposite. Obviously the distinction between positive and negative depends on the chosen significance level.
  • true and false refer to the correct answers. A true instance is one where the treatment group (group1) is drawn from a distribution with \(mean=d\) (\(d>0\)). A false instance is the opposite: an instance where group1 is drawn from a distribution with \(mean=0\).
  • empirical refers to results calculated from the simulated data, as opposed to theoretical which means results calculated using standard formulas.

The simulation parameters are

parameter meaning default
prop.true fraction of cases where there is a real effect seq(.1,.9,by=.2)
m number of iterations 1e4
n sample size 16
d standardized effect size (aka Cohen’s d) c(.25,.50,.75,1,2)
pwr power. if set, the program adjusts d to achieve power NA
sig.level significance level for power calculations when pwr is set 0.05
pval.plot p-values for which we plot results c(.001,.01,.03,.05,.1)


The simulation procedure with default parameters produces four graphs similar to the ones below.