The 20% Statistician

A blog on statistics, methods, and open science. Understanding 20% of statistics will improve 80% of your inferences.

Tuesday, March 14, 2017

Equivalence testing in jamovi

One of the challenges of trying to get people to improve their statistical inferences is access to good software. After 32 years, SPSS still does not give a Cohen’s d effect size when researchers perform a t-test. I’m a big fan of R nowadays, but I still remember when it I thought R looked so complex I was convinced I was not smart enough to learn how to use it. And therefore, I’ve always tried to make statistics accessible to a larger, non-R using audience. I know there is a need for this – my paper on effect sizes from 2013 will reach 600 citations this week, and the spreadsheet that comes with the article is a big part of its success.

So when I wrote an article about thebenefits of equivalence testing for psychologists, I also made a spreadsheet. But really, what we want is easy to use software that combines all the ways in which you can improve your inferences. And in recent years, we see some great SPSS alternatives that try to do just that, such as PSPP, JASP, and more recently, jamovi.

Jamovi is made by developers who used to work on JASP, and you’ll see JASP and jamovi look and feel very similar. I’d recommend downloading and installing both these excellent free software packages. Where JASP aims to provide Bayesian statistical methods in an accessible and user-friendly way (and you can do all sorts of Bayesian analyses in JASP), the core aim of jamovi is wanting to make software that is ‘“community driven”, where anyone can develop and publish analyses, and make them available to a wide audience’. This means that if I develop statistical analyses, such as equivalence tests, I can make these available through jamovi for anyone who wants to use these tests. I think that’s really cool, and I’m super excited my equivalence testing package TOSTER is now available as a jamovi module. 

You can download the latest version of jamovi here. The latest version at the time of writing is Install, and open the software. Then, install the TOSTER module. Click the + module button:

Install the TOSTER module:

And you should see a new menu option in the task bar, called TOSTER:

To play around with some real data, let’s download the data from Study 7 from Yap et al, in press, from the Open Science Framework: This study examines the effect of weather (good vs bad days) on mood and life satisfaction. Like any researcher who takes science seriously, Yap, Wortman, Anusic, Baker, Scherer, Donnellan, and Lucas made their data available with the publication. After downloading the data, we need to replace the missing values indicated with NA with “” in a text editor (CTRL H, find and replace), and then we can read in the data in jamovi. If you want to follow along, you can also directly download the jamovi file here.

Then, we can just click the TOSTER menu, select a TOST independent samples t-test, select ‘condition’ as condition, and analyze for example the ‘lifeSat2’ variable, or life satisfaction. Then we need to select an equivalence bound. For this DV we have data from approximately 117 people on good days, and 167 people on bad days. We need 136 participants in each condition to have 90% power to reject effects of d = 0.4 or larger, so let’s select d = 0.4 as an equivalence bound. I’m not saying smaller effects are not practically relevant – they might very well be. But if the authors were interested in smaller effects, they would have collected more data. So I’m assuming here the authors thought an effect of d = 0.4 would be small enough to make them reconsider the original effect by Schwarz & Clore (1983), which was quite a bit larger with a d = 1.38.

In the screenshot above you see the analysis and the results. By default, TOSTER uses Welch’s t-test, which is preferable over Student’s t-test (as we explain in this recent article), but if you want to reproduce the results in the original article, you can check the ‘Assume equal variances’ checkbox. To conclude equivalence in a two-sided test, we need to be able to reject both equivalence bounds, and with p-values of 0.002 and < 0.001, we do. Thus, we can reject an effect larger than d = 0.4 or smaller than d = -0.4, and given these equivalence bounds, conclude the effect is too small to be considered support for the presence of an effect that is large enough, for our current purposes, to matter.

Jamovi runs on R, and it’s a great way to start to explore R itself, because you can easily reproduce the analysis we just did in R. To use equivalence tests with R, we can download the original datafile (R will have no problems with NA as missing values), and read it into R.  Then, in the top right corner of jamovi, click the … options window, and check the box ‘syntax mode’.

You’ll see the output window changing to the input and output style of R. You can simply right-click the syntax on the top, right-click, choose Syntax>Copy and then co to R, and paste the syntax in R:

Running this code gives you exactly the same results as jamovi.

I collaborated a lot with Jonathon Love on getting the TOSTER package ready for jamovi. The team is incredibly helpful, so if you have a nice statistics package that you want to make available to a huge ‘not-yet-R-using’ community, I would totally recommend checking out the developers hub and getting started! We are seeing all sorts of cool power-analyses Shiny apps, meta-analysis spreadsheets, and meta-science tools like p-checker that now live on websites all over the internet, but that could all find a good home in jamovi. If you already have the R code, all you need to do is make it available as a module!

Friday, March 10, 2017

No, the p-values are not to blame: Part 53

In the latest exuberant celebration of how Bayes Factors will save science, Ravenzwaaij and Ioannidis write: “our study offers through simulations yet another demonstration of the unfortunate effect of p-values on statistical inferences.” Uh oh – what have these evil p-values been up to this time?

Because the Food and Drug Administration thinks two significant studies are a good threshold before they'll allow you to put stuff in your mouth, in a simple simulation, Ravenzwaaij and Ioannidis look at what Bayes factors have to say when researchers find exactly two p < 0.05.

If you find two effects in 2 studies, and there is a true effect of d = 0.5, the data is super-duper convincing. The blue bars below indicate Bayes Factors > 20, the tiny green parts are BF > 3 but < 20 (still fine).

 Even when you study a small effect with d = 0.2, after observing two significant results in two studies, everything is hunky-dory.

So p-values work like a charm, and there is no problem. THE END.

What's that you say? This simple message does not fit your agenda? And it's unlikely to get published? Oh dear! Let's see what we can do!

Let's define 'support for the null-hypothesis' as a BF < 1. After all, just as a 49.999% rate of heads in a coin flip is support for a coin biased towards tails, any BF < 1 is stronger support for the null, than for the alternative. Yes, normally researchers consider 1/3 > BF < 3 as 'inconclusive' but let's ignore that for now.

The problem is we don't even have BF < 1 in our simulations so far. So let's think of something else. Let's introduce our good old friend lack of power!

Now we simulate a bunch of studies, until we find exactly 2 significant results. Let's say we do 20 studies where the true effect is d = 0.2, and only find an effect in 2 studies. We have 15% power (because we do a tiny study examining a tiny effect). This also means that the effect size estimates in the 18 other studies have to be small enough not to be significant. Then, we calculate Bayes Factors "for the combined data from the total number of trials conducted." Now what do we find?

Look! Black stuff! That's bad. The 'statistical evidence actually favors the null hypothesis', at least based on a BF < 1 cut-off. If we include the possibility of 'inconclusive evidence' (applying the widely used 1/3 > BF < 3 thresholds), we see that actually, when you find only 2 out of 20 significant studies when you have 15% power, the overall data is sometimes inconclusive (but not support for H0).

That's not surprising. When we have 20 people per cell, and d = 0.2, when we combine all the data to calculate the Bayes factor (so we have N = 400 per cell) the data is inconclusive sometimes. After all, we only have 88% power! That's not bad, but the data you collect will sometimes still be inconclusive!

Let's see if we can make it even worse, by introducing our other friend, publication bias. They show another example of when p-values lead to bad inferences, namely when there is no effect, we do 20 studies, and find 2 significant results (which are Type 1 errors).

Wowzerds, what a darkness! Aren't you surprised? No, I didn't think so.

To conclude: Inconclusive results happen. In small samples and small effects, there is huge variability in the data. This is not only true for p-values, but it is just as true of Bayes Factors (see my post on Dance of the Bayes Factors here).

I can understand the authors might be disappointed by the lack of enthusiasm of the FDA (which cares greatly about controlling error rates, given that they deal with life and death) to embrace Bayes Factors. But the problems the authors simulate are not going to be fixed by replacing p-values by Bayes Factors. It's not that "Use of p-values may lead to paradoxical and spurious decision-making regarding the use of new medications." Publication bias and lack of power lead to spurious decision making - regardless of the statistic you throw at the data.

I'm gonna bet that a little less Bayesian propaganda, a little less p-value bashing for no good reason, and a little more acknowledgement of the universal problems of publication bias and too small sample sizes for any statistical inference we try to make, is what will really improve science in the long run.

P.S. The authors shared their simulation script with the publication, which was extremely helpful in understanding what they actually did, and which allowed me to make the figure above which includes an 'inconclusive' category (in which I also used a slightly more realistic prior when you expect small effects -I don't think it matters but I'm too impatient to redo the simulation with the same prior and only different cut-offs).

Monday, March 6, 2017

How p-values solve 50% of the problems with p-values

Greenland and colleagues (Greenland et al., 2016) published a list with 25 common misinterpretations of statistical concepts such as power, confidence intervals, and, in points 1-10, p-values. Here I’ll explain how 50% of these problems are resolved by using equivalence tests in addition to null-hypothesis significance tests.

First, let’s look through the 5 points we will resolve:

4. A nonsignificant test result (P > 0.05) means that the test hypothesis [NOTE DL: This is typically the null-hypothesis] is true or should be accepted.
5. A large P value is evidence in favor of the test hypothesis.
6. A null-hypothesis P value greater than 0.05 means that no effect was observed, or that absence of an effect was shown or demonstrated.
7. Statistical significance indicates a scientifically or substantively important relation has been detected.
8. Lack of statistical significance indicates that the effect size is small.

With an equivalence test, we specify which effect is scientifically or substantially important. We call this the ‘smallest effect size of interest’. If we find an effect that is surprisingly small, assuming any effect we would care about exists, we can conclude practical equivalence (and we’d not be wrong more than 5% of the time).

Let’s take a look at the figure below (adapted from Lakens, 2017). A mean difference of Cohen’s d 0.5 (either positive or negative) is specified as a smallest effect size of interest. Data is collected, and one of four possible outcomes is observed. 

Equivalence tests fix point 4 above, because a non-significant result no longer automatically means we can accept the null hypothesis. We can accept the null if we find the pattern indicated by A: The p-value from NHST is > 0.05, and the p-value for the equivalence test is 0.05. However, if the p-value for the equivalence test is also > 0.05, the outcome matches pattern D, and we can not reject either hypothesis, and thus remain undecided. Equivalence tests similarly fix point 5: A large p-value is not evidence in favor of the test hypothesis. Ignoring that p-values don’t have an evidential interpretation to begin with, we can only conclude equivalence under pattern A, not under pattern D. Point 6 is just more of the same (empirical scientists inflate error rates, statisticians inflate lists criticizing p-values): We can only conclude there is absence of evidence under pattern A, but not under pattern D.

Point 7 is solved because by using equivalence tests, we can also observe pattern C: An effect is statistically significant, but also smaller than anything we care about, or equivalent to null. We can only conclude the effect is significant, and that the possibility that the effect is large enough to matter can not be rejected, under pattern B. Similarly, point 8 is solved because we can only conclude an effect is non-significant and small under pattern A, but not under pattern D. When there is no significant difference (P > 0.05), but also no statistical equivalence (P > 0.05), it is still possible there is an effect large enough to be interesting.

We see p-values (from equivalence tests) solve 50% of the misinterpretations of p-values (from NHST). They also allow you to publish effects that are statistically significant because they reject the presence of a meaningful effect. They are just as easy to calculate and report as a t-test. Read my practical primer if you want to get started.

P.S. In case you are wondering about the other 50% of the misinterpretations: These are all solved by remembering one simple thing. P-values tell you nothing about the probability a hypothesis is true (neither about the alternative hypothesis, nor about the null hypothesis). This fixed points 1, 2, 3, 9, and 10 below by Greenland and colleagues. So if we use equivalence tests together will null-hypothesis significance tests, and remember p-values do not tell us anything about the probability a hypothesis is true, you’re good.

1. The P value is the probability that the test hypothesis is true; for example, if a test of the null hypothesis gave P = 0.01, the null hypothesis has only a 1 % chance of being true; if instead it gave P = 0.40, the null hypothesis has a 40 % chance of being true.
2. The P value for the null hypothesis is the probability that chance alone produced the observed association; for example, if the P value for the null hypothesis is 0.08, there is an 8 % probability that chance alone produced the association.
3. A significant test result (P 0.05) means that the test hypothesis is false or should be rejected.
9. The P value is the chance of our data occurring if the test hypothesis is true; for example, P = 0.05 means that the observed association would occur only 5 % of the time under the test hypothesis.
10. If you reject the test hypothesis because P 0.05, the chance you are in error (the chance your
‘‘significant finding’’ is a false positive) is 5 %.


Greenland, S., Senn, S. J., Rothman, K. J., Carlin, J. B., Poole, C., Goodman, S. N., & Altman, D. G. (2016). Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European Journal of Epidemiology, 31(4), 337–350.
Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science.

Sunday, February 12, 2017

ROPE and Equivalence Testing: Practically Equivalent?

In a previous post, I compared equivalence tests to Bayes factors, and pointed out several benefits of equivalence tests. But a much more logical comparison, and one I did not give enough attention to so far, is the ROPE procedure using Bayesian estimation. I’d like to thank John Kruschke for feedback on a draft of this blog post. Check out his own recent blog comparing ROPE to Bayes factors here

When we perform a study, we would like to conclude there is an effect, when there is an effect. But it is just as important to be able to conclude there is no effect, when there is no effect. I’ve recently published a paper that makes Frequentist equivalence tests (using the two-one-sided tests, or TOST, approach) as easy as possible (Lakens, 2017). Equivalence tests allow you to reject the presence of any effect you care about. In Bayesian estimation, one way to argue for the absence of a meaningful effect is the Region of Practical Equivalence (ROPE) procedure (Kruschke, 2014, chapter 12), which is “somewhat analogous to frequentist equivalence testing” (Kruschke & Liddell, 2017).

In the ROPE procedure, a 95% Highest Density Interval (HDI) is calculated based on a posterior distribution (which is calculated based on a prior and the data). Kruschke suggests that: “if the 95 % HDI falls entirely inside the ROPE then we decide to accept the ROPE’d value for practical purposes”. Note that the same HDI can also be used to reject the null hypothesis, where in Frequentist statistics, even though the confidence interval plays a similar role, you would still perform both a traditional t-test and the TOST procedure.

The only real difference with equivalence testing is that instead of using a confidence interval, a Bayesian Highest Density Interval is used. If the prior used by Kruschke was perfectly uniform, ROPE and equivalence testing would identical, barring philosophical differences in how the numbers should be interpreted. The BEST package by default uses a ‘broad’ prior, and therefore the 95% CI and 95% HDI are not exactly the same, but they are very close, for single comparisons. When multiple comparisons are made, (for example when using sequential analyses, Lakens, 2014), the CI needs to be adjusted to maintain the desired error rate, but in Bayesian statistics, error rates are not directly controlled (they are limited due to ‘shrinkage’, but can be inflated beyond 5%, and often considerably so).

In the code below, I randomly generate random normally distributed data (with means of 0 and a sd of 1) and perform the ROPE procedure and the TOST. The 95% HDI is from -0.10 to 0.42, and the 95% CI is from -0.11 to 0.41, with mean differences of 0.17 or 0.15.

Indeed, if you will forgive me the pun, you might say these two approaches are practically equivalent. But there are some subtle differences between ROPE and TOST

95% HDI vs 90% CI

Kruschke (2014, Chapter 5) writes: “How should we define “reasonably credible”? One way is by saying that any points within the 95% HDI are reasonably credible.” There is not a strong justification for the use of a 95% HDI over a 96% of 93% HDI, except that it mirrors the familiar use of a 95% CI in Frequentist statistics. In Frequentist statistics, the 95% confidence interval is directly related to the 5% alpha level that is commonly deemed acceptable for a maximum Type 1 error rate (even though this alpha level is in itself a convention without strong justification).

But here’s the catch: The TOST equivalence testing procedure does not use a 95% CI, but a 90% CI. The reason for this is that two one-sided tests are performed. Each of these tests has a 5% error rate. You might intuitively think that doing two tests with a 5% error rate will increase the overall Type 1 error rate, but in this case, that’s not true. You could easily replace the two tests, with just one test, testing the observed effect against the equivalence bound (upper or lower) closest to it. If this test is statistically significant, so is the other – and thus, there is no alpha inflation in this specific case. That’s why the TOST procedure uses a 90% CI to have a 5% error rate, while the same researcher would use a 95% CI in a traditional two-sided t-test to examine whether the observed effect is statistically different from 0, while maintaining a 5% error rate (see also Senn, 2007, section 22.2.4)

This nicely illustrates the difference between estimation (where you just want to have a certain level of accuracy, such as 95%), and Frequentist hypothesis testing, where you want to distinguish between signal and noise, and not be wrong more than 5% of the time when you declare there is a signal. ROPE keeps the accuracy the same across tests, Frequentist approaches keep the error rate constant. From a Frequentist perspective, ROPE is more conservative than TOST, like the use of alpha = 0.025 is more conservative than the use of alpha = 0.05.

Power analysis

For an equivalence test, power analysis can be performed based on closed functions, and the calculations take just a fraction of a second. I find that useful, for example in my role in our ethics board, where we evaluate proposals that have to justify their sample size, and we often check power calculations. Kruschke has an excellent R package (BEST) that can do power analyses for the ROPE procedure. This is great work – but the simulations take a while (a little bit over an hour for 1000 simulations).

Because the BESTpower function relies on simulations, you need to specify the sample size, and it will calculate the power. That’s actually the reverse of what you typically want in a power analysis (you want to input the desired power, and see which sample size you need). This means you most likely need to run multiple simulations in BESTpower, before you have determined the sample size that will yield good power. Furthermore, the software requires your to specify the expected means and standard deviations, instead of simply an expected effect size. Instead of Frequentist power analysis, where the hypothesized effect size is a point value (e.g., d = 0.4), Bayesian power analysis models the alternative as a distribution, acknowledging there is uncertainty.

In the end, however, the result of a power analysis for ROPE and for TOST is actually remarkably similar. Using the code below to perform the power analysis for ROPE, we see that 100 participants in each group give us approximately 88.4% power (with 2000 simulations, this estimate is still a bit uncertain) to get a 95% HDI that falls within our ROPE of -0.5 to 0.5, assuming standard deviations of 1.

We can use the powerTOSTtwo.raw function in the TOSTER package (using an alpha of 0.025 instead of 0.05, to mirror to 95% HDI) to calculate the sample size we would need to achieve 88.4% power for independent t-test (using equivalence bounds of -0.5 and 0.5, and standard deviations of 1):


The outcome is 100 as well. So if you use a broad prior, it seems you can save yourself some time by using the power analysis for equivalence tests, without severe consequences.

Use of prior information

The biggest benefit of ROPE over TOST is that is allows you to incorporate prior information in your data analysis. If you have reliable prior information, ROPE can use this information, which is especially useful if you don’t have a lot of data. If you use priors, it is typically advised to check the robustness of the posterior against reasonable changes in the prior (Kruschke, 2013).


Using the ROPE procedure or the TOST procedure will most likely lead to very similar inferences. For all practical purposes, the differences are small. It’s quite a lot easier to perform a power analysis for TOST, and by default, TOST has greater statistical power because it uses 90% CI. But power analysis is possible for ROPE (which is a rare pleasure to see for Bayesian analyses), and you could choose to use a 90% HDI, or any other value that matches your goals. TOST will be easier and more familiar because it is just a twist on the classic t-test, but ROPE might be a great way to dip your toes in Bayesian waters and explore the many more things you can do with Bayesian posterior distributions.


Kruschke, J. (2013). Bayesian estimation supersedes the t test. Journal of Experimental Psychology: General, 142(2), 573–603.
Kruschke, J. (2014). Doing Bayesian Data Analysis, Second Edition: A Tutorial with R, JAGS, and Stan (2 edition). Boston: Academic Press.
Kruschke, J., & Liddell, T. M. (2017). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin & Review.
Lakens, D. (2014). Performing high-powered studies efficiently with sequential analyses: Sequential analyses. European Journal of Social Psychology, 44(7), 701–710.
Lakens, D. (2017). Equivalence tests: A practical primer for t-tests, correlations, and meta-analyses. Social Psychological and Personality Science.
Senn, S. (2007). Statistical issues in drug development (2nd ed). Chichester, England ; Hoboken, NJ: John Wiley & Sons.