The 20% Statistician: October 2014

Yesterday Mike McCullough posted an interesting question on Twitter. He had collected some data, observed a p = 0.026 for his hypothesis, but he wasn't happy. Being aware that higher p-values do not always provide strong support for H1, he wanted to set a new goal and collect more data. With sequential analyses it's no problem to look at data (at least when you plan them ahead of time), and collect additional observations (because you control the false positive rate) so the question was: which goal should you have?

Mike suggested a p-value of 0.01, which (especially with increasing sample sizes) is a good target. But others quickly suggested forgetting about those damned p-values altogether, and plan for accuracy. Planning for accuracy is simple: you decide upon the width of the confidence interval you'd like, and determine the sample size you need.

I don't really understand why people are pretending like these two choices are any different. They always boil down to the same thing: your sample size. A higher sample size will give you more power, and thus a better chance of observing a p-value of 0.01, or 0.001. A higher sample will also reduce the width of your confidence interval.

So the only difference is which calculation you use to base your sample size on. You either decide upon an effect size you expect, or a width of a confidence you desire, and calculate the sample size. One criticism against power analysis is that you often don't know the effect size (e.g., Maxwell, Kelley, & Rausch, 2008). But with sequential analyses (e.g., Lakens, 2014) you can simple collect some data, and calculate conditional power based on the observed effect size for the remaining sample.

I think a bigger problem is that people have no clue whatsoever when determining an appropriate width for a confidence interval. I've argued before that people have a much better feel for p-values than confidence intervals.

In the graph below, you see 60 one-sided t-test, all examining a true effect with a mean difference of 0.3 (dark vertical line) with a SD of 1. The bottom 20 are based on a sample size of 118, the middle 20 on a sample size of 167, and the top 20 on a sample size of 238. This gives you 90% power for a p=0.05, p=0.01, and p=0.001, respectively. Not surprisingly, as power increases, less confidence intervals include 0 (i.e., are significant). The higher the sample size, the further the confidence intervals stay away from 0.

Take a look at the width of the confidence intervals. Can you see the differences? Do you feel the difference in aiming for a width of the confidence interval of 0.40, 0.30, or 0.25 (more or less the width in the three groups from bottom to top)? If not, but you do have feel for the difference between aiming for p = 0.01 or p=0.001, then go for the conditional power analysis. I would.

R script that produced the figure above:

Chris Fraley and Simine Vizare (F&V) published a very interesting paper in PlosOne where they propose to evaluate journals based on the sample size and statistical power of the studies. As the authors reason: “All else being equal, we believe that journals that publish empirical studies based on highly powered designs should be regarded as more prestigious and credible scientific outlets than those that do not.” What they find is “the journals that have the highest impact also tend to publish studies that have smaller samples”. How can this be? Do we simply not care about the informational value of studies, or even prefer to cite smaller studies?

What is ‘impact’?

‘Impact Factor’ should join ‘significance’ in the graveyard of misleading concepts in science. For an excellent blog post about some of the problems with the impact factor, go here. We intuitively feel high impact factor journals (see how I am not using ‘high impact journals’, just as I prefer ‘a statistical difference’ over ‘statistical significance’?) should publish high quality research, but citation rates are extremely skewed. For example, the paper by Simmons, Nelson, & Simonsohn (2011) illustrating problems with small samples and false positives was cited more than 200 times within the first 2 years, and has greatly contributed to the impact factor of Psychological Science (it’s ok if you find that ironic).

The relation between the median sample size (used by F&V) and impact factor is one approach to examine whether number of citations and sample size are related, but we should probably be especially interested in the small number of studies in high impact factor journals that contribute most to the impact factor. At least some of these are probably not even empirical papers (and please don’t start citing Cumming, 2014, in Psychological Science, whenever you want to refer to “The New Statistics” – it just shows you were too lazy to read the book; You should cite Cumming, 2012). Even so, F&V would probably note that there are simply too many articles with tiny sample sizes getting to many citations, and I’d agree.

There are several reasons for this, but all of them are caused by you and me, because we are the ones doing the citing. We don’t always (or we often don’t?) cite articles because of their quality (again, see this blog). Let me add one. As we discuss (Koole & Lakens, 2012) psychological science has a strong narrative tradition. We like to present our research as a story, instead of as a bunch of dry facts. This culture has many consequences (such as an under appreciation of telling the same story twice by publishing replications, and a tendency to only tell the post-hoc final edited version of the story, and not the one you initially had in mind [see Bem, 1987]) but it also means we highly reward the first person to come up with a story – even though their data wasn’t particularly strong.

F&V’s main point, I think, is not that we should have expected sample size and impact factors to be correlated, but more normative: We should want impact factors and sample size to be related. Their argument for a cultural shift towards a greater appreciation of sample size as an indicator of the quality of a study is important, and makes sense, rationally. Although I don’t think people will easily give up their narrative tradition, the new generation of reviewers with highly improved statistical knowledge are no longer convinced by an excellent story arc, but want to see empirical support for your theoretical rationale. When you write ‘We know that X leads to more Y (Someone, Sometime), and therefore predict….” you can still reference someone who happened to have published about the topic slightly earlier than someone else. I’m not asking you to give up your culture. But if that first study had a sample size of 20 per condition without examining an effect that should clearly be huge (d>1), know that you are expected to add a reference to a study that provides convincing empirical support for the narrative (showing the same basic idea in a larger sample), or reviewers will not be convinced.

Is N all important?

Fraley and Vazire (2014) only code sample size, not the type of design, or the number of conditions. At the same time, we know Psychological Science likes elegant designs (which might very well mean simple comparisons between two conditions, and not a 2x2x3 design examining the impact of some moderator). This might explain why sample sizes are smaller in Psychological Science. This also has a consequence for the power calculations by F&V, which are similarly based on the assumption journals do not differ in the type of designs they publish. But if the Journal of Personality Research (a journal which F&V show has larger samples, but a lower impact factor) publishes a lot more correlational or between subject studies than Psychological Science, that could matter quite a bit.

This doesn’t mean Psychological Science is off the hook. Table 4 in F&V illustrates that the median sample size only gives sufficient power to observe large effects, and it is unrealistic all studies published in Psychological Science have large effects. This is not very surprising (we immediately realize why the paper by F&V was not published in Psychological Science, wink). However, low sample sizes are especially problematic for journals like Psychological Science, whose editors say “We hope to publish manuscripts that are innovative and ground-breaking and that address issues likely to interest a wide range of scientists in the field.” There are different types of innovative, but one is where everyone (researchers themselves and readers) consider a finding ‘surprising’ or ‘counterintuitive’. If a journal published findings that are a-priori unlikely (so less than 50% probable, however subjective this might be) collecting a large sample becomes even more important if you’d like H1 to have a high posterior probability in a Bayesian sense. F&V present good arguments to have large samples using Frequentist assumptions – which similarly become more important when examining a-priori unlikely hypotheses.

The solution is to run larger samples (not necessarily by running experiments with 200 people as Simine Vazire suggests on her blog, but for example by using sequential analyses) to increase power, and to perform close replications (which reduce Type 1 errors).

A good start

The N-pact factor might be a good starting point for people to use when deciding what to cite. Remember that the sample size is just a proxy for power (small studies can have high power, if there is good reason to believe the effect size is very large) and power is only one dimension you can use evaluate studies (you can also look at the a-priori likelihood, the effect size, etc.). Nevertheless, research tells us that reviewers only moderately agree on the quality of a scientific article (and people are often biased in their quality judgments based on the impact factor of the journal a paper was published in), so it seems that at least for now, asking people to use sample size as a proxy of the informational value of studies is a good start. In a few years, we should hope the impact factor and N-Pact factor have become at least somewhat positively correlated – preferably because high impact journals start to publish more studies with large sample sizes, and because people start to reward individuals who took the effort to contribute studies to the scientific literature with a higher informational by collecting larger samples by citing their work more.

Postscript

In my hometown, there are two art fairs. The traditional one sells hugely overpriced pieces of art by established artists who are ‘hot’ as determined by the majority of art collectors. The other one, the Raw Art Fair, showcases the work of artists that don’t yet have a lot of impact. Many never will, but for me, the raw art fair is always more memorable, because it makes you think about what you are seeing, and forces you to judge the quality based on your own criteria. For exactly the same reason I prefer to read papers on SSRN, PlosOne, or Frontiers.

The 20% Statistician

Thursday, October 30, 2014

Sample Size Planning: P-values or Precision?

Friday, October 10, 2014

Why Do We Cite Small N Studies?