Thursday, 12 July 2018

One big study or two small studies? Insights from simulations

At a recent conference, someone posed a question that had been intriguing me for a while: suppose you have limited resources, with the potential to test N participants. Would it be better to do two studies, each with N/2 participants, or one big study with all N?

I've been on the periphery of conversations about this topic, but never really delved into it, so I gave a rather lame answer. I remembered hearing that statisticians would recommend the one big study option, but my intuition was that I'd trust a result that replicated more than one which was a one-off, even if the latter was from a bigger sample. Well, I've done the simulations and it's clear that my intuition is badly flawed.

Here's what I did. I adapted a script that is described in my recent slides that give hands-on instructions for beginners on how to simulate data, The script, Simulation_2_vs_1_study_b.R, which can be found here, generates data for a simple two-group comparison using a t-test. In this version, on each run of the simulation, you get output for one study where all subjects are divided into two groups of size N, and for two smaller studies each with half the number of subjects. I ran it with various settings to vary both the sample size and the effect size (Cohen's d). I included the case where there is no real difference between groups (d = 0), so I could estimate the false positive rate as well as the power to detect a true effect.

I used a one-tailed t-test, as I had pre-specified that group B had the higher mean when d > 0. I used a traditional approach with p-value cutoffs for statistical significance (and yes, I can hear many readers tut-tutting, but this is useful for this demonstration….) to see how often I got a result that met each of three different criteria:
  • a) Single study, p < .05 
  • b) Split sample, p < .05 replicated in both studies 
  • c) Single study, p < .005

Figure 1 summarises the results.
Figure 1

The figure is pretty busy but worth taking a while to unpack. Power is just the proportion of runs of the simulation where the significance criterion was met. It's conventional to adopt a power cutoff of .8 when deciding on how big a sample to use in a study. Sample size is colour coded, and refers to the number of subjects per group for the single study. So for the split replication, each group has half this number of subjects. The continuous line shows the proportion of results where p < .05 for the single study, the dotted line has results from the split replication, and the dashed line has results from the single study with more stringent significance criterion, p < .005 .

It's clear that for all sample sizes and all effect sizes, the one single sample is much better powered than the split replication.

But I then realised what had been bugging me and why my intuition was different. Look at the bottom left of the figure, where the x-axis is zero: the continuous lines (i.e., big sample, p < .05) all cross the y-axis at .05. This is inevitable: by definition, if you set p < .05, there's a one in 20 chance that you'll get a significant result when there's really no group difference in the population, regardless of the sample size. In contrast, the dotted lines cross the y-axis close to zero, reflecting the fact that when the null hypothesis is true, the chance of two samples both giving p < .05 in a replication study is one in 400 (.05^2 = .0025). So I had been thinking more like a Bayesian: given a significant result, how likely was it to have been come from a population with a true effect rather than a null effect? This is a very different thing from what a simple p-value tells you*.

Initially, I thought I was onto something. If we just stick with p < .05, then it could be argued that from a Bayesian perspective, the split replication approach is preferable. Although you are less likely to see a significant effect with this approach, when you do, you can be far more confident it is a real effect. In formal terms, the likelihood ratio for a true vs null hypothesis, given p < .05, will be much higher for the replication.

My joy at having my insight confirmed was, however, short-lived. I realised that this benefit of the replication approach could be exceeded with the single big sample simply by reducing the p-value so that the odds of a false positive are minimal. That's why Figure 1 also shows the scenario for one big sample with p < .005: a threshold that has recently proposed as a general recommendation for claims of new discoveries (Benjamin et al, 2018)**.

None of this will surprise expert statisticians: Figure 1 just reflects basic facts about statistical power that were popularised by Jacob Cohen in 1977. But I'm glad to have my intuitions now more aligned with reality, and I'd encourage others to try simulation as a great way to get more insights into statistical methods.

Here is the conclusions I've drawn from the simulation:
  • First, even when the two groups come from populations with different means, it's unlikely that you'll get a clear result from a single small study unless the effect size is at least moderate; and the odds of finding a replicated significant effect are substantially lower than this.  None of the dotted lines achieves 80% power for a replication if effect size is less than .3 - and many effects in psychology are no bigger than that. 
  • Second, from a statistical perspective, testing an a priori hypothesis in a larger sample with a lower p-value is more efficient than subdividing the sample and replicating the study using a less stringent p-value.
I'm not a stats expert, and I'm aware that there's been considerable debate out there about p-values - especially regarding the recommendations of Benjamin et al (2018). I have previously sat on the fence as I've not felt confident about the pros and cons. But on the basis of this simulation, I'm warming to the idea of p < .005. I'd welcome comments and corrections.

*In his paper The reproducibility of research and the misinterpretation of p-values. Royal Society Open Science, 4(171085). doi:10.1098/rsos.171085 David Colquhoun (2017) discusses these issues and notes that we also need to consider the prior likelihood of the null hypothesis being true: something that is unknowable and can only be estimated on the basis of past experience and intuition.
**The proposal for adopting p < .005 as a more stringent statistical threshold for new discoveries can be found here: Benjamin, D. J., Berger, J. O., Johannesson, M., Nosek, B. A., Wagenmakers, E. J., Berk, R., . . . Johnson, V. E. (2018). Redefine statistical significance. Nature Human Behaviour, 2(1), 6-10. doi:10.1038/s41562-017-0189-z

Postscript, 15th July 2018

This blogpost has generated a lot of discussion, mostly on Twitter. One point that particularly interested me was a comment that I hadn’t done a fair comparison between the one-study and two-study situation, because the plot showed a one-off two group study with an alpha at .005, versus a replication study (half sample size in each group) with alpha at .05. For a fair comparison, it was argued, I should equate the probabilities between the two situations, i.e. the alpha for the one-off study should be .05 squared = .0025.

So I took a look at the fair comparison: Figure 2 shows the situation when comparing one study with alpha set to .0025 vs a split replication with alpha of .05. The intuition of many people on Twitter was that these should be identical, but they aren’t. Why not? We have the same information in the two samples. (In fact, I modified the script so that this was literally true and the same sample was tested singly and again split into two – previously I’d just resampled to get the smaller samples. This makes no difference – the single sample with more extreme alpha still gives higher power).

Figure 2: Power for one-off study with alpha .0025 (dashed lines) vs. split replication with p < .05
To look at it another way, in one version of the simulation there were 1600 simulated experiments with a true effect (including all the simulated sample sizes and effect sizes). Of these 581 were identified as ‘significant’ both by the one-off study with p < .0025 and they were also replicated in two small studies with p < .05. Only 5 were identified by the split replication alone, but 134 were identified by the one-off study alone.

I think I worked out why this is the case, though I’d appreciate having a proper statistical opinion. It seems to have to do with accuracy of estimating the standard deviation. If you have a split sample and you estimate the mean from each half (A and B), then the average of mean A and mean B will be the same as for the big sample of AB combined. But when it comes to estimating the standard deviation – which is a key statistic when computing group differences – the estimate is more accurate and precise with the large sample. This is because the standard deviation is computed by measuring the difference of each value from its own sample mean. Means for A and B will fluctuate due to sampling error, and this will make the estimated SDs less reliable. You can estimate the pooled standard deviation for two samples by taking the square root of the average of the variances. However, that value is less precise than the SD from the single large sample. I haven’t done a large number of runs, but a quick check suggests that whereas both the one-off study and the split replication give pooled estimates of the SD at around the true value of 1.0, the standard deviation of the standard deviation (we are getting very meta here!) is around .01 for the one-off study but .14 for the split replication. Again, I’m reporting results from across all the simulated trials, including the full range of sample sizes and effect sizes.

Figure 3: Distribution of estimates of pooled SD; The range is narrower for the one-off study (pink) than for the split replication studies (blue). Purple shows area of overlap of distributions

This has been an intriguing puzzle to investigate, but in the original post, I hadn’t really been intending to do this kind of comparison - my interest was rather in making the more elementary point which is that there's a very low probability of achieving a replication when sample size and effect size are both relatively small.

Returning to that issue, another commentator said that they’d have far more confidence in five small studies all showing the same effect than in one giant study. This is exactly the view I would have taken before I looked into this with simulations; but I now realise this idea has a serious flaw, which is that you’re very unlikely to get those five replications, even if you are reasonably well powered, because – the tldr; message implicit in this post – when we’re talking about replications, we have to multiply the probabilities, and they rapidly get very low. So, if you look at the figure, suppose you have a moderate effect size, around .5, then you need a sample of 48 per group to get 80% power. But if you repeat the study five times, then the chance of getting a positive result in all five cases is .8^5, which is .33. So most of the time you’d get a mixture of null and positive results. Even if you doubled the sample size to increase power to around .95, the chance of all five studies coming out positive is still only .95^5 (82%).

Finally, another suggestion from Twitter is that a meta-analysis of several studies should give the same result as a single big sample. I’m afraid I have no expertise in meta-analysis, so I don’t know how well it handles the issue of more variable SD estimates in small samples, but I’d be interested to hear more from any readers who are up to speed with this.


  1. One advantage of running two studies - leaving power calculations aside - is that you get the opportunity to use real data from the first study to learn all the things that were wrong with your a-priori predictions or analysis plan.

    A point that I think is sometimes missed in calls for pre-registration is something I would summarise with the quote that "research is what I'm doing when I don't know what I'm doing". Pre-registration may have little value for studies with novel dependent measures, or for which the data holds surprises. In my experience of studies like these, sticking to the pre-registered analysis is a mistake.

    I think a better approach is to work with the data in an exploratory fashion and then pre-register the right analysis and predictions for your second, replication study.

    1. I guess the other alternative would be to do some form of leave-half-out analysis.

      e.g in the context of ERPs:
      - test N participants;
      - determine based on randomly selected N/2 the latency where the greatest effect is;
      - determine the effect size for the remaining N/2 at that latency;
      - repeat 1000x with different random N/2 subsamples;
      - average the effect sizes across the 1000 runs.

      My intuition is that this gives a more accurate picture of the true effect size. But it would probably only make sense when there are few researcher degrees of freedom.

    2. Uh - not sure why I'm anonymous when I'm supposedly signed in. Jon Brock here ^^

    3. Thanks Matt. I think you could also argue for other advantages of 2 studies, e.g. done by different groups so establish robustness of result against lab-specific effects. But the power issue is really serious: if you are not powered to detect the effect of interest, then you're in trouble. And most of the time we aren't. Another option is to consider other ways of improving power by minimising measurement error, and hence increasing effect size. But, I repeat, power is key.

  2. @ Matt Davis
    I am certainly no statistician but with limited N-sizes that we often have in human psychology a serious problem with the two study approach is that it magnifies the chances of both false positive and false negative results if the data is at all noisy.

    Given sufficient sample sizes and relatively clean measurements your approach has a lot of appeal but the curse of the N-size haunts us.

    I specified "human psychology" above most researchers working with animals do not, at least in principle, have to worry about limited recruitment pools.

  3. Once you introduce heterogenity of effect sizes, then one big study is highly problematic.

  4. @ Unknown (aka Jon Bock)
    Check in mirror that you are not wearing an iron mask.

    How does one geheterogenity of effect sizes in a single study (assuming one measurement)?

    As I said, I am no statistician.

    1. So this is similar to how lots of machine learning approaches work.

      You randomly divide the sample into two - you use the first half to determine how to analyse the data (eg what the epoch of interest is) and then, having fixed those analytical degrees of freedom, you determine the effect size for the remaining half of the participants.

      If you repeat that exercise a second time with a different random division of the participants, you'll end up with a slightly different effect size.

      So the best thing to do is repeat that exercise many times (say 1000) and then determine the average effect size.

    2. Ah, obvious once someone points it out. Thanks.

  5. Blogger has refused to interact with David Colquhoun, so I am posting this comment on his behalf!

    "Well actually in my 2017 paper to which you kindly refer, what I do is to suggest ways of circumventing the inconvenient fact that we rarely have a valid prior probability. More details in my 2018 paper: and in my CEBM talk: …."