## Friday, 7 June 2013

### Interpreting unexpected significant results

Here's s question for researchers who use analysis of variance (ANOVA). Suppose I set up a study to see if one group (e.g. men) differs from another (women) on brain response to auditory stimuli (e.g. standard sounds vs deviant sounds – a classic mismatch negativity paradigm). I measure the brain response at frontal and central electrodes located on two sides of the head. The nerds among my readers will see that I have here a four-way ANOVA, with one between-subjects factor (sex) and three within-subjects factors (stimulus, hemisphere, electrode location). My hypothesis is that women have bigger mismatch effects than men, so I predict an interaction between sex and stimulus, but the only result significant at p < .05 is a three-way interaction between sex, stimulus and electrode location. What should I do?

a) Describe this as my main effect of interest, revising my hypothesis to argue for a site-specific sex effect
b) Describe the result as an exploratory finding in need of replication
c) Ignore the result as it was not predicted and is likely to be a false positive

I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.
Many people assume that if an ANOVA output is significant at the .05 level, there's only a one in twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather than numerous t-tests because ANOVA adjusts for multiple comparisons. But this interpretation is quite wrong. ANOVA adjusts for the number of levels within a factor, so, for instance, the probability of finding a significant effect of group is the same regardless of how many groups you have. ANOVA makes no adjustment to p-values for the number of factors and interactions in your design. The more of these you have, the greater the chance of turning up a "significant" result.
So, for the example given above, the probability of finding something significant at .05, is as follows:
For the four-way ANOVA example above, we have 15 terms (four main effects, six 2-way interactions, four 3-way interactions and one 4-way interaction) and the probability of finding no significant effect is .95^15 = .46. It follows that the probability of finding something significant is .54.
And for a three-way ANOVA there are seven terms (three main effects, three 2-way interactions and one 3-way interaction), and p (something significant) = .30.
So, basically, if you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an implausible example: after all who uses ANOVA like this? Well, unfortunately, this example corresponds rather closely to what often happens in electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in comparing a clinical and a control group, and so some results are more interesting than others: the main effect of group, and the seven interactions with group are the principal focus of attention. But hypotheses about exactly what will be found are seldom clearcut: excitement is generated by any p-value associated with a group term that falls below .05. There's a one in three chance that one of the terms involving group will have a p-value this low. This means that the potential for 'false positive psychology' in this field is enormous (Simmons et al, 2011).
A corollary of this is that researchers can modify the likelihood of finding a "significant" result by selecting one ANOVA design rather than another. Suppose I'm interested in comparing brain responses to standard and deviant sounds. One way of doing this is to compute the difference between ERPs to the two auditory stimuli and use this difference score as the dependent variable:  this reduces my ANOVA from a 4-way to a 3-way design, and gives fewer opportunities for spurious findings. So you will get a different risk of a false positive, depending on how you analyse the data.

Another feature of ERP research is that there is flexibility in how electrodes are handled in an ANOVA design: since there is symmetry in electrode placement, it is not uncommon to treat hemisphere as one factor, and electrode site as another. The alternative is just to treat electrode as a repeated measure. This is not a neutral choice: the chances of spurious findings is greater if one adopts the first approach, simply because it adds a factor to the analysis, plus all the interactions with that factor.

I stumbled across these insights into ANOVA when I was simulating data using a design adopted in a recent PLOS One paper that I'd commented on. I was initially interested in looking at the impact of adopting an unbalanced design in ANOVA: this study had a group factor with sample sizes of 20, 12 and 12. Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be the reason why simulated random numbers were giving such a lot of "significant" p-values. However, when I modified the simulation to use equal sample sizes across groups, the analysis continued to generate far more low p-values than I had anticipated, and I eventually twigged that this was because this is what you get if you use 4-way ANOVA. For any one main effect or interaction, the probability of p < .05 was one in twenty: but the probability that at least one term in the analysis would give p < .05 was closer to 50%.
The analytic approach adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have seen papers where 5-way or even 6-way repeated measures ANOVA is used. When you do an ANOVA and it spews out the results, it's tempting to home in on the results that achieve the magical significance level of .05 and then formulate some kind of explanation for the findings. Alas, this is an approach that has left the field swamped by spurious results.
There have been various critiques of analytic methods in ERP, but I haven't yet found any that have focussed on this point. Kilner (2013) has noted the bias that arises when electrodes or windows are selected for analysis post hoc, on the basis that they give big effects. Others have noted problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly correlated. More generally, statisticians are urging psychologists to move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for instance, with unbalanced designs. However, we're not going to fix the problem of "false positive ERP" by adopting a different form of analysis. The problem is not just with the statistics, but with the use of statistics for what are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need educating in the perils of post hoc interpretation of p-values and the importance of a priori specification of predictions.
I've argued before that the best way to teach people about statistics is to get them to generate their own random data sets. In the past, this was difficult, but these days it can be achieved using free statistical software, R. There's no better way of persuading someone to be less impressed by p < .05 than to show them just how readily a random dataset can generate "significant" findings. Those who want to explore this approach may find my blog on twin analysis in R useful for getting started (you don't need to get into the twin bits!).
The field of ERP is particularly at risk of spurious findings because of the way in which ANOVA is often used, but the problem of false positives is not restricted to this area, nor indeed to psychology. The mindset of researchers needs to change radically, with a recognition that our statistical methods only allow us to distinguish signal from noise in the data if we understand the nature of chance.
Education about probability is one way forward. Another is to change how we do science to make a clear distinction between planned and exploratory analyses. This post was stimulated by a letter that appeared in the Guardian this week on which I was a signatory. The authors argued that we should encourage a system of pre-registration of research, to avoid the kind of post hoc interpretation of findings that is so widespread yet so damaging to science.

Reference

Simmons, Joseph P., Nelson, Leif D., & Simonsohn, Uri (2011). False-positive psychology Psychological Science, 1359-1366 DOI: 10.1037/e636412012-001

Bishop, Dorothy V M (2014): Interpreting unexpected significant findings. figshare.
http://dx.doi.org/10.6084/m9.figshare.1030406

PS. 2nd July 2013
There's remarkably little coverage of this issue in statistics texts, but Mark Baxter pointed me to a 1996 manual for SYSTAT that does explain it clearly. See: http://www.slideshare.net/deevybishop/multiway-anova-and-spurious-results-syt
The authors noted "Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multi-factorial design not corrected for the experiment-wise error rate."
They recommend doing a Q-Q plot to see if the distribution of p-values is different from expectation, and using Bonferroni correction to guard against type I error.

They also note that the different outputs from an ANOVA are not independent if they are based on the same mean squares denominator, a point that is discussed here:
Hurlburt, R. T., & Spiegel, D. K. (1976). Dependence of F Ratios Sharing a Common Denominator Mean Square. The American Statistician, 30(2), 74-78. doi: 10.2307/2683798
These authors conclude (p 76)
It is important to realize that the appearance of two significant F ratios sharing the same denominator should decrease one's confidence in rejecting either of the null hypotheses. Under the null hypothesis, significance can be attained either by the numerator mean square being "unusually" large, or by the denominator mean square being "unusually" small. When the denominator is small, all F ratios sharing that denominator are more likely to be significant. Thus when two F ratios with a common denominator mean square are both significant, one should realize that both significances may be the result of unusually small error mean squares. This is especially true when the numerator degrees of freedom are not small compared' to the denominator degrees of freedom.

1. I see two issues.
1.experimental design. In the standard frequentist view, the purpose of experimental design is to extend the magnitude of the target effect along a single dimension and to nullify the magnitude of other effects along this dimension. So if no main effect is found but the 3-way interaction turned significant, one would say that experimental manipulation failed and we get your option c). Actually, I don't think it is (with possible exception of applied research) meaningful to try to slice reality into orthogonal effects nor do I think it is possible to do so with such a complex system as human brain surely is. Instead of adjusting our experiments so that they fit the assumptions of statistical models (balanced design, no missing values, homoskedasticity...) we should focus on design of ecologically valid experiments then go back to statistician and ask for tools that allow us to model the data we obtained. With such approach two issues arise.

2.1 Are there statistical tools that solve problems highlighted in your blog? Yes there are. Bayesian hierarchical modeling (Gelman, 2005) can be used to correct the effect size (instead of significance level) in GLMs. As a consequence no further correction of significance level is needed and the number of comparisons or additional comparisons (such as between hemispheres) is no issue. In principle, nothing stops you from hierarchically pooling effects across levels of Anova, though I don't think this is meaningful. Bayesian models can be extended to capture correlations between electrodes. In fact, Bayesian analysis doesn't shy away from including prior knowledge into the analysis. If you have relevant knowledge about the properties of electrophysical signals in the brain - how they propagate and interact then please let your model reflect this knowledge. If one uses hierarchical pooling one may fit a model with more unknown parameters than the actual number of data points.

2.2 How to choose the proper model? There is no THE Anova (even though there may be THE Anova button in SPSS). There are always choices that have to be done by the researcher. Consequently, the researcher should justify the choice of the model in the publication. This can be done with more rigor than just by a hazy reference to some hypothesis mentioned in introduction (which always opens the way for HARKing). One option is model checking - for example Bayesians do posterior predictive checks - they generate data from the model and look whether the simulated data capture the main patterns in the data. Another option is to compare several models by statistical means. The comparison can weigh both the fit and the simplicity of models. Both these options highlight the shift from fitting the experiment/data to the model to fitting the model to the data which I alluded to in 1.

With the powerful and smart software we have today I often find the only limit is how much time one wants to spend with model design.

1. The reference on hierarchical anova is
Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53.

John Kruschke also does provide some insightful demonstrations over at his DBDA blog:
http://doingbayesiandataanalysis.blogspot.de/2012/11/shrinkage-in-multi-level-hierarchical.html

2. Thanks matus!
I am thinking of writing a more formal paper on this topic, incorporating some of my simulations, and these references and comments will be very helpful in refining my ideas.

3. Just in case you are not familiar with this, there already are studies - even in psychological literature, looking at results of anova simulations, making the points you raised above. For instance, Maxwell(2003) did Anova simulations to see how sample size influences the probability of obtaining any significant result within and across levels of analysis.

Also Maxwell is more concerned with beta-error instead of false positives and I think he is right. You may also want to focus on false-negatives. If you say, someone does report 5-way interaction, that's 2^5=32 cells at least. If we expect at least 10 data points per cell then we ask for total sample of 320 data points. I would be really interested to see the power of the mentioned 5-way interaction. It is probably too low to reliably confirm the interaction.

Maxwell, S. E. (2003). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological methods, 9(2), 147-163.

4. matus already brought up this point, and more eloquently at that, but I'd just like to chime in: of course multilevel models will not solve the false-positive problem at one fell swoop, but it will alleviate the problem multiple comparisons in complex models a little, through partial pooling of estimates. Also, nice and clear post that I will show to people in need. Thanks!

5. Your point 1. When we apply statistical design properly we tailor the design to the "research question" and the "ecological context". We don't constrain the ecology to fit the design. This may not be your experience but there are a lot of people doing bad statistics. You should not blame statistics because it is done badly. Your suggestion that researchers should do what you think is right and approaching the statistician afterwards is a formula for disaster. I have seen it happen so many times.

2. Not being very familiar with this stuff, I admit I zoned out about half way through. But I think this core sentence is hugely important:

If you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied.

I'd like to suggest that you post a much, much shorter article that contains the barest minimum setup necessary to reach this conclusion. I think it would have a powerful salutory effect on a lot of people who will struggle with this full-length version.

1. For me, this article was packed with useful information, and I'm happy to have all the extras so that I can do my own exploring and research. Thank you, Professor Bishop!

2. Thanks Mike.
I did realise this post would be suitable only for a rather rarified readership familiar with ANOVA. I usually try to make my posts accessible to as broad as readership as possible, but in this case I was rather defeated by the subject matter.
see http://quoteinvestigator.com/2012/04/28/shorter-letter/

3. Oh, sorry, I didn't at all mean to say that this post should have been shorter! Just that if you were also to post a greatly truncated version, that one might reach a much wider audience. (And perhaps draw more people to read the full-length one, too.)

4. No worries - I realised that. I was just explaining why I'm unlikely to come up with the goods - it would be very hard to do and life (and time) is short!

3. If p<0.05, then for four way ANOVA with 15 terms shouldn't the significance be tested at 0.05^15, rather than 0.05?

4. I enjoyed reading the blog. Thanks for posting it.
I think a 4-way ANOVA model suffers from this criticism and there might be some better statistical tools to address this issue. However, I think a better way to address this issue might be to modify research methodology. I remember my professor saying that “never run a study using only one set of dependent variables”. For example, DPOAE input/output function and DPgram measures almost similar attributes of the cochlear physiology. If I got an unexpected significant 3-way interaction for both of these two test protocols, then the probability of getting both test significant would be (0.54)*(0.54) = 0.29. I would be more confident to report the 3-way interaction (at the same frequency region) if both measurements are telling me the same story. I would be more cautious (I may choose option b) to interpret “statistical significance” if only one measurement is showing me significant 3-way interaction.

1. If (1) DPOAE Input/output function and (2) DPgram measure almost similar attributes then they will be highly correlated and so your calculation is almost certainly wrong. The probability of getting both tests significant will be much bigger.

5. I was wondering whether the significance tests were independent. After all, if two main effects go "significantly" in opposite directions (p<.05, by chance), you're more likely to get an interaction as well. If the tests are not independent, then it may affect how we would like to correct the p-values. In particular, it may not be valid to compute the False Alarm rate as 1-.95^7=.30 (say for 7 p-values obtained out of a 3 way ANOVA).

I ran a simulation (R-code below if you want to play with it) by sampling 10000 sets of gaussian data and assigning arbitrary levels of 3 binary factors to it. I computed the FA rate as the number of times at least one of the 7 p-values that the ANOVA would give would be below .05. I obtained 29.6% of FA, which is pretty close to 30%.

I probably worried for no reason then, but it was worth checking (and I don't have a proof, just a simulation).

(R-code below).

nF <- 3
N <- 30
Iteration <- 10000

LP <- c()

for (x in 1:Iteration) {
a <- rnorm((2^nF)*N)
F1 <- gl(2, 1, (2^(nF-1)*N))
F2 <- gl(2, 2, (2^(nF-1)*N))
F3 <- gl(2, 4, (2^(nF-1)*N))

D <- data.frame(a, F1, F2, F3)

a <- summary(aov(a~F1*F2*F3, data=D))

LP <- c(LP, min(a[[1]][["Pr(>F)"]], na.rm=T))
}

sum(LP<.05)/Iteration # ==> 0.3
1-.95^7

1. Don't worry in this case - each of the tests in the ANOVA is statistically independent (orthogonal) of the other.

6. thanks for all the comments.
My own experience - and that of Anon - confirms how useful it is it check things out by simulations
This shows there IS some dependency between the different terms of ANOVA output: those that have the same error term (i.e. where the same MSerror is used in deriving the F ratio) are weakly correlated in my simulations - presumably because if the MSerror is, by chance, unusually high or low, then it affects all F-ratios derived from it. So for instance, the term for task x group is slightly more likely to be significant if the term for task is significant.
Some people on Twitter have asked for the R code I used to generate simulations: I hope to post that soon when I have had a chance to double check it.

7. The R script I used is now available here:
http://www.slideshare.net/deevybishop/erpsimulatescript

8. I don't think it's right to say that it's a problem of multiple comparisons. ANOVA is exact within the Fisher framework. The problem is that that approach doesn't tell you how often you'll make a fool of yourself by claiming that there is a real effect when there isn't. If you are daft enough to use P=0.05 as a cutoff there is at least a 30% chance of being wrong.