©www.cartoonstock.com |

a) Describe this as my
main effect of interest, revising my hypothesis to argue for a site-specific
sex effect

b) Describe the result as
an exploratory finding in need of replication

c) Ignore the result as
it was not predicted and is likely to be a false positive

I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.

I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.

Many people assume that
if an ANOVA output is significant at the .05 level, there's only a one in
twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather
than numerous t-tests because ANOVA adjusts for multiple comparisons. But this
interpretation is quite wrong. ANOVA adjusts for the number of

**levels**within a factor, so, for instance, the probability of finding a significant effect of group is the same regardless of how many groups you have. ANOVA makes**no**adjustment to p-values for the number of factors and interactions in your design. The more of these you have, the greater the chance of turning up a "significant" result.
So, for the example given
above, the probability of finding

**something**significant at .05, is as follows:
For the four-way ANOVA
example above, we have 15 terms (four
main effects, six 2-way interactions, four 3-way interactions and one 4-way
interaction) and the probability of finding no significant effect is .95^15 =
.46. It follows that the probability of finding

**something**significant is .54.
And for a three-way ANOVA
there are seven terms (three main effects, three 2-way interactions and one
3-way interaction), and p (something significant) = .30.

So, basically, if you do
a four-way ANOVA, and you don't care what results comes out, provided something
is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an
implausible example: after all who uses ANOVA like this? Well, unfortunately,
this example corresponds rather closely to what often happens in
electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in
comparing a clinical and a control group, and so some results are more
interesting than others: the main effect of group, and the seven interactions
with group are the principal focus of attention. But hypotheses about exactly what will be
found are seldom clearcut: excitement is generated by any p-value associated
with a group term that falls below .05. There's a one in three chance that one
of the terms involving group will have a p-value this low. This means that the
potential for 'false positive psychology' in this field is enormous (Simmons et
al, 2011).

A corollary
of this is that researchers can modify the likelihood of finding a
"significant" result by selecting one ANOVA design rather than
another. Suppose I'm interested in comparing brain responses to standard and
deviant sounds. One way of doing this is to compute the difference between ERPs
to the two auditory stimuli and use this difference score as the dependent
variable: this reduces my ANOVA from a
4-way to a 3-way design, and gives fewer opportunities for spurious findings. So
you will get a different risk of a false positive,
depending on how you analyse the data.Another feature of ERP research is that there is flexibility in how electrodes are handled in an ANOVA design: since there is symmetry in electrode placement, it is not uncommon to treat hemisphere as one factor, and electrode site as another. The alternative is just to treat electrode as a repeated measure. This is not a neutral choice: the chances of spurious findings is greater if one adopts the first approach, simply because it adds a factor to the analysis, plus all the interactions with that factor.

I stumbled across these
insights into ANOVA when I was simulating data using a design adopted in a
recent PLOS One paper that I'd commented on. I was initially interested in looking at the
impact of adopting an unbalanced design in ANOVA: this study had a group factor
with sample sizes of 20, 12 and 12. Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be
the reason why simulated random numbers were giving such a lot of
"significant" p-values. However, when I modified the simulation to
use equal sample sizes across groups, the analysis continued to generate far
more low p-values than I had anticipated, and I eventually twigged that this
was because this is what you get if you use 4-way ANOVA. For any one main
effect or interaction, the probability of p < .05 was one in twenty: but the
probability that at least one term in the analysis would give p < .05 was closer
to 50%.

The analytic approach
adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have
seen papers where 5-way or even 6-way
repeated measures ANOVA is used. When
you do an ANOVA and it spews out the results, it's tempting to home in on the
results that achieve the magical significance level of .05 and then formulate
some kind of explanation for the findings. Alas, this is an approach that has
left the field swamped by spurious results.

There have been various
critiques of analytic methods in ERP, but I haven't yet found any that have
focussed on this point. Kilner (2013) has noted the bias that arises when
electrodes or windows are selected for analysis post hoc, on the basis that
they give big effects. Others have noted problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly
correlated. More generally,
statisticians are urging psychologists to move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for
instance, with unbalanced designs. However, we're not going to fix the problem
of "false positive ERP" by adopting a different form of analysis. The
problem is not just with the statistics, but with the use of statistics for what
are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need
educating in the perils of post hoc interpretation of p-values and the
importance of a priori specification of predictions.

I've argued before that
the best way to teach people about statistics is to get them to generate their
own random data sets. In the past, this was difficult, but these days it can be
achieved using free statistical software, R. There's no better way of persuading someone to be less impressed by p
< .05 than to show them just how readily a random dataset can generate
"significant" findings. Those who want to explore this approach may
find my blog on twin analysis in R useful for getting started (you don't need
to get into the twin bits!).

The field of ERP is
particularly at risk of spurious findings because of the way in which ANOVA is
often used, but the problem of false positives is not restricted to this area,
nor indeed to psychology. The mindset of researchers needs to change radically,
with a recognition that our statistical methods only allow us to distinguish
signal from noise in the data if we understand the nature of chance.

Education about
probability is one way forward. Another is to change how we do science to make
a clear distinction between planned and exploratory analyses. This post was
stimulated by a letter that appeared in the Guardian this week on which I was a
signatory. The authors argued that we should encourage a system of
pre-registration of research, to avoid the kind of post hoc interpretation of
findings that is so widespread yet so damaging to science.

__Reference__

This article (Figshare version) can be cited as:

Bishop, Dorothy V M (2014): Interpreting unexpected significant findings. figshare.

http://dx.doi.org/10.6084/m9.figshare.1030406

**PS. 2nd July 2013**

There's remarkably little coverage of this issue in statistics texts, but Mark Baxter pointed me to a 1996 manual for SYSTAT that does explain it clearly. See: http://www.slideshare.net/deevybishop/multiway-anova-and-spurious-results-syt

The authors noted "Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multi-factorial design not corrected for the experiment-wise error rate."

They recommend doing a Q-Q plot to see if the distribution of p-values is different from expectation, and using Bonferroni correction to guard against type I error.

They also note that the different outputs from an ANOVA are not independent if they are based on the same mean squares denominator, a point that is discussed here:

Hurlburt, R. T., & Spiegel, D. K. (1976). Dependence of F Ratios Sharing a Common Denominator Mean Square. The American Statistician, 30(2), 74-78. doi: 10.2307/2683798

These authors conclude (p 76)

*It is important to realize that the appearance of two significant F ratios sharing the same denominator should decrease one's confidence in rejecting either of the null hypotheses. Under the null hypothesis, significance can be attained either by the numerator mean square being "unusually" large, or by the denominator mean square being "unusually" small. When the denominator is small, all F ratios sharing that denominator are more likely to be significant. Thus when two F ratios with a common denominator mean square are both significant, one should realize that both significances may be the result of unusually small error mean squares. This is especially true when the numerator degrees of freedom are not small compared' to the denominator degrees of freedom.*

I see two issues.

ReplyDelete1.experimental design. In the standard frequentist view, the purpose of experimental design is to extend the magnitude of the target effect along a single dimension and to nullify the magnitude of other effects along this dimension. So if no main effect is found but the 3-way interaction turned significant, one would say that experimental manipulation failed and we get your option c). Actually, I don't think it is (with possible exception of applied research) meaningful to try to slice reality into orthogonal effects nor do I think it is possible to do so with such a complex system as human brain surely is. Instead of adjusting our experiments so that they fit the assumptions of statistical models (balanced design, no missing values, homoskedasticity...) we should focus on design of ecologically valid experiments then go back to statistician and ask for tools that allow us to model the data we obtained. With such approach two issues arise.

2.1 Are there statistical tools that solve problems highlighted in your blog? Yes there are. Bayesian hierarchical modeling (Gelman, 2005) can be used to correct the effect size (instead of significance level) in GLMs. As a consequence no further correction of significance level is needed and the number of comparisons or additional comparisons (such as between hemispheres) is no issue. In principle, nothing stops you from hierarchically pooling effects across levels of Anova, though I don't think this is meaningful. Bayesian models can be extended to capture correlations between electrodes. In fact, Bayesian analysis doesn't shy away from including prior knowledge into the analysis. If you have relevant knowledge about the properties of electrophysical signals in the brain - how they propagate and interact then please let your model reflect this knowledge. If one uses hierarchical pooling one may fit a model with more unknown parameters than the actual number of data points.

2.2 How to choose the proper model? There is no THE Anova (even though there may be THE Anova button in SPSS). There are always choices that have to be done by the researcher. Consequently, the researcher should justify the choice of the model in the publication. This can be done with more rigor than just by a hazy reference to some hypothesis mentioned in introduction (which always opens the way for HARKing). One option is model checking - for example Bayesians do posterior predictive checks - they generate data from the model and look whether the simulated data capture the main patterns in the data. Another option is to compare several models by statistical means. The comparison can weigh both the fit and the simplicity of models. Both these options highlight the shift from fitting the experiment/data to the model to fitting the model to the data which I alluded to in 1.

With the powerful and smart software we have today I often find the only limit is how much time one wants to spend with model design.

The reference on hierarchical anova is

DeleteGelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53.

John Kruschke also does provide some insightful demonstrations over at his DBDA blog:

http://doingbayesiandataanalysis.blogspot.de/2012/11/shrinkage-in-multi-level-hierarchical.html

Thanks matus!

DeleteI am thinking of writing a more formal paper on this topic, incorporating some of my simulations, and these references and comments will be very helpful in refining my ideas.

Just in case you are not familiar with this, there already are studies - even in psychological literature, looking at results of anova simulations, making the points you raised above. For instance, Maxwell(2003) did Anova simulations to see how sample size influences the probability of obtaining any significant result within and across levels of analysis.

DeleteAlso Maxwell is more concerned with beta-error instead of false positives and I think he is right. You may also want to focus on false-negatives. If you say, someone does report 5-way interaction, that's 2^5=32 cells at least. If we expect at least 10 data points per cell then we ask for total sample of 320 data points. I would be really interested to see the power of the mentioned 5-way interaction. It is probably too low to reliably confirm the interaction.

Maxwell, S. E. (2003). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological methods, 9(2), 147-163.

matus already brought up this point, and more eloquently at that, but I'd just like to chime in: of course multilevel models will not solve the false-positive problem at one fell swoop, but it will alleviate the problem multiple comparisons in complex models a little, through partial pooling of estimates. Also, nice and clear post that I will show to people in need. Thanks!

DeleteYour point 1. When we apply statistical design properly we tailor the design to the "research question" and the "ecological context". We don't constrain the ecology to fit the design. This may not be your experience but there are a lot of people doing bad statistics. You should not blame statistics because it is done badly. Your suggestion that researchers should do what you think is right and approaching the statistician afterwards is a formula for disaster. I have seen it happen so many times.

DeleteNot being very familiar with this stuff, I admit I zoned out about half way through. But I think this core sentence is hugely important:

ReplyDeleteIf you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied.I'd like to suggest that you post a much, much shorter article that contains the barest minimum setup necessary to reach this conclusion. I think it would have a powerful salutory effect on a lot of people who will struggle with this full-length version.

Hope this is helpful.

For me, this article was packed with useful information, and I'm happy to have all the extras so that I can do my own exploring and research. Thank you, Professor Bishop!

DeleteThanks Mike.

DeleteI did realise this post would be suitable only for a rather rarified readership familiar with ANOVA. I usually try to make my posts accessible to as broad as readership as possible, but in this case I was rather defeated by the subject matter.

see http://quoteinvestigator.com/2012/04/28/shorter-letter/

Oh, sorry, I didn't at all mean to say that

Deletethispost should have been shorter! Just that if you werealsoto post a greatly truncated version, that one might reach a much wider audience. (And perhaps draw more people to read the full-length one, too.)No worries - I realised that. I was just explaining why I'm unlikely to come up with the goods - it would be very hard to do and life (and time) is short!

DeleteIf p<0.05, then for four way ANOVA with 15 terms shouldn't the significance be tested at 0.05^15, rather than 0.05?

ReplyDeleteI enjoyed reading the blog. Thanks for posting it.

ReplyDeleteI think a 4-way ANOVA model suffers from this criticism and there might be some better statistical tools to address this issue. However, I think a better way to address this issue might be to modify research methodology. I remember my professor saying that “never run a study using only one set of dependent variables”. For example, DPOAE input/output function and DPgram measures almost similar attributes of the cochlear physiology. If I got an unexpected significant 3-way interaction for both of these two test protocols, then the probability of getting both test significant would be (0.54)*(0.54) = 0.29. I would be more confident to report the 3-way interaction (at the same frequency region) if both measurements are telling me the same story. I would be more cautious (I may choose option b) to interpret “statistical significance” if only one measurement is showing me significant 3-way interaction.

If (1) DPOAE Input/output function and (2) DPgram measure almost similar attributes then they will be highly correlated and so your calculation is almost certainly wrong. The probability of getting both tests significant will be much bigger.

DeleteI was wondering whether the significance tests were independent. After all, if two main effects go "significantly" in opposite directions (p<.05, by chance), you're more likely to get an interaction as well. If the tests are not independent, then it may affect how we would like to correct the p-values. In particular, it may not be valid to compute the False Alarm rate as 1-.95^7=.30 (say for 7 p-values obtained out of a 3 way ANOVA).

ReplyDeleteI ran a simulation (R-code below if you want to play with it) by sampling 10000 sets of gaussian data and assigning arbitrary levels of 3 binary factors to it. I computed the FA rate as the number of times at least one of the 7 p-values that the ANOVA would give would be below .05. I obtained 29.6% of FA, which is pretty close to 30%.

I probably worried for no reason then, but it was worth checking (and I don't have a proof, just a simulation).

(R-code below).

nF <- 3

N <- 30

Iteration <- 10000

LP <- c()

for (x in 1:Iteration) {

a <- rnorm((2^nF)*N)

F1 <- gl(2, 1, (2^(nF-1)*N))

F2 <- gl(2, 2, (2^(nF-1)*N))

F3 <- gl(2, 4, (2^(nF-1)*N))

D <- data.frame(a, F1, F2, F3)

a <- summary(aov(a~F1*F2*F3, data=D))

LP <- c(LP, min(a[[1]][["Pr(>F)"]], na.rm=T))

}

sum(LP<.05)/Iteration # ==> 0.3

1-.95^7

Don't worry in this case - each of the tests in the ANOVA is statistically independent (orthogonal) of the other.

Deletethanks for all the comments.

ReplyDeleteMy own experience - and that of Anon - confirms how useful it is it check things out by simulations

This shows there IS some dependency between the different terms of ANOVA output: those that have the same error term (i.e. where the same MSerror is used in deriving the F ratio) are weakly correlated in my simulations - presumably because if the MSerror is, by chance, unusually high or low, then it affects all F-ratios derived from it. So for instance, the term for task x group is slightly more likely to be significant if the term for task is significant.

Some people on Twitter have asked for the R code I used to generate simulations: I hope to post that soon when I have had a chance to double check it.

The R script I used is now available here:

ReplyDeletehttp://www.slideshare.net/deevybishop/erpsimulatescript

I don't think it's right to say that it's a problem of multiple comparisons. ANOVA is exact within the Fisher framework. The problem is that that approach doesn't tell you how often you'll make a fool of yourself by claiming that there is a real effect when there isn't. If you are daft enough to use P=0.05 as a cutoff there is at least a 30% chance of being wrong.

ReplyDeleteI'm also working on a paper about this.

Neither Gelman's approach nor Kruschke's approach (both based on parameter estimation, assuming that the null hypothesis is *false*) allow testing the same thing as an ANOVA supposedly tests (is the null hypothesis *plausible*). One could argue that parameter estimation is a better approach in some cases, but one cannot argue that it solves the problem described, if you really care about nulls. What you really want is option D, "move away from significance testing toward a model comparison approach." Stop asking whether effect X is significant, and start comparing the relative success of various explanations for the data (e.g., http://bayesfactorpcl.r-forge.r-project.org/#fixed)

ReplyDeleteThanks For Your valuable posting,this is for wonderful sharing,i would like to see more information from your side.i am working in Erp Company In Dubai

ReplyDeleteHello! Your blog is exciting, I really enjoy reading your informative articles. Thanks for posting and developing so important and significant topics, so that people interested in this sphere can start their own research.

ReplyDelete