Comments on BishopBlog: Interpreting unexpected significant results

Hello! Your blog is exciting, I really enjoy readi...

2016-09-22T10:10:26.144+01:00

Hello! Your blog is exciting, I really enjoy reading your informative articles. Thanks for posting and developing so important and significant topics, so that people interested in this sphere can start their own research.

Thanks For Your valuable posting,this is for wonde...

2015-11-14T10:18:26.126+00:00

Thanks For Your valuable posting,this is for wonderful sharing,i would like to see more information from your side.i am working in Erp Company In Dubai

Neither Gelman's approach nor Kruschke's a...

2015-03-18T09:01:17.899+00:00

Neither Gelman's approach nor Kruschke's approach (both based on parameter estimation, assuming that the null hypothesis is *false*) allow testing the same thing as an ANOVA supposedly tests (is the null hypothesis *plausible*). One could argue that parameter estimation is a better approach in some cases, but one cannot argue that it solves the problem described, if you really care about nulls. What you really want is option D, "move away from significance testing toward a model comparison approach." Stop asking whether effect X is significant, and start comparing the relative success of various explanations for the data (e.g., http://bayesfactorpcl.r-forge.r-project.org/#fixed)

I don't think it's right to say that it...

2014-02-14T12:21:11.768+00:00

I don't think it's right to say that it's a problem of multiple comparisons. ANOVA is exact within the Fisher framework. The problem is that that approach doesn't tell you how often you'll make a fool of yourself by claiming that there is a real effect when there isn't. If you are daft enough to use P=0.05 as a cutoff there is at least a 30% chance of being wrong.
I'm also working on a paper about this.

Your point 1. When we apply statistical design pr...

2013-07-13T10:17:47.646+01:00

Your point 1. When we apply statistical design properly we tailor the design to the "research question" and the "ecological context". We don't constrain the ecology to fit the design. This may not be your experience but there are a lot of people doing bad statistics. You should not blame statistics because it is done badly. Your suggestion that researchers should do what you think is right and approaching the statistician afterwards is a formula for disaster. I have seen it happen so many times.

If (1) DPOAE Input/output function and (2) DPgram ...

2013-07-09T08:37:15.602+01:00

If (1) DPOAE Input/output function and (2) DPgram measure almost similar attributes then they will be highly correlated and so your calculation is almost certainly wrong. The probability of getting both tests significant will be much bigger.

The R script I used is now available here: http://...

2013-06-18T14:46:19.155+01:00

The R script I used is now available here:
http://www.slideshare.net/deevybishop/erpsimulatescript

thanks for all the comments. My own experience - a...

2013-06-18T12:06:07.382+01:00

thanks for all the comments.
My own experience - and that of Anon - confirms how useful it is it check things out by simulations
This shows there IS some dependency between the different terms of ANOVA output: those that have the same error term (i.e. where the same MSerror is used in deriving the F ratio) are weakly correlated in my simulations - presumably because if the MSerror is, by chance, unusually high or low, then it affects all F-ratios derived from it. So for instance, the term for task x group is slightly more likely to be significant if the term for task is significant.
Some people on Twitter have asked for the R code I used to generate simulations: I hope to post that soon when I have had a chance to double check it.

Don't worry in this case - each of the tests i...

2013-06-14T11:43:33.470+01:00

Don't worry in this case - each of the tests in the ANOVA is statistically independent (orthogonal) of the other.

I was wondering whether the significance tests wer...

2013-06-11T19:18:01.467+01:00

I was wondering whether the significance tests were independent. After all, if two main effects go "significantly" in opposite directions (p<.05, by chance), you're more likely to get an interaction as well. If the tests are not independent, then it may affect how we would like to correct the p-values. In particular, it may not be valid to compute the False Alarm rate as 1-.95^7=.30 (say for 7 p-values obtained out of a 3 way ANOVA).

I ran a simulation (R-code below if you want to play with it) by sampling 10000 sets of gaussian data and assigning arbitrary levels of 3 binary factors to it. I computed the FA rate as the number of times at least one of the 7 p-values that the ANOVA would give would be below .05. I obtained 29.6% of FA, which is pretty close to 30%.

I probably worried for no reason then, but it was worth checking (and I don't have a proof, just a simulation).

(R-code below).

nF <- 3
N <- 30
Iteration <- 10000

LP <- c()

for (x in 1:Iteration) {
a <- rnorm((2^nF)*N)
F1 <- gl(2, 1, (2^(nF-1)*N))
F2 <- gl(2, 2, (2^(nF-1)*N))
F3 <- gl(2, 4, (2^(nF-1)*N))

D <- data.frame(a, F1, F2, F3)

a <- summary(aov(a~F1*F2*F3, data=D))

LP <- c(LP, min(a[[1]][["Pr(>F)"]], na.rm=T))
}

sum(LP<.05)/Iteration # ==> 0.3
1-.95^7

I enjoyed reading the blog. Thanks for posting it....

2013-06-10T21:28:07.648+01:00

I enjoyed reading the blog. Thanks for posting it.
I think a 4-way ANOVA model suffers from this criticism and there might be some better statistical tools to address this issue. However, I think a better way to address this issue might be to modify research methodology. I remember my professor saying that “never run a study using only one set of dependent variables”. For example, DPOAE input/output function and DPgram measures almost similar attributes of the cochlear physiology. If I got an unexpected significant 3-way interaction for both of these two test protocols, then the probability of getting both test significant would be (0.54)*(0.54) = 0.29. I would be more confident to report the 3-way interaction (at the same frequency region) if both measurements are telling me the same story. I would be more cautious (I may choose option b) to interpret “statistical significance” if only one measurement is showing me significant 3-way interaction.

If p<0.05, then for four way ANOVA with 15 term...

2013-06-10T19:02:23.755+01:00

If p<0.05, then for four way ANOVA with 15 terms shouldn't the significance be tested at 0.05^15, rather than 0.05?

matus already brought up this point, and more eloq...

2013-06-09T10:47:53.025+01:00

matus already brought up this point, and more eloquently at that, but I'd just like to chime in: of course multilevel models will not solve the false-positive problem at one fell swoop, but it will alleviate the problem multiple comparisons in complex models a little, through partial pooling of estimates. Also, nice and clear post that I will show to people in need. Thanks!

No worries - I realised that. I was just explainin...

2013-06-08T17:36:43.750+01:00

No worries - I realised that. I was just explaining why I'm unlikely to come up with the goods - it would be very hard to do and life (and time) is short!

Oh, sorry, I didn't at all mean to say that th...

2013-06-08T16:34:55.617+01:00

Oh, sorry, I didn't at all mean to say that this post should have been shorter! Just that if you were also to post a greatly truncated version, that one might reach a much wider audience. (And perhaps draw more people to read the full-length one, too.)

Thanks Mike. I did realise this post would be suit...

2013-06-08T14:33:20.456+01:00

Thanks Mike.
I did realise this post would be suitable only for a rather rarified readership familiar with ANOVA. I usually try to make my posts accessible to as broad as readership as possible, but in this case I was rather defeated by the subject matter.
see http://quoteinvestigator.com/2012/04/28/shorter-letter/

For me, this article was packed with useful inform...

2013-06-07T14:22:16.654+01:00

For me, this article was packed with useful information, and I'm happy to have all the extras so that I can do my own exploring and research. Thank you, Professor Bishop!

Just in case you are not familiar with this, there...

2013-06-07T13:19:33.565+01:00

Just in case you are not familiar with this, there already are studies - even in psychological literature, looking at results of anova simulations, making the points you raised above. For instance, Maxwell(2003) did Anova simulations to see how sample size influences the probability of obtaining any significant result within and across levels of analysis.

Also Maxwell is more concerned with beta-error instead of false positives and I think he is right. You may also want to focus on false-negatives. If you say, someone does report 5-way interaction, that's 2^5=32 cells at least. If we expect at least 10 data points per cell then we ask for total sample of 320 data points. I would be really interested to see the power of the mentioned 5-way interaction. It is probably too low to reliably confirm the interaction.

Maxwell, S. E. (2003). The persistence of underpowered studies in psychological research: Causes, consequences, and remedies. Psychological methods, 9(2), 147-163.

Not being very familiar with this stuff, I admit I...

2013-06-07T12:09:11.286+01:00

Not being very familiar with this stuff, I admit I zoned out about half way through. But I think this core sentence is hugely important:

If you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied.

I'd like to suggest that you post a much, much shorter article that contains the barest minimum setup necessary to reach this conclusion. I think it would have a powerful salutory effect on a lot of people who will struggle with this full-length version.

Hope this is helpful.

Thanks matus! I am thinking of writing a more form...

2013-06-07T10:52:35.649+01:00

Thanks matus!
I am thinking of writing a more formal paper on this topic, incorporating some of my simulations, and these references and comments will be very helpful in refining my ideas.

The reference on hierarchical anova is Gelman, A....

2013-06-07T10:44:48.389+01:00

The reference on hierarchical anova is
Gelman, A. (2005). Analysis of variance—why it is more important than ever. The Annals of Statistics, 33(1), 1-53.

John Kruschke also does provide some insightful demonstrations over at his DBDA blog:
http://doingbayesiandataanalysis.blogspot.de/2012/11/shrinkage-in-multi-level-hierarchical.html

I see two issues. 1.experimental design. In the s...

2013-06-07T10:40:21.782+01:00

I see two issues.
1.experimental design. In the standard frequentist view, the purpose of experimental design is to extend the magnitude of the target effect along a single dimension and to nullify the magnitude of other effects along this dimension. So if no main effect is found but the 3-way interaction turned significant, one would say that experimental manipulation failed and we get your option c). Actually, I don't think it is (with possible exception of applied research) meaningful to try to slice reality into orthogonal effects nor do I think it is possible to do so with such a complex system as human brain surely is. Instead of adjusting our experiments so that they fit the assumptions of statistical models (balanced design, no missing values, homoskedasticity...) we should focus on design of ecologically valid experiments then go back to statistician and ask for tools that allow us to model the data we obtained. With such approach two issues arise.

2.1 Are there statistical tools that solve problems highlighted in your blog? Yes there are. Bayesian hierarchical modeling (Gelman, 2005) can be used to correct the effect size (instead of significance level) in GLMs. As a consequence no further correction of significance level is needed and the number of comparisons or additional comparisons (such as between hemispheres) is no issue. In principle, nothing stops you from hierarchically pooling effects across levels of Anova, though I don't think this is meaningful. Bayesian models can be extended to capture correlations between electrodes. In fact, Bayesian analysis doesn't shy away from including prior knowledge into the analysis. If you have relevant knowledge about the properties of electrophysical signals in the brain - how they propagate and interact then please let your model reflect this knowledge. If one uses hierarchical pooling one may fit a model with more unknown parameters than the actual number of data points.

2.2 How to choose the proper model? There is no THE Anova (even though there may be THE Anova button in SPSS). There are always choices that have to be done by the researcher. Consequently, the researcher should justify the choice of the model in the publication. This can be done with more rigor than just by a hazy reference to some hypothesis mentioned in introduction (which always opens the way for HARKing). One option is model checking - for example Bayesians do posterior predictive checks - they generate data from the model and look whether the simulated data capture the main patterns in the data. Another option is to compare several models by statistical means. The comparison can weigh both the fit and the simplicity of models. Both these options highlight the shift from fitting the experiment/data to the model to fitting the model to the data which I alluded to in 1.

With the powerful and smart software we have today I often find the only limit is how much time one wants to spend with model design.