BishopBlog: How Analysis of Variance Works

Monday, 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

Lots of people use Analysis of Variance (Anova) without really understanding how it works, so I thought I'd have a go at explaining the basics in an intuitive fashion.

Consider three experiments, A, B and C, each of which compares the impact of an intervention on an outcome measure. The three experiments each have 20 people in a control group and 20 in an intervention group. Figure 1 shows the individual scores on an outcome measure for the two groups as blobs, and the mean score for each group as a dotted black line.

Figure 1: Simulated data from 3 intervention studies

In terms of average scores of control and intervention groups, the three groups look very similar, with the intervention group about .4 to .5 points higher than the control group. But we can't interpret this difference without having an idea of how variable scores are in the two groups.

For experiment A, there is considerable variation within each group, that swamps the average difference between the groups. In contrast, for experiment C, the scores within each group are tightly packed. Group B is somewhere in between.

If you enter these data into a one-way Anova, with group as a between-subjects factor, you get out a F-ratio, which can then be evaluated in terms of a p-value which gives the probability of obtaining such an extreme result if there is really no impact of the intervention. As you will see, the F-ratios are very different for A, B, and C, even though the group mean differences are the same. And in terms of the conventional .05 level of significance, the result from experiment A is not significant, experiment C is significant at the .001 level, and experiment B shows a trend (p = .051).

So how is the F-ratio computed? It just involves computing a number that reflects the ratio between the variance of the means of the groups, and the average variance within each group. When we just have two groups, as here, the first value just reflects how far away the two group means are from the overall mean. This is the Between Groups term, which is just the Variance of the two means multiplied by the number in each group (20). That will be similar for A, B and C, because the means for the two groups are similar and the numbers in each group are the same.

But the Within Groups term will differ substantially for A, B, and C, because it is computed as the average variance for the two groups. The F-ratio is obtained by just dividing the between groups term by the within groups term. If the within groups term is big, F is small, and vice versa.

The R script used to generate Figure 1 can be found here: https://github.com/oscci/intervention/blob/master/Rftest.R

PS. 20/11/2017. Thanks to Jan Vanhove for providing code to show means rather than medians in Fig 1.

9 comments:

Anonymous21 November 2017 at 00:55
You might like
https://youtu.be/lZESjuuDguE
ReplyDelete
Replies
jrkrideau24 November 2017 at 16:04
A minor criticism of the code. There are no spaces.

# This is much easier to read
bigsummary <- data.frame(matrix(NA, nrow = nsims, ncol = 10))
# than this.
bigsummary<-data.frame(matrix(NA,nrow=nsims,ncol=10))

And a question about vectorization versus loops.
King Edward VII was probably still on the throne when I last did anything in ANOVA so I may be well off but do you really need all those loops. The normal way in R is to create the data.frame and then do manipulations.

If I understood the code, slightly, I think you can create the data.frame as follows:

# Three groups of 40, 20 Intervention, 20 Control / group

dat1 <- data.frame(grp = rep(c("A", "B", "C"), each = 40),
cond = rep(rep(c("Intervention", "Control"), each = 20), 3),
rr = rnorm(120))

And another way to get the three panel plot which I think is simpler and easier is:

# Plot using ggplot2

library(ggplot2)
p <- ggplot (dat1, aes(cond, rr, colour= grp)) + geom_point()

p # look at basic plot

p + facet_grid(. ~ grp)

# The use of "p" is not needed but let's one see the intermediate step(s)

# And for fun

p + facet_grid(. ~ grp)
ReplyDelete
Replies
jrkrideau24 November 2017 at 20:29
Cancel that last data.frame. It works well for many things particularly the ggplot2 plots but probably not an ANOVA. The one below is the same thing but with a different layout.

dat2 <- data.frame(grp = rep(c("A", "B", "C"), each = 20), treat = rnorm(60), plac = rnorm(60))

Note that "grp" is a factor. I think that is what is wanted but, as I can hardly spell ANOVA I am not sure.

ReplyDelete
Replies
Anonymous30 November 2017 at 07:19
I found this post and the follow up post very helpful, thank you. Please could you clarify whether the 20 individuals are in the control group in Exp A are the same 20 individuals in Exp b? i.e. are there 40 individuals across three experiments or 120?
ReplyDelete
Replies

Add comment

New comments are not allowed.

BishopBlog

Monday, 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

9 comments:

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Monday, 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

9 comments:

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers