Monday 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

Lots of people use Analysis of Variance (Anova) without really understanding how it works, so I thought I'd have a go at explaining the basics in an intuitive fashion.

Consider three experiments, A, B and C, each of which compares the impact of an intervention on an outcome measure. The three experiments each have 20 people in a control group and 20 in an intervention group. Figure 1 shows the individual scores on an outcome measure for the two groups as blobs, and the mean score for each group as a dotted black line.

Figure 1: Simulated data from 3 intervention studies

In terms of average scores of control and intervention groups, the three groups look very similar, with the intervention group about .4 to .5 points higher than the control group. But we can't interpret this difference without having an idea of how variable scores are in the two groups.

For experiment A, there is considerable variation within each group, that swamps the average difference between the groups. In contrast, for experiment C, the scores within each group are tightly packed. Group B is somewhere in between.

If you enter these data into a one-way Anova, with group as a between-subjects factor, you get out a F-ratio, which can then be evaluated in terms of a p-value which gives the probability of obtaining such an extreme result if there is really no impact of the intervention. As you will see, the F-ratios are very different for A, B, and C, even though the group mean differences are the same. And in terms of the conventional .05 level of significance, the result from experiment A is not significant, experiment C is significant at the .001 level, and experiment B shows a trend (p = .051).

So how is the F-ratio computed? It just involves computing a number that reflects the ratio between the variance of the means of the groups, and the average variance within each group. When we just have two groups, as here, the first value just reflects how far away the two group means are from the overall mean. This is the Between Groups term, which is just the Variance of the two means multiplied by the number in each group (20). That will be similar for A, B and C, because the means for the two groups are similar and the numbers in each group are the same.

But the Within Groups term will differ substantially for A, B, and C, because it is computed as the average variance for the two groups. The F-ratio is obtained by just dividing the between groups term by the within groups term. If the within groups term is big, F is small, and vice versa.

The R script used to generate Figure 1 can be found here:

PS. 20/11/2017. Thanks to Jan Vanhove for providing code to show means rather than medians in Fig 1. 


  1. You might like

    1. Just to say, this comment almost got deleted as it looks like spam. In fact it is a vid explainer for doing 2 way Anova in Excel. I personally would not recommend that, but some may find it helpful I

    2. I have come to the conclusion that using a spreadsheet for anything more complicated than a shopping list is a poor idea.

      It's a heck of a lot easier and much safer to use a dedicated statistical package such as R or SAS. I have heard that Python is quite good but have never used it.

      This report is 10 years old but I don't think much has changed t

  2. A minor criticism of the code. There are no spaces.

    # This is much easier to read
    bigsummary <- data.frame(matrix(NA, nrow = nsims, ncol = 10))
    # than this.

    And a question about vectorization versus loops.
    King Edward VII was probably still on the throne when I last did anything in ANOVA so I may be well off but do you really need all those loops. The normal way in R is to create the data.frame and then do manipulations.

    If I understood the code, slightly, I think you can create the data.frame as follows:

    # Three groups of 40, 20 Intervention, 20 Control / group

    dat1 <- data.frame(grp = rep(c("A", "B", "C"), each = 40),
    cond = rep(rep(c("Intervention", "Control"), each = 20), 3),
    rr = rnorm(120))

    And another way to get the three panel plot which I think is simpler and easier is:

    # Plot using ggplot2

    p <- ggplot (dat1, aes(cond, rr, colour= grp)) + geom_point()

    p # look at basic plot

    p + facet_grid(. ~ grp)

    # The use of "p" is not needed but let's one see the intermediate step(s)

    # And for fun

    p + facet_grid(. ~ grp)

  3. Cancel that last data.frame. It works well for many things particularly the ggplot2 plots but probably not an ANOVA. The one below is the same thing but with a different layout.

    dat2 <- data.frame(grp = rep(c("A", "B", "C"), each = 20), treat = rnorm(60), plac = rnorm(60))

    Note that "grp" is a factor. I think that is what is wanted but, as I can hardly spell ANOVA I am not sure.

    1. Thanks. I usually do give a health warning about my scripting, to explain that it is not to be regarded as an example of well-written code. Though sometimes clunky code can be easier for novices to follow.

    2. Very true. My code is usually very clunky but it helps me remember what the blazes I was thinking about last week.

      I was just protesting loops. They are very un-R and usually make code harder to read and tend to consume programmer time. I think in operation they are just as efficient as vectorization.

      On the other hand I had those who write lovely compact code that it only takes 5 hours of agony to figure out what they did

  4. I found this post and the follow up post very helpful, thank you. Please could you clarify whether the 20 individuals are in the control group in Exp A are the same 20 individuals in Exp b? i.e. are there 40 individuals across three experiments or 120?

    1. Good question. There are, of course, no real individuals! The figure shows simulated data. The way it was simulated was to simulate one set of values for 40 people, and then manipulate the effect size to simulate each experiment based on the same numbers: so you could say it is as if we had the same 40 people - though if you actually did have the same 40 people, you'd still expect some variation from one experiment to another. So the simulation is not realistic in that respect.
      I did it this way to make it easier to see how the changes in variance affected the results. It would be easy to change this so that you simulated a fresh dataset for each experiment (and in fact I initially did it that way) - but with small samples like these, the data then hop around quite a lot and it is harder to focus on the specific effects that I wanted to demonstrate.