BishopBlog: 2017

Tuesday, 26 December 2017

Using simulations to understand p-values

Intuitive explanations of statistical concepts for novices #4

The p-value is widely used but widely misunderstood. I'll demonstrate this in the context of intervention studies. The key question is how confident can we be that an apparently beneficial effect of treatment reflects a change due to the intervention, rather than arising just through the play of chance. The p-value gives one way of deciding that. There are other approaches, including those based on Bayesian statistics, which are preferred by many statisticians. But I will focus here on the traditional null hypothesis significance testing (NHST) approach, which dominates statistical reporting in many areas of science, and which uses p-values.

As illustrated in my previous blogpost, where our measures include random noise, the distorting effects of chance mean that we can never be certain whether or not a particular pattern of data reflects a real difference between groups. However, we can compute the probability that the data came from a sample where there was no effect of intervention.

There are two ways to do this. One way is by simulation. If you repeatedly run the kind of simulation described in my previous blogpost, specifying no mean difference between groups, each time taking a new sample, for each result you can compute a standardized effect size. Cohen's d is the mean difference between groups expressed in standard deviation units, which can be computed by subtracting the group A mean from the group B mean, and dividing by the pooled standard deviation (i.e. the square root of the average of the variances for the two groups). You then see how often the simulated data give an effect size at least as large as the one observed in your experiment.

Histograms of effecct sizes obtained by repeatedly sampling from population where there is no difference between groups*

Figure 1 shows the distribution of effect sizes for two different studies: the first has 10 participants per group, and the second has 80 per group. For each study, 10,000 simulations were run; on each run, a fresh sample was taken from the population, and the standardized effect size, d, computed for that run. The peak of each distribution is at zero: we expect this, as we are simulating the case of no real difference between groups – the null hypothesis. But note that, though the shape of the distribution is the same for both studies, the scale on the x-axis covers a broader range for the study with 10 per group than the study with 80 per group. This relates to the phenomenon shown in Figure 5 of the previous blogpost, whereby estimates of group means jump around much more when there is a small sample.

The dotted red lines show the cutoff points that identify the top 5%, 1% and 0.1% of the effect sizes. Suppose we ran a study with 10 people and it gave a standardized effect size of 0.3. We can see from the figure that a value in this range is fairly common when there is no real effect: around 25% of the simulations gave an effect size of at least 0.3. However, if our study had 80 people per group, then the simulation tells us this is an improbable result to get if there really is no effect of intervention: only 2.7% of simulations yield an effect size as big as this.

The p-value is the probability of obtaining a result at least as extreme as the one that is observed, if there really is no difference between groups. So for the study with N = 80, p = .027. Conventionally, a level of p < .05 has been regarded as 'statistically significant', but this is entirely arbitrary. There is an inevitable trade-off between false positives (type I errors) and false negatives (type II errors). If it is very important to avoid false positives, and you do not mind sometimes missing a true effect, then a stringent p-value is desirable. If, however, you do not want to miss any finding of potential interest, even if it turns out to be a false positive, then you could adopt a more lenient criterion.

The comparison between the two sample sizes in Figure 1 should make it clear that statistical significance is not the same thing as practical significance. Statistical significance simply tells us how improbable a given result would be if there was no true effect. The larger the sample size, the smaller the effect size that would be detected at a threshold such as p < .05. Small samples are generally a bad thing, because they only allow us to reliably detect very large effects. But very large samples have the opposite problem: they allow us to detect as 'significant' effect that are so small as to be trivial. The key point that the researcher who is conducting an intervention study should start by considering how big an effect would be of practical interest, given the cost of implementing the intervention. For instance, you may decide that staff training and time spent on a vocabulary intervention would only be justified if it boosted children's vocabulary by at least 10 words. If you knew how variable children scores were on the outcome measure, the sample size could then be determined so that the study has a good chance of detecting that effect while minimising false positives. I will say more about how to do that in a future post.

I've demonstrated p-values using simulations in the hope that this will give some insight into how they are derived and what they mean. In practice, we would not normally derive p-values this way, as there are much simpler ways to do this, using statistical formulae. Provided that data are fairly normally distributed, we can use statistical approaches such as ANOVA, t-tests and linear regression to compute probabilities of observed results (see this blogpost). Simulations can, however, be useful in two situations. First, if you don't really understand how a statistic works, you can try running an analysis with simulated data. You can either simulate the null hypothesis by creating data from two groups that do not differ, or you can add a real effect of a given size to one group. Because you know exactly what effect size was used to create the simulated data, you can get a sense of whether particular statistics are sensitive to detect real effects, and how these might vary with sample size.

The second use of simulations is for situations where the assumptions of statistical tests are not met – for instance, if data are not normally distributed, or if you are using a complex design that incorporates multiple interacting variables. If you can simulate a population of data that has the properties of your real data, you can then repeatedly sample from this and compute the probability of obtaining your observed result to get a direct estimate of a p-value, just as was done above.

The key point to grasp about a p-value is that it tells you how likely your observed evidence is, if the null hypothesis is true. The most widely used p-value is .05: if the p-value in your study is less than .05, then the chance of your observed data arising when the intervention had no effect is 1 in 20. You may decide on that basis that it's worth implementing the intervention, or at least investing in the costs of doing further research on it.

The most common mistake is to think that the p-value tells you how likely the null hypothesis is given the evidence. But that is something else. The probability of A (observed data) given B (null hypothesis) not the same as the probability of B (null hypothesis) given A (observed data). As I have argued in another blogpost, the probability that if you are a man you are a criminal is not high, but if you are a criminal, the probability that you are a man is much higher. This may seem fiendishly complicated, but a concrete example can help.

Suppose Bridget Jones has discovered three weight loss pills: if taken for a month, pill A is totally ineffective placebo, pill B leads to a modest weight loss of 2 lbs, and pill C leads to an average loss of 7 lb. We do studies with three groups of 20 people; in each group, half are given A, B or C and the remainder are untreated controls. We discover that after a month, one of the treated groups has an average weight loss of 3 lb, whereas their control group has lost no weight at all. We don't know which pill this group received. If we run a statistical test, we find the p-value is .45. This means we cannot reject the null hypothesis of no effect – which is what we'd expect if this group had been given the placebo pill, A. But the result is also compatible with the participants having received pills B or C. This is demonstrate in Figure 2 which shows the probability density function for each scenario - in effect, the outline of the histogram. The red dotted line corresponds to our obtained result, and it is clear it is highly probable regardless of which pill was used. In short, this result doesn't tell us how likely the null hypothesis is – only that the null hypothesis is compatible with the evidence that we have.

Probability density function for weight loss pills A, B and C, with red line showing observed result

Many statisticians and researchers have argued we should stop using p-values, or at least adopt more stringent levels of p. My view is that p-values can play a useful role in contexts such as the one I have simulated here, where you want to decide whether an intervention is worth adopting, provided you understand what they tell you. It is crucial to appreciate how dependent a p-value is on sample size, and to recognise that the information it provides is limited to telling you whether an observed difference could just be due to chance. In a later post I'll go on to discuss the most serious negative consequence of misunderstanding of p-values: the generation of false positive findings by the use of p-hacking.

*The R script to generate Figures 1 and 2 can be found here.

Thursday, 21 December 2017

Using simulations to understand the importance of sample size

Intuitive explanations of statistical concepts for novices #3

I'll be focusing here on the kinds of stats needed if you conduct an intervention study. Suppose we measured the number of words children could define on a 20-word vocabulary task. Words were selected so that at the start of training, none of the children knew any of them. At the end of 3 months of training, every child in the vocabulary training group (B) knew four words, whereas those in a control group (A) knew three words. If we had 10 children per group, the plot of final scores would look like Figure 1 panel 1.

Figure 1. Fictional data to demonstrate concept of random error (noise)

In practice, intervention data never look like this. There is always unexplained variation in intervention outcomes, and real results look more like panel 2 or panel 3. That is, in each group, some children learn more than average and some less than average. Such fluctuations can reflect numerous sources of uncontrolled variation: for instance, random error will be large if we use unreliable measures, or there may be individual differences in responsiveness to intervention in the people included in the study, as well as things that can fluctuate from day to day or even moment to moment, such as people's mood, health, tiredness and so on.

The task for the researcher is to detect a signal – the effect of intervention – from noise – the random fluctuations. It is important to carefully select our measures and our participants to minimise noise, but we will never eliminate it entirely.

There are two key concepts behind all the statistics we do: (a) data will contain random noise, and (b) when we do a study we are sampling from a larger population. We can make these ideas more concrete through simulation.

The first step is to generate a large quantity of random numbers. Random numbers can be easily generated using the free software package R: if you have this installed, you can follow this demo by typing in the commands shown in italic at your console. R has a command, rnorm, that generates normally distributed random numbers. For instance:

rnorm(10,0,1)

will generate 10 z-scores, i.e. random numbers with mean of 0 and standard deviation of 1.You get new random numbers each time you submit the command, (unless you explicitly set something known as the random number seed to be the same each time). Now let's use R to generate 100,000 random numbers, and plot the output in a histogram. Figure 2 can be generated with the commands:

myz = rnorm(100000,0,1)
hist(myz)

Figure 2: Distribution of z-scores simulated with rnorm

This shows that numbers close to zero are most common, and the further we get from zero in either direction, the lower the frequency of the number. The bell-shaped curve is a normal distribution, which we get because we opted to generate random numbers following a normal distribution using rnorm. (Other distributions of random number are also possible; you can see some options here).

So you might be wondering what we do with this list of numbers. Well, we can simulate experimental data based on this population of numbers by specifying two things:
1. The sample size
2. The effect size – i.e., Cohen's d, the mean difference between groups in standard deviation (SD) units.

Suppose we want two groups, A and B, each with a sample size of 10, where group B has scores that are on average 1 SD larger than group A. First we select 20 values at random from myz:

mydata = sample(myz, 20)

Next we create a variable corresponding to group, which is created by just making a variable, mygroup, that combines ten repeats of 'A' with ten repeats of 'B'.

mygroup = c(rep('A', 10), rep('B', 10))

Next we add the effect size, 1, to the last 10 numbers, i.e. those for group B

mydata[11:20] = mydata[11:20] + 1

Now we can plot the individual points clustered by group. First install and activate the beeswarm package to make a nice plot format:

install.packages('beeswarm')
library(beeswarm)

Then you can make the plot with the command:

beeswarm(mydata ~ mygroup)

The resulting plot will look something like one of the graphs in Figure 3. It won't be exactly the same as any of them because your random sample will be different from the ones we have generated. In fact, this is one point of this exercise: to show you how numbers will vary from one occasion to another when you sample from a population.

If you just repeatedly run these lines, you will see how things vary just by chance:

mydata = sample(myz, 20)
mydata[11:20] = mydata[11:20] + 1
beeswarm(mydata ~ mygroup)

Figure 3: Nine runs of simulated data from 2 groups: A comes from population with mean score of 0 and B from population with mean score of 1

Note how in Figure 3, the difference between groups A and B is far more marked in runs 7 and 9 than in runs 4 and 6, even though each dataset was generated by the same script. This is what is meant by the 'play of chance' affecting experimental data.

Now let's look at Figure 4, which gives output from another nine runs of a simulation. This time, some runs were set so that there was a true effect of intervention (by adding .6 to values for group B) and some were set with no difference between groups. Can you tell which simulations were based on a real effect?

Figure 4: Some of these runs were generated with effect size of .6, others had no difference between A and B

The answer is that runs 1, 2, 4, 8 and 9 came from runs where there was a real effect of .6 (which, by the standard of most intervention studies is a large effect). You may have identified some of these runs correctly, but you may also to have falsely selected run 3 as showing an effect. This would be a false positive, where we wrongly conclude there is an intervention effect when the apparent superiority of the intervention group is just down to chance. This type of error is known as a type I error. Run 2 looks like a false negative – we are likely to conclude there is no effect of intervention, when in fact there was one. This is a type II error. One way to remember this distinction is that a type I error is when you think you've won (1) but you haven't.

The importance of sample size
Figures 3 and 4 demonstrate that, when inspecting data from intervention trials, you can't just rely on the data in front of your eyes. Sometimes, they will suggest a real effect when the data are really random (type I error) and sometimes they will fail to reveal a difference when the intervention is really effective (type II error). These anomalies arise because data incorporates random noise which can generate spurious effects or mask real effects. This masking is particularly problematic when samples are small.

Figure 5 shows two sets of data: the top panel and the bottom panel were derived by the same simulation, the only difference being the sample size: 10 per group in the top panels, and 80 per group in the bottom panels. In both cases, the simulation specified that group B scores were drawn from a population that had higher scores than group A, with an effect size of 0.6. The bold line shows the group average. The figure shows that the larger the sample, the closer the results from the sample will agree with the population from which it was drawn.

Figure 5: Five runs of simulation where true effect size = .6

When samples are small, estimates of the means will jump around much more than when samples are large. Note, in particular, that with the small sample size, on the third run, the mean difference between A and B is overestimated by about 50%, whereas in the fourth run, the mean for B is very close to that for A.

In the population from which these samples are taken the mean difference between A and B is 0.6, but if we just take a sample from this population, by chance we may select atypical cases, and these will have a much larger impact on the observed mean when the sample is small.

In my next post, I will show how we can build on these basic simulations to get an intuitive understanding of p-values.

P.S. Scripts for generating the figures in this post can be found here.

Monday, 27 November 2017

Reproducibility and phonics: necessary but not sufficient

Over a hotel breakfast at an unfeasibly early hour (I'm a clock mutant) I saw two things on Twitter that appeared totally unrelated but which captured my interest for similar reasons.

The two topics were the phonics wars and the reproducibility crisis. For those of you who don't work on children's reading, the idea of phonics wars may seem weid. But sadly, there we have it: those in charge of the education of young minds locked in battle over how to teach children to read. Andrew Old (@oldandrewuk), an exasperated teacher, sounded off this week about 'phonics denialists', who are vehemently opposed to phonics instrution, despite a mountain of evidence indicating this is an important aspect of teaching children to read. He analysed three particular arguments used to defend an anti-phonics stance. I won't summarise the whole piece, as you can read what Andrew says in his blogpost. Rather, I just want to note one of the points that struck a chord with me. It's the argument that 'There's more to phonics than just decoding'. As Andrew points out, those who say this want to imply that those who teach phonics don't want to do anything else.

'In this fantasy, phonics denialists are the only people saving children from 8 hours a day, sat in rows, being drilled in learning letter combinations from a chalkboard while being banned from seeing a book or an illustration.'

This is nonsense: see, for instance, this interview with my colleague Kate Nation, who explains how phonics knowledge is necessary but not sufficient for competent reading.

So what has this got to do with reproducibility in science? Well, another of my favourite colleagues, Dick Passingham, started a little discussion on Twitter - in response to a tweet about a Radiolab piece on replication. Dick is someone I enjoy listening to because he is a fount of intelligence and common sense, but on this occasion, what he said made me a tad irritated:

This has elements of the 'more to phonics than just decoding' style of argument. Of course scientists need to know more than how to make their research reproducible. They need to be able to explore, to develop new theories and to see how to interpret the unexpected. But it really isn't an either/or. Just as phonics is necessary but not sufficient for learning to read, so are reproducible practices necessary but not sufficient for doing good science. Just as phonics denialists depicts phonics advocates as turning children into bored zombies who hate books, those trying to fix reproducibility problems are portrayed as wanting to suppress creative geniuses and turn the process of doing research into a tedious and mechanical exercise. The winds of change that are blowing through psychology won't stop researchers being creative, but they will force them to test their ideas more rigorously before going public.

For those, like Dick, who was trained to do rigorous science from the outset, the focus on reproducibiity may seem like a distraction from the important stuff. But the incentive structure has changed dramatically in recent decades with the rewards favouring the over-hyped sensational result over the careful, thoughful science that he favours. The result is an enormous amount of waste - of resources, of time and careers. So I'm not going to stop 'obsessing about the reproducibility crisis.' As I replied rather sourly to Dick:

Friday, 24 November 2017

ANOVA, t-tests and regression: different ways of showing the same thing

Intuitive explanations of statistical concepts for novices #2

In my last post, I gave a brief explainer of what the term 'Analysis of variance' actually means – essentially you are comparing how much variation in a measure is associated with a group effect and how much with within-group variation.

The use of t-tests and ANOVA by psychologists is something of a historical artefact. These methods have been taught to generations of researchers in their basic statistics training, and they do the business for many basic experimental designs. Many statisticians, however, prefer variants of regression analysis. The point of this post is to explain that, if you are just comparing two groups, all three methods – ANOVA, t-test and linear regression – are equivalent. None of this is new but it is often confusing to beginners.

Anyone learning basic statistics probably started out with the t-test. This is a simple way of comparing the means of two groups, and, just like ANOVA, it looks at how big that mean difference is relative to the variation within the groups. You can't conclude anything by knowing that group A has a mean score of 40 and group B has a mean score of 44. You need to know how much overlap there is in the scores of people in the two groups, and that is related to how variable they are. If scores in group A range from to 38 to 42 and those in group B range from 43 to 45 we have a massive difference with no overlap between groups – and we don't really need to do any statistics! But if group A ranges from 20 to 60 and group B ranges from 25 to 65, then a 2-point difference in means is not going to excite us. The t-test gives a statistic that reflects how big the mean difference is relative to the within-group variation. What many people don't realise is that the t-test is computationally equivalent to the ANOVA. If you square the value of t from a t-test, you get the F-ratio*.

Figure 1: Simulated data from experiments A, B, and C. Mean differences for two intervention groups are the same in all three experiments, but within-group variance differs

Now let's look at regression. Consider Figure 1. This is similar to the figure from my last post, showing three experiments with similar mean differences between groups, but very different within-group variance. These could be, for instance, scores out of 80 on a vocabulary test. Regression analysis focuses on the slope of the line between the two means, shown in black, which is referred to as b. If you've learned about regression, you'll probably have been taught about it in the context of two continuous variables, X and Y, where the slope b, tells you how much change there is in Y for every unit change in X. But if we have just two groups, b is equivalent to the difference in means.

So, how can it be that regression is equivalent to ANOVA, if the slopes are the same for A, B and C? The answer is that, just as illustrated above, we can't interpret b unless we know about the variation within each group. Typically, when you run a regression analysis, the output includes a t-value that is derived by dividing b by a measure known as the standard error, which is an index of the variation within groups.

An alternative way to show how it works is to transform data from the three experiments to be on the same scale, in a way that takes into account the within-group variation. We achieve this by transforming the data into z-scores. All three experiments now have the same overall mean (0) and standard deviation (1). Figure 2 shows the transformed data – and you see that after the data have been rescaled in this way, the y-axis now ranges from -3 to +3, and the slope is considerably larger for Experiment C than Experiment A. The slope for z- transformed data is known as beta, or the standardized regression coefficient.

Figure 2: Same data as from Figure 1, converted to z-scores

The goal of this blogpost is to give an intuitive understanding of the relationship between ANOVA, t-tests and regression, so I am avoiding algebra as far as possible. The key point is when you are comparing two groups, t and F are different ways of representing the ratio between variation between groups and variation within groups, and t can be converted into F by simply squaring the value. You can derive t from linear regression by dividing the b or beta by its standard error - and this is automatically done by most stats programmes. If you are nerdy enough to want to use algebra to transform beta into F, or to see how Figures 1 and 2 were created, see the script Rftest_with_t_and_b.r here.

How do you choose which statistics to do? For a simple two-group comparison it really doesn't matter and you may prefer to use the method that is most likely to be familiar to your readers. The t-test has the advantage of being well-known – and most stats packages also allow you to make an adjustment to the t-value which is useful if the variances in your two groups are different. The main advantage of ANOVA is that it works when you have more than two groups. Regression is even more flexible, and can be extended in numerous ways, which is why it is often preferred.

Further explanations can be found here:

*It might not be exactly the same if your software does an adjustment for unequal variances between groups, but it should be close. It is identical if no correction is done.

Monday, 20 November 2017

How Analysis of Variance Works

Intuitive explanations of statistical concepts for novices #1

Lots of people use Analysis of Variance (Anova) without really understanding how it works, so I thought I'd have a go at explaining the basics in an intuitive fashion.

Consider three experiments, A, B and C, each of which compares the impact of an intervention on an outcome measure. The three experiments each have 20 people in a control group and 20 in an intervention group. Figure 1 shows the individual scores on an outcome measure for the two groups as blobs, and the mean score for each group as a dotted black line.

Figure 1: Simulated data from 3 intervention studies

In terms of average scores of control and intervention groups, the three groups look very similar, with the intervention group about .4 to .5 points higher than the control group. But we can't interpret this difference without having an idea of how variable scores are in the two groups.

For experiment A, there is considerable variation within each group, that swamps the average difference between the groups. In contrast, for experiment C, the scores within each group are tightly packed. Group B is somewhere in between.

If you enter these data into a one-way Anova, with group as a between-subjects factor, you get out a F-ratio, which can then be evaluated in terms of a p-value which gives the probability of obtaining such an extreme result if there is really no impact of the intervention. As you will see, the F-ratios are very different for A, B, and C, even though the group mean differences are the same. And in terms of the conventional .05 level of significance, the result from experiment A is not significant, experiment C is significant at the .001 level, and experiment B shows a trend (p = .051).

So how is the F-ratio computed? It just involves computing a number that reflects the ratio between the variance of the means of the groups, and the average variance within each group. When we just have two groups, as here, the first value just reflects how far away the two group means are from the overall mean. This is the Between Groups term, which is just the Variance of the two means multiplied by the number in each group (20). That will be similar for A, B and C, because the means for the two groups are similar and the numbers in each group are the same.

But the Within Groups term will differ substantially for A, B, and C, because it is computed as the average variance for the two groups. The F-ratio is obtained by just dividing the between groups term by the within groups term. If the within groups term is big, F is small, and vice versa.

The R script used to generate Figure 1 can be found here: https://github.com/oscci/intervention/blob/master/Rftest.R

PS. 20/11/2017. Thanks to Jan Vanhove for providing code to show means rather than medians in Fig 1.

Friday, 3 November 2017

Prisons, developmental language disorder, and base rates

There's been some interesting discussion on Twitter about the high rate of developmental language disorder (DLD) in the prison population. Some studies give an estimate as high as 50 percent (Anderson et al, 2016), and this has prompted calls for speech-language therapy services to be involved in the working with offenders. Work by Pam Snow and others has documented the difficulties of navigating the justice system if your understanding and ability to express yourself are limited.

This is important work, but I have worried from time to time about the potential for misunderstanding. In particular, if you are a parent of a child with DLD, should you be alarmed at the prospect that your offspring will be incarcerated? So I wanted to give a brief explainer that offers some reassurance.

The simplest way to explain it is to think about gender. I've been delving into the latest national statistics for this post, and found that the UK prison population this year contained 82,314 men, but a mere 4,013 women. That's a staggering difference, but we don't conclude that because most criminals are men, therefore most men are criminals. This is because we have to take into account base rates: the proportion of the general population who are in prison. Another set of government statistics estimates the UK population as around 64.6 million, about half of whom are male, and 81% are adults. So a relatively small proportion of the adult population is in prison, and the numbers of non-criminal men vastly outnumber the number of criminal men.

I did similar sums for DLD, using data from Norbury et al (2016) to estimate a population prevalence of 7% in adult males, and plugging in that relatively high figure of 50% of prisoners with DLD. The figures look like this.

Numbers (in thousands) assuming 7% prevalence of DLD and 50% DLD in prisoners*

As you can see, according to this scenario, the probability of going to prison is much greater for those with DLD than for those without DLD (2.24% DLD vs 0.17% without DLD), but the absolute probability is still very low – 98% of those with DLD will not be incarcerated.

The so-called base rate fallacy is a common error in logical reasoning. It seems natural to conclude that if A is associated with B, then B must be associated with A. Statistically, that is true, but if A is extremely rare, then the likelihood of B given A can be considerably less than the likelihood of A given B.

So I don't think therefore that we need to seek explanations for the apparent inconsistency that's being flagged up on Twitter between rates of incarceration in studies of those with DLD, vs rates of DLD in those who are incarcerated. It could just be the consequence of the low base rate of incarceration.

References
Anderson et al (2016) Language impairments among youth offenders: A systematic review. Children and Youth Services Review, 65, 195-203.

Norbury, C. F., et al. (2016). The impact of nonverbal ability on prevalence and clinical presentation of language disorder: evidence from a population study. Journal of Child Psychology and Psychiatry, 57, 1247-1257.

*An R script for generating this figure can be found here.

Postscript - 4th November 2017
The Twitter discussion has continued and drawn attention to further sources of information on rates of language and related problems in prison populations. Happy to add these here if people can send sources:

Talbot, J. (2008). No One Knows: Report and Final Recommendations. Report by Prison Reform Trust.

House of Commons Justice Committee (2016) The Treatment of Young Adults in the Criminal Justice System. Report HC 169.

Tuesday, 17 October 2017

Citing the research literature: the distorting lens of memory


Corticogenesis: younger neurons migrate past older ones using radial glia as a scaffolding. Figure from https://en.wikipedia.org/wiki/Neural_development#/media/File:Corticogenesis_in_a_wild-type_mouse.png

"Billy was a likable twelve-year old boy whose major areas of difficulty were described by his parents as follows: 1) marked difficulty in reading and retaining what he read; 2) some trouble with arithmetic; 3) extreme slowness in completing homework with writing and spelling of poor quality; 4) slowness in learning to tell time (learned only during the past year); 5) lapses of attention with staring into space; 6) "dizzy spells" with "blackouts"; 7) recurring left frontal headaches always centering around and behind the left eye; 8) occasional enuresis until recently; 9) disinterest in work; 10) sudden inappropriate temper outbursts which were often violent; 11) enjoyment of irritating people; and 12) tendency to cry readily." Drake (1968), p . 488

Poor Billy would have been long forgotten, were it not for the fact that he died suddenly shortly after he had undergone extensive assessment for his specific learning difficulties. An autopsy found that death was due to a brain haemorrhage caused by an angioma in the cerebellum, but the neuropathologist also remarked on some unusual features elsewhere in his brain:

"In the cerebral hemispheres, anomalies were noted in the convolutional pattern of the parietal lobe bilaterally. The cortical pattern was disrupted by penetrating deep gyri that appeared disconnected. Related areas of the corpus callosum appeared thin (Figure 2). Microscopic examination revealed the cause of the hemorrrage to be a cerebellar angioma of the type known as capillary telangiectases (Figure 3). The cerebral cortex was more massive than normal, the lamination tended to be columnar, the nerve cells were spindle-shaped, and there were numerous ectopic neurons in the white matter that were not collected into distinct heterotopias (Figure 4)." p. 496*

I had tracked down this article in the course of writing a paper with colleagues on the neuronal migration account of dyslexia – a topic I have blogged about previously The 'ectopic neurons' referred to by Drake are essentially misplaced neurons that, because of disruptions of very early development, have failed to migrate to their usual location in the brain.

I realised that my hazy memory of this paper was quite different from the reality: I had thought the location of the ectopic neurons was consistent with those reported in later post mortem studies by Galaburda and colleagues. In fact, Drake says nothing about their location, other than that it is in white matter – which contrasts with the later reports.

This made me curious to see how this work had been reported by others. This was not a comprehensive exercise: I did this by identifying from Web of Science all papers that cited Drake's article, and then checking what they said about the results if I could locate an online version of the article easily. Here's what I found:

Out of a total of 45 papers, 18 were excluded: they were behind a paywall or not readily traceable online, or (1 case) did not mention neuroanatomical findings A further 10 papers included the Drake study in a bunch of references referring to neuroanatomical abnormalities in dyslexia, without singling out any specific results. Thus they were not inaccurate, but just vague.

The remaining 17 could be divided up as follows:

Seven papers gave a broadly accurate account of the neuroanatomical findings. The most detailed accurate account was by Galaburda et al (1985) who noted:

"Drake published neuropathological findings in a well-documented case of developmental dyslexia. He described a thinned corpus callosum particularly involving the parietal connections, abnormal cortical folding in the parietal regions, and, on microscopical examination, excessive numbers of neurons in the subcortical white matter. The illustrations provided did not show the parietal lobe, and the portion of the corpus callosum that could be seen appeared normal. No mention was made as to whether the anomalies were asymmetrically distributed."p. 227.

Four (three of them from the same research group) cited Drake as though there were two patients, rather than one, and focussed only on the the corpus callosum, without mentioning ectopias.

Six gave an inaccurate account of the findings. The commonest error was to be specific about the location of the ectopias, which (as is clear from the Galaburda quote above), was not apparent in the text or figures of the original paper. Five of these articles located the ectopias in the left parietal lobe, one more generally in the parietal lobe, and one in the cerebellum (where the patient's stroke had been).

So, if we discount those available articles that just gave a rather general reference to Drake's study, over half of the remainder got some information wrong – and the bias was in the direction of making this early study consistent with later research.

The paper is hard to get hold of**, and when you do track it down, it is rather long-winded. It is largely concerned with the psychological evaluation of the patient, including aspects, such as Oedipal conflicts, that seem fanciful to modern eyes, and the organisation of material is not easy to follow. Perhaps it is not so surprising that people make errors when reporting the findings. But if nothing else, this exercise reminded me of the need to check sources when you cite them. It is all too easy to think you know what is in a paper – or to rely on someone else's summary. In fact, these days I am often dismayed to discover I have a false memory of what is in my own old papers, let alone those by other people. But once in the literature, errors can propagate, and we need to be vigilant to prevent a gradual process of distortion over time. It is all too easy to hurriedly read a secondary source or an abstract: we (and I include myself here) need to slow down.

References
Drake, W. E. (1968). Clinical and pathological findings in a child with a developmental learning disability Journal of Learning Disabilities, 1(9), 486-502.
Galaburda, A. M., Sherman, G. F., Rosen, G. D., Aboitiz, F., & Geschwind, N. (1985). Developmental dyslexia: four consecutive cases with cortical anomalies. Annals of Neurology, 18, 222-233.

* I assume the figures are copyrighted so am not reproducing them here
**I thank Michelle Dawson for pointing out that the article can be downloaded from this site: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.949.4021&rep=rep1&type=pdf

Sunday, 1 October 2017

Pre-registration or replication: the need for new standards in neurogenetic studies

This morning I did a very mean thing. I saw an author announce to the world on Twitter that they had just published this paper, and I tweeted a critical comment. This does not make me happy, as I know just how proud and pleased one feels when a research project at last makes it into print, and to immediately pounce on it seems unkind. Furthermore, the flaws in the paper are not all that unusual: they characterise a large swathe of literature. And the amount of work that has gone into the paper is clearly humongous, with detailed analysis of white matter structural integrity that probably represents many months of effort. But that, in a sense, is the problem. We just keep on and on doing marvellously complex neuroimaging in contexts where the published studies are likely to contain unreliable results.

Why am I so sure that this is unreliable? Well, yesterday saw the publication of a review that I had led on, which was highly relevant to the topic of the paper – genetic variants affecting brain and behaviour. In our review we closely scrutinised 30 papers on this topic that had been published in top neuroscience journals. The field of genetics was badly burnt a couple of decades ago when it was discovered that study after study reported results that failed to replicate. These days, it's not possible to publish a genetic association in a genetics journal unless you show that the finding holds up in a replication sample. However, neuroscience hasn't caught up and seems largely unaware of why this is a problem.

The focus of this latest paper was on a genetic variant known as the COMT Val158Met SNP. People can have one of three versions of this genotype: Val/Val, Val/Met and Met/Met, but it's not uncommon for researchers to just distinguish people with Val/Val from Met carriers (Val/Met and Met/Met). This COMT polymorphism is one of the most-studied genetic variants in relation to human cognition, with claims of associations with all kinds of things: intelligence, memory, executive functions, emotion, response to anti-depressants, to name just a few. Few of these, however, have replicated, and there is reason to be dubious about the robustness of findings (Barnett, Scoriels & Munafo, 2008)

In this latest COMT paper – and many, many other papers in neurogenetics – the sample size is simply inadequate. There were 19 participants (12 males and 7 females) with the COMT Val/Val version of the variant, compared with 63 (27 males and 36 females) who had either Met/Met or Val/Met genotype. The authors reported that significant effects of genotype on corpus callosum structure were found in males only. As we noted in our review, effects of common genetic variants are typically very small. In this context, an effect size (standardized difference between means of two genotypes, Cohen's d) of .2 would be really large. Yet this study has power of .08 to detect such an effect in males – that is if there really is a difference of -0.2 SDs between the two genotypes, and you repeatedly ran studies with this sample size, then you'd fail to see the effect in 92% of studies. To look at it another way, the true effect size would need to be enormous (around 1 SD difference between groups) to have an 80% chance of being detectable, given the sample size.

When confronted with this kind of argument, people often say that maybe there really are big effect sizes. After all, the researchers were measuring characteristics of the brain, which are nearer to the gene than the behavioural measures that are often used. Unfortunately, there is another much more likely explanation for the result, which is that it is a false positive arising from a flexible analytic pipeline.

The problem is that both neuroscience and genetics are a natural environment for analytic flexibility. Put the two together, and you need to be very very careful to control for spurious false positive results. In the papers we evaluated for our review, there were numerous sources of flexibility: often researchers adopted multiple comparisons corrections for some of these, but typically not for all. In the COMT/callosum paper, the authors addressed the multiple comparisons issue using permutation testing. However, one cannot tell from a published paper how many subgroupings/genetic variants/phenotypes/analysis pathways etc were tried but not reported. If, as in mainstream genetics, the authors had included a direct replication of this result, that would be far more convincing. Perhaps the best way for the field to proceed would be by adopting pre-registration as standard. Pre-registration means you commit yourself to a specific hypothesis and analytic plan in advance; hypotheses can then be meaningfully tested using standard statistical methods. If you don’t pre-register and there are many potential ways of looking at the data, it is very easy to fool yourself into finding something that looks 'significant'.

I am sufficiently confident that this finding will not replicate that I hereby undertake to award a prize of £1000 to anyone who does a publicly preregistered replication of the El-Hage et al paper and reproduces their finding of a statistically significant male-specific effect of COMT Val158Met polymorphism on the same aspects of corpus callosum structure.

I emphasise that, though the new COMT/callosum paper is the impetus for this blogpost, I do not intend this as a specific criticism of the authors of that paper. The research approach they adopted is pretty much standard in the field, and the literature is full of small studies that aren't pre-registered and don't include a replication sample. I don't think most researchers are being deliberately misleading, but I do think we need a change of practices if we are to amass a research literature that can be built upon. Either pre-registration or replication should be conditions of publication.

PS. 3rd October 2017
An anonymous commentator (below) drew my attention to a highly relevant preprint in Bioarxiv by Jahanshad and colleagues from the ENIGMA-DTI consortium, entitled 'Do Candidate Genes Affect the Brain's White Matter Microstructure? Large-Scale Evaluation of 6,165 Diffusion MRI Scans'. They included COMT as one of the candidate genes, although they did not look at gender-specific effects. The Abstract makes for sobering reading: 'Regardless of the approach, the previously reported candidate SNPs did not show significant associations with white matter microstructure in this largest genetic study of DTI to date; the negative findings are likely not due to insufficient power.'

In addition, Kevin Mitchell (@WiringTheBrain) on Twitter alerted me to a blogpost from 2015 in which he made very similar points about neuroimaging biomarkers. Let's hope that funders and mainstream journals start to get the message.

Tuesday, 25 July 2017

Breaking the ice with buxom grapefruits: Pratiques de publication and predatory publishing

Guest blogpost by

Ryan McKay, Department of Psychology, Royal Holloway University of London

and

Max Coltheart, Department of Cognitive Science, Macquarie University

These days it is common for academics to receive invitations from unfamiliar sources to attend conferences, submit papers, or join editorial boards. We began an attack against this practice by not ignoring such invitations – by, instead, replying to them with messages selected from the output of the wonderful Random Surrealism Generator. It generates syntactically correct but surreal sentences such as “Is that a tarantula in your bicycle clip, or are you just gold-trimmed?” (a hint of Mae West there?). This sometimes had the desired effect of generating a bemused response from the inviter; but we decided more was needed.

So we used the surrealism generator to craft an absurdist critique of “impaired” publication practices (the title of the piece says as much, albeit obliquely). The first few sentences seem relevant to the paper’s title but the piece then deteriorates rapidly into a sequence of surreal sentences (we threw in some gratuitous French and Latin for good measure) so that no one who read the paper could possibly believe that it was serious (our piece also quotes itself liberally); and we submitted the paper to a number of journals. Specifically, we submitted the paper to every journal that contacted either of us in the period 21 June 2017 to 1 July 2017 inviting us to submit a paper. There were 10 such invitations. We accepted all of them, and submitted the paper, making minor changes to the title of the paper and the first couple of sentences to generate the impression that the paper was somehow relevant to the interests of the journal; but the bulk of the paper was always the same sequence of surreal sentences.

While we were engaged in this exercise, the blogger Neuroskeptic was doing something similar: we describe that work below. Both of us were of course following the honourable tradition of submissions as these by -->Peter Vamplew and Christoph Bartnek (More generally, there is a fine tradition of hoax articles intended as critiques of certain academic fields, e.g., postmodernism or theology).

What happened then?

All ten journals responded by informing us that our ms had been sent out for review. We did not hear anything further from four of them. A fifth, the SM Journal of Psychiatry and Mental Health, eventually responded “The ms was plagiarized so please make some changes to the content”. We did not respond to this request, nor to a subsequent request for resubmission.

The Scientific Journal of Neurology & Neurosurgery responded by telling us that our paper had been peer-reviewed; the reviewer praised our “scientific methodology” but chided us about our poor English (specifically, they said “English should be rewritten, it is necessary a correction of typing errors (spaces)”). We ignored this advice and resubmitted. However, the journal then noticed the similarity with the article we had submitted to the International Journal of Brain Disorders and Therapy (see below for this), so ceased production of our article.

The paper was accepted by Psychiatry and Mental Disorders: “accepted for publication by our reviewers without any changes”, we were told.

The paper was accepted by Mental Health and Addiction Research, but at that point we were told that a publication fee was due. We protested on the ground that when we had been invited to submit there had been no mention of a fee, and we said that unless a full fee waiver was granted we would take our work to a more appreciative journal. In response, we were granted a full fee waiver, and our paper was published in the on-line journal.

The SM Journal of Disease Markers also accepted the paper, and sent us proofs, which we corrected and returned. At that point, we were told that an article processing fee of US$920 was due. We protested in the same way, asking for a full fee waiver. In response, they offered a reduced fee of $520. We did not respond, so this paper, although accepted, has not been published.

The tenth journal, the International Journal of Brain Disorders and Therapy, sent us one reviewer comment. The reviewer had entered into the spirit of the hoax by providing a review which was itself surrealistic. We incorporated this reviewer’s comment about Scottish Lithium Flying saucers and resubmitted, and the paper was accepted. The journal then noticed irregularities in some (but surprisingly not all) of the references. We replaced these problematic references with citations of recent and classic hoaxes (e.g., Kline & Saunders’ 1959 piece on “psychochemical symbolism”; Lindsay & Boyle’s recent piece on the “Conceptual Penis”), along with a citation of Pennycook et al’s article “On the reception and detection of pseudo-profound bullshit”. The paper was then published in the on-line journal. Later this journal asked us for a testimonial about the review process, which we supplied: "The process of publishing this article was much smoother than we anticipated".

In sum: all ten journals to which we submitted the paper sent it out for review, even though any editor had only to read to the end of the first paragraph to come across this:
“Of course, neither cognitive neuropsychiatry nor cognitive neuropsychology is remotely informative when it comes to breaking the ice with buxom grapefruits. When pondering three-in-a-bed romps with broken mules, therefore, one must refrain, at all costs, from driving a manic-depressive lemon-squeezer through ham (Baumard & Brugger, 2016).”

Of these ten journals, two tentatively accepted the paper and four fully accepted it for publication. Two of these journals have already published it.

The blogger Neuroskeptic did this a little differently (see http://blogs.discovermagazine.com/neuroskeptic/2017/07/22/predatory-journals-star-wars-sting/#.WXbIstP5hTF ). A hoax paper entitled “Mitochondria: Structure, Function and Clinical Relevance” was prepared. It did not contain any nonsensical sentences, as our paper did, but its topic was the fictional cellular entities “midi-chlorians” (which feature in Star Wars). The paper was submitted to nine journals. Four accepted it. One of these charged a fee, which the author declined to pay; the other three charged no fee, and so the paper has been published in all three of these papers, the International Journal of Molecular Biology: Open Access (MedCrave), the Austin Journal of Pharmacology and Therapeutics (Austin) and American Research Journal of Biosciences (ARJ). In order to know that this paper was nonsense, one would need some knowledge of cell biology. But our paper is blatantly nonsensical to any reader; and yet it boasted an acceptance rate very similar to that of Neuroskeptic’s paper.

What can be learned from our exercise? Several things:

(a) It is clear that with these journals there is no process by which a submission is initially read by an editor to decide whether the paper should be sent out for review, because our paper could not possibly have survived any such inspection.

(b) But nor should our paper have survived any serious review process, since any reviewer reading the paper would have pointed out its nonsensical content. Only twice did a journal send us feedback from a reviewer, one which said we should discuss Lithium Flying Saucers, and one which seemed suspect to us because its criticism of our English was expressed in such poor English.

(c) In contrast to this apparent lack of human intervention in the article-handling process, there was some software intervention: some of these journals appear routinely to apply plagiarism-detection software to submitted articles

(d) What’s in this for the journals? We assumed that they exist solely to make money by charging authors. We presume that, just as they attempt to build apparently legitimate editorial boards (see here), these journals will sometimes waive their fees so as to get some legitimate-seeming articles on their books, the better to entice others to submit.

Sunday, 2 July 2017

The STEP Physical Literacy programme: have we been here before?

One day in 2003, I turned on BBC Radio 4 and found myself listening to an interview on the Today Programme with Wynford Dore, the founder of an educational programme that claimed to produce dramatic improvements in children's reading and attentional skills. The impetus for the programme was a press release of a study published in the journal Dyslexia, reporting results from a trial of the programme with primary school-children. The interview seemed more like an advertisement than a serious analysis, but the consequent publicity led many parents to sign up for the programme, both in the UK and in other countries, notably Australia.

The programme involved children doing two 10-minute sessions per day of exercises designed to improve balance and eye-hand co-ordination. These were personalised to the child, so that the specific exercises would be determined by level of progress in particular skills. The logic behind the approach was that these exercises trained the cerebellum, a part of the brain concerned with automatizing skills. For instance, when you first play the piano or drive a car, it is slow and effortful, but after practice you can do it automatically without thinking about it. The idea was that cerebellar training would lead to a general cerebellar boost, helping other tasks, such as reading, to become more automatic.

Various experts who were on the editorial board of Dyslexia were unhappy with the quality of the research and asked for the paper to be retracted. When no action was taken, a number of them resigned. In 2007, I published a detailed critique of the study, which by that time had been complemented by a follow-up – which had prompted further editorial resignations.

Meanwhile, Wynford Dore, who had considerable business acumen, continued to promote the Dore Programme, writing a popular book describing its origins, and signing up celebrities to endorse it. Among these were rugby legends Kenny Logan and Scott Quinnell. In addition, Dore was in conversations with the Welsh Assembly about the possibility of rolling the programme out in Welsh schools. He had also persuaded Conservative MP Christopher Chope that the Dore programme was enormously effective but was being suppressed by government.

Various bloggers were interested in the amazing uptake of the Dore Programme, and in 2008, Ben Goldacre wrote a trenchant piece on his Bad Science blog, noting among other things that Kenny Logan was paid for some of his promotional work. The nail in the coffin of the Dore Programme was an Australian documentary in the Four Corners series, which included interviews with Dore, some of his customers, and scientists who had been involved both in the evaluation and the criticisms. The Dore business, which had been run as a franchise, collapsed, leaving many people out of pocket: parents who had paid up-front for a long-term intervention course, and staff at Dore centres, who found themselves out of a job.

The Dore programme did not die completely, however. Scott Quinnell continued to market a scaled-down version of the programme through his company Dynevor, but was taken to task by the Advertising Standards Authority for making unsubstantiated claims. Things then went rather quiet for a while.

This year, however, I have been contacted by concerned teachers who have told me about a new programme, STEP Physical Literacy, which is being promoted for use in schools, and which bears some striking similarities to Dore. Here are some quotes from the STEP website:

Pupils undertake 2 ten minute exercise sessions at the start and end of each school day. The exercises focus on the core skills of balance, eye-tracking and coordination.
STEP is a series of personalised physical exercises that stimulate the cerebellum to function more efficiently.
The STEP focus is on the development of physical capabilities that should be automatic such as standing still, riding a bike or following words on a page.

In addition, STEP Physical Literacy is being heavily promoted by Kenny Logan, who features several times on the News section of the website.

As with Dore, STEP has been promoted to politicians, who argue it should be introduced into schools. In this case, the Christopher Chope role is fulfilled by Liz Smith MSP, who appears to be sincerely convinced that Scotland's literacy problems can be overcome by having children take two 10 minute sessions out of lessons to do physical exercises.

On Twitter, Ben Goldacre noted that the directors of Dynevor CIC, overlap substantially with directors of Step2Progress, who own STEP. The registered address is the same for the two companies.

When asked about Dore, those involved with STEP deny any links. After I tweeted about this, I was emailed by Lucinda Roberts Holmes, Managing Director of STEP, to reassure me that STEP is not a rebranding of Dore, and to suggest we meet so she could "talk through the various pilots and studies that have gone on both in the UK and the US as well as future research RCTs planned with Florida State University and the University of Edinburgh." I love evidence, but I find it best to sit down with data rather than have a conversation, so I replied explaining that and saying I'd be glad to take a look at any written reports. So far nothing has materialised. I should add that I have not been able to find any studies on STEP published in the peer-reviewed literature, and the account of the pilot study and case studies on the STEP website does not given me confidence that these would be publishable in a reputable journal.

In short, the evidence to date does not justify introducing this intervention into schools: there's no methodologically adequate study showing effectiveness, and it carries both financial costs and opportunity costs to children. It's a shame that the field of education is so far behind medicine in its attitude to evidence, and that we have politicians who will consider promoting educational interventions on the basis of persuasive marketing. I suggest Liz Smith talks to the Education Endowment Foundation, who will be able to put her in touch with experts who can offer an objective evaluation of STEP Physical Literacy.

8th July 2017: Postscript. A response from STEP
I have had a request from Lucinda Roberts-Holmes, Managing Director of Step2Progress, to remove this blogpost on the grounds that it contains defamatory and inaccurate information. I asked for more information on specific aspects of the post that were problematic and obtained a very long response, which I reproduce in full below. Readers are invited to form their own interpretation of the facts, based on the response from STEP (in italics) and my comments on the points raised.

Preamble: To be clear your blog in its current form includes a number of statements which are factually incorrect. In particular, the suggestion that STEP is simply a reincarnation of the Dore programme is not true as I have already explained to you (see my email of 29 June). The fact that you chose to ignore that assurance and instead publish the blog is very concerning to us. The suggestion, also, that I had chosen not to reply to your email ("so far nothing has materialised") is, I am afraid, disingenuous particularly in circumstances where you did not even set a deadline in your email and you waited only 72 hours to post your blog. Had you, of course, waited to receive a response to your email, we would have explained the correct position to you. Similarly, had you carried out an objective comparison of the two programmes you would have noted the many differences between STEP and Dore and, more significantly, identified the fact that STEP makes absolutely none of the assertions about cures for Dyslexia and other learning difficulties or any other of the hypotheses that Wynford Dore concocted. They are not the same programme evidenced not least by the fact that STEP states its programme is not a SEN learning intervention.

Comment: a) I did not state in the blog that STEP is 'simply a reincarnation of the Dore Programme'. I said it bears some striking similarities to Dore.

b) I did not ignore Lucinda's reassurance that STEP is not a rebranding of Dore. On the contrary, I stated in the blogpost that I had received that reassurance from her.

c) I did not suggest that Lucinda had chosen not to reply to my email: I simply observed that I had not so far received a response. As my blogpost points out, I had made it clear in my initial email that I did not want her to 'explain the correct position' to me. I had specifically requested written reports documenting the evidence for effectiveness of STEP.

1. Despite what Ben Goldacre may believe, Kenny Logan (KL) was not paid by the Dore programme for "promotional work". He was, in fact, a paying customer of the programme who went from being unable to read at the start of the programme to being literate by the end of it. KL was happy to share his experience publicly and was very clear with Dore that he would not be paid to do this. Whilst it is true that in 2006, he was contracted and paid by Wynford Dore for his professional input into a sports programme that he was seeking to develop that is an entirely different matter. The suggestion that KL was only promoting the Dore programme for his own financial benefit is clearly defamatory of him (and indeed of us).

I asked Ben Goldacre about this. The claim about Logan's payment for promotional work was made in a Comment is Free article in the Guardian. Ben told me it all went through a legal review at the Guardian to ensure everything was robust, and no complaints were received from Kenny Logan at the time. If the claim is untrue, then Kenny Logan needs to take this up with the Guardian. It's unclear to me why Kenny Logan promoting Dore would be defamatory of STEP, given that STEP claims to have no association with Dore.

2. The fact that KL previously promoted the Dore programme also does not support the allegation that the STEP programme is the same as the Dore programme. They are very different programmes and we are a very different organisation to Dore. Incorrectly stating that KL was paid for the promotion of Dore and trying to draw an inference that therefore he is paid to promote STEP (which he is not) is also misleading.

Comment: I made no claims that Kenny Logan is paid to promote STEP. He is a shareholder in STEP2Progress, which is a different matter.

3. Dynevor was never "Scott Quinnell's Company". Dynevor was primarily owned by Tim Griffiths and was the organisation that purchased the intellectual property rights in Dore after it went bankrupt. Tim Griffiths had no prior connection to Wynford Dore or the Dore programme but did have an interest in the link between exercise and ability to learn. As many thousands of people had been left in a difficult position when Dore collapsed into administration having purchased a programme they could not continue the directors at Dynevor agreed to commit the funding necessary to allow those who wanted to continue the programme the opportunity to do so. Scott Quinnell had a shareholding of less than 1% in Dynevor. STEP has absolutely no association with Scott Quinnell.

Comment: The role of Scott Quinnell in Dynevor is not central to my description of Dore, but this account of his role seems disingenuous. According to Companies House, Quinnell was appointed as one of two Directors of Dynevor C.I.C in 2009, and his interest in the company in 2011 was 2.6% of the shareholding, at a time when Wynford Dore had a shareholding of 4.3%.

I have not claimed that Scott Quinnell has any relationship with STEP. My account of his dealings was to provide a brief history of the problems with Dore for readers unfamiliar with the background.

4. You refer to the claims Ben Goldacre has made on Twitter that the directors of Dynevor CIC "overlap substantially" with the directors of STEP. In fact, of the 8 Directors of Dynevor only 2 hold directorships at STEP. In any event that misses the point which is that none of the directors of STEP had any association with the Dore Programme prior to the purchase of intellectual property rights in 2009.

Comment: According to Companies House, the one 'active person with significant control' in Dynevor CIC is Timothy Griffiths, and the 'one active person with significant control' in STEP2Progress is Conor Davey. If I have understood this correctly, this is based on shareholdings. Timothy Griffiths is one of four Directors of STEP2Progress, and Conor Davey is the Chairman of Dynevor CIC. Dynevor CIC and STEP2Progress have the same postal address.

It wasn't quite clear if Lucinda was saying that Dynevor CIC is now disassociated from Dore, but if that is the case, it would be wise to update the company's LinkedIn Profile, which states that the company 'provides the Dore Programme to individual clients and schools around the UK and licences the rights to provide the Dore Programme in a number of overseas countries'.

5. It is not correct to state that STEP denies any links to the Dore programme. There is, of course, a link, as there is also to the work of Dr Frank Belgau and his studies into balametrics. There is also a link to other movement programmes such as Better Movers and Thinkers and Move to Learn. What we have said is that the STEP programme is not the Dore programme and we stand by this. You may seek to draw similarities between them as I could between apples and pears.

Comment: Nowhere in my blogpost did I state that STEP denies any links to the Dore programme.

Re Belgau: I have just done a search on Web of Science that returned no articles for either author = Belgau or topic = balametrics.

6. May I also ask how you can state that "the evidence to date does not justify introducing this intervention in to schools" when you have refused so far to meet with me or even seen the evidence or read the full Pilot Study? Have you asked any teachers or head teachers who have experience of delivering the STEP Programme whether they would recommend to their peers the use of the programme in their schools?

Comment: There is a fundamental misunderstanding here about how scientists evaluate evidence. If you want to find out whether an intervention is effective, the worst thing you can do is to talk to people who are convinced that it is. There are people who believe passionately in all sorts of things: the healing powers of crystals, the harms of vaccines, the benefits of homeopathy, or the evils of phonics instruction. They will, understandably, try to convince you that they are right, but they will not be objective. The way to get an accurate picture of what works is not by asking people what they think about it, but by doing well-controlled studies that compare the intervention with a control condition in terms of children's outcomes. It is for this reason that I have been asking for any hard evidence that STEP2Progress has from properly conducted studies or information about future-planned studies, which I am told are in the pipeline. I would love to read the full Pilot Study, but am having difficulty accessing it (see below).

7. You say in your blog "It is a shame that... We have politicians who will consider promoting educational interventions on the basis of persuasive marketing" Presumably this is a reference to Liz Smith MSP (LS) who you refer to separately in the blog? For your information, LS has read the full research report of the 2015/2016 Pilot Study as well as the other case studies. In light of that information, she has indicated the she is impressed with the STEP programme and that the Scottish Government should consider piloting it and looking more widely at the impact of physical literacy on academic attainment. At the point she expressed this view there had not been any marketing of the STEP programme in Scotland so I do not understand the evidence to support the statement you make in the blog.

Comment: In this regard Liz Smith has the advantage. Although Lucinda has now sent me three emails since my blogpost appeared, in none of them did she send me the reports I had initially requested. In my latest email I asked to see the 'full research report' that Liz Smith had access to. I got this reply from Lucinda:

Dear Dorothy,

Thank you for your email. With the greatest respect, I think the first step should be for you to correct or remove your blog and apologise for the inaccuracies I have outlined below. Alongside that I repeat my offer to come and talk you through the STEP programme and the studies that have been carried out so far. As I say, we are not the same programme as the Dore programme and it is wrong to allege otherwise.

Kind regards Lucinda

Nevertheless, with her penultimate email, Lucinda attached a helpful Excel spreadsheet documenting differences between Dore and STEP, as follows:

Difference 1. The Dore Programme was a paper book of 100 exercises followed sequentially. Dore's assertions that they were personalised were untrue. STEP software contains over 350 exercises delivered through an adaptive learning software platform that is individualised to the child based on previous performance. The Programme also contains 10 minutes of 1-1 time with each pupil twice per day (nurture) and involves pupils overcoming a series of physical challenges (resilience) in a non class-competitive environment (success cycle) which displays their commitment levels (engagement) and is overseen by committed members of staff who also work with them in the classroom (mentoring and translational trust building).

Comment: The question of interest is where do these exercises come from? How were they developed? Usually for an adaptive learning process, one needs to do prior research to establish difficulty levels of items for children of different ages. I raised this issue with the original Dore programme: there is no published evidence of the kind of foundational work you'd normally expect for an educational programme. Readers will no doubt be intereted to hear that STEP has more exercises than Dore and delivers these in a specific, personalised sequence, but what is missing is a clear rationale explaining how and why specific exercises were developed. It would also be of interest to know how many of Dore's original 100 exercises are incorporated in STEP.

Difference 2. Dore was an exercise programme completed by adults and children at home supervised by untrained parents. STEP is delivered in schools and overseen by teaching staff trained through industry leader Professor Geraint Jones' teacher training programme. This also includes training on how to assess pupil performance.

Comment. If the intervention is effective, then standardized administration by teachers is a good thing. If it is not effective, then teachers should not be spending time and money being trained. Everything hinges on evidence for effectiveness (see below).

Difference 3. Dore asserted that the programme was a cure for dyslexia and and other learning difficulties. It further claimed to know the cause of these learning difficulties. STEP makes absolutely no assertions about Dyslexia, ADHD or other learning difficulties and absolutely no assertions about the medical cause for these.

Comment. I am sure that there are many people who will be glad to have the clarification that STEP is not designed to treat children with specific learning difficulties or dyslexia, as there appears to be some misunderstanding of this. This may in part be the consequence of Kenny Logan's involvement in promoting STEP. Consider, for instance, this piece in the Daily Mail, which first describes how Kenny's dyslexia was remediated by the Dore programme, and then moves to talk of his worries over his son Reuben, who was having difficulties in school:

"The answer was already staring him in the face, however, and within months, Kenny decided to try putting Reuben through a similar 'brain-training' technique to the one that transformed his own life just 14 years ago. Reuben, it transpired, had mild dyspraxia - a condition affecting mental and physical co-ordination - and the outcome for him has been so successful that Kenny is currently trying to persuade education chiefs to implement the technique in the country's worst-performing state schools, to raise attainment levels."

Another reason for confusion may be because the STEP home page lists the British Dyslexia Association as a partner and has features in the News section of its website on Dyslexia Awareness Month , on unidentified dyslexia, and a case study describing use of STEP with dyslexic children in Mississippi.

The transcript of the debate in the Scottish Parliament (scroll down to the section on Motion debated: That the Parliament is impressed by the STEP physical literacy programme) shows that many of the Scottish MPs who took part in the debate with Liz Smith were under the impression that STEP was a treatment for specific learning disabilities such as dyslexia and ADHD, as evident from these quotes:

Daniel Johnson: 'It is vital that we understand that there is a direct link between physical understanding, learning, knowledge and ability and educational ability. Overall - and specifically - there would be key benefits for people who have conditions such as ADHD and dyslexia... There is a growing body of evidence about the link between spatial awareness and physical ability and dyslexia. Likewise, the improvements on focus and concentration that exercises such as those that are outlined in the STEP programme can have for people with ADHD are clear. Improvements in those areas are linked not only to training the mind to concentrate, but to the impacts on brain chemistry.'

Elaine Smith: With regard to STEP, we have already heard that it is a programme of exercises performed twice a day for 10 minutes and focuses in particular on balance, eye tracking and co-ordination with the aim of making physical activity part of children's everyday learning. Improving physical literacy is particularly advantageous for children and young people who can find it difficult to concentrate, such as those with dyslexia and autism... STEP also has the backing of the British Dyslexia Association, which supported the findings of the pilot study.

Shirley-Anne Somerville: We are aware that the STEP programme has been promoted for children who have dyslexia.

Difference 4. Dore claimed that completing the exercises would repair a damaged or underdeveloped cerebellum. It is known that repetitive physical exercises stimulate the cerebellum but STEP makes no assertions of science that any physiological changes take place. STEP involves using repetitive physical exercises to embed actions and make them automatic.

Comment: It is good to see that some of the more florid claims of Dore are avoided by STEP, but the fact remains that the underlying theory is similar, namely that cerebellar training will improve skills beyond motor skills. The idea that training motor skills will produce effects that generalise to other aspects of development is is dubious because the cerebellum is a complex organ subserving a range of functions and controlled studies typically find that training effects are task-specific. I discussed these issues in relation to the Dore programme here.

Specific statements about the cerebellum on the STEP website are:

'After going on national television to tell his heart-breaking story about facing up to the frustrations of overcoming a childhood stumbling block bigger than Mount Everest, Kenny (Logan) is determined to highlight the positive effects of using cerebellum specific teaching and learning programmes in primary school settings.'

And on this page of the website we hear: 'In the last century, academics experimenting with balametrics, dance and movement, established that specifically stimulating the cerebellum through exercise improves skill automation. The STEP Programme is built upon this foundation.'

Difference 5. Dore was a "medical" treatment that required participants to regularly visit treatment centres for "medical" evaluations to determine whether their learning difficulty was being cured. STEP is a primary school physical literacy programme delivered by teaching assistants or other teaching staff. It is to date shown to be most impactful on the lower quartile of the classroom in terms of academic improvement.

This is a rather odd interpretation of the Dore programme, which perhaps is signalled by the use of quotes around 'medical'. I never had the impression it was medical ╨ it was not prescribed or administered by doctors. It is true that Dore did establish centres for assessment and this proved to be a major reason for its commercial failure: there were substantial costs in premises, staffing and equipment. But there was no necessity to run the intervention that way: some people at the time of the collapse suggested it would be feasible to offer the exercises over the internet at much lower cost.

The second point, re the greatest benefits for the lower quartile of the classroom, is on the one hand of potential interest, but on the other hand raises the concern that the benefits could be a classic case of the regression to the mean. This is one of many ways in which scores can improve on an outcome measure for spurious reasons - which is why you need proper randomised controlled trials. Improvements are largely uninterpretable without these because increases in scores can arise because of practice, maturation, regression to the mean or placebo effects.

Difference 6. Dore determined "progress" and "cure" via a series of physical assessments. STEP empirically measures the academic progress of pupils with baseline data and presents reports against actual physical skills developed inviting schools to draw their own conclusions in the context of their school setting.

Comment. Agree that Dore's method of measuring progress and cure was a major problem, because a child could improve on the measures of balance and eye-hand co-ordination and be deemed 'cured' even though their reading had not improved at all. But the account of STEP sounds too vague to evaluate - and the evidence on their website from the pilot study is so underspecified as to be uninterpretable. It is not clear what the measures were, and which children were involved in which measures. I would like to see the full report to have a clearer idea of the methods and results.

Difference 7. Dore claimed that the exercises were developed and delivered in a formulaic manner that was a trade secret. STEP focuses on determining whether a pupils core physical capabilities in balance, eye tracking and coordination. There is no secret formula or claims of one. The genesis of STEP is in balametrics as well as other movement programmes such as Better Movers and Thinkers https://www.ncbi.nlm.nih.gov/pubmed/27247688 and Move to Learn https://www.movetolearn.com.au/research/

Comment. In STEP, how are the scores on core physical capabilities standardized for age and sex? This refers back to my earlier comment about the development work needed to underpin an effective programme. The impression is that people in this field borrow ideas from previous programmes but there is no serious science behind this.

Difference 8. The Dore Programme cost over £2000 per person and was paid for individually. STEP costs £365 per year per child and is completed over 2 years. It is largely paid for through schools that have the discretion to ask parents to fund the programme if it is an additional intervention being offered. STEP also commits a significant number of places to schools free of charge. The fee includes year round school support

Comment. Good to have the differences in charging methods clarified.

Difference 9. Dore published research based around a single school with hypotheses relating to the cerebellum and dyslexia that could not be substantiated. It used dyslexic tendencies as a measure of improvement and selection. STEP as an organisation is wholly open to independent research and evaluation. Its initial pilot study was designed and led by the IAPS Education Committee and conducted by Innovation Bubble, led by Dr Simon Moore, University of Middlesex and Chartered Psychologist. It was held across 17 schools. Further pilot studies have taken place carried out by education districts in Mississippi and ESCCO as well as independent case studies. These have always been presented openly and in the context they were compiled. STEP believes it has sufficient evidence to warrant a large scale evaluation of the Programme.

Comment. In the context of intervention evaluation, quantity of research does not equate with quality. Here is Wikipedia's definition of a pilot study: 'A small scale preliminary study conducted in order to evaluate feasibility, time, cost, adverse events, and effect size (statistical variability) in an attempt to predict an appropriate sample size and improve upon the study design prior to performance of a full-scale research project.' I agree that a large-scale evaluation of the Programme is warranted. It's a bit odd to say the results have been presented openly while at the same time refusing to send me reports unless I take down my blogpost.

It is clear that the MSPs in the debate in the Scottish Parliament were all, without exception, convinced that we already had evidence for the effectiveness of STEP. If they based these impressions on the information on the STEP website (as suggested by Liz Smith's initial statement), then this is worrying, as this came from the pilot study, where the methods were not clearly described, and the description of the results is unclear and looks incomplete, or from uncontrolled case studies.

Here are some of the statements from MSPs:

Liz Smith: As members know, the programme has been used successfully in both England and the United States, and it has been empirically evidenced to reduce the attainment gap in primary school pupils. Pupils who have completed STEP have shown significant improvements academically, behaviourally, physically and socially. A United Kingdom pilot last year compared more than 100 below-attainment primary school pupils who were on the STEP programme to a group of pupils at the same attainment level who were not. The improved learning outcomes that the study showed are extremely impressive: 86 per cent of pupils on the programme moved to on or above target in reading, compared with 56 per cent of the non-STEP group; 70 per cent of STEP pupils met their target for maths, compared with 30 per cent of the non-STEP group; and 75 per cent and 62 per cent of STEP pupils were on or above target for English comprehension and spelling respectively, compared with 43 per cent and 30 per cent of the non-STEP group.
In Mississippi, in the USA, more than 1,000 pupils have completed the programme over the past three years, and it is no coincidence that that state has seen significant improvement in fourth grade - which is the equivalent of P6 - reading and maths, which has resulted in the state being awarded a commendation for educational innovation.

Brian Whittle: The STEP programme is tried and tested, with measured physical, emotional and academic outcomes, especially in the lower percentiles.

Daniel Johnson: Perhaps most impressive is the STEP programme's achievements on academic improvement╤it has led to improved English for 76 per cent of participants, and to improved maths, reading and spelling for 70 per cent of participants. The benefits that physical literacy can bring to academic attainment are clear.

Oliver Mundell: the STEP programme has been shown to work and is popular with both the teachers and the pupils who have benefited from it in England and the USA.

Conclusion This has been a very long postscript, but it seems important to be clear about what the objections to STEP are. I have not claimed that STEP is exactly the same as Dore. My sense of déjà vu arises because of the similarities, in the people involved, in the use of cerebellar exercises involving balance and eye-hand coordination delivered in short sessions, and in the successful promotion of the programme to politicians and schools in the absence of adequate peer-reviewed evidence. Given that the basic theory does not have strong scientific plausibility, this latter point that is the source of greatest concern. We can agree that we all want children to succeed in school and any method that can help them achieve this is to be welcomed. There is also, however, a need for better education of our politicians, so that they are equipped to evaluate evidence properly. They have a responsibility to ensure we do the best for our children, but this requires a critical mindset.