Showing posts with label p-values. Show all posts
Showing posts with label p-values. Show all posts

Tuesday, 26 December 2017

Using simulations to understand p-values

Intuitive explanations of statistical concepts for novices #4

The p-value is widely used but widely misunderstood. I'll demonstrate this in the context of intervention studies. The key question is how confident can we be that an apparently beneficial effect of treatment reflects a change due to the intervention, rather than arising just through the play of chance. The p-value gives one way of deciding that. There are other approaches, including those based on Bayesian statistics, which are preferred by many statisticians. But I will focus here on the traditional null hypothesis significance testing (NHST) approach, which dominates statistical reporting in many areas of science, and which uses p-values.

As illustrated in my previous blogpost, where our measures include random noise, the distorting effects of chance mean that we can never be certain whether or not a particular pattern of data reflects a real difference between groups. However, we can compute the probability that the data came from a sample where there was no effect of intervention.

There are two ways to do this. One way is by simulation. If you repeatedly run the kind of simulation described in my previous blogpost, specifying no mean difference between groups, each time taking a new sample, for each result you can compute a standardized effect size. Cohen's d is the mean difference between groups expressed in standard deviation units, which can be computed by subtracting the group A mean from the group B mean, and dividing by the pooled standard deviation (i.e. the square root of the average of the variances for the two groups). You then see how often the simulated data give an effect size at least as large as the one observed in your experiment.
Histograms of effecct sizes obtained by repeatedly sampling from population where there is no difference between groups*
Figure 1 shows the distribution of effect sizes for two different studies: the first has 10 participants per group, and the second has 80 per group. For each study, 10,000 simulations were run; on each run, a fresh sample was taken from the population, and the standardized effect size, d, computed for that run. The peak of each distribution is at zero: we expect this, as we are simulating the case of no real difference between groups – the null hypothesis. But note that, though the shape of the distribution is the same for both studies, the scale on the x-axis covers a broader range for the study with 10 per group than the study with 80 per group. This relates to the phenomenon shown in Figure 5 of the previous blogpost, whereby estimates of group means jump around much more when there is a small sample.

The dotted red lines show the cutoff points that identify the top 5%, 1% and 0.1% of the effect sizes. Suppose we ran a study with 10 people and it gave a standardized effect size of 0.3. We can see from the figure that a value in this range is fairly common when there is no real effect: around 25% of the simulations gave an effect size of at least 0.3. However, if our study had 80 people per group, then the simulation tells us this is an improbable result to get if there really is no effect of intervention: only 2.7% of simulations yield an effect size as big as this.

The p-value is the probability of obtaining a result at least as extreme as the one that is observed, if there really is no difference between groups. So for the study with N = 80, p = .027. Conventionally, a level of p < .05 has been regarded as 'statistically significant', but this is entirely arbitrary. There is an inevitable trade-off between false positives (type I errors) and false negatives (type II errors). If it is very important to avoid false positives, and you do not mind sometimes missing a true effect, then a stringent p-value is desirable. If, however, you do not want to miss any finding of potential interest, even if it turns out to be a false positive, then you could adopt a more lenient criterion.

The comparison between the two sample sizes in Figure 1 should make it clear that statistical significance is not the same thing as practical significance. Statistical significance simply tells us how improbable a given result would be if there was no true effect. The larger the sample size, the smaller the effect size that would be detected at a threshold such as p < .05. Small samples are generally a bad thing, because they only allow us to reliably detect very large effects. But very large samples have the opposite problem: they allow us to detect as 'significant' effect that are so small as to be trivial. The key point that the researcher who is conducting an intervention study should start by considering how big an effect would be of practical interest, given the cost of implementing the intervention. For instance, you may decide that staff training and time spent on a vocabulary intervention would only be justified if it boosted children's vocabulary by at least 10 words. If you knew how variable children scores were on the outcome measure, the sample size could then be determined so that the study has a good chance of detecting that effect while minimising false positives. I will say more about how to do that in a future post.

I've demonstrated p-values using simulations in the hope that this will give some insight into how they are derived and what they mean. In practice, we would not normally derive p-values this way, as there are much simpler ways to do this, using statistical formulae. Provided that data are fairly normally distributed, we can use statistical approaches such as ANOVA, t-tests and linear regression to compute probabilities of observed results (see this blogpost). Simulations can, however, be useful in two situations. First, if you don't really understand how a statistic works, you can try running an analysis with simulated data. You can either simulate the null hypothesis by creating data from two groups that do not differ, or you can add a real effect of a given size to one group. Because you know exactly what effect size was used to create the simulated data, you can get a sense of whether particular statistics are sensitive to detect real effects, and how these might vary with sample size.

The second use of simulations is for situations where the assumptions of statistical tests are not met – for instance, if data are not normally distributed, or if you are using a complex design that incorporates multiple interacting variables. If you can simulate a population of data that has the properties of your real data, you can then repeatedly sample from this and compute the probability of obtaining your observed result to get a direct estimate of a p-value, just as was done above.

The key point to grasp about a p-value is that it tells you how likely your observed evidence is, if the null hypothesis is true. The most widely used p-value is .05: if the p-value in your study is less than .05, then the chance of your observed data arising when the intervention had no effect is 1 in 20. You may decide on that basis that it's worth implementing the intervention, or at least investing in the costs of doing further research on it.

The most common mistake is to think that the p-value tells you how likely the null hypothesis is given the evidence. But that is something else. The probability of A (observed data) given B (null hypothesis) not the same as the probability of B (null hypothesis) given A (observed data). As I have argued in another blogpost, the probability that if you are a man you are a criminal is not high, but if you are a criminal, the probability that you are a man is much higher. This may seem fiendishly complicated, but a concrete example can help.

Suppose Bridget Jones has discovered three weight loss pills: if taken for a month, pill A is totally ineffective placebo, pill B leads to a modest weight loss of 2 lbs, and pill C leads to an average loss of 7 lb. We do studies with three groups of 20 people; in each group, half are given A, B or C and the remainder are untreated controls. We discover that after a month, one of the treated groups has an average weight loss of 3 lb, whereas their control group has lost no weight at all. We don't know which pill this group received. If we run a statistical test, we find the p-value is .45. This means we cannot reject the null hypothesis of no effect – which is what we'd expect if this group had been given the placebo pill, A. But the result is also compatible with the participants having received pills B or C. This is demonstrate in Figure 2 which shows the probability density function for each scenario - in effect, the outline of the histogram. The red dotted line corresponds to our obtained result, and it is clear it is highly probable regardless of which pill was used. In short, this result doesn't tell us how likely the null hypothesis is – only that the null hypothesis is compatible with the evidence that we have.
Probability density function for weight loss pills A, B and C, with red line showing observed result

Many statisticians and researchers have argued we should stop using p-values, or at least adopt more stringent levels of p. My view is that p-values can play a useful role in contexts such as the one I have simulated here, where you want to decide whether an intervention is worth adopting, provided you understand what they tell you. It is crucial to appreciate how dependent a p-value is on sample size, and to recognise that the information it provides is limited to telling you whether an observed difference could just be due to chance. In a later post I'll go on to discuss the most serious negative consequence of misunderstanding of p-values: the generation of false positive findings by the use of p-hacking.

*The R script to generate Figures 1 and 2 can be found here.

Tuesday, 26 January 2016

The Amazing Significo: why researchers need to understand poker

©www.savagechickens.com
Suppose I tell you that I know of a magician, The Amazing Significo, with extraordinary powers. He can undertake to deal you a five-card poker hand which has three cards with the same number.

You open a fresh pack of cards, shuffle the pack and watch him carefully. The Amazing Significo deals you five cards and you find that you do indeed have three of a kind.

According to Wikipedia, the chance of this happening by chance when dealing from an unbiased deck of cards is around 2 per cent - so you are likely to be impressed. You may go public to endorse The Amazing Significo's claim to have supernatural abilities.

But then I tell you that The Amazing Significo has actually dealt five cards to 49 other people that morning, and you are the first one to get three of a kind. Your excitement immediately evaporates: in the context of all the hands he dealt, your result is unsurprising.

Let's take it a step further and suppose that The Amazing Significo was less precise: he just promised to give you a good poker hand without specifying the kind of cards you would  get. You regard your hand as evidence of his powers, but you would have been equally happy with two pairs, a flush, or a full house. The probability of getting any one of those good hands goes up to 7 per cent, so in his sample of 50 people, we'd expect three or four to be very happy with his performance.

So context is everything. If The Amazing Significo had dealt a hand to just one person and got a three-of-a-kind hand, that would indeed be amazing. If he had dealt hands to 50 people, and predicted in advance which of them would get a good hand, that would also be amazing. But if he dealt hands to 50 people and just claimed that one or two of them would get a good hand without prespecifying which ones it would be - well, he'd be rightly booed off the stage.

When researchers work with probabilities, they tend to see p-values as measures of the size and importance of a finding. However, as The Amazing Significo demonstrates, p-values can only be interpreted in the context of a whole experiment: unless you know about all the comparisons that have been made (corresponding to all the people who were dealt a hand) they are highly misleading.

In recent years, there has been growing interest in the phenomenon of p-hacking - selecting experimental data after doing the statistics to ensure a p-value below the conventional cutoff of .05. It is recognised as one reason for poor reproducibility of scientific findings, and it can take many forms.

I've become interested in one kind of p-hacking, use of what we term 'ghost variables' - variables that are included in a study but not reported unless they give a significant result. In a recent paper (preprint available here), Paul Thompson and I simulated the situation when a researcher has a set of dependent variables, but reports only those with p-values below .05. This would be like The Amazing Significo making a film of his performances in which he cut out all the cases where he dealt a poor hand**. It is easy to get impressive results if you are selective about what you tell people. If you have two groups of people who are equivalent to one another, and you compare them on just one variable, then the chance that you will get a spurious 'significant' difference (p < .05)  is 1 in 20. But with eight variables, the chance of a false positive 'significant' difference on any one variable is 1-.95^8, i.e. 1 in 3. (If variables are correlated these figures change: see our paper for more details).

Quite simply p-values are only interpretable if you have the full context: if you pull out the 'significant' variables and pretend you did not test the others, you will be fooling yourself - and other people - by mistaking chance fluctuations for genuine effects. As we showed with our simulations, it can be extremely difficult to detect this kind of p-hacking, even using statistical methods such as p-curve analysis, which were designed for this purpose. This is why it is so important to either specify statistical tests in advance (akin to predicting which people will get three of a kind), or else adjust p-values for the number of comparisons in exploratory studies*.

Unfortunately, there are many trained scientists who just don't understand this. They see a 'significant' p-value in a set of data and think it has to be meaningful. Anyone who suggests that they need to correct p-values to take into account the number of statistical tests - be they correlations in a correlation matrix, coefficients in a regression equation, or factors and interactions in Analysis of Variance, is seen as a pedantic killjoy (see also Cramer et al, 2015). The p-value is seen as a property of the variable it is attached to, and the idea that it might change completely if the experiment were repeated is hard for them to grasp.

This mass delusion can even extend to journal editors, as was illustrated recently by the COMPare project, the brainchild of Ben Goldacre and colleagues. This involves checking whether the variables reported in medical studies correspond to the ones that the researchers had specified before the study was done and informing journal editors when this was not the case. There's a great account of the project by Tom Chivers in this Buzzfeed article, which I'll let you read for yourself. The bottom line is that the editors of the Annals of Internal Medicine appear to be people who would be unduly impressed by The Amazing Significo because they don't understand what Geoff Cumming has called 'the dance of the p-values'.



*I am ignoring Bayesian approaches here, which no doubt will annoy the Bayesians


**PS.27th Jan 2016.  Marcus Munafo has drawn my attention to a film by Derren Brown called 'the System' which pretty much did exactly this! http://www.secrets-explained.com/derren-brown/the-system

Friday, 20 April 2012

Getting genetic effect sizes in perspective


My research focuses on neurodevelopmental disorders - specific language impairment, dyslexia, and autism in particular. For all of these there is evidence of genetic influence. But the research papers reporting relevant results are often incomprehensible to people who aren’t geneticists (and sometimes to those who are).  This leaves us ignorant of what has really been found, and subject to serious misunderstandings.
Just as preamble, evidence for genetic influences on behaviour comes in two kinds. The first approach, sometimes referred to as genetic epidemiology or behaviour genetics allows us to infer how far genes are involved in causing individual differences by studying similarities between people who have different kinds of genetic relationship. The mainstay of this field is the twin study. The logic of twin studies is pretty simple, but the methods currently used to analyse twin data are complex. The twin method is far from perfect, but it has proved useful in helping us identify which conditions are worth investigating using the second approach, molecular genetics.
Molecular genetics involves finding segments of DNA that are correlated with a behavioural (or other phenotypic) measure. It involves laboratory work analysing biological samples of people who’ve been assessed on relevant measures. So if we’re interested in, say, dyslexia, we can either look for DNA variants that predict a person’s reading ability - a quantitative approach - or we can look for DNA variants that are more common in people who have dyslexia. There’s a range of methods that can be used, depending on whether the data come from families - in which case the relationship between individuals can be taken into account - or whether we just have a sample of unrelated people who vary on the behaviour of interest, in this case reading ability.
The really big problem comes from a tendency in molecular genetics to focus just on p-values when reporting findings. This is understandable: the field of molecular genetics has been plagued by chance findings. This is because there’s vast amounts of DNA that can be analysed, and if you look at enough things, then the odd result will pop up as showing a group difference just by chance. (See this blogpost for further explanation). The p-value indicates whether an association between a DNA variant and a behavioural measure is a solid finding that is likely to replicate in another sample.
But a p-value depends on two things: (a) the strength of association between DNA and behaviour (effect size) and (b) the sample size. Psychologists, many of whom are interested in genetic variants linked to behaviour, are mostly used to working with samples that number in the tens rather than hundreds or thousands. It’s easy, therefore, to fall into the trap of assuming that a very low p-value means we have a large effect size, because that’s usually the case in the kind of studies we’re used to. Misunderstanding can arise if effect sizes are not reported in a paper.
Suppose we have a genetic locus with two alleles, a and A, and a geneticist contrasts people with an aa genotype vs those with aA or AA (who are grouped together). We read a molecular genetics paper that reports an association between these genotypes and reading ability with p-value of .001. Furthermore, we see there are other studies in the literature reporting similar associations, so this seems a robust finding. You could be forgiven for concluding that the geneticists have found a “dyslexia gene”, or at least a strong association with a DNA variant that will be useful in screening and diagnosis. And, if you are a psychologist, you might be tempted to do further studies contrasting people with aa vs aA/AA genotypes on behavioural or neurobiological measures that are relevant for reading ability.
However, this enthusiasm is likely to evaporate if you consider effect sizes. There is a nice little function in R, compute.es, that allows you to compute effect size easily if you know a p-value and a sample size. The table below shows:
  •  effect sizes (Cohen’s d, which gives mean difference between groups in z-score units)
  •  average for each group for reading scores scaled so the mean for the aA/AA group is 100 with SD of 15
Results are shown for various sample sizes with equal numbers of aa vs aA/AA and either p =.001 or p = .0001. (See reference manual for the R function for relevant formulae, which are also applicable in cases of unequal sample size). For those unfamiliar with this area, a child would not normally be flagged up as having reading problems unless a score on a test scaled this way was 85 or less (i.e., 1 SD below the mean).

Table 1: Effect sizes (Cohen’s d) and group means derived from p-value and sample size (N)










When you have the kind of sample size that experimental or clinical psychologists often work with, with 25 participants per group, a p of .001 is indicative of a big effect, with a mean difference between groups of almost one SD. However, you may be surprised at how small the effect size is when you have a large sample. If you have a sample of 3000 or so, then a difference of just 1-2 points (or .08 SD) will give you p < .001. Most molecular genetic studies have large sample sizes. Geneticists in this area have learned that they have to have large samples, because they are looking for small effects!
It would be quite wrong to suggest that only large effect sizes are interesting. Small but replicable effects can be of great importance in helping us understand causes of disorders, because if we find relevant genes we can study their mode of action (Scerri & Schulte-Korne, 2010). But, as far as the illustrative data in Table 1 are concerned, few psychologists would regard the reading deficit associated at p of .001 or .0001 with genotype aa as of clinical significance, once the sample size exceeds 1000 per group.
Genotype aa may be a risk factor for dyslexia, but only in conjunction with other risks. On its own it doesn’t cause dyslexia.  And the notion, propagated by some commercial genetics testing companies, that you could use a single DNA variant with this magnitude of effect to predict a person’s risk of dyslexia, is highly misleading.


Further reading
Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University Press.
Scerri, T., & Schulte-Körne, G. (2009). Genetics of developmental dyslexia European Child & Adolescent Psychiatry, 19 (3), 179-197 DOI: 10.1007/s00787-009-0081-0

If you are interested in analysing twin data, you can see my blog on Twin Methods in OpenMx, which illustrates a structural equation modelling approach in R with simulated data.

Update 21/4/12: Thanks to Tom Scerri for pointing out my original wording talked of "two versions of an allele", which has now been corrected to "a genetic locus with two alleles"; as Tom noted: an allele is an allele, you can't have two versions of it.
Tom also noted that in the table, I had taken the aA/AA genotype as the reference group for standardisation to mean 100, SD 15. A more realistic simulation would take the whole population with all three genotypes as the reference group, in which case the effect size would result from the aA/AA group having a mean above 100, while the aa group would have mean below 100. This would entail that, relative to the grand population average, the averages for aa would be higher than shown here, so that the number with clinically significant deficits will be even smaller.
I hope in future to illustrate these points by computing effect sizes for published molecular genetic studies reporting links with cognitive phenotypes.