Showing posts with label statistics. Show all posts
Showing posts with label statistics. Show all posts

Tuesday, 26 January 2016

The Amazing Significo: why researchers need to understand poker

©www.savagechickens.com
Suppose I tell you that I know of a magician, The Amazing Significo, with extraordinary powers. He can undertake to deal you a five-card poker hand which has three cards with the same number.

You open a fresh pack of cards, shuffle the pack and watch him carefully. The Amazing Significo deals you five cards and you find that you do indeed have three of a kind.

According to Wikipedia, the chance of this happening by chance when dealing from an unbiased deck of cards is around 2 per cent - so you are likely to be impressed. You may go public to endorse The Amazing Significo's claim to have supernatural abilities.

But then I tell you that The Amazing Significo has actually dealt five cards to 49 other people that morning, and you are the first one to get three of a kind. Your excitement immediately evaporates: in the context of all the hands he dealt, your result is unsurprising.

Let's take it a step further and suppose that The Amazing Significo was less precise: he just promised to give you a good poker hand without specifying the kind of cards you would  get. You regard your hand as evidence of his powers, but you would have been equally happy with two pairs, a flush, or a full house. The probability of getting any one of those good hands goes up to 7 per cent, so in his sample of 50 people, we'd expect three or four to be very happy with his performance.

So context is everything. If The Amazing Significo had dealt a hand to just one person and got a three-of-a-kind hand, that would indeed be amazing. If he had dealt hands to 50 people, and predicted in advance which of them would get a good hand, that would also be amazing. But if he dealt hands to 50 people and just claimed that one or two of them would get a good hand without prespecifying which ones it would be - well, he'd be rightly booed off the stage.

When researchers work with probabilities, they tend to see p-values as measures of the size and importance of a finding. However, as The Amazing Significo demonstrates, p-values can only be interpreted in the context of a whole experiment: unless you know about all the comparisons that have been made (corresponding to all the people who were dealt a hand) they are highly misleading.

In recent years, there has been growing interest in the phenomenon of p-hacking - selecting experimental data after doing the statistics to ensure a p-value below the conventional cutoff of .05. It is recognised as one reason for poor reproducibility of scientific findings, and it can take many forms.

I've become interested in one kind of p-hacking, use of what we term 'ghost variables' - variables that are included in a study but not reported unless they give a significant result. In a recent paper (preprint available here), Paul Thompson and I simulated the situation when a researcher has a set of dependent variables, but reports only those with p-values below .05. This would be like The Amazing Significo making a film of his performances in which he cut out all the cases where he dealt a poor hand**. It is easy to get impressive results if you are selective about what you tell people. If you have two groups of people who are equivalent to one another, and you compare them on just one variable, then the chance that you will get a spurious 'significant' difference (p < .05)  is 1 in 20. But with eight variables, the chance of a false positive 'significant' difference on any one variable is 1-.95^8, i.e. 1 in 3. (If variables are correlated these figures change: see our paper for more details).

Quite simply p-values are only interpretable if you have the full context: if you pull out the 'significant' variables and pretend you did not test the others, you will be fooling yourself - and other people - by mistaking chance fluctuations for genuine effects. As we showed with our simulations, it can be extremely difficult to detect this kind of p-hacking, even using statistical methods such as p-curve analysis, which were designed for this purpose. This is why it is so important to either specify statistical tests in advance (akin to predicting which people will get three of a kind), or else adjust p-values for the number of comparisons in exploratory studies*.

Unfortunately, there are many trained scientists who just don't understand this. They see a 'significant' p-value in a set of data and think it has to be meaningful. Anyone who suggests that they need to correct p-values to take into account the number of statistical tests - be they correlations in a correlation matrix, coefficients in a regression equation, or factors and interactions in Analysis of Variance, is seen as a pedantic killjoy (see also Cramer et al, 2015). The p-value is seen as a property of the variable it is attached to, and the idea that it might change completely if the experiment were repeated is hard for them to grasp.

This mass delusion can even extend to journal editors, as was illustrated recently by the COMPare project, the brainchild of Ben Goldacre and colleagues. This involves checking whether the variables reported in medical studies correspond to the ones that the researchers had specified before the study was done and informing journal editors when this was not the case. There's a great account of the project by Tom Chivers in this Buzzfeed article, which I'll let you read for yourself. The bottom line is that the editors of the Annals of Internal Medicine appear to be people who would be unduly impressed by The Amazing Significo because they don't understand what Geoff Cumming has called 'the dance of the p-values'.



*I am ignoring Bayesian approaches here, which no doubt will annoy the Bayesians


**PS.27th Jan 2016.  Marcus Munafo has drawn my attention to a film by Derren Brown called 'the System' which pretty much did exactly this! http://www.secrets-explained.com/derren-brown/the-system

Monday, 21 July 2014

Percentages, quasi statistics and bad arguments


© www.CartoonStock.com

Percentages have been much in the news lately. First, we have a PLOS One paper by John Ioannidis and his colleagues which noted that less than one per cent of all publishing scientists in the period from 1996 to 2011 published something in each and every year of this 16-year period.

Then there was have a trailer for a wonderfully silly forthcoming film, Lucy, in which Scarlett Johansson suffers from a drug overdose that leads her to learn Chinese in an hour and develop an uncanny ability to make men fall over by merely pouting at them. Morgan Freeman plays a top neuroscientist who explains that whereas the rest of us use a mere ten per cent of our brain capacity, Johansson's character has access to a full hundred per cent of her brain.

And today I've just read an opinion piece in Prospect Magazine by the usually clear-thinking philosopher, A. C. Grayling, which states: Neuropsychology tells us that more than ninety per cent of mental computation happens below the level of awareness.

Examples like these can be used to demonstrate just how careful you need to be when interpreting percentages. There are two issues. For a start, a percentage is uninterpretable unless you know the origin of the denominator (i.e., the total number of cases that the percentage is based on).  I'm sure the paper by Ioannidis and colleagues is competently conducted, but the result seems far less surprising when you realise that the 'less than one per cent' figure was obtained using a denominator based on all authors mentioned on all papers during the target period. As Ioannidis et al noted, this will include a miscellaneous bunch of people, including those who are unsuccessful at gaining research funding or in getting papers published, those taking career breaks, people who are trainees or research assistants, those working in disciplines where it is normal to publish infrequently, and those who fit in  research activity around clinical responsibilities. Presumably it also includes those who have died, retired, or left the field in the study period.

So if you are someone who publishes regularly, and are feeling smug at your rarity value, you might want to rethink. In fact, given the heterogeneity of the group on whom the numerator is based, I'm not sure what conclusions to draw from this paper. Ioannidis et al noted that those who publish frequently also get cited more frequently – even after taking into account number of publications and concluded that the stability and continuity of the publishing scientific workforce may have important implications for the efficiency of science. But what one should actually do with this information is unclear. The authors suggest that one option is to give more opportunities to younger scientists so that they can join the elite group who publish regularly. However, I suspect that's not how the study will be interpreted: instead, we'll have university administrators adding 'continuity of publishing record' to their set of metrics for recruiting new staff, leading to even more extreme pressure to publish quickly, rather than taking time to reflect on results. A dismal thought indeed.

The other two examples that I cited are worse still. It's not that they have a misleading denominator: as far as one can tell, they don't have a denominator at all.  In effect, they are quasi-statistics. Since the publication of the Lucy trailer, neuroscientists have stepped up to argue that of course we use much more than ten per cent of our brains, and to note that the origin of this mythical statistic is hard to locate (see, for instance here and here). I'd argue there's an even bigger problem – the statement can't be evaluated as accurate or inaccurate without defining what scale is being adopted to quantify 'brain use'. Does it refer to cells, neural networks, white matter, grey matter, or brain regions? Are we only 'using' these if there is measurable activity? And is that activity measured by neural oscillations, synaptic firing, a haemodynamic response or something else?

In a similar vein, in the absence of any supporting reference for the Grayling quote, it remains opaque to me how you'd measure 'mental computation' and then subdivide it into the conscious and the unconscious. Sure, he's right that our brains carry out many computations of which we have no explicit awareness. Language is a classic case – I assume most readers would have no difficulty turning a sentence like You wanted to eat the apples that she gave you into a negative form (You didn't want to eat the apples that she gave you) or a question (Did you want to eat the apples that she gave you?) but unless you are a linguist, you will have difficulty explaining how you did this. I don't take issue with Grayling's main point, but I am surprised that an expert philosopher should introduce a precise number into the argument, when it can readily be shown to be meaningless.

The main point here is that we are readily impressed by numbers. A percentage seems to imply that there is a body of evidence on which a statement is based. But we need to treat percentages with suspicion; unless we can identify the numerator and denominator from which they are calculated, they are likely to just be fancy ways of trying to persuade us into giving more weight to an argument than it deserves.

Saturday, 5 October 2013

Good and bad news on the phonics screen



Teaching children to read is a remarkably fraught topic. Last year the UK Government introduced a screening check to assess children’s ability to use phonics – i.e., to decode letters into sounds. Judging from the reaction in some quarters they might as well have announced they were going to teach 6-year-olds calculus. The test, we were told, would confuse and upset children and not tell teachers anything they did not already know. Some people implied that there was an agenda to teach children to read solely using meaningless materials. This, of course, is not the case. Nonwords are used in assessment precisely because you need to find out if the child has the skills to attack an unfamiliar word by working out the sounds. Phonics has been ignored or rejected for many years by those who assumed that if you taught phonics the child would be doomed to an educational approach that involved boring drills in meaningless materials. This is not the case: for instance, Kevin Wheldall argues that children need to combine teaching of phonics with training in vocabulary and comprehension, and storybook reading with real texts should be a key component of reading instruction.
There is evidence for the effectiveness of phonics training from controlled trials,  and I therefore regard it as a positive move that the government has endorsed the  use of phonics in schools. However, they continue to meet resistance from many teachers, for a whole range of reasons. Some just don’t like phonics. Some don’t like testing children, especially when the outcome is a pass/fail classification. Many fear that the government will use results of a screening test to create league tables of schools, or to identify bad teachers. Others question the whole point of screening: This recent piece from the BBC website quotes Christine Blower, the head of the National Union of Teachers, as saying: "Children develop at different levels, the slow reader at five can easily be the good reader by the age of 11.” To anyone familiar with the literature on predictors of children’s reading, this shows startling levels of complacency and ignorance. We have known for years that you can predict with good accuracy which children are likely to be poor readers at 11 years from their reading ability at 6 (Butler et al, 1985).
When the results from last year's phonics screen came out I blogged about them, because they looked disturbingly dodgy, with a spike in the frequency distribution at the pass mark of 32. On Twitter, @SusanGodsland has pointed me to a report on the 2012 data where this spike was discussed. This noted that the spike in the distribution was not seen in a pilot study where the pass mark had not been known in advance. The spike was played down in this report, and attributed to “teachers accounting for potential misclassification in the check results, and using their teacher judgment to determine if children are indeed working at the expected standard.” It was further argued that the impact of the spike was small, and would lead to only around 4% misclassification.
However, a more detailed research report on the results was rather less mealy-mouthed about the spike and noted “the national distribution of scores suggests that pupils on the borderline may have been marked up to meet the expected standard.” The authors of that report did the best they could with the data and carried out two analyses to try to correct for the spike. In the first, they deleted points in the distribution where the linear pattern of increase in scores was disrupted, and instead interpolated the line. They concluded that this gave 54% rather than 58% of children passing the screen. The second approach, which they described as more statistically robust, was to take all the factors that they had measured that predicted scores on the phonics screen, ignoring cases with scores close to the spike, and then use these to predict the percentage passing the screen in the whole population. When this method was used, only 46% of children were estimated to have passed the screen when the spike was corrected for.
Well, this year’s results have just been published. The good news is that there is an impressive increase in percentage of children passing from 2012 to 2013, up from 58% to 69%. This suggests that the emphasis on phonics is encouraging teachers to teach children about how letters and sounds go together.
But any positive reaction to this news is tinged with a sense of disappointment that once again we have a most peculiar distribution with a spike at the pass mark. 
 
Proportions of children with different scores on phonics screen in 2012 and 2013. Dotted lines show interpolated values.

I applied the same correction as had been used for the 2012 data, i.e. interpolating the curve over the dodgy area. This suggested that the proportion of cases passing the screen was overestimated by about 6% for both 2012 and 2013. (The precise figure will depend on the exact way the interpolation is done). 
Of course I recognise that any pass mark is arbitrary, and children’s performance may fluctuate and not always represent their true ability. The children who scored just below the pass mark may indeed not warrant extra help with reading, and one can see how a teacher may be tempted to nudge a score upward if that is their judgement. Nevertheless, teachers who do this are making it difficult to rely on the screen data and to detect whether there are any improvements year on year. And it undermines their professional status if they cannot be trusted to administer a simple reading test objectively.
It has been announced that the pass mark for the phonics screen won’t be disclosed in advance in 2014, which should reduce the tendency to nudge scores up. However, if the pass mark differs from previous years, then the tests won’t be comparable, so it seems likely that teachers will be able to guess it will remain at 32. Perhaps one solution would be to ask the teacher to make a rating of whether or not the test result agrees with their judgement of the child’s ability. If they have an opportunity to give their professional opinion, they may be less tempted to tweak test results. I await with interest the results from 2014!

Reference
Butler, Susan R., Marsh, Herbert W., Sheppard, Marlene J., & Sheppard, John L (1985). Seven-year longitudinal study of the early prediction of reading achievement Journal of Educational Psychology, 77, 349-361 DOI: 10.1037//0022-0663.77.3.349

Friday, 26 July 2013

Why we need pre-registration


There has been a chorus of disapproval this week at the suggestion that researchers should 'pre-register' their studies with journals and spell out in advance the methods and analyses that they plan to do. Those who wish to follow the debate should look at this critique by Sophie Scott, with associated comments, and the responses to it collated here by Pete Etchells. They should also read the explanation of the pre-registration proposals and FAQ  by Chris Chambers - something that many participants in the debate appear not to have done.

Quite simply, pre-registration is designed to tackle two problems in scientific publishing:
  • Bias against publication of null results
  • A failure to distinguish hypothesis-generating (exploratory) from hypothesis-testing analyses
Either of these alone is bad for science: the combined effect of both of them is catastrophic, and has led to a situation where research is failing to do its job in terms of providing credible answers to scientific questions.

Null results

Let's start with the bias against null results. Much has been written about this, including by me. But the heavy guns in the argument have been wielded by Ben Goldacre, who has pointed out that, in the clinical trials field, if we only see the positive findings, then we get a completely distorted view of what works, and as a result, people may die. In my field of psychology, the stakes are not normally as high, but the fact remains that there can be massive distortion in our perception of evidence.

Pre-registration would fix this by guaranteeing publication of a paper regardless of how the results turn out. In fact, there is another, less bureaucratic, way the null result problem could be fixed, and that would be by having reviewers decide on a paper's publishability solely on the basis of the introduction and methods. But that would not fix the second problem.

Blurring the boundaries between exploratory and hypothesis-testing analyses

A big problem is that nearly all data analysis is presented as if it is hypothesis-testing when in fact much of it is exploratory.

In an exploratory analysis, you take a dataset and look at it flexibly to see what's there. Like many scientists, I love exploratory analyses, because you don't know what you will find, and it can be important and exciting. I suspect it is also something that you get better at as you get more experienced, and more able to see the possibilities in the numbers. But my love of exploratory analyses is coupled with a nervousness. With an exploratory analysis, whatever you find, you can never be sure it wasn't just a chance result. Perhaps I was lucky in having this brought home to me early in my career, when I had an alphabetically ordered list of stroke patients I was planning to study, and I happened to notice that those with names in the first half of the alphabet  had left hemisphere lesions and those with names in the second half had right hemisphere lesions. I even did a chi square test and found it was highly significant. Clearly this was nonsense, and just one of those spurious things that can turn up by chance.

These days it is easy to see how often meaningless 'significant' results occur by running analyses on simulated data - see this blogpost for instance. In my view, all statistics classes should include such exercises.

So you've done your exploratory analysis, got an exciting finding, but are nervous as to whether it is real. What do you do? The answer is you need a confirmatory study. In the field of genetics, failure to realise this led to several years of stasis, cogently described by Flint et al (2010). Genetics really highlights the problem, because of the huge numbers of possible analyses that can be conducted. What was quickly learned was that most exciting effects don't replicate. The bar has accordingly been set much higher, and most genetics journals won't consider publishing a genetic association unless replication has been demonstrated (Munafo & Flint, 2011). This is tough, but it has meant that we can now place confidence in genetics results. (It also has had a positive side-effect of encouraging more collaboration between research groups). Unfortunately, those outside the field of genetics are unaware of these developments, and we are seeing increasing numbers of genetic association studies being published in the neuroscience literature, with tiny samples and no replication.

The important point to grasp is that the meaning of a p-value is completely different if it emerges when testing an a priori prediction, compared with when it is found in the course of conducting numerous analyses of a dataset. Here, for instance, are outputs from 15 runs of a 4-way Anova on random data, as described here:
Each row shows p-value for outputs (main effects then interactions) for one run of 4-way Anova on new set of random data. For a slightly more legible version see here

If I approached a dataset specifically testing the hypothesis that there would be an interaction between group and task, then the chance of a p-value of .05 or less would be 1 in 20  (as can be confirmed by repeating the simulation thousands of times - in a small number of runs it's less easy to see). But if I just looked for significant findings, it's not hard to find something on most of these runs. An exploratory analysis is not without value, but its value is in generating hypotheses that can then be tested in an a priori design.

So replication is needed to deal with the uncertainties around exploratory analysis. How does pre-registration fit in the picture? Quite simply, it makes explicit the distinction between hypothesis-generating (exploratory) and hypothesis-testing research, which is currently completely blurred. As in the example above, if you tell me in advance what hypothesis you are testing, then I can place confidence in the uncorrected statistical probabilities associated with the predicted effects.  If you haven't predicted anything in advance, then I can't.

This doesn't mean that the results from exploratory analyses are necessarily uninteresting, untrue, or unpublishable, but it does mean we should interpret them as what they are: hypothesis-generating rather than hypothesis-testing.

I'm not surprised at the outcry against pre-registration. This is mega. It would require most of us to change our behaviour radically. It would turn on its head the criteria used to evaluate findings: well-conducted replication studies, currently often unpublishable,  would be seen as important, regardless of their results. On the other hand, it would no longer be possible to report exploratory analyses as if they are hypothesis-testing. In my view, unless we do this we will continue to waste time and precious research funding chasing illusory truths.

References

Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University press.

Munafo, M, & Flint, J. (2011). Dissecting the genetic architecture of human personality Trends in Cognitive Sciences, 15 (9), 395-400 DOI: 10.1016/j.tics.2011.07.007