©www.savagechickens.com |

You open a fresh pack of cards, shuffle the pack and watch him carefully. The Amazing Significo deals you five cards and you find that you do indeed have three of a kind.

According to Wikipedia, the chance of this happening by chance when dealing from an unbiased deck of cards is around 2 per cent - so you are likely to be impressed. You may go public to endorse The Amazing Significo's claim to have supernatural abilities.

But then I tell you that The Amazing Significo has actually dealt five cards to 49 other people that morning, and you are the first one to get three of a kind. Your excitement immediately evaporates: in the context of all the hands he dealt, your result is unsurprising.

Let's take it a step further and suppose that The Amazing Significo was less precise: he just promised to give you a good poker hand without specifying the kind of cards you would get. You regard your hand as evidence of his powers, but you would have been equally happy with two pairs, a flush, or a full house. The probability of getting any one of those good hands goes up to 7 per cent, so in his sample of 50 people, we'd expect three or four to be very happy with his performance.

So context is everything. If The Amazing Significo had dealt a hand to just one person and got a three-of-a-kind hand, that would indeed be amazing. If he had dealt hands to 50 people, and predicted in advance

*which of them*would get a good hand, that would also be amazing. But if he dealt hands to 50 people and just claimed that one or two of them would get a good hand without prespecifying which ones it would be - well, he'd be rightly booed off the stage.

When researchers work with probabilities, they tend to see p-values as measures of the size and importance of a finding. However, as The Amazing Significo demonstrates, p-values can only be interpreted in the context of a whole experiment: unless you know about all the comparisons that have been made (corresponding to all the people who were dealt a hand) they are highly misleading.

In recent years, there has been growing interest in the phenomenon of p-hacking - selecting experimental data after doing the statistics to ensure a p-value below the conventional cutoff of .05. It is recognised as one reason for poor reproducibility of scientific findings, and it can take many forms.

I've become interested in one kind of p-hacking, use of what we term 'ghost variables' - variables that are included in a study but not reported unless they give a significant result. In a recent paper (preprint available here), Paul Thompson and I simulated the situation when a researcher has a set of dependent variables, but reports only those with p-values below .05. This would be like The Amazing Significo making a film of his performances in which he cut out all the cases where he dealt a poor hand**. It is easy to get impressive results if you are selective about what you tell people. If you have two groups of people who are equivalent to one another, and you compare them on just one variable, then the chance that you will get a spurious 'significant' difference (p < .05) is 1 in 20. But with eight variables, the chance of a false positive 'significant' difference on any one variable is 1-.95^8, i.e. 1 in 3. (If variables are correlated these figures change: see our paper for more details).

Quite simply p-values are only interpretable if you have the full context: if you pull out the 'significant' variables and pretend you did not test the others, you will be fooling yourself - and other people - by mistaking chance fluctuations for genuine effects. As we showed with our simulations, it can be extremely difficult to detect this kind of p-hacking, even using statistical methods such as p-curve analysis, which were designed for this purpose. This is why it is so important to either specify statistical tests in advance (akin to predicting which people will get three of a kind), or else adjust p-values for the number of comparisons in exploratory studies*.

Unfortunately, there are many trained scientists who just don't understand this. They see a 'significant' p-value in a set of data and think it has to be meaningful. Anyone who suggests that they need to correct p-values to take into account the number of statistical tests - be they correlations in a correlation matrix, coefficients in a regression equation, or factors and interactions in Analysis of Variance, is seen as a pedantic killjoy (see also Cramer et al, 2015). The p-value is seen as a property of the variable it is attached to, and the idea that it might change completely if the experiment were repeated is hard for them to grasp.

This mass delusion can even extend to journal editors, as was illustrated recently by the COMPare project, the brainchild of Ben Goldacre and colleagues. This involves checking whether the variables reported in medical studies correspond to the ones that the researchers had specified before the study was done and informing journal editors when this was not the case. There's a great account of the project by Tom Chivers in this Buzzfeed article, which I'll let you read for yourself. The bottom line is that the editors of the Annals of Internal Medicine appear to be people who would be unduly impressed by The Amazing Significo because they don't understand what Geoff Cumming has called 'the dance of the p-values'.

*I am ignoring Bayesian approaches here, which no doubt will annoy the Bayesians

**PS.27th Jan 2016. Marcus Munafo has drawn my attention to a film by Derren Brown called 'the System' which pretty much did exactly this! http://www.secrets-explained.com/derren-brown/the-system

Really nicely explained. We'd get rid of a lot of these problems if, as a profession, we just switched from 2 sigma to 5 sigma like they do in physics...

ReplyDeleteNot really a Bayesian but I did notice it .)

ReplyDeleteI was reading Ben Goldacre's COMPare info the other day and was quite impressed.

It is amazing what some good digging can turn up. The problems in doing so are a)it is usually as exciting as doing new research and b) it can take a lot of time and resources.

@Anonymous

No switching to 5 sigma just means that we need a bigger effect but the multiple tests without corrections and p-hacking problems remains.

It's the actual practices not specific criteria that are the problem.

Thanks to you and @anonymous for comments. On the basis of simulations, though, I share the view of Anonymous that shifting to require p < .001 instead of p < .05 would pretty much fix the problem in psychology. For typical experimental parameters in psychology, there would just be too few low p-values to p-hack. See also David Colquhoun's article here: http://rsos.royalsocietypublishing.org/content/1/3/140216

DeleteI understand the point and agree that, in practical terms it would solve much of the current problem in psychology and, perhaps,in a number of other areas. Moving to 5 sigma or, as Colquhoun seems to suggest, 3 sigma would likely be a vast improvement. (Though, of course, Bayes is,really,the way to go :)).

DeleteThere certainly is nothing sacred about .05. A significance level selected for agricultural research in the early 1900's may not be totally appropriate anymore.

However, I still hold (stubbornly) to my position, which I did not explain clearly, that it does not provide a cure for the actual behaviour of p-hacking. That is why I disagreed with Anonymous. There probably is no complete cure for p-hacking but probably better training in graduate school might have some effect as the implication in some papers I read suggests that the researcher is just not aware of the issue rather than deliberately trying to game the system.

I enjoyed David Calquhoun's article. It seems one of the best, straight-forward, explanations of the problem that I have seen. The accompanying R-scrip is interesting.

Great post and thank you for writing!

ReplyDeleteWe are trying to encourage more people to preregister their analyses with the Preregistration Challenge: 1,000 researchers will win $1,000 for publishing the results of their preregistered work: https://cos.io/prereg

I have not studied statistics but had heard of this prohibition once and now you have explained it; thank you!

ReplyDeleteOne question: suppose a researcher notices, among multiple variables, one that is unexpectedly significant. Then, they do a new study, sound in other ways, to investigate that one variable. Would that be reliable?

Absolutely. That is how it is supposed to work - initial exploratory study, generate hypothesis, test hypothesis with new data.

DeleteRelevant: http://xkcd.com/882/

ReplyDeleteI think that much of the problem is based on the artificial bisection of effects into significant vs. non-significant while ignoring effect sizes much of the time. The rules of the game are that you're allowed to claim that 'X does Y' if the corresponding p value is <0.05. If 'X does Y' pertains to something funky that the general audience can get exited about, you might be able to publish it in the 'glamour' journals and the media are likely to pick it up. Great for the career!

ReplyDeleteHowever, p<0.05 (or any other alpha) DOES NOT mean that X does Y. It means that X does Y sometimes, but at other times it may do the opposite or nothing at all. How consistently X does Y is determined by the effect size, not by the p-value!

In my ideal world, there would be an automatic editing process that all papers in all journals would be subjected to that replaces all statements of 'X does Y' (especially in the title) with a more accurate assessment based on the consistency of the effect in the data. Many papers that have attracted interest and raised controversy would then have titles like 'X does Y sometimes, although it has the opposite or no effect at all in many other cases and we don't really know why'. Papers with p-values close to 5% for the main conclusion would likely be entitled 'Limited evidence that we did not just measure random numbers in our attempt to examine whether X does Y'. That way, overselling weak effects to 'glamour' journals, the media and the public would become an entirely new challenge...

Lowering alpha thresholds would exert higher pressure to study bigger, more reliable, effects, and I think that is where the real benefit is. Even with alpha at 1%, sustaining a career or research field based on spurious effects would become much harder (the current '1/20+inflation by p hacking' chance of finding unexpected and exiting results may be too high to deter this).