BishopBlog: hypothesis-testing

The Thatcher Illusion (see below)

I'm a great fan of pre-registration of studies. It is, to my mind, the most effective safeguard against p-hacking and publication bias, the twin scourges that have led to the literature being awash with false positive findings. When combined with a more formal process, as in Registered Reports, it also allows researchers to benefit from reviewer expertise before they do the study, and to take control of the publication timeline.

But one salient objection to pre-registration comes up time and time again: if we pre-register our studies it will destroy the creative side of doing science, and turn it instead into a dull, robotic, cheerless process. We will have to anticipate what we might find, and close our eyes to what the data tell us.

Now this is both silly and untrue. For a start, there's nobody stopping anyone from doing fairly unstructured exploration, which may be the only sensible approach when entering a completely new area. The main thing in that case is to just be clear that this is what it is, and not to start applying statistical tests to the findings. If a finding has emerged from observing the data, testing it with p-values is statistically illiterate.

Nor is there any prohibition on reporting unexpected findings that emerge in the course of a study. Suppose you do a study with a pre-registered hypothesis and analysis plan, which you adhere to. Meanwhile, a most exciting, unanticipated phenomenon is observed in your experiment. If you are going down the kind of registered reports pathway used in Cortex, you report the planned experiment, and then describe the novel finding in a separate section. Hypothesis-testing and exploration are clearly delineated and no p-values are used for the latter.

In fact, with any new exciting observation, any reputable scientist would take steps to check its repeatability, to explore the conditions under which it emerges, and to attempt to develop a theory that can account for it. In effect, all that has happened is that the 'data have spoken' and suggested a new hypothesis, which could potentially be registered and evaluated in the usual way.

But would there be instances of important findings that would have been lost to history if we started using pre-registration years ago? Because I wanted examples of serendipitous findings to test this point, I asked Twitter, and lo, Twitter delivered some cracking examples. All of these predate by many years the notion of pre-registration, but note that, in all cases, having made the initial unexpected observation – either from unstructured exploratory research, or in the course of investigating something else - the researchers went on to shore up the findings with further, hypothesis-driven experiments. What they did not do is to report just the initial observation, embellished with statistics, and then move on, as if the presence of a low p-value guaranteed the truth of the result.

Here are ten phenomena well-known to psychologists that show how the combination of chance and the prepared mind can lead to important discoveries*. Where I could find one, I cite a primary source, but readers should feel free to contribute further background information.

1. Classical conditioning, Pavlov, 1902.
The conventional account of Pavlov's discovery goes like this: He was a physiologist interested in processes of digestion and was studying the tendency of dogs to salivate when presented with food. He noted that over time, the dogs would salivate when the lab assistant entered the room, even before the food was presented, thus discovering the 'conditioned response': a response that is learned by association. A recent account is here. I was not able to find any confirmation of the serendipitous event in either Pavlov's Nobel speech, or in his Royal Society obituary, so it would be interesting to know if this described anywhere in his own writings or those of his contemporaries.

One thing that I did (serendipitously) discover from the latter source, was this intriguing detail, which makes it clear that Pavlov would never have had any truck with p-values, even if they had been in use in 1902: "He never employed mathematics even in its elementary form. He frequently said that mathematics is all very well but it confuses clear thinking almost to the same extent as statistics."

Suggested by @speech_woman @smomara1 @AglobeAgog

2. Psychotropic drugs, 1950s
Chance appears to have played an important role in the discovery of many psychotropic drugs in the early days of psychopharmacology. For instance, tricyclics were initially used to treat tuberculosis, when it was noticed that there was an unanticipated beneficial effect on mood. Even more striking is Hoffman's first-hand account of discovering the psychotropic effects of LSD, which he had developed as a potential circulatory stimulant. After experiencing strange sensations during a laboratory session, Hoffman returned to test the substances he had been working with, including LSD. "Even the first minimum dose of one quarter of a milligram induced a state of intoxication with very severe psychic disturbances, and this persisted for about 12 hours….This first planned experiment with LSD was a particularly terrifying experience because at the time, I had no means of knowing if I should ever return to everyday reality and be restored to a normal state of consciousness. It was only when I became aware of the gradual reinstatement of the old familiar world of reality that I was able to enjoy this greatly enhanced visionary experience".

Suggested by @ollirobinson @kealyj @neuroraf

3. Orientation-sensitive receptive fields in visual cortex, 1959
In his Nobel speech, David Hubel recounts how he and Torsten Wiesel were trying to plot receptive fields of visual cortex neurons using dots of light projected onto a screen, with only scant success, when they observed a cell that gave a massive response as a slide was inserted, creating a faint but sharp shadow on the retina. As he memorably put it, "over the audiomonitor, the cell went off like a machine gun". This initial observation led to a rich vein of research, but, again to quote from Hubel "It took us months to convince ourselves that we weren’t at the mercy of some optical artefact".

Suggested by: @jpeelle @Anth_McGregor @J_Greenwood @theExtendedLuke @nikuss @sophiescott, @robustgar

4. Right ear advantage in dichotic listening, 1961
Doreen Kimura reported that when groups of digits were played to the two ears simultaneously, more were reported back from the right than the left ear (review here). This method was subsequently used for assessing cerebral lateralisation in neuropsychological patients, and a theory was developed that linked the right ear advantage to cerebral dominance for language. I have not been able to access a published account of the early work, but I recall being told during a visit to the Montreal Neurological Institute that it had taken time for the right ear advantage to be recognised as a real phenomenon and not a consequence of unbalanced headphones. The method of dichotic listening dated back to Broadbent or earlier, but it had originally been used to assess selective attention rather than cerebral lateralisation.

5. Phonological similarity effect in STM, 1964
Conrad and Hull (1964) described what they termed 'acoustic confusions' when people were recalling short sequences of visually-presented letters, i.e. errors tended to involve letters that rhymed with the target letter, such as P, D, or G. In preparation for an article celebrating his 100th birthday, I recently listened to a recording of Conrad describing this early work, and explaining that when such errors were observed with auditory presentation, it was assumed they were due to mishearings. Only after further experiments did it become clear that the phenomenon arose in the course of phonological recoding in short-term memory.

6. Hippocampal place cells, 1971
In his 2014 Nobel lecture, John O'Keefe describes a nice example of unconstrained exploratory research: "… we decided to record from electrodes … as the animal performed simple memory tasks and otherwise went about its daily business. I have to say that at this stage we were very catholic in our approach and expectations and were prepared to see that the cells fire to all types of situations and all types of memories. What we found instead was unexpected and very exciting. Over the course of several months of watching the animals behave while simultaneously listening to and monitoring hippocampal cell activity it became clear that there were two types of cells, the first similar to the one I had originally seen which had as its major correlate some non-specific higher-order aspect of movements, and the second a much more silent type which only sprang into activity at irregular intervals and whose correlate was much more difficult to identify. Looking back at the notes from this period it is clear that there were hints that the animal’s location was important but it was only on a particular day when we were recording from a very clear well isolated cell with a clear correlate that it dawned on me that these cells weren’t particularly interested in what the animal was doing or why it was doing it but rather they were interested in where it was in the environment at the time. The cells were coding for the animal’s location!" Needless to say, once the hypothesis of place cells had been formulated, O'Keefe and colleagues went on to test and develop it in a series of rigorous experiments.

7. McGurk effect, 1976
In a famous paper, McGurk and McDonald reported a dramatic illusion: when watching a talking head, in which repeated utterances of the syllable [ba] are dubbed on to lip movements for [ga], normal adults report hearing [da]. Those who recommended this example to me mentioned that the mismatching of lips and voices arose through a dubbing error, and there was even the idea that a technician was disciplined for mixing up the tapes, but I've not found a source for that story. I noted with interest that the Nature paper reporting the findings does not contain a single p-value.

Suggested by: @criener @neuroconscience @DrMattDavis

8. Thatcher illusion, 1980
Peter Thompson kindly sent me an account of his discovery of the Thatcher Illusion (downloadable from here, p. 921). His goal had been to illustrate how spatial frequency information is used in vision, entailing that viewing the same image close up and at a distance will give very different percepts if low spatial frequencies are manipulated. He decided to illustrate this with pictures of Margaret Thatcher, one of which he doctored to invert the eyes and mouth, creating an impressively hideous image. He went to get sellotape to fix the material in place, but noticed that when he returned, approaching the table from the other side, the doctored images were no longer hideous when inverted. Had he had sellotape to hand, we might never have discovered this wonderful illusion.

Suggested by @J_Greenwood

9. Repetition blindness, 1987
Repetition blindness, described here by Nancy Kanwisher, is the phenomenon whereby people have difficulty detecting repeated words that are presented using rapid serial visual presentation (RSVP) - even when the two occurrences are nonconsecutive and differ in case. I could not find a clear account of the history of the discovery, but it seems that researchers investigating a different problem thought that some stimuli were failing to appear, and then realised these were the repeated ones.

Suggested by @PaulEDux

10. Mirror neurons, 1992
Giacomo Rizzolatti and colleagues were recording from cells in the macaque premotor cortex that responded when the animal reached for food, or bit a peanut. To their surprise, they noticed when testing the animals, the same cell that responded when the monkey picked up a peanut also responded when the experimenter did so (see here for summary). Ultimately, they dubbed these cells 'mirror neurons' because they responded both to the animal's own actions and when the animal observed another performing a similar action. The story that mirror neurons were first identified when they started responding during a coffee break as Rizzolatti picked up his espresso appear to be apocryphal.

Suggested by: @brain_apps @neuroraf @ArranReader @seriousstats @jameskilner @RRocheNeuro

*I picked ones that I deemed the clearest and best-known examples. Many thanks to all the people who suggested others.

There has been a chorus of disapproval this week at the suggestion that researchers should 'pre-register' their studies with journals and spell out in advance the methods and analyses that they plan to do. Those who wish to follow the debate should look at this critique by Sophie Scott, with associated comments, and the responses to it collated here by Pete Etchells. They should also read the explanation of the pre-registration proposals and FAQ by Chris Chambers - something that many participants in the debate appear not to have done.

Quite simply, pre-registration is designed to tackle two problems in scientific publishing:

Bias against publication of null results
A failure to distinguish hypothesis-generating (exploratory) from hypothesis-testing analyses

Either of these alone is bad for science: the combined effect of both of them is catastrophic, and has led to a situation where research is failing to do its job in terms of providing credible answers to scientific questions.

Null results

Let's start with the bias against null results. Much has been written about this, including by me. But the heavy guns in the argument have been wielded by Ben Goldacre, who has pointed out that, in the clinical trials field, if we only see the positive findings, then we get a completely distorted view of what works, and as a result, people may die. In my field of psychology, the stakes are not normally as high, but the fact remains that there can be massive distortion in our perception of evidence.

Pre-registration would fix this by guaranteeing publication of a paper regardless of how the results turn out. In fact, there is another, less bureaucratic, way the null result problem could be fixed, and that would be by having reviewers decide on a paper's publishability solely on the basis of the introduction and methods. But that would not fix the second problem.

Blurring the boundaries between exploratory and hypothesis-testing analyses

A big problem is that nearly all data analysis is presented as if it is hypothesis-testing when in fact much of it is exploratory.

In an exploratory analysis, you take a dataset and look at it flexibly to see what's there. Like many scientists, I love exploratory analyses, because you don't know what you will find, and it can be important and exciting. I suspect it is also something that you get better at as you get more experienced, and more able to see the possibilities in the numbers. But my love of exploratory analyses is coupled with a nervousness. With an exploratory analysis, whatever you find, you can never be sure it wasn't just a chance result. Perhaps I was lucky in having this brought home to me early in my career, when I had an alphabetically ordered list of stroke patients I was planning to study, and I happened to notice that those with names in the first half of the alphabet had left hemisphere lesions and those with names in the second half had right hemisphere lesions. I even did a chi square test and found it was highly significant. Clearly this was nonsense, and just one of those spurious things that can turn up by chance.

These days it is easy to see how often meaningless 'significant' results occur by running analyses on simulated data - see this blogpost for instance. In my view, all statistics classes should include such exercises.

So you've done your exploratory analysis, got an exciting finding, but are nervous as to whether it is real. What do you do? The answer is you need a confirmatory study. In the field of genetics, failure to realise this led to several years of stasis, cogently described by Flint et al (2010). Genetics really highlights the problem, because of the huge numbers of possible analyses that can be conducted. What was quickly learned was that most exciting effects don't replicate. The bar has accordingly been set much higher, and most genetics journals won't consider publishing a genetic association unless replication has been demonstrated (Munafo & Flint, 2011). This is tough, but it has meant that we can now place confidence in genetics results. (It also has had a positive side-effect of encouraging more collaboration between research groups). Unfortunately, those outside the field of genetics are unaware of these developments, and we are seeing increasing numbers of genetic association studies being published in the neuroscience literature, with tiny samples and no replication.

The important point to grasp is that the meaning of a p-value is completely different if it emerges when testing an a priori prediction, compared with when it is found in the course of conducting numerous analyses of a dataset. Here, for instance, are outputs from 15 runs of a 4-way Anova on random data, as described here:

Each row shows p-value for outputs (main effects then interactions) for one run of 4-way Anova on new set of random data. For a slightly more legible version see here

If I approached a dataset specifically testing the hypothesis that there would be an interaction between group and task, then the chance of a p-value of .05 or less would be 1 in 20 (as can be confirmed by repeating the simulation thousands of times - in a small number of runs it's less easy to see). But if I just looked for significant findings, it's not hard to find something on most of these runs. An exploratory analysis is not without value, but its value is in generating hypotheses that can then be tested in an a priori design.

So replication is needed to deal with the uncertainties around exploratory analysis. How does pre-registration fit in the picture? Quite simply, it makes explicit the distinction between hypothesis-generating (exploratory) and hypothesis-testing research, which is currently completely blurred. As in the example above, if you tell me in advance what hypothesis you are testing, then I can place confidence in the uncorrected statistical probabilities associated with the predicted effects. If you haven't predicted anything in advance, then I can't.

This doesn't mean that the results from exploratory analyses are necessarily uninteresting, untrue, or unpublishable, but it does mean we should interpret them as what they are: hypothesis-generating rather than hypothesis-testing.

I'm not surprised at the outcry against pre-registration. This is mega. It would require most of us to change our behaviour radically. It would turn on its head the criteria used to evaluate findings: well-conducted replication studies, currently often unpublishable, would be seen as important, regardless of their results. On the other hand, it would no longer be possible to report exploratory analyses as if they are hypothesis-testing. In my view, unless we do this we will continue to waste time and precious research funding chasing illusory truths.

References

Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University press.

Munafo, M, & Flint, J. (2011). Dissecting the genetic architecture of human personality Trends in Cognitive Sciences, 15 (9), 395-400 DOI: 10.1016/j.tics.2011.07.007

BishopBlog

Sunday, 29 May 2016

Ten serendipitous findings in psychology

Friday, 26 July 2013

Why we need pre-registration

Null results

Blurring the boundaries between exploratory and hypothesis-testing analyses

References

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Sunday, 29 May 2016

Ten serendipitous findings in psychology

Friday, 26 July 2013

Why we need pre-registration

Null results

Blurring the boundaries between exploratory and hypothesis-testing analyses

References

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers