BishopBlog: May 2017

Sunday, 28 May 2017

Which neuroimaging measures are useful for individual differences research?

The tl;dr version

A neuroimaging measure is potentially useful for individual differences research if variation between people is substantially greater than variation within the same person tested on different occasions. This means that we need to know about the reliability of our measures, before launching into studies of individual differences.
High reliability is not sufficient to ensure a good measure, but it is necessary.

Individual differences research

Psychologists have used behavioural measures to study individual differences - in cognition and personality - for many years. The goal is complementary to psychological research that looks for universal principles that guide human behaviour: e.g. factors affecting learning or emotional reactions. Individual differences research also often focuses on underlying causes, looking for associations with genetic, experiential and/or neurobiological differences that could lead to individual differences.

Some basic psychometrics

Suppose I set up a study to assess individual differences in children’s vocabulary. I decide to look at three measures.

Measure A involves asking children to define a predetermined set of words, ordered in difficulty, and scoring their responses by standard criteria.
Measure B involves showing the child pictured objects that have to be named.
Measure C involves recording the child talking with another child and measuring how many different words they use.

For each of these measures, we’d expect to see a distribution of scores, so we could potentially rank order children on their vocabulary ability. But are the three measures equally good indicators of individual differences?

We can see immediately one problem with Test B: the distribution of scores is bunched tightly, so it doesn’t capture individual variation very well. Test C, which has the greatest spread of scores, might seem the most suitable for detecting individual variation. But spread of scores, while important, is not the only test attribute to consider. We also need to consider whether the measure assesses a stable individual difference, or whether it is influenced by random or systematic factors that are not part of what we want to measure.

There is a huge literature addressing this issue, starting with Francis Galton in the 19th century, with major statistical advances in the 1950s and 1960s (see review by Wasserman & Bracken, 2003). The classical view treats test scores as a compound, with a ‘true score’ part, plus an ‘error’ part. We want a measure that minimises the impact of random or systematic error.

If there is a big influence of random error, then the test score is likely to change from one occasion to the next. Suppose we measure the same children on two occasions a month apart on three new three tests, and then plot scores on time 1 vs time 2. (To simplify this example, we assume that all three tests have the same normal distribution of scores - the same as for test A in Figure 1, and there is an average gain of 10 points from time 1 to time 2).


Figure 2

We can see that Test F is not very reliable: although there is a significant association between the scores on two test occasions, individual children can show remarkable changes from time to time. If our goal is to measure a reasonably stable attribute of the person, then Test F is clearly not suitable. aov
Just because a test is reliable, it does not mean it is valid. But if it is not reliable, then it won’t be valid. This is illustrated by this nice figure from https://explorable.com/research-methodology:
.

What about change scores?

Sometimes we explicitly want to measure change: for instance, we may be more interested in how quickly a child learns vocabulary, rather than how much they know at some specific point in time. Surely, then, we don’t want a stable measure, as it would not identify the change? Wouldn’t test F be better than D or E for this purpose?

Unfortunately, the logic here is flawed. It’s certainly possible that people may vary in how much they change from time to time, but if our interest is in change, then what we want is a reliable measure of change. There has been considerable debate in the psychological literature as to how best to establish the reliability of a change measure, but the key point is that you can find substantial change in test scores that is meaningless, and that the likelihood of it being meaningless is substantial if the underlying measure is unreliable. The data in Figure 2 were simulated by assuming that all children changed by the same amount from Time 1 to Time 2, but that tests varied in how much random error was incorporated in the test score. If you want to interpret a change score as meaningful, then the onus is on you to convince others that you are not just measuring random error.

What does this have to do with neuroimaging?

My concern with the neuroimaging literature, is that measures from functional or structural imaging are often used to measure individual differences, but it is rare to find any mention of reliability of those measures. In most cases, we simply don’t have any data on repeated testing using the same measures - or if we do, the sample size is too small, or too selected, to give a meaningful estimate of reliability. Such data as we have don’t inspire confidence that brain measurements achieve high level of reliability that is aimed for in psychometric tests. This does not mean that these measures are not useful, but it does make them unsuited for the study of individual differences.

I hesitated about blogging on this topic, because nothing I am saying here is new: the importance of reliability has been established in the literature on measurement theory since 1950. Yet, when different subject areas evolve independently, it seems that methodological practices that are seen as crucial in one discipline can be overlooked in another that is rediscovering the same issues but with different metrics.

There are signs that things are changing, and we are seeing a welcome trend for neuroscientists to start taking reliability seriously. I started thinking about blogging on this topic just a couple of weeks ago after seeing some high-profile papers that exemplified the problems in this area, but in that period, there have also been some nice studies that are starting to provide information on reliability of neuroscience measures. This might seem like relatively dull science to many, but to my mind it is a key step towards incorporating neuroscience in the study of individual differences. As I commented on Twitter recently, my view is that anyone who wants to using a neuroimaging measure as an endophenotype should first be required to establish that it has adequate reliability for that purpose.

Reproducible practices are the future for early career researchers

This post was prompted by an interesting exchange on Twitter with Brent Roberts (@BrentWRoberts) yesterday. Brent had recently posted a piece about the difficulty of bringing about change to improve reproducibility in psychology, and this had led to some discussion about what could be done to move things forward. Matt Motyl (@mattmotyl) tweeted:

I had one colleague tell me that sharing data/scripts is "too high a bar" and that I am wrong for insisting all students who work w me do it

And Brent agreed:

We were recently told that teaching our students to pre-register, do power analysis, and replicate was "undermining" careers.

Now, as a co-author of a manifesto for reproducible science, this kind of thing makes me pretty cross, and so I weighed in, demanding to know who was issuing such rubbish advice. Brent patiently explained that most of his colleagues take this view and are skeptics, agnostics or just naïve about the need to tackle reproducibility. I said that was just shafting the next generation, but Brent replied:

Not as long as the incentive structure remains the same. In these conditions they are helping their students.

So things have got to the point where I need more than 140 characters to make my case. I should stress that I recognise that Brent is one of the good guys, who is trying to make a difference. But I think he is way too pessimistic about the rate of progress, and far from 'helping' their students, the people who resist change are badly damaging them. So here are my reasons.

1. The incentive structure really is changing. The main drivers are funders, who are alarmed that they might be spending their precious funds on results that are not solid. In the UK, funders (Wellcome Trust and Research Councils) were behind a high profile symposium on Reproducibility, and subsequently have issued statements on the topic and started working to change policies and to ensure their panel members are aware of the issues. One council, the BBSRC, funded an Advanced Workshop on Reproducible Methods this April. In the US, NIH has been at the forefront of initiatives to improve reproducibility. In Germany, Open Science is high on the agenda.

2. Some institutions are coming on board. They react more slowly than funders, but where funders lead, they will follow. Some nice examples of institution-wide initiatives toward open, reproducible science come from the Montreal Neurological Institute and the Cambridge MRC Cognition and Brain Sciences Unit. In my own department, Experimental Psychology at the University of Oxford, our Head of Department has encouraged me to hold a one-day workshop on reproducibility later this year, saying she wants our department to be at the forefront of improving psychological science.

3. Some of the best arguments for working reproducibly have been made by Florian Markowetz. You can read about them on this blog, see him give a very entertaining talk on the topic here, or read the published paper here. So there is no escape. I won't repeat his arguments here, as he makes them better than I could, but his basic point is that you don't need to do reproducible research for ideological reasons: there are many selfish arguments for adopting this approach – in the long run it makes your life very much easier.

4. One point Florian doesn't cover is pre-registration of studies. The idea of a 'registered report', where your paper is evaluated, and potentially accepted for publication, on basis of introduction and methods was introduced with the goal of improving science by removing publication bias, p-hacking and HARKing (hypothesising after results are known). You can read about it in these slides by Chris Chambers. But when I tried this with a graduate student, Hannah Hobson, I realised there were other huge benefits. Many people worry that pre-registration slows you down. It does at the planning stage, but you more than compensate for that by the time saved once you have completed the study. Plus you get reviewer comments at a point in the research process when they are actually useful – i.e. before you have embarked on data collection. See this blogpost for my personal experience of this.

5. Another advantage of registered reports is that publication does not depend on getting a positive result. This starts to look very appealing to the hapless early career researcher who keeps running experiments that don't 'work'. Some people imagine that this means the literature will become full of boring registered reports with null findings that nobody is interested in. But because that would be a danger, journals who offer registered reports impose a high bar on papers they accept – basically, the usual requirement is that the study is powered at 90%, so that we can be reasonably confident that a negative result is really a null finding, and not just a type II error. But if you are willing to put in the work to do a well-powered study, and the protocol passes scrutiny of reviewers, you are virtually guaranteed a publication.

6. If you don't have time or inclination to go the whole hog with a registered report, there are still advantages to pre-registering a study, i.e. depositing a detailed, time-stamped protocol in a public archive. You still get the benefits of establishing priority of an idea, as well as avoiding publication bias, p-hacking, etc. And you can even benefit financially: the Open Science Framework is running a pre-registration challenge – they are giving $1000 to the first 1000 entrants who succeed in publishing a pre-registered study in a peer-reviewed journal.

7. The final advantage of adopting reproducible and open science practices is that it is good for science. Florian Markowetz does not dwell long on the argument that it is 'the right thing to do', because he can see that it has as much appeal as being told to give up drinking and stop eating Dunkin Donuts for the sake of your health. He wants to dispel the idea that those who embrace reproducibility are some kind of altruistic idealists who are prepared to sacrifice their careers to improve science. Given arguments 1-6, he is quite right. You don't need to be idealistic to be motivated to adopt reproducible practices. But it is nice when one's selfish ambitions can be aligned with the good of the field. Indeed, I'd go further and suggest that I've long suspected that this may relate to the growing rates of mental health problems among graduate students and postdocs: many people who go into science start out with high ideals, but are made to feel they have to choose between doing things properly vs. succeeding by cutting corners, over-hyping findings, or telling fairy tales in grant proposals. The reproducibility agenda provides a way of continuing to do science without feeling bad about yourself.

Brent and Matt are right that we have a problem with the current generation of established academic psychologists, who are either hostile to or unaware of the reproducibility agenda. When I give talks on this topic, I get instant recognition of the issues by early career researchers in the audience, whereas older people can be less receptive. But what we are seeing here is 'survivor bias'. Those who are in jobs managed to succeed by sticking to the status quo, and so see no need for change. But the need for change is all too apparent to the early career researcher who has wasted two years of their life trying to build on a finding that turns out to be a type I error from an underpowered, p-hacked study. My advice to the latter is don't let yourself be scared by dire warnings of the perils of working reproducibly. Times really are changing and if you take heed now, you will be ahead of the curve.

BishopBlog

Sunday, 28 May 2017

Which neuroimaging measures are useful for individual differences research?

The tl;dr version

Individual differences research

Some basic psychometrics

What about change scores?

What does this have to do with neuroimaging?

Further reading

Monday, 1 May 2017

Reproducible practices are the future for early career researchers

Search This Blog

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers

BishopBlog

Sunday, 28 May 2017

Which neuroimaging measures are useful for individual differences research?

The tl;dr version

Individual differences research

Some basic psychometrics

What about change scores?

What does this have to do with neuroimaging?

Further reading

Monday, 1 May 2017

Reproducible practices are the future for early career researchers

Search This Blog

Subscribe To

Prizewinning blog

Popular Posts

Blog Archive

Contributors

Followers