Saturday, 17 August 2013

Changing children's brains

Portraits of Serafino & Francesco Falzacappa; Pier Leone Ghezzi [1674 - 1755]
Source: J. Paul Getty Museum


Our children are being exposed to an experience that alters their brains in ways we do not fully understand. There is now strong evidence that patterns of brain connectivity are different in individuals who have been exposed compared to those who have not1. Influential figures have expressed concern that people’s memories will be restricted by this experience, which removes the need for them to memorise material, and allows them to look things up instead2. And indeed, there is clear evidence of changes in cognitive processing, as predicted3. Furthermore, instead of learning in a social context, our children are increasingly being encouraged to engage in solitary activities that deprive them from the benefits of interacting with other people. And, rather than embracing traditional influences, they are exposed to alien ideas from other cultures4,5 

Does this sound familiar? Are you thinking computer games, ipads, smart phones? If so, then consider: We are changing children’s brains, altering their memories, and influencing their ideas by exposing them to books.

References
1. Dehaene, Stanislas, Pegado, Felipe, Braga, Lucia W., Ventura, Paulo, Filho, Gilberto Nunes, Jobert, Antoinette, Dehaene-Lambertz, Ghislaine, Kolinsky, Régine, Morais, José, & Cohen, Laurent (2010). How learning to read changes the cortical networks for vision and language Science, 330, 1359-1364 DOI: 10.1126/science.1194140 
2. http://outofthejungle.blogspot.co.uk/2007/11/socrates-objections-to-writing.html
3. Ong, W. J. (1982). Orality and Literacy. London and New York: Routledge.
4. http://yalebooks.wordpress.com/2012/06/11/a-history-of-women-readers-belinda-jack-discusses-the-relationship-between-gender-and-literacy/
5. Nafisi, A. (2003). Reading Lolita in Tehran. New York: Random House.



Sunday, 11 August 2013

The arcuate fasciculus and word learning: a critique

The arcuate fasciculus is a white matter tract linking areas in the temporal lobe involved in interpreting speech with areas in the frontal lobe that control motor movements. Its role in language was established years ago when it was proposed that conduction aphasia, characterised by poor repetition despite good understanding and fluent spontaneous speech, was a disconnection syndrome resulting from lesions of the arcuate fasciculus.

Compared with apes and monkeys, humans have much stronger structural connections between temporal and frontal regions of the brain, suggesting that evolution of the arcuate fasciculus played a key role in language evolution.

Study of white matter tracts in the brain has advanced rapidly since the advent of diffusion tensor imaging (DTI). DTI makes it possible to measure parameters such as fractional anisotropy and radial diffusivity, indirect measures of myelination and/or axonal density within white matter.

Use of DTI has revealed an intriguing aspect of the arcuate fasciculus: it shows wide individual variation. In most people, the left arcuate fasciculus is larger than the right, but in some a more bilateral pattern is seen, and in others, a right arcuate fasciculus may not be visible on DTI. This immediately raises the question of whether this individual variation corresponds to functional differences in language ability. Two studies considered whether the degree of lateralisation of the arcuate fasciculus related to language level, but they obtained conflicting results. Lebel and Beaulieu (2009) found that laterality of the arcuate fasciculus, measured on diffusion tensor imaging, was modestly correlated (r = 0.32) with receptive vocabulary in 68 children, with the highest scores for those with strong left lateralization. However, a study of adults found no relation between left lateralization of the arcuate fasciculus and vocabulary; instead, higher verbal memory was found to be associated with weak lateralisation.

A couple of weeks ago, López-Barroso et al published a paper in the Proceedings of the National Academy of Sciences claiming that structural and functional  measures of the left arcuate fasciculus predicted word learning ability. The authors started with 27 young adults who had brain scans that yielded measures of structural and functional connectivity between temporal and frontal language areas of the brain. Twenty of these individuals also did a learning task while in the scanner. They heard a rapid sequence of novel words, each consisting of three syllables, and were asked to concentrate on them, as they would be asked to recognise them later. After this learning phase, they were presented with the same nonwords mixed in with other nonwords made from the same syllables in a different order, and were asked to make a left or right keypress to indicate if each item was familiar or not. Their responses were transformed into a measure called d-prime, which indicates how well the person discriminates between familiar and unfamiliar items.
Figure 1A from López-Barroso et al, showing the learning task 

From previous research, one might have expected to see an association between nonword learning and lateralisation of the arcuate fasciculus. This was not found, but accuracy in the nonword learning task was significantly correlated with structural and functional measures of strength of connectivity in the left hemisphere. The authors’ conclusion is given in the title of the paper: “Word learning is mediated by the left arcuate fasciculus”.

Given what we know about the arcuate fasciculus, this is a plausible finding, but how robust is the evidence? I think there are at least three problems with this study, which lead me to be cautious about accepting its claims.

First, there is the perennial problem of multiple comparisons. The authors considered three different DTI measures (number of streamlines, fractional anisotropy and radial diffusivity) for left and right sides of four tracts (arcuate long, arcuate anterior, arcuate posterior, and inferior fronto-occipital fasciculus). They used, however, a Bonferroni correction appropriate for 8 correlations (p = .0062) rather than for 24 correlations (.002).  None of the reported correlations is significant if the appropriate correction is used.

Second, the authors emphasised that the correlation between word learning and radial diffusivity was significant only for the direct arcuate tract in the left hemisphere. This, however, confuses difference in significance levels with significance of differences: as Nieuwenhuis et al (2011)  remarked: "when making a comparison between two effects, researchers should report the statistical significance of their difference rather than the difference between their significance levels". Table 1 shows the correlations of radial diffusivity with nonword learning for different regions, with 95% confidence intervals added, and it is clear that there is overlap between these. In other words, these correlations do not differ significantly from one another. See here for further discussion of these issues.
Table 1: Correlations (r) between nonword learning and radial diffusivity in different pathways, with 95% confidence intervals
In this study, the problem is compounded by the fact that different subsets of individuals are included in the correlations for different brain regions. It is not unusual to have to exclude participants from DTI studies because of measurement difficulties, but this does mean that when comparing one brain region with another one is not comparing like with like. And since statistical significance depends on sample size, if this varies from brain region to brain region, this further complicates interpretation. This is evident from Figures 2 and 3 of the López-Barroso et al paper; in both cases the absolute value of the correlation is .42, yet for radial diffusivity of the right posterior segment, this is dismissed as nonsignificant (with N = 19), whereas for the  fMRI analysis it is heralded as significant (with N = 25).

To establish what results would look like if the same subset of participants was used in all analyses, I requested the raw data for radial diffusivity from the first author, who kindly provided it. There were just 13 participants with DTI data for all brain regions: if analysis was restricted to them, then just one of the correlations with word learning was significant by the authors' criterion, that with the right posterior arcuate fasciculus (r = .73, p = .005). This analysis does not prove that this pathway is important: rather, it emphasises that a similar pattern of associations is seen in all pathways, and the study  is underpowered to detect reliable associations, particularly if the interest is in selective associations with one pathway and not another.

Perhaps of greatest concern, though, is the measure of ‘word learning’. For a start, this was not word learning in the usual sense, as the participants were not required to associate speech sounds with meanings. Instead, they had to recognise familiar strings of meaningless sounds. There is a serious oddity about the results. Measures of d-prime usually range from zero (no ability to discriminate familiar from unfamiliar items, i.e. chance performance) to 2 or 3 (highly significant ability to discriminate familiar from unfamiliar items). But in this study, five of the twenty participants obtained negative values of d-prime. A negative value means performance is below chance: i.e., the person was more likely to treat the unfamiliar items as familiar, and vice versa. This is frankly weird, and makes one wonder whether some participants simply got confused about which key corresponded to which response. The authors give a different explanation: “Negative values indicate discrimination is achieved but individuals segmented incorrectly, classifying nonwords as words of the artificial language.” I find this unconvincing, as it would only make sense if the distractor items were made by taking sequences from the original input that crossed word boundaries: this does not seem to have been the case. But even if it were the explanation, does it make sense to treat those who discriminate the nonwords, but segment them wrongly, as doing worse on word learning than those who don’t discriminate the nonwords at all?

Does this matter? I re-ran the correlations excluding four participants with a negative d-prime value of less than -0.42 (which as far as I can work out corresponds to below chance performance). The correlations no longer reached conventional levels of statistical significance, and the largest value was now for a right-sided pathway. This is pretty meaningless, however, because the sample size, already small, becomes so tiny that one cannot do an adequately powered test of the association. The best one can say is that ‘further data are needed’.

I hope the authors will look further at this issue, as the role of the arcuate fasciculus in language  learning is fascinating and potentially important. One possibility would be to look at the associations between vocabulary level and analogous connectivity measures in the sample of 50 adults reported by Catani et al (2007), where the same DTI methods were used.

After I had drafted this critique, I Googled to see if anyone else had blogged about this study. I didn’t find blogs, but I did find extensive media coverage. I was astonished to see that, in discussing implications of this study, one of the authors, Marco Catani, a respected expert in tractography, appeared to be channeling Susan Greenfield. He was quoted as claiming that children’s vocabularies will be restricted by their use of iPads. The newspapers have picked up on these quotes, coming out with headlines such as: “Experts say too much time is spent learning via tablets and computers. Children's vocabulary could be stunted because they listen to teachers and parents less.”  For further sensationalist and misleading accounts, see here and here.

Just to be clear, this was a study looking at structural and functional brain connectivity in relation to a task that involved extracting syllabic patterns from auditory input. It did not feature children, vocabulary learning or iPads.

It really does a disservice to families of children with language learning problems to come out with scaremongering claims about modern technology on the basis of no hard evidence. And, for the record, auditory input is not the only way to learn new words: reading provides an  increasingly important route for vocabulary learning as children grow older.



Reference López-Barroso D, Catani M, Ripollés P, Dell'acqua F, Rodríguez-Fornells A, & de Diego-Balaguer R (2013). Word learning is mediated by the left arcuate fasciculus. Proceedings of the National Academy of Sciences of the United States of America, 110 (32), 13168-73 PMID: 23884655

Friday, 26 July 2013

Why we need pre-registration


There has been a chorus of disapproval this week at the suggestion that researchers should 'pre-register' their studies with journals and spell out in advance the methods and analyses that they plan to do. Those who wish to follow the debate should look at this critique by Sophie Scott, with associated comments, and the responses to it collated here by Pete Etchells. They should also read the explanation of the pre-registration proposals and FAQ  by Chris Chambers - something that many participants in the debate appear not to have done.

Quite simply, pre-registration is designed to tackle two problems in scientific publishing:
  • Bias against publication of null results
  • A failure to distinguish hypothesis-generating (exploratory) from hypothesis-testing analyses
Either of these alone is bad for science: the combined effect of both of them is catastrophic, and has led to a situation where research is failing to do its job in terms of providing credible answers to scientific questions.

Null results

Let's start with the bias against null results. Much has been written about this, including by me. But the heavy guns in the argument have been wielded by Ben Goldacre, who has pointed out that, in the clinical trials field, if we only see the positive findings, then we get a completely distorted view of what works, and as a result, people may die. In my field of psychology, the stakes are not normally as high, but the fact remains that there can be massive distortion in our perception of evidence.

Pre-registration would fix this by guaranteeing publication of a paper regardless of how the results turn out. In fact, there is another, less bureaucratic, way the null result problem could be fixed, and that would be by having reviewers decide on a paper's publishability solely on the basis of the introduction and methods. But that would not fix the second problem.

Blurring the boundaries between exploratory and hypothesis-testing analyses

A big problem is that nearly all data analysis is presented as if it is hypothesis-testing when in fact much of it is exploratory.

In an exploratory analysis, you take a dataset and look at it flexibly to see what's there. Like many scientists, I love exploratory analyses, because you don't know what you will find, and it can be important and exciting. I suspect it is also something that you get better at as you get more experienced, and more able to see the possibilities in the numbers. But my love of exploratory analyses is coupled with a nervousness. With an exploratory analysis, whatever you find, you can never be sure it wasn't just a chance result. Perhaps I was lucky in having this brought home to me early in my career, when I had an alphabetically ordered list of stroke patients I was planning to study, and I happened to notice that those with names in the first half of the alphabet  had left hemisphere lesions and those with names in the second half had right hemisphere lesions. I even did a chi square test and found it was highly significant. Clearly this was nonsense, and just one of those spurious things that can turn up by chance.

These days it is easy to see how often meaningless 'significant' results occur by running analyses on simulated data - see this blogpost for instance. In my view, all statistics classes should include such exercises.

So you've done your exploratory analysis, got an exciting finding, but are nervous as to whether it is real. What do you do? The answer is you need a confirmatory study. In the field of genetics, failure to realise this led to several years of stasis, cogently described by Flint et al (2010). Genetics really highlights the problem, because of the huge numbers of possible analyses that can be conducted. What was quickly learned was that most exciting effects don't replicate. The bar has accordingly been set much higher, and most genetics journals won't consider publishing a genetic association unless replication has been demonstrated (Munafo & Flint, 2011). This is tough, but it has meant that we can now place confidence in genetics results. (It also has had a positive side-effect of encouraging more collaboration between research groups). Unfortunately, those outside the field of genetics are unaware of these developments, and we are seeing increasing numbers of genetic association studies being published in the neuroscience literature, with tiny samples and no replication.

The important point to grasp is that the meaning of a p-value is completely different if it emerges when testing an a priori prediction, compared with when it is found in the course of conducting numerous analyses of a dataset. Here, for instance, are outputs from 15 runs of a 4-way Anova on random data, as described here:
Each row shows p-value for outputs (main effects then interactions) for one run of 4-way Anova on new set of random data. For a slightly more legible version see here

If I approached a dataset specifically testing the hypothesis that there would be an interaction between group and task, then the chance of a p-value of .05 or less would be 1 in 20  (as can be confirmed by repeating the simulation thousands of times - in a small number of runs it's less easy to see). But if I just looked for significant findings, it's not hard to find something on most of these runs. An exploratory analysis is not without value, but its value is in generating hypotheses that can then be tested in an a priori design.

So replication is needed to deal with the uncertainties around exploratory analysis. How does pre-registration fit in the picture? Quite simply, it makes explicit the distinction between hypothesis-generating (exploratory) and hypothesis-testing research, which is currently completely blurred. As in the example above, if you tell me in advance what hypothesis you are testing, then I can place confidence in the uncorrected statistical probabilities associated with the predicted effects.  If you haven't predicted anything in advance, then I can't.

This doesn't mean that the results from exploratory analyses are necessarily uninteresting, untrue, or unpublishable, but it does mean we should interpret them as what they are: hypothesis-generating rather than hypothesis-testing.

I'm not surprised at the outcry against pre-registration. This is mega. It would require most of us to change our behaviour radically. It would turn on its head the criteria used to evaluate findings: well-conducted replication studies, currently often unpublishable,  would be seen as important, regardless of their results. On the other hand, it would no longer be possible to report exploratory analyses as if they are hypothesis-testing. In my view, unless we do this we will continue to waste time and precious research funding chasing illusory truths.

References

Flint, J., Greenspan, R. J., & Kendler, K. S. (2010). How Genes Influence Behavior: Oxford University press.

Munafo, M, & Flint, J. (2011). Dissecting the genetic architecture of human personality Trends in Cognitive Sciences, 15 (9), 395-400 DOI: 10.1016/j.tics.2011.07.007

Friday, 21 June 2013

Discussion meeting vs conference: in praise of slower science

Pompeii mosaic
Plato conversing with his students
As time goes by, I am increasingly unable to enjoy big conferences. I'm not sure how much it's a change in me or a change in conferences, but my attention span shrivels after the first few talks. I don't think I'm alone. Look around any conference hall and everywhere you'll see people checking their email or texting. I usually end up thinking I'd be better off staying at home and just reading stuff.

All this made me start to wonder, what is the point of conferences?  Interaction should be the key thing that a conference can deliver. I have in the past worked in small departments, grotting away on my own without a single colleague who is interested in what I'm doing. In that situation, a conference can reinvigorate your interest in the field, by providing contact with like-minded people who share your particular obsession. And for early-career academics, it can be fascinating to see the big names in action. For me, some of the most memorable and informative experiences at conferences came in the discussion period. If X suggested an alternative interpretation of Y's data, how did Y respond: with good arguments or with evasive arrogance? And how about the time that Z noted important links between the findings of X and Y that nobody had previously been aware of, and the germ of an idea for a new experiment was born?

I think my growing disaffection with conferences is partly fuelled by a decline in the amount and standard of discussion at such events. There's always a lot to squeeze in, speakers will often over-run their allocated time, and in large meetings, meaningful discussion is hampered by the acoustic limitations of large auditoriums. And there's a psychological element too: many people dislike public discussion, and are reluctant to ask questions for fear of seeming rude or self-promotional (see comments on this blogpost for examples). Important debate between those doing cutting-edge work may take place at the conference, but it's more likely to involve a small group over dinner than those in the academic sessions.

Last week, the Royal Society provided the chance for me, together with Karalyn Patterson and Kate Nation, to try a couple of different formats that aimed to restore the role of discussion in academic meetings. Our goal was to bring together researchers from two fields that were related but seldom made contact: acquired and developmental language disorders. Methods and theories in these areas have evolved quite separately, even though the phenomena they deal with overlap substantially.

The Royal Society asks for meeting proposals twice a year, and we were amazed when they not only approved our proposal, but suggested we should have both a Discussion Meeting at the Royal Society in London, and a smaller Satellite meeting at their conference centre at Chicheley Hall in the Buckinghamshire countryside.

We wanted to stimulate discussion, but were aware that if we just had a series of talks by speakers from the two areas, they would probably continue as parallel, non-overlapping streams. So we gave them explicit instructions to interact. For the Discussion meeting, we paired up speakers who worked on similar topics with adults or children, and encouraged them to share their paper with their "buddy" before the meeting. They were asked to devote the last 5-10 minutes of their talk to considering the implications of their buddy's work for their own area. We clearly invited the right people, because the speakers rose to this challenge magnificently. They also were remarkable in all keeping to their allotted 30 minutes, allowing adequate time for discussion. And the discussion really did work: people seemed genuinely fired up to talk about the implications of the work, and the links between speakers, rather than scoring points off each other.

After two days in London, a smaller group of us, feeling rather like a school party, were wafted off to Chicheley in a special Royal Society bus. Here we were going to be even more experimental in our format. We wanted to focus more on early-career scientists, and thanks to generous funding from the Experimental Psychology Society, we were able to include a group of postgrads and postdocs. The programme for the meeting was completely open-ended. Apart from a scheduled poster session, giving the younger people a chance to present their work, we planned two full days of nothing but discussion. Session 1 was the only one with a clear agenda: it was devoted to deciding what we wanted to talk about.

We were pretty nervous about this: it could have been a disaster. What if everyone ran out of things to say and got bored? What if one or two loud-mouths dominated the discussion? Or maybe most people would retire to their rooms and look at email. In fact, the feedback we've had concurs with our own impressions that it worked brilliantly. There were a few things that helped make it a success.
  • The setting, provided by the Royal Society, was perfect. Chicheley Hall is a beautiful stately home in the middle of nowhere. There were no distractions, and no chance of popping out to do a bit of shopping. The meeting spaces were far more conducive to discussion than a traditional lecture theatre.
  • The topic, looking for shared points of interest in two different research fields, encouraged a collaborative spirit, rather than competition.
  • The people were the right mix. We'd thought quite carefully about who to invite; we'd gone for senior people whose natural talkativeness was powered by enthusiasm rather than self-importance. People had complementary areas of expertise, and everyone, however senior, came away feeling they'd learned something.
  • Early-career scientists were selected from those applying, on the basis that their supervisor indicated they had the skills to participate fully in the experience. Nine of them were selected as rapporteurs, and were required to take notes in a break-out session, and then condense 90 minutes of discussion into a 15-minute summary for the whole group.  All nine were quite simply magnificent in this role, and surpassed our expectations. The idea of rapporteurs was, by the way, stimulated by experience at Dahlem conferences, which pioneered discussion-based meetings, and subsequent Strüngmann forums, which continue the tradition.
  • Kate Nation noted that at the London meeting, the discussion had been lively and enjoyable, but largely excluded younger scientists. She suggested that for our discussions at Chicheley, nobody over the age of 40 should be allowed to talk for the first 10 minutes. The Nation Rule proved highly effective - occasionally broken, but greatly appreciated by several of the early career scientists, who told us that they would not have spoken out so much without this encouragement.
I was intrigued to hear from Uta Frith that there is a Slow Science movement, and I felt the whole experience fitted with their ethos: encouraging people to think about science rather than frenetically rushing on to the next thing. Commentary on this has focused mainly on the day-to-day activities of scientists and publication practices (Lutz, 2012). I haven't seen anything specifically about conferences from the Slow Science movement (and since they seem uninterested in social media, it's hard to find out much about them!), but I hope that we'll see more meetings like this, where we all have time to pause, ponder and discuss ideas.  

Reference
Lutz, J. (2012). Slow science Nature Chemistry, 4 (8), 588-589 DOI: 10.1038/nchem.1415

Monday, 17 June 2013

Research fraud: More scrutiny by administrators is not the answer

I read this piece in the Independent this morning and an icy chill gripped me. Fraudulent researchers have been damaging Britain's scientific reputation and we need to do something. But what? Sadly, it sounds like the plan is to do what is usually done when a moral panic occurs: increase the amount of regulation.

So here is my, very quick, response – I really have lots of other things I should be doing, but this seemed urgent, so apologies for typos etc.

According to the account in the Independent, Universities will not be eligible for research funding unless they sign up to a Concordat for Research Integrity which entails, among other things, that they "will have to demonstrate annually that each team member’s graphs and spreadsheets are precisely correct."

We already have massive regulation around the ethics of research on human participants that works on the assumption that nobody can be trusted, so we all have to do mountains of paperwork to prove we aren't doing anything deceptive or harmful. 

So, you will ask, am I in favour of fraud and sloppiness in research? Of course not. Indeed, I devote a fair part of my blog to criticisms of what I see as dodgy science: typically, not outright fraud, but rather over-hyped or methodologically weak work, which is, to my mind, a far greater problem. I agree we need to think about how to fix science, and that many of our current practices lead to non-replicable findings. I just don't think more scrutiny by administrators is the solution. To start scrutinising datasets is just silly: this is not where the problem lies.

So what would I do? The answers fall into three main categories: incentives, publication practices, and research methods.

Incentives is the big one. I've been arguing for years that our current reward system distorts and damages science. I won't rehearse the arguments again: you can read them here.  The current Research Excellence Framework is, to my mind, an unnecessary exercise that further incentivizes researchers against doing slow and careful work. My first recommendation is therefore that we ditch the REF and use simpler metrics to allocate research funding to University, freeing up a great deal of time and money, and improving the security of research staff. Currently, we have a situation where research stardom, assessed by REF criteria, is all-important. Instead of valuing papers in top journals, we should be valuing research replicability

Publication practices are problematic, mainly because the top journals prioritize exciting results over methodological rigour. There is therefore a strong temptation to do post hoc analyses of data until an exciting result emerges. Pre-registration of research projects has been recommended as a way of dealing with this - see this letter to the Guardian on which I am a signatory.  It might be even more effective if research funders adopted the practice of requiring researchers to specify the details of their methods and analyses in advance on a publicly-available database. And once the research was done, the publication should contain a link to a site where data are openly available for scrutiny – with appropriate safeguards about conditions for re-use.

As regards research methods, we need better training of scientists to become more aware of the limitations of the methods that they use. Too often statistical training is a dry and inaccessible discipline. All scientists should be taught how to generate random datasets: nothing is quite as good at instilling a proper understanding of p-values as seeing the apparent patterns in data that will inevitably arise if you look hard enough at some random numbers. In addition, not enough researchers receive training in best practices for ensuring quality of data entry, or in exploratory data analysis to check the numbers are coherent and meet assumptions of the analytic approach.

In my original post on expansion of regulators, I suggested that before a new regulation is introduced, there should be a cold-blooded cost-benefit analysis that considers, among other things, the cost of the regulation both in terms of the salaries of people who implement it, and the time and other costs to those affected by it. My concern is that among the 'other costs' is something rather nebulous that could easily get missed. Quite simply, doing good research takes time and mental space of the researchers. Most researchers are geeks who like nothing better than staring at data and thinking about complicated problems. If you require them to spend time satisfying bureaucratic requirements, this saps the spirit and reduces creativity.

I think we can learn much from the way ethics regulations have panned out. When a new system was first introduced in response to the Alder Hey scandal, I'm sure many thought it was a good idea. It has taken several years for the full impact to be appreciated. The problems are documented in a report by the Academy of Medical Sciences, which noted "Urgent changes are required to the regulation and governance of health research in the UK because unnecessary delays, bureaucracy and complexity are stifling medical advances, without additional benefits to patient safety"

If the account in the Independent is to be believed, then the Concordat for Research Integrity could lead to a similar outcome. I'm glad I will retire before the it is fully implemented.

Sunday, 16 June 2013

Overhyped genetic findings: the case of dyslexia

A press release by Yale University Press Office was recently recycled on the Research Blogging website*, announcing that their researchers had made a major breakthrough. Specifically they said "A new study of the genetic origins of dyslexia and other learning disabilities could allow for earlier diagnoses and more successful interventions, according to researchers at Yale School of Medicine. Many students now are not diagnosed until high school, at which point treatments are less effective." The breathless account by the Press Office is hard to square with the abstract of the paper, which makes no mention of early diagnosis or intervention, but rather focuses on characterising a putative functional risk variant in the DCDC2 gene, named READ1, and establishing its association with reading and language skills.

I've discussed why this kind of thing is problematic in a previous blogpost, but perhaps a figure will help. The point is that in a large sample you can have a statistically strong association between a condition such as dyslexia and a genetic variant, but this does not mean that you can predict who will be dyslexic from their genes.

Proportions with risk variants estimated from Scerri et al (2011)
In this example, based on one of the best-replicated associations in the literature, you can see that most people with dyslexia don't have the risk version of the gene, and most people with the risk version of the gene don't have dyslexia. The effect sizes of individual genetic variants can be very small even when the strength of genetic association is large.

So what about the results from the latest Yale press release? Do they allow for more accurate identification of dyslexia on the basis of genes? In a word, no. I was pleased to see that the authors reported the effect sizes associated with the key genetic variants, which makes it relatively easy to estimate their usefulness in screening. In addition to identifying two sequences in DCDC2 associated with risk of language or reading problems, the authors noted an interaction with a risk version of another gene, KIAA0319, such that children with risk versions in both genes were particularly likely to have problems.  The relevant figure is shown here.

Update: 30th December 2014 - The authors have published an erratum indicating that Figure 3A was wrong. The corrected and original versions are shown below and I have amended conclusions in red.
Corrected Fig 3A from Powers et al (2013)

Original Fig 3A from Powers et al (2013)



There are several points to note from this plot, bearing in mind that dyslexia or SLI would normally only be diagnosed if a child's reading or language scores were at least 1.0 SD below average.
  1. For children who have either KIAA0319 or DCDC2 risk variants, but not both, the average score on reading and language measures is at most no more than 0.1 SD below average at most.
  2. For those who have both risk factors together, some tests give scores that are from 0.2 to 0.3 SD below average, but this is only a subset of the reading/language measures. On nonword reading, often used as a diagnostic test for dyslexia, there is no evidence of any deficit in those with both risk versions of the genes. On the two language measures, the deficit hovers around 0.15 SD below the mean.
  3. The tests that show the largest deficits in those with two risk factors are measures of IQ rather than reading or language. Even here, the degree of impairment in those with two risk factors together indicates that the majority of children with this genotype would not fall in the impaired range.
  4. The number of children with the two risk factors together is very small, around 2% of the population.
In sum, I think this is an interesting paper that might help us discover more about how genetic variation works to influence cognitive development by affecting brain function. The authors present the data in a way that allows us to appraise the clinical significance of the findings quite easily. However, the results indicate that, far from indicating translational potential for diagnosis and treatment, genetic effects are subtle and unlikely to be useful for this purpose.

*It is unclear to me whether the Yale University Press Office are actively involved in gatecrashing Research Blogging, or whether this is just an independent 'blogger' who is recycling press releases as if they are blogposts.

Reference
Powers, N., Eicher, J., Butter, F., Kong, Y., Miller, L., Ring, S., Mann, M., & Gruen, J. (2013). Alleles of a Polymorphic ETV6 Binding Site in DCDC2 Confer Risk of Reading and Language Impairment The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2013.05.008
Scerri, T. S., Morris, A. P., Buckingham, L. L., Newbury, D. F., Miller, L. L., Monaco, A. P., . . . Paracchini, S. (2011). DCDC2, KIAA0319 and CMIP are associated with reading-related traits. Biological Psychiatry, 70, 237-245. doi: 10.1016/j.biopsych.2011.02.005
 

Friday, 7 June 2013

Interpreting unexpected significant results

©www.cartoonstock.com
Here's s question for researchers who use analysis of variance (ANOVA). Suppose I set up a study to see if one group (e.g. men) differs from another (women) on brain response to auditory stimuli (e.g. standard sounds vs deviant sounds – a classic mismatch negativity paradigm). I measure the brain response at frontal and central electrodes located on two sides of the head. The nerds among my readers will see that I have here a four-way ANOVA, with one between-subjects factor (sex) and three within-subjects factors (stimulus, hemisphere, electrode location). My hypothesis is that women have bigger mismatch effects than men, so I predict an interaction between sex and stimulus, but the only result significant at p < .05 is a three-way interaction between sex, stimulus and electrode location. What should I do?

a) Describe this as my main effect of interest, revising my hypothesis to argue for a site-specific sex effect
b) Describe the result as an exploratory finding in need of replication
c) Ignore the result as it was not predicted and is likely to be a false positive

I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.   
Many people assume that if an ANOVA output is significant at the .05 level, there's only a one in twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather than numerous t-tests because ANOVA adjusts for multiple comparisons. But this interpretation is quite wrong. ANOVA adjusts for the number of levels within a factor, so, for instance, the probability of finding a significant effect of group is the same regardless of how many groups you have. ANOVA makes no adjustment to p-values for the number of factors and interactions in your design. The more of these you have, the greater the chance of turning up a "significant" result.
So, for the example given above, the probability of finding something significant at .05, is as follows:
For the four-way ANOVA example above, we have 15 terms (four main effects, six 2-way interactions, four 3-way interactions and one 4-way interaction) and the probability of finding no significant effect is .95^15 = .46. It follows that the probability of finding something significant is .54.
And for a three-way ANOVA there are seven terms (three main effects, three 2-way interactions and one 3-way interaction), and p (something significant) = .30.
So, basically, if you do a four-way ANOVA, and you don't care what results comes out, provided something is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an implausible example: after all who uses ANOVA like this? Well, unfortunately, this example corresponds rather closely to what often happens in electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in comparing a clinical and a control group, and so some results are more interesting than others: the main effect of group, and the seven interactions with group are the principal focus of attention. But hypotheses about exactly what will be found are seldom clearcut: excitement is generated by any p-value associated with a group term that falls below .05. There's a one in three chance that one of the terms involving group will have a p-value this low. This means that the potential for 'false positive psychology' in this field is enormous (Simmons et al, 2011).
A corollary of this is that researchers can modify the likelihood of finding a "significant" result by selecting one ANOVA design rather than another. Suppose I'm interested in comparing brain responses to standard and deviant sounds. One way of doing this is to compute the difference between ERPs to the two auditory stimuli and use this difference score as the dependent variable:  this reduces my ANOVA from a 4-way to a 3-way design, and gives fewer opportunities for spurious findings. So you will get a different risk of a false positive, depending on how you analyse the data.

Another feature of ERP research is that there is flexibility in how electrodes are handled in an ANOVA design: since there is symmetry in electrode placement, it is not uncommon to treat hemisphere as one factor, and electrode site as another. The alternative is just to treat electrode as a repeated measure. This is not a neutral choice: the chances of spurious findings is greater if one adopts the first approach, simply because it adds a factor to the analysis, plus all the interactions with that factor.

I stumbled across these insights into ANOVA when I was simulating data using a design adopted in a recent PLOS One paper that I'd commented on. I was initially interested in looking at the impact of adopting an unbalanced design in ANOVA: this study had a group factor with sample sizes of 20, 12 and 12. Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be the reason why simulated random numbers were giving such a lot of "significant" p-values. However, when I modified the simulation to use equal sample sizes across groups, the analysis continued to generate far more low p-values than I had anticipated, and I eventually twigged that this was because this is what you get if you use 4-way ANOVA. For any one main effect or interaction, the probability of p < .05 was one in twenty: but the probability that at least one term in the analysis would give p < .05 was closer to 50%.
The analytic approach adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have seen papers where 5-way or even 6-way repeated measures ANOVA is used. When you do an ANOVA and it spews out the results, it's tempting to home in on the results that achieve the magical significance level of .05 and then formulate some kind of explanation for the findings. Alas, this is an approach that has left the field swamped by spurious results.
There have been various critiques of analytic methods in ERP, but I haven't yet found any that have focussed on this point. Kilner (2013) has noted the bias that arises when electrodes or windows are selected for analysis post hoc, on the basis that they give big effects. Others have noted problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly correlated. More generally, statisticians are urging psychologists to move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for instance, with unbalanced designs. However, we're not going to fix the problem of "false positive ERP" by adopting a different form of analysis. The problem is not just with the statistics, but with the use of statistics for what are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need educating in the perils of post hoc interpretation of p-values and the importance of a priori specification of predictions.
I've argued before that the best way to teach people about statistics is to get them to generate their own random data sets. In the past, this was difficult, but these days it can be achieved using free statistical software, R. There's no better way of persuading someone to be less impressed by p < .05 than to show them just how readily a random dataset can generate "significant" findings. Those who want to explore this approach may find my blog on twin analysis in R useful for getting started (you don't need to get into the twin bits!).
The field of ERP is particularly at risk of spurious findings because of the way in which ANOVA is often used, but the problem of false positives is not restricted to this area, nor indeed to psychology. The mindset of researchers needs to change radically, with a recognition that our statistical methods only allow us to distinguish signal from noise in the data if we understand the nature of chance.
Education about probability is one way forward. Another is to change how we do science to make a clear distinction between planned and exploratory analyses. This post was stimulated by a letter that appeared in the Guardian this week on which I was a signatory. The authors argued that we should encourage a system of pre-registration of research, to avoid the kind of post hoc interpretation of findings that is so widespread yet so damaging to science.

Reference

Simmons, Joseph P., Nelson, Leif D., & Simonsohn, Uri (2011). False-positive psychology Psychological Science, 1359-1366 DOI: 10.1037/e636412012-001

This article (Figshare version) can be cited as:
Bishop, Dorothy V M (2014): Interpreting unexpected significant findings. figshare.
http://dx.doi.org/10.6084/m9.figshare.1030406




PS. 2nd July 2013
There's remarkably little coverage of this issue in statistics texts, but Mark Baxter pointed me to a 1996 manual for SYSTAT that does explain it clearly. See: http://www.slideshare.net/deevybishop/multiway-anova-and-spurious-results-syt
The authors noted "Some authors devote entire chapters to fine distinctions between multiple comparison procedures and then illustrate them within a multi-factorial design not corrected for the experiment-wise error rate." 
They recommend doing a Q-Q plot to see if the distribution of p-values is different from expectation, and using Bonferroni correction to guard against type I error.

They also note that the different outputs from an ANOVA are not independent if they are based on the same mean squares denominator, a point that is discussed here:
Hurlburt, R. T., & Spiegel, D. K. (1976). Dependence of F Ratios Sharing a Common Denominator Mean Square. The American Statistician, 30(2), 74-78. doi: 10.2307/2683798
These authors conclude (p 76)
It is important to realize that the appearance of two significant F ratios sharing the same denominator should decrease one's confidence in rejecting either of the null hypotheses. Under the null hypothesis, significance can be attained either by the numerator mean square being "unusually" large, or by the denominator mean square being "unusually" small. When the denominator is small, all F ratios sharing that denominator are more likely to be significant. Thus when two F ratios with a common denominator mean square are both significant, one should realize that both significances may be the result of unusually small error mean squares. This is especially true when the numerator degrees of freedom are not small compared' to the denominator degrees of freedom.