Saturday 26 January 2013

An alternative to REF2014?

After blogging last week about use of journal impact factors in REF2014, many people have asked me what alternative I'd recommend. Clearly, we need a transparent, fair and cost-effective method for distributing funding to universities to support research. Those designing the REF have tried hard over the years to devise such a method, and have explored various alternatives, but the current system leaves much to be desired.

Consider the current criteria for rating research outputs, designed by someone with a true flair for ambiguity:
Rating Definition
4* Quality that is world-leading in terms of originality, significance and rigour
3* Quality that is internationally excellent in terms of originality, significance and rigour but which falls short of the highest standards of excellence
2* Quality that is recognised internationally in terms of originality, significance and rigour
1* Quality that is recognised nationally in terms of originality, significance and rigour

Since only 4* and 3* outputs will feature in the funding formula, then a great deal hinges on whether research is deemed “world-leading”, “internationally excellent” or “internationally recognised”. This is hardly transparent or objective. That’s one reason why many institutions want to translate these star ratings into journal impact factors. But substituting a discredited, objective criterion for a subjective criterion is not a solution.

The use of bibliometrics was considered but rejected in the past. My suggestion is that we should reconsider this idea, but in a new version. A few months ago, I blogged about how university rankings in the previous assessment exercise (RAE) related to grant income and citation rates for outputs. Instead of looking at citations for individual researchers, I used Web of Science to compute an H-index for the period 2000-2007 for each department, by using the ‘address’ field to search. As noted in my original post, I did this fairly hastily and the method can get problematic in cases where a Unit of Assessment does not correspond neatly to a single department. The H-index reflected all research outputs of everyone at that address – regardless of whether they were still at the institution or entered for the RAE. Despite these limitations, the resulting H-index predicted the RAE results remarkably well, as seen in the scatterplot below, which shows H-index in relation to the funding level following from RAE. This is computed by number of full-time staff equivalents multiplied by the formula:
    .1 x 2* + .3  x 3* + .7 x 4*
(N.B. I ignored subject weighting, so units are arbitrary).

Psychology (Unit of Assessment 44), RAE2008 outcome by H-index
Yes, you might say, but the prediction is less successful at the top end of the scale, and this could mean that the RAE panels incorporated factors that aren’t readily measured by such a crude score as H-index. Possibly true, but how do we know those factors are fair and objective? In this dataset, one variable that accounted for additional variance in outcome, over and above departmental H-index, was whether the department had a representative on the psychology panel: if they did, then the trend was for the department to have a higher ranking than that predicted from the H-index. With panel membership included in the regression, the correlation (r) increased significantly from .84 to .86, t = 2.82, p = .006. It makes sense that if you are a member of a panel, you will be much more clued up than other people about how the whole process works, and you can use this information to ensure your department’s submission is strategically optimal. I should stress that this was a small effect, and I did not see it in a handful of other disciplines that I looked at, so it could be a fluke. Nevertheless, with the best intentions in the world, the current system can’t ever defend completely against such biases.

So overall, my conclusion is that we might be better off using a bibliometric measure such as a departmental H-index to rank departments. It is crude and imperfect, and I suspect it would not work for all disciplines – especially those in the humanities. It relies solely on citations, and it's debatable whether that is desirable. But for sciences, it seems to be pretty much measuring whatever the RAE was measuring, and it would seem to be the lesser of various possible evils, with a number of advantages compared to the current system. It is transparent and objective, it would not require departments to decide who they do and don’t enter for the assessment, and most importantly, it wins hands down on cost-effectiveness. If we'd used this method instead of the RAE, a small team of analysts armed with Web of Science should be able to derive the necessary data in a couple of weeks to give outcomes that are virtually identical to those of the RAE.  The money saved both by HEFCE and individual universities could be ploughed back into research. Of course, people will attempt to manipulate whatever criterion is adopted, but this one might be less easily gamed than some others, especially if self-citations from the same institution are excluded.

It will be interesting to see how well this method predicts RAE outcomes in other subjects, and whether it can also predict results from the REF2014, where the newly-introduced “impact statement” is intended to incorporate a new dimension into assessment.

Saturday 19 January 2013

Journal Impact Factors and REF 2014

In 2014, British institutions of Higher Education are to be evaluated in the Research Excellence Framework (REF), an important exercise on which their future funding depends. Academics are currently undergoing scrutiny by their institutions to determine whether their research outputs are good enough to be entered in the REF. Outputs are to be assessed in terms of  "‘originality, significance and rigour’, with reference to international research quality standards."
Here's what the REF2014 guidelines say about journal impact factors:

"No sub-panel will make any use of journal impact factors, rankings, lists or the perceived standing of publishers in assessing the quality of research outputs."

Here are a few sources that explain why it is a bad idea to use impact factors to evaluate individual research outputs:
Stephen Curry's blog
David Colquhoun letter to Nature
Manuscript by Brembs & Munafo on "Unintended consequences of journal rank"
Editage tutorial

Here is some evidence that the REF2014 statement on impact factors is being widely ignored:

Jenny Rohn Guardian blogpost

And here's a letter I wrote yesterday to the representatives of RCUK who act as observers on REF panels about this. I'll let you know if I get a reply.

18th January 2013

To: Ms Anne-Marie Coriat: Medical Research Council   
Dr Alf Game: Biotechnology and Biological Sciences Research Council   
Dr Alison Wall: Engineering and Physical Sciences Research Council   
Ms Michelle Wickendon: Natural Environment Research Council   
Ms Victoria Wright: Science and Technology Facilities Council   
Dr Fiona Armstrong: The Economic and Social Research Council    
Mr Gary Grubb: Arts and Humanities Research Council    

Dear REF2014 Observers,

I am contacting you because a growing number of academics are expressing concerns that, contrary to what is stated in the REF guidelines, journal impact factors are being used by some Universities to rate research outputs. Jennifer Rohn raised this issue here in a piece on the Guardian website last November:

I have not been able to find any official route whereby such concerns can be raised, and I have evidence that some of those involved in the REF, including senior university figures and REF panel members, regard it as inevitable and appropriate that journal impact factors will be factored in to ratings - albeit as just one factor among others. Many, perhaps most, of the academics involved in panels and REF preparations grew up in a climate where publication in a high impact journal was regarded as the acme of achievement. Insofar as there are problems with the use of impact factors, they seem to think the only difficulty is the lack of comparability across sub-disciplines, which can be adjusted for. Indeed, I have been told that it is na├»ve to imagine that this statement should be taken literally: "No sub-panel will make any use of journal impact factors, rankings, lists or the perceived standing of publishers in assessing the quality of research outputs." 

Institutions seem to vary in how strictly they are interpreting this statement and this could lead to serious problems further down the line. An institution that played by the rules and submitted papers based only on perceived scientific quality might challenge the REF outcome if they found the panel had been basing ratings on journal impact factor. The evidence for such behaviour could be reconstructed from an analysis of outputs submitted for the REF.

I think it is vital that RCUK responds to the concerns raised by Dr Rohn to clarify the position on journal impact factors and explain the reasoning behind the guidelines on this. Although the statement seems unambiguous, there is a widespread view that the intention is only to avoid slavish use of impact factors as a sole criterion, not to ban their use altogether. If that is the case, then this needs to be made explicit. If not, then it would be helpful to have some mechanism whereby academics could report institutions that flout this rule.

Yours sincerely

(Professor) Dorothy Bishop

Colquhoun, D. (2003). Challenging the tyranny of impact factors Nature, 423 (6939), 479-479 DOI: 10.1038/423479a

P.S. 21/1/13
This post has provoked some excellent debate in the Comments, and also on Twitter. I have collated the tweets on Storify here, and the Comments are below. They confirm that there are very divergent views out there about whether REF panels are likely to, or should, use journal impact factor in any shape or form. They also indicate that this issue is engendering high levels of anxiety in many sections of academia.

P.P.S. 30/1/13
I now have a response from Graeme Rosenberg, REF Manager at HEFCE, who kindly agreed that I could post relevant content from his email here. This briefly explains why impact factors are disallowed for REF panels, but notes that institutions are free to flout this rule in their submissions, at their own risk. The text follows:

I think your letter raises two sets of issues, which I will respond to in turn. 

The REF panel criteria state clearly that panels will not use journal impact factors in the assessment. These criteria were developed by the panels themselves and we have no reason to doubt they will be applied correctly. The four main panels will oversee the work of the sub-panels throughout the assessment process, and it part of the main panels' remit to ensure that all sub-panels apply the published criteria. If there happen to be some individual panel members at this stage who are unsure about the potential use of impact factors in the panels' assessments, the issue will be clarified by the panel chairs when the assessment starts. The published criteria are very clear and do not leave any room for ambiguity on this point. 

The question of institutions using journal impact factors in preparing their submissions is a separate issue. We have stated clearly what the panels will and will not be using to inform their judgements. But institutions are autonomous and ultimately it is their decision as to what forms of evidence they use to inform their selection decisions. If they choose to use journal impact factors as part of the evidence, then the evidence for their decisions will differ to that used by panels. This would no doubt increase the risk to the institution of reaching different conclusions to the REF panels. Institutions would also do well to consider why the REF panels will not use journal impact factors - at the level of individual outputs they are a poor proxy for quality. Nevertheless, it remains the institution's choice.

Friday 11 January 2013

Genetic variation and neuroimaging: some ground rules for reporting research

Those who follow me on Twitter may have noticed signs of tetchiness in my tweets over the past few weeks. In the course of writing a review article, I’ve been reading papers linking genetic variants to language-related brain structure and function. This has gone more slowly than I expected for two reasons. First, the literature gets ever more complicated and technical: both genetics and brain imaging involve huge amounts of data, and new methods for crunching the numbers are developed all the time. If you really want to understand a paper, rather than just assuming the Abstract is accurate, it can be a long, hard slog, especially if, like me, you are neither a geneticist nor a neuroimager. That’s understandable and perhaps unavoidable. The other reason, though, is less acceptable. For all their complicated methods, many of the papers in this area fail to tell the reader some important and quite basic information. This is where the tetchiness comes in. Having burned my brains out trying to understand what was done, I then realise that I have no idea about something quite basic like the sample size. The initial assumption is that I’ve missed it, and so I wade through the paper again, and the Supplementary Material, looking for the key information. Only when I’m absolutely certain that it’s not there, am I reduced to writing to the authors for the information. So this is a plea – to authors, editors and reviewers. If a paper is concerned with an association between a genetic variant and a phenotype (in my case the interest is in neural phenotypes, but I suspect this applies more widely) then could we please ensure that the following information is clearly reported in the Methods or Results section

1. What genetic variant are we talking about? You might think this is very simple, but it’s not: for instance, one of the genes I’m interested in is CNTNAP2, which has been associated with a range of neurodevelopmental disorders, especially those affecting language. The evidence for a link between CNTNAP2 and developmental disorders comes from studies that have examined variation in single-nucleotide polymorphisms or SNPs. These are segments of DNA that are useful in revealing differences between people because they are highly variable. DNA is composed of four bases, C, T, G, and A in paired strands. So for instance, we might have a locus where some people have two copies of C, some have two copies of T, and others have a C and a T. SNPs are not  necessarily a functional part of the gene itself – they may be in a non-coding region, or so close to a gene that variation in the SNP co-occurs with variation in the gene. Many different SNPs can index the same gene. So for CNTNAP2, Vernes et al (2008)tested 38 SNPs, ten of which were linked to language problems. So we have to decide which SNP to study – or whether to study all of them. And we have to decide how to do the analysis. For instance, SNP rs2710102 can take the form CC, CT or TT. We could look for a dose response effect (CC < CT < TT) or we could compare CC/CT with TT, or we could compare CC with CT/TT. Which of these we do may depend on whether prior research suggests the genetic effect is additive or dominant, but for brain imaging studies grouping can also be dictated by practical considerations: it’s usual to compare just two groups and to combine genotypes to give a reasonable sample size. If you’ve followed me so far, and you have some background in statistics, you will already be starting to see why this is potentially problematic. If the researcher can select from ten possible SNPs, and two possible analyses, the opportunities for finding spuriously ‘significant’ results are increased. If there are no directional predictions – i.e. we are just looking for a difference between two groups, but don’t have a clear idea of what type of difference will be associated with ‘risk’ – then the number of potentially ‘interesting’ results is doubled.
For CNTNAP2, I found two papers that had looked at brain correlates of SNP rs2710102. Whalley et al (2011) found that adults with the CC genotype had different patterns of brain activation from CT/TT individuals. However, the other study, by Scott-van Zeeland et al (2010), treated CC/CT as a risk genotype that was compared with TT. (This was not clear in the paper, but the authors confirmed it was what they did).
 Four studies looked at another SNP - rs7794745, on the basis that an increased risk of autism had been reported for the T allele in males. Two of them (Tan et al, 2010; Whalley et al, 2010) compared TT vs TA/AA and two (Folia et al, 2011; Kos et al, 2012) compared TT/TA with AA. In any case, the ground is rather cut from under the feet of these researchers by a recent failure to replicate an association of this SNP with autism (Anney et al, 2012).

2. Who are the participants? It’s not very informative to just say you studied “healthy volunteers”. There are some types of study where it doesn’t much matter how you recruited people. A study looking at genetic correlates of cognitive ability isn’t one of them. Samples of university students, for instance, are not representative of the general population, and aren’t likely to include many people with significant language problems.

3. How many people in the study had each type of genetic variant? And if subgroup analyses are reported, how many people in each subgroup had each type of genetic variant? I've found that papers in top-notch journals often fail to provide this basic information.
Why is this important? For a start, likelihood of showing significant activation of a brain region will be affected by sample size. Suppose you have 24 people with genotype A and 8 with genotype B. You find significant activation of brain region X in those with genotype A, but not for those with genotype B. If you don’t do an explicit statistical comparison of groups (you should - but many people don’t) you may be misled into concluding that brain activation is defective in genotype B – when in fact you just have low power to detect effects in that group because it is so small.
In addition, if you don’t report the N, then it’s difficult to get an idea of the effect size and confidence interval for any effect that is reported. The reasons why this is optimal are well-articulated here. This issue has been much discussed in psychology, but seems not to have permeated the field of genetics, where reliance on p-values seems the norm. In neuroimaging it gets particularly complicated, because some form of correction for ‘false discovery’ will be applied when multiple comparisons are conducted. It’s often hard to work out quite how this was done, and you can end up staring at a table that shows brain regions and p-values, with only a vague idea of how big a difference there actually is between groups.
 Most of the SNPs that are being used in brain studies are ones that were found to be associated with a behavioural phenotype in large-scale genomic studies where the sample size would include hundreds if not thousands of individuals, so small effects could be detected. Brain-based studies often use sample sizes that are relatively small, but some of them find large, sometimes very large, effects. So what does that mean? The optimistic interpretation is that a brain-based phenotype is much closer to the gene effect, and so gives clearer findings. This is essentially  the argument used by those who talk of ‘endophenotypes’ or ‘biomarkers’. There is, however, an alternative, and much more pessimistic view, which is that studies linking genotypes with brain measures are prone to generate false positive findings, because there are too many places in the analysis pipeline where the researchers have opportunities to pick and choose the analysis that brings out the effect of interest most clearly. Neuroskeptic has a nice blogpost illustrating this well-known problem in the neuroimaging area; matters are only made worse by uncertainty re SNP classification (point 1).
A source of concern here is the unpublishability of null findings. Suppose you did a study where you looked at, say, 40 SNPs and a range of measures of brain structure, covering the whole brain. After doing appropriate corrections for multiple comparisons, nothing is significant. The sad fact is that your study is unlikely to find a home in a journal. But is this right? After all, we don’t want to clutter up the literature with a load of negative results. The answer depends on your sample size, among other things. In a small sample, a null result might well reflect lack of statistical power to detect a small effect. This is precisely why people should avoid doing small studies: if you find nothing, it’s uninterpretable. What we need are studies that allow us to say with confidence whether or not there is a significant gene effect.

4. How do the genetic/neuroimaging results relate to cognitive measures in your sample?  Your notion that ‘underactivation of brain area X’ is an endophenotype that leads to poor language, for instance, doesn’t look very plausible if people who have such underactivation have excellent language skills. Out of five papers on CNTNAP2 that I reviewed, three made no mention of cognitive measures, one gathered cognitive data but did not report how it related to genotype or brain measures, and only one provided some relevant, though sketchy, data.

5. Report negative findings. The other kind of email I’ve been writing to people is one that says – could you please clarify whether your failure to report on the relationship between X and Y was because you didn’t do that analysis, or whether you did the analysis but failed to find anything. This is going to be an uphill battle, because editors and reviewers often advise authors to remove analyses with nonsignificant findings. This is a very bad idea as it distorts the literature.

And last of all....
A final plea is not so much to journal editors as to press officers. Please be aware that studies of common SNPs aren't the same as studies of rare genetic mutations. The genetic variants in the studies I looked at were all relatively common in the general population, and so aren't going to be associated with major brain abnormalities. Sensationalised press releases can only cause confusion:
This release on the Scott van-Zeeland (2010) study described neuroimaging findings from  CNTNAP2 variants that are found in over 70% of the population. It claims that: 
  • “A gene variant tied to autism rewires the brain"
  • "Now we can begin to unravel the mystery of how genes rearrange the brain's circuitry, not only in autism but in many related neurological disorders."
  • “Regardless of their diagnosis, the children carrying the risk variant showed a disjointed brain. The frontal lobe was over-connected to itself and poorly connected to the rest of the brain”
  • "If we determine that the CNTNAP2 variant is a consistent predictor of language difficulties, we could begin to design targeted therapies to help rebalance the brain and move it toward a path of more normal development."
Only at the end of the press release, are we told that "One third of the population [sic: should be two thirds] carries this variant in its DNA. It's important to remember that the gene variant alone doesn't cause autism, it just increases risk." 

Anney, R., Klei, L., Pinto, D., Almeida, J., Bacchelli, E., Baird, G., . . . Devlin, B. . Individual common variants exert weak effects on the risk for autism spectrum disorders. Human Molecular Genetics, 21(21), 4781-4792. doi: 10.1093/hmg/dds301(2012)
V. Folia, C. Forkstam, M. Ingvar, P. Hagoort, K. M. Petersson, Implicit artificial syntax processing: Genes, preference, and bounded recursion. Biolinguistics 5,  (2011).

M. Kos et al., CNTNAP2 and language processing in healthy individuals as measured with ERPs. PLOS One 7,  (2012).
Scott-Van Zeeland, A., Abrahams, B., Alvarez-Retuerto, A., Sonnenblick, L., Rudie, J., Ghahremani, D., Mumford, J., Poldrack, R., Dapretto, M., Geschwind, D., & Bookheimer, S. (2010). Altered Functional Connectivity in Frontal Lobe Circuits Is Associated with Variation in the Autism Risk Gene CNTNAP2 Science Translational Medicine, 2 (56), 56-56 DOI: 10.1126/scitranslmed.3001344

G. C. Tan, T. F. Doke, J. Ashburner, N. W. Wood, R. S. Frackowiak, Normal variation in fronto-occipital circuitry and cerebellar structure with an autism-associated polymorphism of CNTNAP2. Neuroimage 53, 1030 (2010).

Vernes, S. C., Newbury, D. F., Abrahams, B., Winchester, L., Nicod, J., Groszer, M., . . . Fisher, S.  A functional genetic link between distinct developmental language disorders. New England Journal of Medicine, 359, 2337-2345. (2008).

H. C. Whalley et al., Genetic variation in CNTNAP2 alters brain function during linguistic processing in healthy individuals. Am. J. Med. Genet. B 156B, 941 (2011).