-->
Replication studies have been much in the news lately,
particularly in the field of psychology, where a great deal of discussion has
been stimulated by the Reproducibility
Project spearheaded by Brian Nosek.
Replication of a study is an important way to test the the
reproducibility and generalisability of the results. It has been a standard
requirement for publication in reputable journals in the field of genetics for
several years (see Kraft et al, 2009).
However, at interdisciplinary boundaries, the need for replication may not be
appreciated, especially where researchers from other disciplines include
genetic associations in their analyses. I’m interested in documenting how far
replications are routinely included in genetics papers that are published in
neuroscience journals, and so I attempted to categorise a set of papers on this
basis.
I’ve encountered many unanticipated obstacles in the course
of this study (unintelligible
papers and uncommunicative
authors, to name just two I have blogged about), but I had not expected to
find it difficult to make this binary categorisation. But it has become clear
that there are nuances to the idea of replication. Here are two of those I have
encountered:
a)
Studies which include a straightforward
Discovery and Replication sample, but which fail to reproduce the original
result in the Replication sample. The authors then proceed to analyse the data
with both samples combined and conclude that the original result is still
there, so all is okay. Now, as far as I am concerned, you can’t treat this as a
successful replication; the best you can say of it is that it is an extension
of the original study to a larger sample size. But if, as is typically the case, the original
result was afflicted by the Winner’s Curse, then
the combined result will be biased.
b)
Studies which use different phenotypes for
Discovery and Replication samples. On the one hand, one can argue that such
studies are useful for identifying how generalizable the initial result is to
changes in measures. It may also be the only practical solution if using
pre-existing samples for replication, as one has to use what measures are available.
The problem is that there is an asymmetry in terms of how the results are then
treated. If the same result is obtained with a new sample using different
measures, this can be taken as strong evidence that the genotype is influencing
a trait regardless of how it is measured. But when the Replication sample fails
to reproduce the original result, one is left with uncertainty as to whether it
was type I error, or a finding that is sensitive to how it is measured. I’ve
found that people are very reluctant to treat failures to replicate as
undermining the original finding in this circumstance.
I’m reminded of arguments in the field of social psychology,
where failures to reproduce well-known phenomena are often attributed to minor
changes in the procedures or lack of ‘flair’ of experimenters. The problem is
that while this interpretation could be valid, there is another, less
palatable, interpretation, which is that the original finding was a type I
error. This is particularly likely when
the original study was underpowered or the phenotype was measured using an
unreliable instrument.
No comments:
New comments are not allowed.