I read this piece in the Independent this morning and an icy chill gripped me. Fraudulent researchers have been damaging Britain's scientific reputation and we need to do something. But what? Sadly, it sounds like the plan is to do what is usually done when a moral panic occurs: increase the amount of regulation.
So here is my, very quick, response – I really have lots of other things I should be doing, but this seemed urgent, so apologies for typos etc.
According to the account in the Independent, Universities will not be eligible for research funding unless they sign up to a Concordat for Research Integrity which entails, among other things, that they "will have to demonstrate annually that each team member’s graphs and spreadsheets are precisely correct."
We already have massive regulation around the ethics of research on human participants that works on the assumption that nobody can be trusted, so we all have to do mountains of paperwork to prove we aren't doing anything deceptive or harmful.
So, you will ask, am I in favour of fraud and sloppiness in research? Of course not. Indeed, I devote a fair part of my blog to criticisms of what I see as dodgy science: typically, not outright fraud, but rather over-hyped or methodologically weak work, which is, to my mind, a far greater problem. I agree we need to think about how to fix science, and that many of our current practices lead to non-replicable findings. I just don't think more scrutiny by administrators is the solution. To start scrutinising datasets is just silly: this is not where the problem lies.
So what would I do? The answers fall into three main categories: incentives, publication practices, and research methods.
Incentives is the big one. I've been arguing for years that our current reward system distorts and damages science. I won't rehearse the arguments again: you can read them here.
The current Research Excellence Framework is, to my mind, an unnecessary exercise that further incentivizes researchers against doing slow and careful work. My first recommendation is therefore that we ditch the REF and use simpler metrics to allocate research funding to University, freeing up a great deal of time and money, and improving the security of research staff. Currently, we have a situation where research stardom, assessed by REF criteria, is all-important. Instead of valuing papers in top journals, we should be valuing research replicability.
Publication practices are problematic, mainly because the top journals prioritize exciting results over methodological rigour. There is therefore a strong temptation to do post hoc analyses of data until an exciting result emerges. Pre-registration of research projects has been recommended as a way of dealing with this - see this letter to the Guardian on which I am a signatory.
It might be even more effective if research funders adopted the practice of requiring researchers to specify the details of their methods and analyses in advance on a publicly-available database. And once the research was done, the publication should contain a link to a site where data are openly available for scrutiny – with appropriate safeguards about conditions for re-use.
As regards research methods, we need better training of scientists to become more aware of the limitations of the methods that they use. Too often statistical training is a dry and inaccessible discipline. All scientists should be taught how to generate random datasets: nothing is quite as good at instilling a proper understanding of p-values as seeing the apparent patterns in data that will inevitably arise if you look hard enough at some random numbers. In addition, not enough researchers receive training in best practices for ensuring quality of data entry, or in exploratory data analysis to check the numbers are coherent and meet assumptions of the analytic approach.
In my original post on expansion of regulators, I suggested that before a new regulation is introduced, there should be a cold-blooded cost-benefit analysis that considers, among other things, the cost of the regulation both in terms of the salaries of people who implement it, and the time and other costs to those affected by it. My concern is that among the 'other costs' is something rather nebulous that could easily get missed. Quite simply, doing good research takes time and mental space of the researchers. Most researchers are geeks who like nothing better than staring at data and thinking about complicated problems. If you require them to spend time satisfying bureaucratic requirements, this saps the spirit and reduces creativity.
I think we can learn much from the way ethics regulations have panned out. When a new system was first introduced in response to the Alder Hey scandal, I'm sure many thought it was a good idea. It has taken several years for the full impact to be appreciated. The problems are documented in a report by the Academy of Medical Sciences, which noted "Urgent changes are required to the regulation and governance of health
research in the UK because unnecessary delays, bureaucracy and
complexity are stifling medical advances, without additional benefits to
patient safety"
If the account in the Independent is to be believed, then the Concordat for Research Integrity could lead to a similar outcome. I'm glad I will retire before the it is fully implemented.
Monday, 17 June 2013
Sunday, 16 June 2013
Overhyped genetic findings: the case of dyslexia
A press release by Yale University Press Office was recently recycled on the Research Blogging website*, announcing that their researchers had made a major breakthrough. Specifically they said "A new study of the genetic origins of dyslexia and other learning disabilities could allow for earlier diagnoses and more successful interventions, according to researchers at Yale School of Medicine. Many students now are not diagnosed until high school, at which point treatments are less effective." The breathless account by the Press Office is hard to square with the abstract of the paper, which makes no mention of early diagnosis or intervention, but rather focuses on characterising a putative functional risk variant in the DCDC2 gene, named READ1, and establishing its association with reading and language skills.
I've discussed why this kind of thing is problematic in a previous blogpost, but perhaps a figure will help. The point is that in a large sample you can have a statistically strong association between a condition such as dyslexia and a genetic variant, but this does not mean that you can predict who will be dyslexic from their genes.
In this example, based on one of the best-replicated associations in the literature, you can see that most people with dyslexia don't have the risk version of the gene, and most people with the risk version of the gene don't have dyslexia. The effect sizes of individual genetic variants can be very small even when the strength of genetic association is large.
So what about the results from the latest Yale press release? Do they allow for more accurate identification of dyslexia on the basis of genes? In a word, no. I was pleased to see that the authors reported the effect sizes associated with the key genetic variants, which makes it relatively easy to estimate their usefulness in screening. In addition to identifying two sequences in DCDC2 associated with risk of language or reading problems, the authors noted an interaction with a risk version of another gene, KIAA0319, such that children with risk versions in both genes were particularly likely to have problems. The relevant figure is shown here.
There are several points to note from this plot, bearing in mind that dyslexia or SLI would normally only be diagnosed if a child's reading or language scores were at least 1.0 SD below average.
*It is unclear to me whether the Yale University Press Office are actively involved in gatecrashing Research Blogging, or whether this is just an independent 'blogger' who is recycling press releases as if they are blogposts.
Reference
Powers, N., Eicher, J., Butter, F., Kong, Y., Miller, L., Ring, S., Mann, M., & Gruen, J. (2013). Alleles of a Polymorphic ETV6 Binding Site in DCDC2 Confer Risk of Reading and Language Impairment The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2013.05.008
Scerri, T. S., Morris, A. P., Buckingham, L. L., Newbury, D. F., Miller, L. L., Monaco, A. P., . . . Paracchini, S. (2011). DCDC2, KIAA0319 and CMIP are associated with reading-related traits. Biological Psychiatry, 70, 237-245. doi: 10.1016/j.biopsych.2011.02.005
I've discussed why this kind of thing is problematic in a previous blogpost, but perhaps a figure will help. The point is that in a large sample you can have a statistically strong association between a condition such as dyslexia and a genetic variant, but this does not mean that you can predict who will be dyslexic from their genes.
| Proportions with risk variants estimated from Scerri et al (2011) |
So what about the results from the latest Yale press release? Do they allow for more accurate identification of dyslexia on the basis of genes? In a word, no. I was pleased to see that the authors reported the effect sizes associated with the key genetic variants, which makes it relatively easy to estimate their usefulness in screening. In addition to identifying two sequences in DCDC2 associated with risk of language or reading problems, the authors noted an interaction with a risk version of another gene, KIAA0319, such that children with risk versions in both genes were particularly likely to have problems. The relevant figure is shown here.
| Fig 3A from Powers et al (2013) |
There are several points to note from this plot, bearing in mind that dyslexia or SLI would normally only be diagnosed if a child's reading or language scores were at least 1.0 SD below average.
- For children who have either KIAA0319 or DCDC2 risk variants, but not both, the average score on reading and language measures is at most 0.1 SD below average.
- For those who have both risk factors together, some tests give scores that are 0.3 SD below average, but this is only a subset of the reading/language measures. On nonword reading, often used as a diagnostic test for dyslexia, there is no evidence of any deficit in those with both risk versions of the genes. On the two language measures, the deficit hovers around 0.15 SD below the mean.
- The tests that show the largest deficits in those with two risk factors are measures of IQ rather than reading or language. Even here, the degree of impairment in those with two risk factors together indicates that the majority of children with this genotype would not fall in the impaired range.
- The number of children with the two risk factors together is very small, around 1% of the population.
*It is unclear to me whether the Yale University Press Office are actively involved in gatecrashing Research Blogging, or whether this is just an independent 'blogger' who is recycling press releases as if they are blogposts.
Reference
Powers, N., Eicher, J., Butter, F., Kong, Y., Miller, L., Ring, S., Mann, M., & Gruen, J. (2013). Alleles of a Polymorphic ETV6 Binding Site in DCDC2 Confer Risk of Reading and Language Impairment The American Journal of Human Genetics DOI: 10.1016/j.ajhg.2013.05.008
Scerri, T. S., Morris, A. P., Buckingham, L. L., Newbury, D. F., Miller, L. L., Monaco, A. P., . . . Paracchini, S. (2011). DCDC2, KIAA0319 and CMIP are associated with reading-related traits. Biological Psychiatry, 70, 237-245. doi: 10.1016/j.biopsych.2011.02.005
Friday, 7 June 2013
Interpreting unexpected significant results
![]() |
| ©www.cartoonstock.com |
a) Describe this as my
main effect of interest, revising my hypothesis to argue for a site-specific
sex effect
b) Describe the result as
an exploratory finding in need of replication
c) Ignore the result as
it was not predicted and is likely to be a false positive
I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.
I'd love to do a survey to see how people respond to these choices; my guess is many would opt for a) and few would opt for c). Yet in this situation, the likelihood of the result being a false positive is very high – much higher than many people realise.
Many people assume that
if an ANOVA output is significant at the .05 level, there's only a one in
twenty chance of it being a spurious chance effect. We have been taught that we do ANOVA rather
than numerous t-tests because ANOVA adjusts for multiple comparisons. But this
interpretation is quite wrong. ANOVA adjusts for the number of levels within a factor, so, for instance, the probability
of finding a significant effect of group is the same regardless of how many
groups you have. ANOVA makes no
adjustment to p-values for the number of factors and interactions in your
design. The more of these you have, the greater the chance of turning up a
"significant" result.
So, for the example given
above, the probability of finding something
significant at .05, is as follows:
For the four-way ANOVA
example above, we have 15 terms (four
main effects, six 2-way interactions, four 3-way interactions and one 4-way
interaction) and the probability of finding no significant effect is .95^15 =
.46. It follows that the probability of finding something significant is .54.
And for a three-way ANOVA
there are seven terms (three main effects, three 2-way interactions and one
3-way interaction), and p (something significant) = .30.
So, basically, if you do
a four-way ANOVA, and you don't care what results comes out, provided something
is significant, you have a slightly greater than 50% chance of being satisfied. This might seem like an
implausible example: after all who uses ANOVA like this? Well, unfortunately,
this example corresponds rather closely to what often happens in
electrophysiological research using event-related potentials (ERPs). In this field, the interest is often in
comparing a clinical and a control group, and so some results are more
interesting than others: the main effect of group, and the seven interactions
with group are the principal focus of attention. But hypotheses about exactly what will be
found are seldom clearcut: excitement is generated by any p-value associated
with a group term that falls below .05. There's a one in three chance that one
of the terms involving group will have a p-value this low. This means that the
potential for 'false positive psychology' in this field is enormous (Simmons et
al, 2011).
A corollary
of this is that researchers can modify the likelihood of finding a
"significant" result by selecting one ANOVA design rather than
another. Suppose I'm interested in comparing brain responses to standard and
deviant sounds. One way of doing this is to compute the difference between ERPs
to the two auditory stimuli and use this difference score as the dependent
variable: this reduces my ANOVA from a
4-way to a 3-way design, and gives fewer opportunities for spurious findings. So
you will get a different risk of a false positive,
depending on how you analyse the data.Another feature of ERP research is that there is flexibility in how electrodes are handled in an ANOVA design: since there is symmetry in electrode placement, it is not uncommon to treat hemisphere as one factor, and electrode site as another. The alternative is just to treat electrode as a repeated measure. This is not a neutral choice: the chances of spurious findings is greater if one adopts the first approach, simply because it adds a factor to the analysis, plus all the interactions with that factor.
I stumbled across these
insights into ANOVA when I was simulating data using a design adopted in a
recent PLOS One paper that I'd commented on. I was initially interested in looking at the
impact of adopting an unbalanced design in ANOVA: this study had a group factor
with sample sizes of 20, 12 and 12. Unbalanced designs are known to be problematic for repeated measures ANOVA and I initially thought this might be
the reason why simulated random numbers were giving such a lot of
"significant" p-values. However, when I modified the simulation to
use equal sample sizes across groups, the analysis continued to generate far
more low p-values than I had anticipated, and I eventually twigged that this
was because this is what you get if you use 4-way ANOVA. For any one main
effect or interaction, the probability of p < .05 was one in twenty: but the
probability that at least one term in the analysis would give p < .05 was closer
to 50%.
The analytic approach
adopted in the PLOS One paper is pretty standard in the field of ERP. Indeed, I have
seen papers where 5-way or even 6-way
repeated measures ANOVA is used. When
you do an ANOVA and it spews out the results, it's tempting to home in on the
results that achieve the magical significance level of .05 and then formulate
some kind of explanation for the findings. Alas, this is an approach that has
left the field swamped by spurious results.
There have been various
critiques of analytic methods in ERP, but I haven't yet found any that have
focussed on this point. Kilner (2013) has noted the bias that arises when
electrodes or windows are selected for analysis post hoc, on the basis that
they give big effects. Others have noted problems with using electrode as a repeated measure, given that ERPs at different electrodes are often highly
correlated. More generally,
statisticians are urging psychologists to move away from using ANOVA to adopt multi-level modelling, which makes different assumptions and can cope, for
instance, with unbalanced designs. However, we're not going to fix the problem
of "false positive ERP" by adopting a different form of analysis. The
problem is not just with the statistics, but with the use of statistics for what
are, in effect, unconstrained exploratory analyses. Researchers in this field urgently need
educating in the perils of post hoc interpretation of p-values and the
importance of a priori specification of predictions.
I've argued before that
the best way to teach people about statistics is to get them to generate their
own random data sets. In the past, this was difficult, but these days it can be
achieved using free statistical software, R. There's no better way of persuading someone to be less impressed by p
< .05 than to show them just how readily a random dataset can generate
"significant" findings. Those who want to explore this approach may
find my blog on twin analysis in R useful for getting started (you don't need
to get into the twin bits!).
The field of ERP is
particularly at risk of spurious findings because of the way in which ANOVA is
often used, but the problem of false positives is not restricted to this area,
nor indeed to psychology. The mindset of researchers needs to change radically,
with a recognition that our statistical methods only allow us to distinguish
signal from noise in the data if we understand the nature of chance.
Education about
probability is one way forward. Another is to change how we do science to make
a clear distinction between planned and exploratory analyses. This post was
stimulated by a letter that appeared in the Guardian this week on which I was a
signatory. The authors argued that we should encourage a system of
pre-registration of research, to avoid the kind of post hoc interpretation of
findings that is so widespread yet so damaging to science.
Reference
Simmons, Joseph P., Nelson, Leif D., & Simonsohn, Uri (2011). False-positive psychology Psychological Science, 1359-1366 DOI: 10.1037/e636412012-001
Tuesday, 4 June 2013
Bishopblog catalogue (updated 4th June 2013)
![]() |
| Source: http://www.weblogcartoons.com/2008/11/23/ideas/ |
Those of you who follow this blog may have noticed a lack of
thematic coherence. I write about whatever is exercising my mind at the time,
which can range from technical aspects of statistics to the design of bathroom
taps. I decided it might be helpful to introduce a bit of order into this
chaotic melange, so here is a catalogue of posts by topic.
Language impairment, dyslexia and related disorders
Autism
Autism diagnosis in cultural context (16 May 2011)
Are our ‘gold standard’ autism diagnostic instruments fit for purpose? (30 May 2011)
How common is autism? (7 Jun 2011)
Autism and hypersystematising parents (21 Jun 2011) An open letter to Baroness Susan Greenfield (4 Aug 2011)
Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011)
Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012)
The ‘autism epidemic’ and diagnostic substitution (4 Jun 2012)
Developmental disorders/paediatrics
The hidden cost of neglected tropical diseases (25 Nov 2010)
The National Children's Study: a view from across the pond (25 Jun 2011)
The kids are all right in daycare (14 Sep 2011) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012)
Genetics
Where does the myth of a gene for things like intelligence come from? (9 Sep 2010)
Genes for optimism, dyslexia and obesity and other mythical beasts (10 Sep 2010)
The X and Y of sex differences (11 May 2011)
Review of How Genes Influence Behaviour (5 Jun 2011)
Getting genetic effect sizes in perspective (20 Apr 2012) Moderate drinking in pregnancy: toxic or benign? (21 Nov 2012) Genes, brains and lateralisation (22 Dec 2012) Genetic variation and neuroimaging (11 Jan 2013) Have we become slower and dumber? (15 May 2013)
Neuroscience
Neuroprognosis in dyslexia (22 Dec 2010) Brain scans show that… (11 Jun 2011)
Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012)
Neuronal migration in language learning impairments (2 May 2012)
Sharing of MRI datasets (6 May 2012)
Genetic variation and neuroimaging (1 Jan 2013)
Statistics
Book review: biography of Richard Doll (5 Jun 2010)
Book review: the Invisible Gorilla (30 Jun 2010)
The difference between p < .05 and a screening test (23 Jul 2010)
Three ways to improve cognitive test scores without intervention (14 Aug 2010)
A short nerdy post about the use of percentiles (13 Apr 2011)
The joys of inventing data (5 Oct 2011)
Getting genetic effect sizes in perspective (20 Apr 2012) Causal models of developmental disorders: the perils of correlational data (24 Jun 2012) Data from the phonics screen (1 Oct 2012)Moderate drinking in pregnancy: toxic or benign? (1 Nov 2012) Flaky chocolate and the New England Journal of Medicine (13 Nov 2012)
Journalism/science communication
Orwellian prize for scientific misrepresentation (1 Jun 2010)
Journalists and the 'scientific breakthrough' (13 Jun 2010)
Science journal editors: a taxonomy (28 Sep 2010)
Orwellian prize for journalistic misrepresentation: an update (29 Jan 2011)
Academic publishing: why isn't psychology like physics? (26 Feb 2011)
Scientific communication: the Comment option (25 May 2011)
Accentuate the negative (26 Oct 2011)
Publishers, psychological tests and greed (30 Dec 2011)
Time for academics to withdraw free labour (7 Jan 2012)
Novelty, interest and replicability (19 Jan 2012)
2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012)
Time for neuroimaging (and PNAS) to clean up its act (5 Mar 2012)
Communicating science in the age of the internet (13 Jul 2012) How to bury your academic writing (26 Aug 2012)
High-impact journals: where newsworthiness trumps methodology (10 Mar 2013)
Blogging as post-publication peer review (21 Mar 2013) A short rant about numbered journal references (5 Apr 2013) Schizophrenia and child abuse in the media (26 May 2013)
Social Media
A gentle introduction to Twitter for the apprehensive academic (14 Jun 2011)
Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011)
Will I still be tweeting in 2013? (2 Jan 2012)
Blogging in the service of science (10 Mar 2012) Blogging as post-publication peer review (21 Mar 2013)
Academic life
An exciting day in the life of a scientist (24 Jun 2010)
How our current reward structures have distorted and damaged science (6 Aug 2010)
The challenge for science: speech by Colin Blakemore (14 Oct 2010)
When ethics regulations have unethical consequences (14 Dec 2010)
A day working from home (23 Dec 2010)
Should we ration research grant applications? (8 Jan 2011)
The one hour lecture (11 Mar 2011)
The expansion of research regulators (20 Mar 2011)
Should we ever fight lies with lies? (19 Jun 2011)
How to survive in psychological research (13 Jul 2011)
So you want to be a research assistant? (25 Aug 2011)
NHS research ethics procedures: a modern-day Circumlocution Office (18 Dec 2011)
The REF: a monster that sucks time and money from academic institutions (20 Mar 2012)
The ultimate email auto-response (12 Apr 2012)
Well, this should be easy…. (21 May 2012) Journal impact factors and REF2014 (19 Jan 2013) An alternative to REF2014 (26 Jan 2013) Postgraduate education: time for a rethink (9 Feb 2013) High-impact journals: where newsworthiness trumps methodology (10 Mar 2013)
Ten things that can sink a grant proposal (19 Mar 2013)Blogging as post-publication peer review (21 Mar 2013) The academic backlog (9 May 2013)
Celebrity scientists/quackery
Three ways to improve cognitive test scores without intervention (14 Aug 2010) What does it take to become a Fellow of the RSM? (24 Jul 2011)
An open letter to Baroness Susan Greenfield (4 Aug 2011)
Susan Greenfield and autistic spectrum disorder: was she misrepresented? (12 Aug 2011)
How to become a celebrity scientific expert (12 Sep 2011) The kids are all right in daycare (14 Sep 2011)
The weird world of US ethics regulation (25 Nov 2011)
Pioneering treatment or quackery? How to decide (4 Dec 2011) Psychoanalytic treatment for autism: Interviews with French analysts (23 Jan 2012) Neuroscientific interventions for dyslexia: red flags (24 Feb 2012)
Women
Academic mobbing in cyberspace (30 May 2010)
What works for women: some useful links (12 Jan 2011)
The burqua ban: what's a liberal response (21 Apr 2011) C'mon sisters! Speak out! (28 Mar 2012)
Psychology: where are all the men? (5 Nov 2012)
Politics and Religion
Lies, damned lies and spin (15 Oct 2011) A letter to Nick Clegg from an ex liberal democrat (11 Mar 2012)
BBC's 'extensive coverage' of the NHS bill (9 Apr 2012)
Schoolgirls' health put at risk by Catholic view on vaccination (30 Jun 2012)
Postscript: academic mobbing in cyberspace (31 May 2010)
Parasites, pangolins and peer review (26 Nov 2010)
Humour
Orwellian prize for scientific misrepresentation (1 Jun 2010)
An exciting day in the life of a scientist (24 Jun 2010)
Science journal editors: a taxonomy (28 Sep 2010)
Parasites, pangolins and peer review (26 Nov 2010)
A day working from home (23 Dec 2010)
The one hour lecture (11 Mar 2011)
The expansion of research regulators (20 Mar 2011)
Scientific communication: the Comment option (25 May 2011)
How to survive in psychological research (13 Jul 2011)
Your Twitter Profile: The Importance of Not Being Earnest (19 Nov 2011)
2011 Orwellian Prize for Journalistic Misrepresentation (29 Jan 2012)
The ultimate email auto-response (12 Apr 2012)
Well, this should be easy…. (21 May 2012)
The bewildering bathroom challenge (19 Jul 2012) Are Starbucks hiding their profits on the planet Vulcan? (15 Nov 2012) Forget the Tower of Hanoi (11 Apr 2013)
Sunday, 26 May 2013
Schizophrenia and child abuse in the media
A couple of weeks ago, the Observer printed a debate headlined “Do we need to change the way we are thinking about mental illness?” I read it with interest, as I happen to think that we do need to change, and that the new Diagnostic and Statistical Manual of the American Psychiatric Association (DSM5) has numerous problems.
The discussion was opened by Simon Wessely, a member of the Royal College of Psychiatrists, who responded No. He didn’t exactly defend the DSM5, but he disagreed with the criticism that it reduces psychiatry to biology. The Yes response was by Oliver James, an author and clinical psychologist, who attacked the medical model of mental illness, noting the importance of experience, especially childhood experience, in causing psychiatric symptoms. I happen to take a middle way here; there’s ample evidence of biological risk factors for many forms of mental illness, but in our contemporary quest for biomarkers, the role of experience is often sidelined. The idea that you might be depressed because bad things have happened to you goes unmentioned in much contemporary research on affective disorders, for instance.
But, rather than getting into that debate, I want to make a more general point about evidence. In his statement, Oliver James came out with some statistics that surprised me. In particular, he said:
Accurate assessment of child abuse is difficult because it is often hidden away and many cases may be missed. Retrospective accounts of abuse are notoriously hard to validate: false memories can be induced, but true memories may be suppressed. All those writing in this field note the problems of getting accurate data, and the wide variations in rates of abuse reported in the general population, depending on how it is defined. For instance, Fryers and Brugha (2013) noted that for the general population, estimated rates of child physical abuse have ranged between 10% to 31% in males and 6% to 40% of females, and child sexual abuse from 3% to 29% in males and 7% to 36% in females.
Fryers and Brugha focused on prospective, longitudinal studies, taking evidence from over 200 studies. They concluded that “most abuses were associated statistically with almost all classes of disorder (psychosis being largely an exception)”, p 26, and “schizophrenia and closely related syndromes have not generally been much associated with previous child abuse but the picture is not simple.” P.27. They noted that one Australian cohort study found an increase in schizophrenic disorders in children who had been sexually abused, though this was an unusual finding in the context of the research literature as a whole.
Other recent meta-analyses have included case-control studies, where participants are recruited in adulthood, and histories of participants with schizophrenia are compared with those of a control group. Matheson et al (2012) summarised seven studies involving a comparison between patients with schizophrenia and non-psychiatric controls, but definitions of adversity varied widely. In some studies, 'adversity' extended beyond abuse, though physical, sexual and emotional abuse predominated in the definitions. Overall, this review gave results similar to those reported by James, with an adversity rate of 58% in the schizophrenia group and 27% in the controls. This high rate in those with schizophrenia depends, however, on one large study that included adversity factors going beyond abuse, such as having a parent with nervous or emotional problems, or a lot of conflict and tension in the household. In this same study, 91% of controls compared with 71% of those with schizophrenia described their childhood as "happy". If this study is excluded, rates of adversity in those with schizophrenia fall to 28% compared to 8% in controls – still a notable and statistically reliable effect, but with less dramatic absolute rates of adversity than those cited by James.
A larger meta-analysis including a total of 36 studies was conducted by Varese et al (2012), who obtained similar results: a higher rate of childhood adversities in those who develop psychosis, with an odds ratio estimated at 2.78 (95% CI = 2.34-3.31). Significant associations of similar magnitude were found for all types of adversity other than parental death. The odds ratio is not, however, the same as a risk ratio, so should not be interpreted as indicating that those with schizophrenia are three times more likely to have suffered abuse. I don’t want to downplay the importance of the association, which is nevertheless striking, and supports the authors' conclusion that clinicians should routinely inquire about adverse events in childhood when seeing patients with psychiatric conditions.
Overall, the research literature confirms a reliable association between childhood adversity, including abuse, and schizophrenia in adulthood. The conclusion drawn by James, however, that “abuse is the major cause of psychoses” is not endorsed by any of the academic authors of the reviews I looked at. The complexity of causation in the field of neuroscience and mental health is a topic I hope to return to in a later blogpost, but for the time being, I would recommend another review of this literature by Sideli et al (2012), which discusses possible explanations for links between adversity and psychosis. Most researchers familiar with this area would endorse this quote by Fryers and Brugha (2013), reflecting on our state of knowledge in this area:
References
Fryers, T., & Brugha, T. (2013). Childhood determinants of adult psychiatric disorder. Clinical Practice & Epidemiology in Mental Health, 9 (1), 1-50 DOI: 10.2174/1745017901309010001
Matheson, S. L., Shepherd, A. M., Pinchbeck, R. M., Laurens, K. R., & Carr, V. J. (2013). Childhood adversity in schizophrenia: a systematic meta-analysis. Psychological Medicine, 43(2), 225-238. doi: 10.1017/s0033291712000785
Sideli, L., Mule, A., La Barbera, D., & Murray, R. M. (2012). Do child abuse and maltreatment increase risk of schizophrenia? Psychiatry Investigation, 9, 87-99. doi.org/10.4306/pi.2012.9.2.87
Varese, F., Smeets, F., Drukker, M., Lieverse, R., Lataster, T., Viechtbauer, W., Read, J., van Os, J., & Bentall, R. (2012). Childhood adversities increase the risk of psychosis: A meta-analysis of patient-control, prospective- and cross-sectional cohort studies Schizophrenia Bulletin, 38 (4), 661-671 DOI: 10.1093/schbul/sbs050
Note: Thanks to the anonymous reviewer who noted the problem with my initial analysis.
The discussion was opened by Simon Wessely, a member of the Royal College of Psychiatrists, who responded No. He didn’t exactly defend the DSM5, but he disagreed with the criticism that it reduces psychiatry to biology. The Yes response was by Oliver James, an author and clinical psychologist, who attacked the medical model of mental illness, noting the importance of experience, especially childhood experience, in causing psychiatric symptoms. I happen to take a middle way here; there’s ample evidence of biological risk factors for many forms of mental illness, but in our contemporary quest for biomarkers, the role of experience is often sidelined. The idea that you might be depressed because bad things have happened to you goes unmentioned in much contemporary research on affective disorders, for instance.
But, rather than getting into that debate, I want to make a more general point about evidence. In his statement, Oliver James came out with some statistics that surprised me. In particular, he said:
13 studies find that more than half of schizophrenics suffered childhood abuse. Another review of 23 studies shows that schizophrenics are at least three times more likely to have been abused than non-schizophrenics. It is becoming apparent that abuse is the major cause of psychoses.The frustrating thing about this claim is that no sources were given. I don't work in this area, so I thought I’d see if I could track down the articles cited by James. My initial attempt was based on a hasty search of Web of Knowledge on the morning that the article appeared. I described the results of my searches on a blogpost that day, but a commentator pointed out that I'd limited myself to looking at the link between schizophrenia and sexual abuse, whereas James had been referring to childhood abuse in general. I realised that a fair and proper appraisal of his claims should look at this broader category, and accordingly I removed the original post until I could find time to do a more thorough job.
Accurate assessment of child abuse is difficult because it is often hidden away and many cases may be missed. Retrospective accounts of abuse are notoriously hard to validate: false memories can be induced, but true memories may be suppressed. All those writing in this field note the problems of getting accurate data, and the wide variations in rates of abuse reported in the general population, depending on how it is defined. For instance, Fryers and Brugha (2013) noted that for the general population, estimated rates of child physical abuse have ranged between 10% to 31% in males and 6% to 40% of females, and child sexual abuse from 3% to 29% in males and 7% to 36% in females.
Fryers and Brugha focused on prospective, longitudinal studies, taking evidence from over 200 studies. They concluded that “most abuses were associated statistically with almost all classes of disorder (psychosis being largely an exception)”, p 26, and “schizophrenia and closely related syndromes have not generally been much associated with previous child abuse but the picture is not simple.” P.27. They noted that one Australian cohort study found an increase in schizophrenic disorders in children who had been sexually abused, though this was an unusual finding in the context of the research literature as a whole.
Other recent meta-analyses have included case-control studies, where participants are recruited in adulthood, and histories of participants with schizophrenia are compared with those of a control group. Matheson et al (2012) summarised seven studies involving a comparison between patients with schizophrenia and non-psychiatric controls, but definitions of adversity varied widely. In some studies, 'adversity' extended beyond abuse, though physical, sexual and emotional abuse predominated in the definitions. Overall, this review gave results similar to those reported by James, with an adversity rate of 58% in the schizophrenia group and 27% in the controls. This high rate in those with schizophrenia depends, however, on one large study that included adversity factors going beyond abuse, such as having a parent with nervous or emotional problems, or a lot of conflict and tension in the household. In this same study, 91% of controls compared with 71% of those with schizophrenia described their childhood as "happy". If this study is excluded, rates of adversity in those with schizophrenia fall to 28% compared to 8% in controls – still a notable and statistically reliable effect, but with less dramatic absolute rates of adversity than those cited by James.
A larger meta-analysis including a total of 36 studies was conducted by Varese et al (2012), who obtained similar results: a higher rate of childhood adversities in those who develop psychosis, with an odds ratio estimated at 2.78 (95% CI = 2.34-3.31). Significant associations of similar magnitude were found for all types of adversity other than parental death. The odds ratio is not, however, the same as a risk ratio, so should not be interpreted as indicating that those with schizophrenia are three times more likely to have suffered abuse. I don’t want to downplay the importance of the association, which is nevertheless striking, and supports the authors' conclusion that clinicians should routinely inquire about adverse events in childhood when seeing patients with psychiatric conditions.
Overall, the research literature confirms a reliable association between childhood adversity, including abuse, and schizophrenia in adulthood. The conclusion drawn by James, however, that “abuse is the major cause of psychoses” is not endorsed by any of the academic authors of the reviews I looked at. The complexity of causation in the field of neuroscience and mental health is a topic I hope to return to in a later blogpost, but for the time being, I would recommend another review of this literature by Sideli et al (2012), which discusses possible explanations for links between adversity and psychosis. Most researchers familiar with this area would endorse this quote by Fryers and Brugha (2013), reflecting on our state of knowledge in this area:
From all this work an understanding has emerged of the 'cause' of serious mental illness as complex, varied and multi-factorial, encompassing elements of genetic constitution, childhood experience, characteristics of personality, significant life events, the quality of relationships, economic and social situation, life-style choices such as alcohol and other drugs, and aging. Some of these factors have been elucidated to the point of representing acknowledged risk factors for specific forms of mental illness or mental illness in general, such as familial genes, relative poverty, major trauma, excessive alcohol consumption, extreme negative life-events, poor education, and long-term unemployment.But to come back to the impetus for the current blogpost, the point I’d really like to make is that if the Observer wants to run articles like this, where scientific evidence is cited, the editor should ask for sources for the evidence, and should provide these with the article. As noted by Prof Michael O'Donovan in a letter to the Observer, Oliver James is "unknown in the scientific community as a researcher into the origins of psychosis". This does not make his opinions worthless, but if he wants to argue his case from the scientific evidence, then we need to know what evidence he is using, just as we would expect for any reputable scientist making such claims. Most readers don't have the resources or skills to trawl through research databases trying to establish whether the evidence is accurately reported or cherry-picked. As O'Donovan points out, confident assertions about childhood causes of schizophrenia can only cause distress to families affected by this condition, and a responsible newspaper should take care to ensure that such claims have a verifiable basis.
These may all be experienced in childhood and we do not need research to tell us that poverty, inadequate education and life events such as loss of a parent or displacement as a refugee by war, or trauma such as child sex abuse are bad. Nor should it need evidence of later consequences such as mental illness to argue for the prevention of such situations and experiences. The strongest argument is in terms of human rights. However, the issues are not generally given a high priority and people may think them exaggerated or assume that these things are just part of human life and children get over them anyway. But we should not be willing to accept these as inevitably part of human life, but fight for a better life for our children – and hope thereby for a better life for adults and the whole community.
References
Fryers, T., & Brugha, T. (2013). Childhood determinants of adult psychiatric disorder. Clinical Practice & Epidemiology in Mental Health, 9 (1), 1-50 DOI: 10.2174/1745017901309010001
Matheson, S. L., Shepherd, A. M., Pinchbeck, R. M., Laurens, K. R., & Carr, V. J. (2013). Childhood adversity in schizophrenia: a systematic meta-analysis. Psychological Medicine, 43(2), 225-238. doi: 10.1017/s0033291712000785
Sideli, L., Mule, A., La Barbera, D., & Murray, R. M. (2012). Do child abuse and maltreatment increase risk of schizophrenia? Psychiatry Investigation, 9, 87-99. doi.org/10.4306/pi.2012.9.2.87
Varese, F., Smeets, F., Drukker, M., Lieverse, R., Lataster, T., Viechtbauer, W., Read, J., van Os, J., & Bentall, R. (2012). Childhood adversities increase the risk of psychosis: A meta-analysis of patient-control, prospective- and cross-sectional cohort studies Schizophrenia Bulletin, 38 (4), 661-671 DOI: 10.1093/schbul/sbs050
Note: Thanks to the anonymous reviewer who noted the problem with my initial analysis.
Labels:
causes,
child abuse,
DSM5,
etiology,
media,
meta-analysis,
Observer,
Oliver James,
psychosis,
schizophrenia,
sources
Wednesday, 15 May 2013
Have we become slower and dumber?
Guest post by Patrick Rabbitt
This week, a paper by Woodley et al (2013) was widely quoted in the media (e.g. Daily Mail, Telegraph). The authors dramatically announced that the average intelligence of populations of Western industrialised societies has fallen since the Victorian era. This is provocative because previous analyses of large archived datasets of intelligence tests scores by Flynn and others show the opposite. However, Woodley et al did not examine average intelligence test scores obtained from different generations. They compared 16 sets of data from Simple Reaction - Time (SRT) experiments made on groups of people at various times between 1884 and 2002. In all of these experiments volunteers responded to a single light signal by pressing a single response key. Data for women are incomplete but averages of SRTs for men increase significantly with year of testing. Because Woodley et al regard SRTs as good inverse proxy measures for intelligence test scores, which are in some senses “purer” measures of intelligence than pencil and paper tests, they concluded that more recent samples are less intelligent than earlier ones
Throughout their paper the authors argue that higher intelligence of persons alive during the Victorian era can explain why their creativity and achievements were markedly greater than for later, duller generations. We can leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since a Great Victorian Flowering because only two of the 16 datasets they compared were collected before Victoria’s death in 1901. The remaining 14 datasets date between 1941 and 2004 and, of these, only four were collected before 1970. So most of the studies analysed were made within my personal working lifespan. This provokes both nostalgia and distrust. Between 1959 and 2004 I collected reaction times (RTs) from many large samples of people but it would make no sense for me to compare absolute values of group mean RTs that I obtained before and after 1975. This was because, until 1975, like nearly all of my colleagues, the only apparatus I had were Dekatron counters, the Birren Psychomet or SPARTA apparatus, none of which measured intervals shorter than 100 msec. Consequently, when my apparatus gave a reading of 200 msec. the actual Reaction Time might be anywhere between 200 and 299 msec. Like most of my colleagues I always computed and published mean RTs to three decimal places, but this was pretentious because all the RTs I had collected had been, in effect, rounded down by my equipment. After 1975, easier access to computers and better programs gradually began to allow true millisecond resolution. More investigators took advantage of new equipment and our reports of millisecond averages became less misleading. I am unsurprised that mean RTs computed from post-1975 data were consistently, and significantly longer than those for pre-1975 data.
Changes in recording accuracy are a sufficient reason to withold excitement at Woodley et al’s comparison. It is worth noticing that different methodological issues also make it tricky to compare absolute values for means of RTs that were collected at different times and so with different kinds of equipment. For example RTs are affected by differences in signal visibility and rise-times to maximum brightness between tungsten lamps, computer monitor displays, neon bulbs and LCDs. The stiffness and “throw” of response buttons will also have varied between the set-ups that investigators used. When comparing absolute values of SRTs, another important factor is whether or not each signal to respond is preceded by a warning signal, whether the periods between warning signals and response signals are constant or variable and just how long they are (intervals between, approximately, 200 and 800 ms allow faster RTs than shorter or longer ones) Knowing these methodological quirks makes us realise that, in marked contrast to intelligence tests, methodologies for measuring RT have been thoroughly explored but never standardised.
So I do not yet believe that Wooley et al’s analyses show that psychologists of my generation were probably (once!) smarter than our young colleagues (now) are. This seems unlikely, but perhaps if I read further publications by these industrious investigators I may become convinced that this is really the case.
References
Flynn, J. R. (1987). Massive IQ gains in 14 nations - what IQ tests really measure. Psychological Bulletin, 101(2), 171-191. doi: 10.1037/0033-2909.101.2.171
Michael A. Woodley, Jan te Nijenhuis, & Raegan Murphy (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time Intelligence : http://dx.doi.org/10.1016/j.intell.2013.04.006
POST SCRIPT, 24th May 2013
Dr Woodley has published a response to my critique on James Thompson's blog. He asks me to answer. I am glad to do so. Sluggishness has been due only to the pleasure of reading the many articles to which Woodley drew my attention. Dorothy’s remorseless archaeology of this trove, summarised in the table below, has provoked much domestic merriment during the past few days. We are grateful to Dr Woodley for this diversion. Here are my thoughts on his comments on my post.
Woodley et al used data from a meta-analysis by Silverman (2010). I am grateful to Prof Silverman for very rapid access to his paper in which he compared average times to make a single response to a light signal from large samples in Francis Galton's anthropometric laboratories and from several later, smaller samples dating from 1941 to 2006. To these Woodley et al added a dataset from Helen Bradford Thompson's 1903 monograph "The mental traits of sex".
As Silverman (2010) trenchantly points out there is a limit to possible comparisons from these datasets,: “In principle, it would be possible to uncover the rate at which RT increased (since the Galton studies) by controlling for potentially confounding variables in a multiple regression analysis. However, this requires that each of these variables be represented by multiple data points, but this requirement cannot be met by the present dataset. Accurately describing change over time also requires that both ends of the temporal dimension be well represented in the dataset and that the dataset be free of outliers (Cohen, Cohen, West, & Aiken, 2003); neither of these requirements can be met …… Thus, it is important to reiterate that the purpose … is not to show that RT has changed according to a specific function over time but rather to show that modern studies have obtained RTs that are far longer than those obtained by Galton."
Neither Silverman nor Woodley et al seem much concerned that results of comparisons might depend on differences between studies in apparatus and methods, which are shown here, together with temporal resolution where reported.
Since Galton's dataset is the key baseline for the conclusion that population mean RT is increasing, it is worth considering details of his equipment described here and in a wonderful archival paper “Galton’s Data a Century Later” by Johnson et al (1985): “……during its descent the pendulum gives a sight-signal by brushing against a very light and small mirror which reflects a light off or onto a screen, or, on the other hand, it gives a sound-signal by a light weight being thrown off the pendulum by impact with a hollow box. The position of the pendulum at either of these occurrences is known. The position of the pendulum when the response is made is obtained by means of a thread stretched parallel to the axis of the pendulum by two elastic bands one above and one below, the thread being in a plane through the axes of the pendulum, perpendicular to the plane of the pendulum's motion. This thread moves freely between two parallel bars in a horizontal plane, and the pressing of a key causes the horizontal bars to clamp the thread. Thus the clamped thread gives the position of the pendulum on striking the key. The elastic bands provide for the pendulum not being suddenly checked on the clamping. The horizontal bars are just below a horizontal scale, 800 mm. below the point of suspension of the pendulum. Galton provided a table for reading off the distance along the scale from the vertical position of the pendulum in terms of the time taken from the vertical position to the position in which the thread is clamped." (p. 347).
Contemporary journal referees would press authors for reassurance that the apparatus produced consistent values over trials and had no bias to over or underestimate. Obviously this would have been very difficult for Galton to achieve.
In my earlier post I noted that over the mid-to late 20th century it became obvious that to report reaction times (RT) to three decimal places is misleading if equipment only allows centi-second recording. In the latter case a reading of 200 ms will remain until a further 100 ms have elapsed, effectively "rounding down" the RT. Woodley argues that we cannot assume that rounding down occurred. I do not follow his reasoning on this point. He also offers a statistical analysis to confirm that if the temporal resolution of the measure is the only difference between studies, this would not systematically underestimate RT. Disagreement on whether rounding occurred may only be resolved with empirical data comparing recorded and true RTs between equipments.
A general concern with comparisons of RTs between studies is that they are significantly affected by the methodology and apparatus used to collect them. This is not only due to differences in resolution but can lead to systematic bias in timing of trials. For a comprehensive account of how minor differences between different 21st century computers and commercial software can flaw comparisons between studies see Plant and Quinlan (2013), who write: "All that we can state with absolute certainty is that all studies are likely to suffer from varying degrees of presentation, synchronization, and response time errors if timing error was not specifically controlled for." I earlier suggested that apparently trivial procedural details can markedly affect RTs. Among these are whether or not participants are given warning signals, whether the intervals between warning signals and response signals are constant or vary across trials and how long these intervals are, the brightness of signal lamps and the mechanical properties of response keys. A further point also turns out to be relevant to assessment of Woodley et al's argument: average values will also depend on the number of trials recorded, and averaged, for each person, and whether outliers are excluded. Note, for instance, that the equipment used in the studies by Deary and Der, though appropriate for the comparisons that they made and reported, did not record RTs for individual trials but an averaged RT for an entire session. This makes it impossible to exclude outliers, as is normal good practice. The point is that comparisons that are satisfactory within the context of a single well-run experiment may be seriously misleading if made between equally scrupulous experiments using different apparatus and procedures. Johnson et al (1985) and Silverman (2010) stress that Galton’s data were wonderfully internally consistent. This reassures us that equipment and methods were well standardised within his own study. It cannot give any assurance that his data can be sensibly comparable with those obtained with other very diverse equipments and methodologies.
Another excellent feature of the Galton dataset is that re-testing of part of his large initial sample allowed estimates of reliability of his measures. With his large sample sizes even low values of test/re-test correlations were better than chance. Nevertheless it is interesting that the test-retest correlation for visual RT, at .17, on which Silverman’s and Woodley’s conclusions depends, was lower than the next lowest (high frequency auditory acuity,.28), or Snellen eye-chart (.58) and visual acuity (.76 to.79) (Johnson et al, 1985, Table 2).
We do not know whether warning signals were used in Galton's RT studies, or, if so, how long the preparatory intervals between warning and response signals might have been. Silverman (2010) had earlier acknowledged that preparatory interval duration might be an issue but felt that he could ignore it because a report by Teichner of Wundt’s discovery that fore-period duration effects could not be independently substantiated and also because he accepted Seashore et al ‘s (1941) reassurance that there are no effects on RT of fore-period duration.
Ever since a convincing study by Klemmer (1957) it has been recognised that the durations of preparatory intervals do significantly affect reaction times, that the effects of fore-period variation are large and that results cannot be usefully compared unless these factors are taken into consideration. Indeed during the 1960s fore-period effects were the staple topic of a veritable academic industry (see review by Niemi and Naatanen, 1981, and far too many other papers by Bertelson, Nickerson, Sanders, Rabbitt etc. etc). In this context Seashore et al’s (1941) failure to find for-period effects does not increase our confidence in their results as one of the data points on which Woodley et al’s analysis is based.
Silverman’s lack of interest in fore-period duration was also heightened by Johnson et al’s (1985) comment that, as far as they were able to discover, each of Galton’s volunteers was only given one trial. Silverman implies that if each of Galton’s volunteers only recorded a single RT, variations in preparatory intervals are hardly an issue. It is also arguable that this relaxed procedure might have lengthened rather than shortened RTs. Well… Yes and No. First, it would be nice to know just how volunteers were alerted that their single trial was imminent? By a nod or a wink? A friendly pat on the shoulder? A verbal “Ready”? Second, an important point of using warning signals, and of recording many rather than just one trial is that the first thing that all of us who run simple RT Experiments discover is that volunteers are very prone to “jump the gun” and begin to respond before any signal appears, so recording absurdly fast “RTs” that can be as low as 10 to 60 ms. 20th and 21st century investigators are usually (!) careful to identify and exclude such observations. Many also edit out what they regard as implausibly slow responses. We do not know whether or how either kind of editing occurred in the Galton laboratories. Many participants would have jumped the gun and if this was their sole recorded reaction the effects on group means would have been considerable. If Galton’s staff did edit RTs, both acceptance of impulsive responses or dismissal of very slow responses would reduce means and favour the idea of “Speedy Victorians”.
I would like to stress that my concerns are methodological rather than dogmatic. Investigators of reaction times try to test models for information processing by making small changes in single variables in tasks run on the same apparatus and with exactly the same procedures. This makes us wary of conclusions from comparisons between datasets collected with wildly different equipments, procedures and groups of people. My concerns were shared by some of those whose data are used by Silverman and Woodley et al. For example, the operational guide for the Datico Terry 84 device used by Anger et al states that "A single device has been chosen because it is very difficult to compare reaction time data from different test devices".
Because I have spent most of my working life using RTs to compare the mental abilities of people of different ages I am very much in favour of using RT measurements as a research tool for individual differences. (For my personal interpretation of the relationships between people’s calendar ages and gross brain status and their performance on measures of mental speed, of fluid intelligence, of executive function, and of memory see e.g. Rabbitt et al, 2007). I also strongly believe that mining archived data is a very valuable scientific endeavour and becomes more valuable as the volume of available data exponentially increases. For example, Flynn’s dogged analyses of archived intelligence test scores show that data mining has raised provocative and surprising questions. I also believe, with Silverman, that large population studies provide good epidemiological evidence of the effects of changes in incidence of malnutrition or of misuse of pesticides or antibiotics. I am more amused than concerned when, in line with Galton’s strange eugenic obsessions, they are also discussed as potential illustrations of growing degeneracy of our species due to increased survival odds for the biologically unfit. As I noted in my original post, my only concern is that it is a time-wasting mistake to uncritically treat measurements of Reaction Times as being, in some sense, “purer”, more direct and more trustworthy indices of individual differences than other measures such as intelligence tests. Of course RTs can be sensitive and reliable measures of individual differences but, as things stand, equipments and procedures are not standardised and, because RTs are liable to many methodological quirks, we obtain widely different mean values from different population samples even from apparently very similar tasks.
![]() |
| http://www.flickr.com/photos/sciencemuseum/3321607591/ |
Throughout their paper the authors argue that higher intelligence of persons alive during the Victorian era can explain why their creativity and achievements were markedly greater than for later, duller generations. We can leave aside an important question whether there is any sound evidence that creativity and intellectual achievements have declined since a Great Victorian Flowering because only two of the 16 datasets they compared were collected before Victoria’s death in 1901. The remaining 14 datasets date between 1941 and 2004 and, of these, only four were collected before 1970. So most of the studies analysed were made within my personal working lifespan. This provokes both nostalgia and distrust. Between 1959 and 2004 I collected reaction times (RTs) from many large samples of people but it would make no sense for me to compare absolute values of group mean RTs that I obtained before and after 1975. This was because, until 1975, like nearly all of my colleagues, the only apparatus I had were Dekatron counters, the Birren Psychomet or SPARTA apparatus, none of which measured intervals shorter than 100 msec. Consequently, when my apparatus gave a reading of 200 msec. the actual Reaction Time might be anywhere between 200 and 299 msec. Like most of my colleagues I always computed and published mean RTs to three decimal places, but this was pretentious because all the RTs I had collected had been, in effect, rounded down by my equipment. After 1975, easier access to computers and better programs gradually began to allow true millisecond resolution. More investigators took advantage of new equipment and our reports of millisecond averages became less misleading. I am unsurprised that mean RTs computed from post-1975 data were consistently, and significantly longer than those for pre-1975 data.
Changes in recording accuracy are a sufficient reason to withold excitement at Woodley et al’s comparison. It is worth noticing that different methodological issues also make it tricky to compare absolute values for means of RTs that were collected at different times and so with different kinds of equipment. For example RTs are affected by differences in signal visibility and rise-times to maximum brightness between tungsten lamps, computer monitor displays, neon bulbs and LCDs. The stiffness and “throw” of response buttons will also have varied between the set-ups that investigators used. When comparing absolute values of SRTs, another important factor is whether or not each signal to respond is preceded by a warning signal, whether the periods between warning signals and response signals are constant or variable and just how long they are (intervals between, approximately, 200 and 800 ms allow faster RTs than shorter or longer ones) Knowing these methodological quirks makes us realise that, in marked contrast to intelligence tests, methodologies for measuring RT have been thoroughly explored but never standardised.
So I do not yet believe that Wooley et al’s analyses show that psychologists of my generation were probably (once!) smarter than our young colleagues (now) are. This seems unlikely, but perhaps if I read further publications by these industrious investigators I may become convinced that this is really the case.
References
Flynn, J. R. (1987). Massive IQ gains in 14 nations - what IQ tests really measure. Psychological Bulletin, 101(2), 171-191. doi: 10.1037/0033-2909.101.2.171
Michael A. Woodley, Jan te Nijenhuis, & Raegan Murphy (2013). Were the Victorians cleverer than us? The decline in general intelligence estimated from a meta-analysis of the slowing of simple reaction time Intelligence : http://dx.doi.org/10.1016/j.intell.2013.04.006
POST SCRIPT, 24th May 2013
Dr Woodley has published a response to my critique on James Thompson's blog. He asks me to answer. I am glad to do so. Sluggishness has been due only to the pleasure of reading the many articles to which Woodley drew my attention. Dorothy’s remorseless archaeology of this trove, summarised in the table below, has provoked much domestic merriment during the past few days. We are grateful to Dr Woodley for this diversion. Here are my thoughts on his comments on my post.
Woodley et al used data from a meta-analysis by Silverman (2010). I am grateful to Prof Silverman for very rapid access to his paper in which he compared average times to make a single response to a light signal from large samples in Francis Galton's anthropometric laboratories and from several later, smaller samples dating from 1941 to 2006. To these Woodley et al added a dataset from Helen Bradford Thompson's 1903 monograph "The mental traits of sex".
As Silverman (2010) trenchantly points out there is a limit to possible comparisons from these datasets,: “In principle, it would be possible to uncover the rate at which RT increased (since the Galton studies) by controlling for potentially confounding variables in a multiple regression analysis. However, this requires that each of these variables be represented by multiple data points, but this requirement cannot be met by the present dataset. Accurately describing change over time also requires that both ends of the temporal dimension be well represented in the dataset and that the dataset be free of outliers (Cohen, Cohen, West, & Aiken, 2003); neither of these requirements can be met …… Thus, it is important to reiterate that the purpose … is not to show that RT has changed according to a specific function over time but rather to show that modern studies have obtained RTs that are far longer than those obtained by Galton."
Neither Silverman nor Woodley et al seem much concerned that results of comparisons might depend on differences between studies in apparatus and methods, which are shown here, together with temporal resolution where reported.
Contemporary journal referees would press authors for reassurance that the apparatus produced consistent values over trials and had no bias to over or underestimate. Obviously this would have been very difficult for Galton to achieve.
In my earlier post I noted that over the mid-to late 20th century it became obvious that to report reaction times (RT) to three decimal places is misleading if equipment only allows centi-second recording. In the latter case a reading of 200 ms will remain until a further 100 ms have elapsed, effectively "rounding down" the RT. Woodley argues that we cannot assume that rounding down occurred. I do not follow his reasoning on this point. He also offers a statistical analysis to confirm that if the temporal resolution of the measure is the only difference between studies, this would not systematically underestimate RT. Disagreement on whether rounding occurred may only be resolved with empirical data comparing recorded and true RTs between equipments.
A general concern with comparisons of RTs between studies is that they are significantly affected by the methodology and apparatus used to collect them. This is not only due to differences in resolution but can lead to systematic bias in timing of trials. For a comprehensive account of how minor differences between different 21st century computers and commercial software can flaw comparisons between studies see Plant and Quinlan (2013), who write: "All that we can state with absolute certainty is that all studies are likely to suffer from varying degrees of presentation, synchronization, and response time errors if timing error was not specifically controlled for." I earlier suggested that apparently trivial procedural details can markedly affect RTs. Among these are whether or not participants are given warning signals, whether the intervals between warning signals and response signals are constant or vary across trials and how long these intervals are, the brightness of signal lamps and the mechanical properties of response keys. A further point also turns out to be relevant to assessment of Woodley et al's argument: average values will also depend on the number of trials recorded, and averaged, for each person, and whether outliers are excluded. Note, for instance, that the equipment used in the studies by Deary and Der, though appropriate for the comparisons that they made and reported, did not record RTs for individual trials but an averaged RT for an entire session. This makes it impossible to exclude outliers, as is normal good practice. The point is that comparisons that are satisfactory within the context of a single well-run experiment may be seriously misleading if made between equally scrupulous experiments using different apparatus and procedures. Johnson et al (1985) and Silverman (2010) stress that Galton’s data were wonderfully internally consistent. This reassures us that equipment and methods were well standardised within his own study. It cannot give any assurance that his data can be sensibly comparable with those obtained with other very diverse equipments and methodologies.
Another excellent feature of the Galton dataset is that re-testing of part of his large initial sample allowed estimates of reliability of his measures. With his large sample sizes even low values of test/re-test correlations were better than chance. Nevertheless it is interesting that the test-retest correlation for visual RT, at .17, on which Silverman’s and Woodley’s conclusions depends, was lower than the next lowest (high frequency auditory acuity,.28), or Snellen eye-chart (.58) and visual acuity (.76 to.79) (Johnson et al, 1985, Table 2).
We do not know whether warning signals were used in Galton's RT studies, or, if so, how long the preparatory intervals between warning and response signals might have been. Silverman (2010) had earlier acknowledged that preparatory interval duration might be an issue but felt that he could ignore it because a report by Teichner of Wundt’s discovery that fore-period duration effects could not be independently substantiated and also because he accepted Seashore et al ‘s (1941) reassurance that there are no effects on RT of fore-period duration.
Ever since a convincing study by Klemmer (1957) it has been recognised that the durations of preparatory intervals do significantly affect reaction times, that the effects of fore-period variation are large and that results cannot be usefully compared unless these factors are taken into consideration. Indeed during the 1960s fore-period effects were the staple topic of a veritable academic industry (see review by Niemi and Naatanen, 1981, and far too many other papers by Bertelson, Nickerson, Sanders, Rabbitt etc. etc). In this context Seashore et al’s (1941) failure to find for-period effects does not increase our confidence in their results as one of the data points on which Woodley et al’s analysis is based.
Silverman’s lack of interest in fore-period duration was also heightened by Johnson et al’s (1985) comment that, as far as they were able to discover, each of Galton’s volunteers was only given one trial. Silverman implies that if each of Galton’s volunteers only recorded a single RT, variations in preparatory intervals are hardly an issue. It is also arguable that this relaxed procedure might have lengthened rather than shortened RTs. Well… Yes and No. First, it would be nice to know just how volunteers were alerted that their single trial was imminent? By a nod or a wink? A friendly pat on the shoulder? A verbal “Ready”? Second, an important point of using warning signals, and of recording many rather than just one trial is that the first thing that all of us who run simple RT Experiments discover is that volunteers are very prone to “jump the gun” and begin to respond before any signal appears, so recording absurdly fast “RTs” that can be as low as 10 to 60 ms. 20th and 21st century investigators are usually (!) careful to identify and exclude such observations. Many also edit out what they regard as implausibly slow responses. We do not know whether or how either kind of editing occurred in the Galton laboratories. Many participants would have jumped the gun and if this was their sole recorded reaction the effects on group means would have been considerable. If Galton’s staff did edit RTs, both acceptance of impulsive responses or dismissal of very slow responses would reduce means and favour the idea of “Speedy Victorians”.
I would like to stress that my concerns are methodological rather than dogmatic. Investigators of reaction times try to test models for information processing by making small changes in single variables in tasks run on the same apparatus and with exactly the same procedures. This makes us wary of conclusions from comparisons between datasets collected with wildly different equipments, procedures and groups of people. My concerns were shared by some of those whose data are used by Silverman and Woodley et al. For example, the operational guide for the Datico Terry 84 device used by Anger et al states that "A single device has been chosen because it is very difficult to compare reaction time data from different test devices".
Because I have spent most of my working life using RTs to compare the mental abilities of people of different ages I am very much in favour of using RT measurements as a research tool for individual differences. (For my personal interpretation of the relationships between people’s calendar ages and gross brain status and their performance on measures of mental speed, of fluid intelligence, of executive function, and of memory see e.g. Rabbitt et al, 2007). I also strongly believe that mining archived data is a very valuable scientific endeavour and becomes more valuable as the volume of available data exponentially increases. For example, Flynn’s dogged analyses of archived intelligence test scores show that data mining has raised provocative and surprising questions. I also believe, with Silverman, that large population studies provide good epidemiological evidence of the effects of changes in incidence of malnutrition or of misuse of pesticides or antibiotics. I am more amused than concerned when, in line with Galton’s strange eugenic obsessions, they are also discussed as potential illustrations of growing degeneracy of our species due to increased survival odds for the biologically unfit. As I noted in my original post, my only concern is that it is a time-wasting mistake to uncritically treat measurements of Reaction Times as being, in some sense, “purer”, more direct and more trustworthy indices of individual differences than other measures such as intelligence tests. Of course RTs can be sensitive and reliable measures of individual differences but, as things stand, equipments and procedures are not standardised and, because RTs are liable to many methodological quirks, we obtain widely different mean values from different population samples even from apparently very similar tasks.
Labels:
apparatus,
history,
intelligence,
reaction time,
secular,
Victorians
Subscribe to:
Posts (Atom)



