The new phonics screening test for children
has been highly controversial. I’ve been
surprised at the amount of hostility engendered by the idea of testing
children’s knowledge of how letters and sounds go together. There’s plenty of
evidence that this is a foundational skill for reading, and poor ability to do
phonics is a good predictor of later reading problems. So while I can see there
are aspects of the implementation of the phonics screen that could be
improved, I don’t buy arguments that it
will ‘confuse’ children, or prevent them reading for meaning.
I discovered today that some early data on
the phonics screen had recently been published by the Department for Education,
and my inner nerd was immediately stimulated to visit the website and
download the tables. What I found was
both surprising and disturbing.
Most of the results are presented in terms
of proportions of children ‘passing’ the screen, i.e. scoring 32 or more. There
are tables showing how this proportion varies with gender, ethnic background,
language background, and provision of free school meals. But I was more
interested in raw scores: after all, a cutoff of 32 is pretty arbitrary. I
wanted to see the range and distribution of scores. I found just one table showing the relevant
data, subdivided by gender, and I have plotted the results here.
Data from Table 4, Additional Tables 2, SFR21/2012 Department for Education (weblink above) |
This is so striking, and so abnormal, that
I fear it provides clear-cut evidence that the data have been manipulated, so
that children whose scores would put them just one or two points below the
magic cutoff of 32 have been given the benefit of the doubt, and had their
scores nudged up above cutoff.
This is most unlikely to indicate a problem
inherent in the test itself. It looks like human bias that arises when people
know there is a cutoff and, for whatever reason, are reluctant to have children
score below that cutoff. As one who is basically in favour of phonics testing, I’m sorry to put another cat among the
educational pigeons, but on the basis of this evidence, I do query whether
these data can be trusted.
For any of the 32+ portion of that graph to make sense, there would have to be an exponential rise from ~25 up to 40. (Assuming the top scores haven't been manipulated. Is that what you'd expect to see?
ReplyDeleteFor all those who won't believe that a simple graph can tell you enough to conclude that the data has been manipulated, I should say that I examined the graph before I read your conclusion and my interpretation was just the same as yours: some children who should score below 32 have been put at 32 or 33 (possibily as early as collection time, not necessarily after the fact).
ReplyDeleteThis is another nice illustration of how much a graph can tell, and how important it is to always plot your data before drawing any conclusion.
Franck, it will have been at collection time as the DfE published the minimum standard threshold in advance of testing.
ReplyDelete@Dorothy: Well spotted. People with a stake in the outcome (teachers and school heads) can't be trusted to administer these tests: Sad, but I guess they need to be computerised, and the speech samples scored independently.
ReplyDeleteWhile very sad, this is consistent with other analyses of what happens when teachers are tested: they cheat (one example linked below, but there are many).
@Steve Jones: Yes, most kids master the skills of reading, including both sounding out and building a basic sight vocabulary, so the distribution on tests of simple words and nonwords is not normal, but negatively skewed (long left tail). But you can literally see the chunk of 29-31 scores sculpted out and bolted on as 32-34 here...
@Franck Ramus: Seems most likely the scores were enhanced at collection time by teachers "giving the benefit of the doubt". If it's happened post collection, that would be IMHO worse. But both are of course inexcusable.
http://www.freakonomics.com/2011/07/06/massive-teacher-cheating-scandal-uncovered-in-atlanta/
I'm no expert on teaching phonics, and have watched the debate as an interested layperson, but I would say that this convincing piece of forensic statistical analysis backs up some of the criticisms of the phonics screen. In a nutshell: it's not seen as a screen, but as a pass/fail test, and a high-stakes one at that. Hence, teachers see it not as a useful diagnostic tool, but as a threat. That may or may not be accurate, but it surely reflects teachers' perceptions of the educational climate and culture that they are working in.
ReplyDelete@tim: Yes, most kids master the skills of reading, including both sounding out and building a basic sight vocabulary, so the distribution on tests of simple words and nonwords is not normal, but negatively skewed (long left tail).
ReplyDeletePerhaps, but what equally interesting here is that the median on the distribution falls almost exactly on the 'pass mark' of 32 - statistically it's 32.27 and the deviation in individual medians for boys and girls is of close to 0.5 for both (boys downwards, girls upwards).
That rather suggests that we have a normalised test, in addition to the data manipulation, which also suggests that its also likely that the test does, in fact, incorporate a sharp increase in difficulty at the pass/fail boundary.
@Deevybee, @Steve Jones, @Frank Ramus and @Unity.
ReplyDeleteThis seems unfair on teachers. I am aware that when my children were going through primary school they were assessed in year one for literacy problems and those who struggled were given quite extensive help. This help would be "cut off" when the child reached an adequate standard. These interventions took place in year 1. Most state infant and junior schools have nurseries and Early Years forms and the school will have had three or four years of experience of their children prior to the screen. They will know who needs, and will benefit from intervention.
You, @Steve Jones, @Frank Ramus and @Unity are clear that you think the data has been manipulated. To do this you must discount the possibility that professional teachers have done their job, identified struggling children and provided targeted help during the Early Years and KS1 stages. It would be helpful if you explained why you all have discounted this possibility.
Robin Cousins
If teachers spot children with special needs and provide appropriate help before the phonics screen is administered (and I hope many of them do), then the effect is to shift the left part of the distribution rightwards. This would not be expected to produce the blips around the 32 cut-off.
ReplyDeleteUnless the data reported reflect a second administration of the screen after a first administration triggering appropriate responses?
In my experience the system is not quite so neat and tidy. Within the school there tends to be an ad hoc collection of helpers including special needs teachers, sencos, teaching assistants, and parents, who take an interest in literacy. There appears to be a general sense of what constitutes under-performance in reading, which is not the same as special needs. Again, from what I've seen, the kids are aware that they have been selected for help and don't like the stigma, even at the age of six or seven. The upshot is that a reasonable number of children can be helped over the line into a "normal" category. Then there is a social explanation as to why there is a lump just over the pass line and the progress does not continue. Schools seem honest and very adaptive organisations with an acute sense of where assessment boundaries lie. My belief would be that children rather than numbers have been "manipulated" over the 32 cut-off. I agree that the sharpness of the rise at 32 is curious but the shape of the graph is very much as I would expect. I certainly would not expect continuous or parametric data from a screen on something about which the schools are so sensitive.
ReplyDeleteRobin Cousins
That's what's I call a "Student's wtf-distribution".
ReplyDeleteIn fairness, it could come about by some really strangely constructed test items - you can imagine that if some of the items were actually measuring the same thing, you might have few children who got 1 right rather than 0 or 2... in theory, but some kind of manipulation looks much more likely.
Actually that looks a little bit like the distribution of scores in an oral reading test in children who are learning to read a regularly spelled language. We've published data on this in our 2000 paper. There's a bump at 0 and then a bump at the top end. Children either know little about how to decode words or they've got the trick and can do all or almost all the words.
ReplyDeleteSo it looks like children learning to read an irregular orthography still do this - they either can't decode regularly spelled words that well, or they can decode almost all of them (and most of them can decode almost all of them). I'm not too surprised the middle isn't completely flat as it is in the regular orthography, as the children are used to being "fooled" by known words.
The blips are probably due to bad data collection etc. I'd agree.
Well spotted!
ReplyDeleteThanks for great analysis. I think the issue is with the subjective nature of these types of tests- I have ran decoding tasks with lots of kids and I have been struck how difficult it can be to a)make out what some kids say and b)decide whether the pronunciation is entirely correct. I don't necessarily think that teachers are being disingenuous, but they would rather give children the benefit of the doubt. And of course they are also under pressure from the governement in terms of reaching certain pre-set standards. This makes poor data colection, but what do we expect? Teachers are teachers, not researchers.
ReplyDeleteMight it also be something to do with the number of words and non-words in the test? Lots of children -- many of them able readers -- struggled more with the non-words than they did actual words. Might the unusual distribution of the graph reflect that in some way?
ReplyDeleteAnonymous. Several people have made a similar point, but I think they're just wrong. Given that there is variation from child to child, I can't see any way that you would get a peak at the cutoff point with a drop on either side. It would be good to see raw data on individual items, but I would be a very large sum of money on this not being an item effect.
ReplyDeleteSee also similar effect on GCSE scoring http://blogs.ft.com/ftdata/2012/11/02/english-gcse-and-ofqual/#axzz2B3oGpUsZ
Was this a data set from a first small run of a draft version of the test? Such a bizarre distribution would surely have led to an examiner's facepalm and a swift redesign?
ReplyDeleteI believe there are a lot more pleasant opportunities up front for people who looked over your site. Visit is raeli jewellery for best Jewellery.
ReplyDelete