The Embedding Formative Assessment (EFA) trial

Quality randomised education trials are possible, but they deserve better reporting

Embedding formative assessment in schools “probably” raises GCSE scores, but only by a little bit

Good teachers continually check their pupils’ grasp of a topic so that their teaching can be tailored to progress. Not got it, try again. Got it, move on. This sort of formative assessment also helps pupils direct their own study appropriately. It may consist of anything from asking an individual student a question or marking a piece of homework, to class wide mock exams or quizzes.

Most teachers know this is a good thing and try to do it regularly, but education expert Dylan Wiliam (click here) is a super enthusiast; he’s written books and study guides to help teachers do it better (click here) and travels the world extolling its importance.

Last month the Education Endowment Foundation (EEF) (click here), an outfit which among other things tries to put education on a firm evidence base, published the results of a randomised trial of Wiliam’s programme. They claim it works:

“Students whose teachers were trained in this approach made two months more progress than a similar group of pupils whose teachers did not receive the intervention. The findings have a very high level of security as it was a large and well-run trial, which means we can be confident in the results.”

But the EEF has not always covered itself in glory with its trial reports. A couple of years ago it drew howls of derision, from me and others, for falsely claiming that a negative trial of teaching philosophy to primary pupils improved their mathematical skills (click here). So let’s take a close look.

Appropriately for a teacher/classroom-based intervention, it was a cluster trial with schools as the unit of randomisation. The main report (click here) is turgid and repetitive, but I’ve read it, so you don’t have to. Here’s what happened.

Population – 140 UK secondary schools during the 2015/16 & 16/17 academic years.

Intervention – Each school implemented Dylan Wiliam’s Embedding Formative Assessment (EFA) programme. They got his EFA pack, a day’s training from the man himself, and ongoing support from the Schools, Students and Teachers (SSAT) network.

Control – Each school got a one-off payment of £300, but otherwise carried on with ‘business as usual’, with no restrictions on how they took forward formative assessment.

Outcome – The pupils’ “Attainment 8 GCSE scores” calculated from their top eight GCSEs, each graded from 1-9, with maths and English counting double. Max score = 90. Pupils who took fewer than eight subjects scored zero for each unfilled slot.

Planned sample size – 120 schools, 60 per group, with 100 students per cluster (school), and assuming an intra-cluster correlation of 0.2, was judged to have 80% power at the 0.05 significance level to show an improvement of 0.2 standard deviations in the mean “Attainment 8 GCSE scores”. This 0.2 SD effect size was judged by the EEF advisory panels “to be an acceptable level of improvement from a policy perspective to roll out the intervention more widely”. In a medical trial this would be the minimum clinically important difference (MCID). I guess we could call it the minimum educationally worthwhile difference (MEWD).

The trial recruited 140 schools, an additional 20 to allow for attrition. However, although twelve intervention schools eventually gave up on the programme for one reason or another, excluding them from the intervention arm would have biased the results, so the final analysis was of all 140 randomised schools, analysed by “intention to treat”.

The trial wasn’t registered but there’s a protocol from 2016 (click here), and an undated statistical analysis plan (click here). I couldn’t find any major outcome switching, or other risks to the trial integrity. A lot of effort was made to achieve and measure fidelity to the programme, but it turned out that many intervention schools adapted it in unanticipated ways. A planned subgroup analysis of high fidelity schools was eventually judged impracticable, so the trial was a pragmatic test of the effect of implementing EFA in the real world. Schools were randomised within blocks, based on similar GCSE scores and proportions of pupils eligible for free school meals, so the trial groups ended up balanced on these factors.

There is a CONSORT flow diagram, and table 4 shows the trial groups were indeed well balanced at baseline.

Results

The trial was negative. The mean intervention group score was only 0.1 standard deviation higher, a difference that could have occurred by chance (P=0.088) using both the conventional and the pre-specified level of statistical significance (p<0.05). However, the 95% confidence interval around the effect size went from -0.01 to 0.21, so they had just failed to rule out their pre-specified minimum worthwhile effect of a gain of 0.2 standard deviations (table 5). For those who are unfamiliar with education trials expressing their results this way, the second right column, labelled Hedges g, is the difference in mean scores between the trial groups, measured as proportion of a standard deviation, together with it’s 95% confidence interval. The right hand column is the P value, the probability that a difference as large or larger than the observed one would have occurred by chance if the treatment had no effect. ln summary a negative trial but, as it turned out, also a slightly under-powered one.

Sub-analyses, among children eligible for free school meals, and those scoring in the lower tercile (Table 6) showed the same non-significant 0.1 SD higher scores as in the overall sample, and there was negligible effect in the upper tercile pupils, or on English and maths scores (table 7). TEEP is the Teacher Effectiveness Enhancement Programme, another intervention applied in some schools. The analysis of non TEEP schools was exploratory.

So how come EEF claims it works?

It appears that the report authors (who do not include Dylan Wiliam) quietly decided, after the results were in, that a level of 10% significance was OK, and that an improvement of only 0.1 standard deviation would be worthwhile after all. Hey presto! The result is positive.

Cheeky eh? Imagine big pharma getting an effect size of half what they had pre-specified as the minimum clinically worthwhile difference, and a P value of 0.088. Imagine if they then announced not only that the smaller effect size was worthwhile after all, but that P <0.1 was what they had been aiming for all along. Doctors would be sceptical.

Or perhaps not. Imagine if the drug manufacturer was working in a difficult field where there was hardly any evidence that anything worked, where all previous trials had been tiny, fatally flawed, or worse, and that the P=0.088 trial had been otherwise well conducted. Doctors, and even regulators, might well decide that, at least for the moment, we should use that drug.

The same applies to Dylan Wiliam’s Embedding Formative Assessment. This is one of very few large trials in education. One of even fewer with a proper protocol, a predefined endpoint and analysis plan, and most important of all an endpoint, GCSE exam results, that matters to parents and pupils. Sure, the benefit was smaller than the authors had hoped for, and the results didn’t quite reach conventional levels of statistical significance, but there’s still a smaller than 10% probability that the difference observed occurred by chance. The effect is also plausible.

If I was a head teacher looking to raise my school’s GCSE scores, I’d seriously consider buying Dylan Wiliam’s Embedding Formative Assessment programme. But I wish the EEF would report their trials more honestly.

Jim Thornton