Several years ago, when Esther Duflo, Abhijit Banerjee, and Michael Kremer won the Nobel Prize for economics I wrote a blog arguing that we shouldn’t spend too much time criticising Randomised Control Trials (RCTs). However, I feel the need to break my own advice, because in the last year or so the randomista project appears to have gotten completely out of control.
I think there are entirely legitimate uses of RCTs, and some randomistas have made important contributions to knowledge and have very thoughtful things to say. RCTs certainly have a role to play in the drive towards evidence-informed policy, but no method is a panacea and no method is a gold standard. I had hoped that the evidence-based policy crowd would be reasonable, rational, and evidence-led (after all, these are their mantras), but the discursive evolution in the last year or so suggests that I was wrong.
Randomista mission creep
I recently sat on a panel in a webinar by Kantar Public on Ensuring Rigour in Theory-Based Evaluation. Kantar’s Head of Evaluation, Alex Hurrell, diplomatically stressed that RCTs and theory-based evaluation aren’t irreconcilable. I agree with him. But as the webinar wasn’t about RCTs, I was surprised how many audience questions were about RCTs. My co-panellist Catherine Hutchinson, UK Cabinet Office’s Head of the Evaluation Task Force, affirmed that Cabinet Office’s preference was for RCTs, but reasonably noted that they weren’t always possible, so other options also needed to be considered. Yet, Hutchinson didn’t explain why RCTs were Cabinet Office’s preference. It was said with a shrug of the shoulders, as if it were obvious, even inarguable.
Perhaps we should remember that the Behavioural Insights Team (A.K.A. “Nudge Unit”), stauch advocates for RCTs, was established in 2010 within the Cabinet Office. But, the unquestioned preference should be surprising given that the UK government’s own guidance from two departments where Hutchinson used to work, the Department for International Development in 2012 and HM Treasury in 2020, both stressed that there is no legitimate hierarchy in evaluation designs.
There are plenty of regularly cited articles on the limitations of RCTs, which ought to give us all pause. These provide powerful critiques about external validity, ethics, and human agency, among many others. Yet, none of these are taken seriously by randomistas. Given how much they like to talk about bias, and systematic reviews, a recent systematic review by Waddington et al. ought to trouble them. It found that there was approximately no difference in bias between RCTs and study designs with selection on unobservables. Whether RCTs are real gold, gold plate, or fool’s gold I leave to your consideration.
Let’s go back a decade to briefly consider why the Cabinet Office might take such a position. In 2012, the Nudge Unit made a pretty weak attempt to debunk some “myths” about RCTs. Without equivocation, they asserted that ‘randomised controlled trials (RCTs) are the best way of determining whether a policy is working’ and that in some fields it would be ‘bizarre, or even reckless’ not to use them. Their argument amounted to a straw-man critique against the notion that ‘RCTs are always difficult, costly, unethical, or unnecessary.’ Predictably, this was a statement without citation. It’s not my intention to critique their specific arguments, but a few brief considerations seem in order.
According to the Randomized Controlled Trial Lab, ‘RCTs are typically slow, resource intensive, and inflexible.’ They can be quicker and more nimble, taking months rather than years under highly stringent conditions. But, they typically aren’t quick. And as one example from my previous employer below shows, they are often very challenging to implement. Given just how many poorly implemented RCTs there are out there, and their increasing sophistication, easy is not a word one might easily associate with them.
Many RCTs cost millions of dollars, but if you already have existing data and embed them into existing programmes, estimates for “cheap” RCTs tend to range from $50,000 to $300,000. Even recent efforts to save money are talking about close to half a million dollars. I work in international development, and this is more than the total cost of a huge proportion of international development projects, making them ludicrously unaffordable. However, without irony, the Nudge Unit questioned ‘what are the costs of not doing an RCT?’
Each year, there is a new RCT ethics horror story. In late 2020, it was the RCT in Kenya which deprived poor people of water. So disgraceful was their error that the authors hurriedly penned an ethics statement after publication. In late 2022, it was an RCT on the mistreatment of Filipino domestic workers which didn’t realise that the treatment was mistreating the treatment group. Yet, what is most concerning of all was that ethics review boards at highly reputable universities and top class randomistas who write blogs on the ethics of RCTs don’t seem to understand when an experiment is unethical. This does not mean that most RCTs are unethical, but it suggests that there are systemic problems that even the best can’t (or won’t) recognise.
And as for whether they are unnecessary, as thoughtful randomistas Mary Kay Gugerty and Dean Karlan remind us, ‘many questions and circumstances do not call for RCTs.’ Or, as the late Martin Ravallion noted, ‘RCTs are not feasible for many of the things governments and others do in the name of development.’
Are there enough RCTs?
A study commissioned by the United States Agency for International Development (USAID) in 2012 found that only 3% of the evaluations were experimental impact evaluations. So, with due consideration for the nature of study questions and relevant interventions, and due concern with ethics, there may have been some case to conduct more experimental impact evaluations.
In 2016, World Bank lead economist David McKenzie asked Have RCTs taken over development economics? He pointed out that in 2000, the top-5 journals published 21 articles in development. None of these were RCTs. By 2015, 10 out of 32 were RCTs. So, there was a huge amount of growth. Yet, RCTs were still a small proportion of all development research. According to McKenzie, in 2015, 44 out of 454 development papers published in 14 journals were RCTs. So, ~ 10% of all articles.
By 2018, we began to see more thoughtful work from randomistas like Howard White on theory-based systematic reviews, Dean Karlan and Mary Kay Gugerty on the Goldilocks Challenge of right-fit evidence, and Mary Anne Bates and Rachel Glennerster on the Generalizability Puzzle.
Gugerty and Karlan recognised that the vast majority of Innovations in Poverty Action’s (IPA) 650 RCT studies ‘did not paint a clear picture that led to immediate policy changes,’ and more broadly argued that despite some studies showing value for money, ‘a great deal of money and time has been wasted on poorly designed, poorly implemented, and poorly conceived impact evaluations.’ Gugerty and Karlan pointed out that:
‘The push for more and more impact measurement can not only lead to poor studies and wasted money, but also distract and take resources from collecting data that can actually help improve the performance of an effort.’
I saw hope in a more reflexive and thoughtful conversation about when it was reasonable and appropriate (or not) to conduct experiments. Thus, in 2019 when Duflo, Banerjee, and Kremer won the Nobel Prize, I argued that ‘some randomistas are a lot more thoughtful than their caricature would suggest, fears of a hostile takeover might be somewhat misplaced,’ and that RCTs weren’t quite a ubiquitous as some had suggested. I think this was still somewhat accurate.
By 2020, David Evans noted that RCT’s still hadn’t taken over development economics. At this point, as much as a quarter of studies at a large development conference were RCTs. This is a very high proportion for a single method, but one could just about sustain the argument that they hadn’t “taken over development economics.”
As I discussed in a blog last year, between 2013 and 2020, we also found a dramatic growth of RCTs in low and middle income countries. The American Economic Association RCT Registry with over 4,500 RCTs across 159 countries showed that a huge part of the growth was in Africa and South Asia, as the graph below illustrates.
There is an even sharper curve in Martin Ravallion’s paper on whether the randomistas should continue to rule, based on the International Initiative for Impact Evaluation’s (3ie) data. Ravallion’s response to the above registry data was spot on: ‘Researchers mainly based in the rich world lead randomized controlled trials whose subjects are mainly people in the poor world. Hard to feel comfortable with that.’ But, I guess, that’s how you get tenure.
David Evans helpfully detailed the submissions to last year’s Centre for the Study of African Economies (CSAE) conference on Economic Development in Africa. Of 47 studies, 24 were RCTs. So, more than 50%. This seems to suggest that by 2022 maybe RCTs had taken over development economics, in Oxford, at least.
Of course, not all sectors are the same. Peter Evans, who was previously Team Leader of the Governance, Conflict, Inclusion and Humanitarian Research Team at the Department for International Development’s (DFID) told me that around 15% of the budget for studies they commissioned were RCTs (~2/18 programmes). To me, while I think experiments are inappropriate for a lot of that portfolio, that seems to be a reasonable and appropriate proportion (see Evans’ defence thread). To me, ~ 10% of interventions and/or funding sounds like a sensible ceiling for such a method, not a floor. Pause to consider what an appropriate proportion ought to be in your sector.
When is enough enough?
Nonetheless, the crusade continued. Just last month, Ananda Young of the USAID-funded Humanitarian Assistance Evidence Cycle (HAEC) project noted that ‘while impact evaluations are an established tool in development, this tool is not yet used regularly in the humanitarian sector.’ But, should it be? Is this the last frontier?
In her view, experimental impact ‘evaluations present a unique opportunity to increase the quality and amount of data and to stretch every dollar to its maximum potential.’ HAEC found that lack of incentives, implementer bandwidth constraints, ethical concerns, short programming timelines, misaligned research partnerships, and lack of funding were some of the core constraints. One might think you would take a breath at this point. Perhaps these are good reasons not to conduct experimental impact evaluations. Yet, these were all brushed under the carpet as if they were trivial concerns.
In a Centre for Excellence for Development Impact and Learning (CEDIL) conference this week, it was argued that perhaps 50% of development interventions can (or should) be ‘subjected to RCTs’ or other experimental impact evaluations:
What evidence tells us that half of all interventions can or should be subjected to RCTs? It turns out that the person who said this, the Director of Development Impact Evaluation (DIME) at the World Bank no less, clarified that they actually said all interventions. It’s an astonishing aspiration which shows that the RCT-industrial complex has truly gotten out of control.
Participants in a CEDIL poll at the conference seemed to agree. When asked whether RCTs had been good for the evaluation field, 83% responded either that ‘no, RCTs had taken away resources from proper evaluation (10%)’ or ‘it was OK, but the balance needs to be restored (73%).’
When I’ve raised concerns about a “bias” towards RCTs within organisations like 3ie, I’ve been asked to reassure them that we both “believe in evidence.” Yet, this kind of straw-man response entirely misses the point. Of course we do. As Yeun Yuen Ang put it, most RCT critiques are questioning the lack of methodological diversity and considering method appropriateness rather than saying that we should never conduct RCTs.
Or as Carlos Oya put it in his response to Ang, ‘the very fact that they think it’s “against RCTs” is testament to the method-centred style of thinking, and the implicit or explicit assumption that RCT is indeed a “gold standard.”’ As Oya points out, what matters is ‘how we use them, for what purpose and questions.’
(Re) considering appropriateness
In my view, Michael Scriven was wrong to say that RCTs have ‘essentially zero practical application in the field of human affairs.’ This is obviously hyperbole. But, it’s really high time randomistas take critiques seriously. An RCT isn’t a sacred cow. If both Gugerty and Karlan and Ravallion can see that many questions and circumstances don’t even call for RCTs, and they aren’t feasible for many things governments and other actors actually want to do, why would anyone suggest that half of international development interventions should be conducted with RCTs?
Last year, I wrote two blogs critiquing a paper Evaluating complex interventions in international development where I argued that experimental approaches in general, and not just RCTs, were a poor fit for evaluating complexity. Overall, just as Gugerty and Karlan recognise, there should be more thinking on when experimental approaches are a good fit with the intervention design and/or context. Yet, as many randomistas care little about context and often prefer to change intervention designs to suit their research designs, such cautions likely fall on deaf ears.
There are more and less thoughtful ways to temper the mania. A more thoughtful way is to consider the design triangle developed for DFID in 2012 by Eliot Stern and colleagues. This offers a common thread between thoughtful randomistas, heterodox economists, and evaluation royalty. It focuses on evaluation questions and the desired attributes of a programme (or intervention) before considering what method is best to evaluate it. For many randomistas, the cart unfortunately comes before the horse.
Jean Drèze’s outstanding article last year on the perils of embedded experiments in India is also instructive here. He reminds us that:
‘Sound policy requires not only evidence — broadly understood — but also a good understanding of the issues, considered value judgements, and inclusive deliberation.’
Drèze argues that insufficient time has been given to considering the integrity of the process that leads from evidence to policy.
Consider two real world examples from my previous employer, CARE International. Ex-colleagues of mine wrote a few blogs on their difficulties of aligning the supposed “gold standard” method with their mission in the Tipping Point Initiative focused on child, early and forced marriage. The project director Anne Sprinkel argued that doing an RCT presented ‘real challenges to our feminist principles of gender transformative programming and CARE’s mission and values.’ In my view, if your methodology fails the mission test, you shouldn’t choose it in the first place. Sprinkel noted that the:
“RCTs come with a slew of methodological demands to maintain the integrity of the research...CARE Nepal put in tremendous effort and a large amount of time to identify all potential sites for implementation, it is nearly impossible to gather such in-depth information quickly or at all — not to mention while balancing feasibility, logistics and need as CARE does in non-study settings.”
So much for the Nudge Unit’s argument that RCTs are easy. And one wonders why the integrity of the research should supersede the integrity of the initiative itself? Sprinkel continued:
“Efforts to ensure the random selection of participants strongly influenced group formation — while we weren’t able to prioritize communities most in need or participants clustered around an accessible community center… For most communities, this process meant much longer distances to travel in order to participate.”
Thus, the RCT didn’t just take up lots of time and resources, it influenced who could and who couldn’t benefit. Even CARE’s approach to social norm change was adapted to fit the method, rather than the other way around.
CARE’s Ghana Strengthening Accountability Mechanisms (GSAM) project was also subjected to an expensive impact evaluation that had all sorts of problems from the start. The evaluation didn’t really assess the full project intervention, but only parts of it that suited the evaluator’s model. There were substantial spill-over effects between intervention and non-intervention districts which was unavoidable given the USAID’s grant stipulation to include radio dissemination as part of the project. There was attrition of over half the sample for one arm of the study. The evaluation made invalid comparisons between treatment arms (short-term vs. medium term effects), and it assessed impact two years before the project had actually finished. Sounds like a poor study, right?
Yet, this was held up a study in USAID’s anti-corruption evidence week last year, and it also somehow made it into 3ie’s Development Evidence Portal. 3ie was commissioned by USAID to produce an evidence gap map on good governance for USAID (you can find GSAM here). Can you see the causal link?
I was told by the presenter that USAID’s Democracy, Rights, and Governance (DRC) Center, which commissioned the gap maps, hadn’t done many impact evaluations, and GSAM was one of the few they did that was very clearly in the anti-corruption space. It wasn’t really an anti-corruption project. I was the bid writer. But, once again, we see an example of cart going before the horse. So fixated are we on the method and its supposed intrinsic and inviolable “rigour” that we pay little attention to the potential damage left in its wake, and exclude all other studies from the conversation.
This is the rather sordid politics of evidence-based policy. We compromise our mission, our values, and abdicate our ethical responsibilities. We redesign our interventions, their approach, their coverage, their population. It all sounds like a Faustian bargain. And all of this… for a method?
As most respondents in the CEDIL survey on impact evaluation said, ‘the balance [and reason and proportionality badly] needs to be restored.’