The fall of Icarus
We often have conversations about research and evaluation methods which depart from the notion that some methods are, a priori, better than others. Randomised Control Trials (RCT) pooled in systematic reviews typically find their way to the top of the pyramid.
In the past, I’ve referred to RCTs as gold-plated standards, with passing reference to a handful of their limitations, and argued that the very notion of pyramids is fundamentally misguided and unhelpful. As Bédécarrats et al. neatly put it, “all that glitters is not gold.” But, in fact, RCTs have a lot of problems, and probably more than you think. I began writing a blog on positive developments among randomistas in the last few years regarding how they had addressed the methodological weaknesses of the method, but evidently in doing so I had first to outline these weaknesses. So, I came up with a list, and figured I’d share my top ten:
- RCTs have serious theoretical shortcomings stemming from a philosophy based on methodological individualism and the compartmentalisation of problems, which in turn, delimits their explanatory potential and validity (Harvold Kvangraven, 2019, Guérin, 2019; Bédécarrats et al. 2020 and see graphic below).
- RCTs face serious challenges with construct validity, often due to limited or poor theorisation related to how interventions are expected to cause effects (Pritchett, 2018; Fischer-MacKey and Fox, 2022; Aston, 2023; Herrera, 2024; see also Ogden, 2020).
- RCT’s desire for control means they largely (or entirely) ignore context and how contextual features interact with an intervention at multiple levels (Pawson and Tilly, 1997; Tilly, 2000; Pritchett, 2018; Kabeer, 2019; Aston, 2022; Aston, 2023).
- RCTs pay little attention to human agency. A related point is that they tend not to take into account participants’ background traits or community histories, given their assumption that treatment groups are genuinely equal and their mechanical and homogenising view of human behaviour (Pawson and Tilly, 1997; Krauss, 2017; Kabeer, 2019).
- RCTs inappropriately distort and narrow the scope of interventions. They typically narrow the boundaries and reduce the scale of action (i.e., not multiple levels or society-wide) in order to identify a valid control group and to randomise treatment. This reduces the potential effectiveness of those interventions. There is also the related “trival significance” critique expressed in several testimonies below (Bédécarrats et al. 2017, Fox, 2019; Ravallion, 2020; Aston, 2023).
- RCTs are poor at explaining how and why change happens because they don’t unpack black boxes (i.e., causal mechanisms). RCTs also often neglect alternative factors contributing to their main reported outcome (Krauss, 2017; Harvold Kvangraven, 2019; Reddy, 2019; Kabeer, 2019; see also Ogden, 2020).
- RCTs are incapable of dealing with issues of complexity. Despite protestations you may hear that the way to deal with complexity is to break things down (as in the graphic below), plumbers can’t fix complex problems. They don’t work well for aggregated phenomena. As Lant Pritchett put it, you “can’t randomise across an economy.” This is also referred to as the “policy sausage” critique (Mayne, 2011; Quinn Patton, 2011; Krauss, 2017; Bicket et al. 2020; Ogden, 2020; Barbrook-Johnson et al. 2021; Aston, 2022).
- RCTs tend to have limited external validity and low capacity for generalization. Put differently, even if RCTs tell us something about a “particular program in a particular place during a particular point in time [they don’t] tell you much about the result of even running an exactly identical program carried out in a different context and time (Deaton and Cartwright, 2017; Pritchett, 2019; Reddy, 2019, Ogden, 2020: 129–30).”
- RCTs are highly inflexible and thus hard to implement at scale. RCTs also require high implementation fidelity. RCTs are only really conducive to localised and highly bounded phenomena. When scaling up (or out) we tend to find local discretion in how interventions are implemented which thus makes inflexible methods like RCTs pretty useless at scale (Pritchett, 2018; Kabeer, 2019; Fox, 2019; Heckman, 2020; Drèze, 2023; Aston, 2023; List, 2024).
- RCTs virtually deny ethical considerations and potentially cause a lot of real harm. Given how both the method’s main features of randomisation and control groups can both cause harm, RCTs might be especially ethically problematic (Ravallion, 2018; Abramowicz and Szafarz, 2019; Hoffmann, 2021; Drèze, 2023; Kinstler, 2024).
If even half of these critiques are fair (I think they all are), it would be reasonable to conclude, as Sanjay G. Reddy does, that RCTs are more cautionary tale than success story. Perhaps the best attempted riposte comes from Timothy Ogden (2020), but in my view, this falls well short.
It’s also reasonable, I think, to conclude that any method with this many flaws can’t seriously be considered a “gold standard.” RCTs are not that special (“nothing magic,” as Martin Ravallion put it), and should not be treated as such. I would actually struggle to come up with at least 10 major limitations for most methods. So, by extension, there are clearly no legitimate evidence hierarchies. Anyone who suggests otherwise, such as this recent blog from Matt Barnard shouldn’t be taken too seriously because they are predicated on the notion that the features within an RCT are always better than alternatives. This is simply not the case.
If you still aren’t convinced by my list, or if you’re still struggling to make the case for why RCTs might not be the right option, here are a few testimonies for you to consider:
- ‘The RCT design… has essentially zero practical application to the field of human affairs (Michael Scriven).’
- ‘The ability of RCTs to answer any of the big questions is close to zero (Arvind Subramanian).’
- ‘RCTs are not feasible for many of the things governments and others do in the name of development (Martin Ravallion).’
- ‘The implications of using results from [RCTs] as a basis of designing public policies for poverty interventions are dangerous at best (Godwin R. Murunga and Ibrahim O. Ogachi).’
And if you think I’m merely cherry picking critiques of bad RCTs, think again. Alexander Krauss assessed the 10 most cited RCT studies worldwide and concluded that “these world-leading RCTs that have influenced policy produce biased results.” Yes, RCTs also have biases. And bear in mind, these studies were in the fields of general medicine, biology, and neurology, fields for which RCTs are perhaps most appropriate. So, what hope is there for field experiments in other less appropriate sectors?
At a bare minimum, these accounts suggest that RCTs aren’t appropriate or feasible for most things we actually want to know about in the world and have a very narrow range of legitimate application.
Methods are only as good or bad as their appropriateness for answering particular questions and in relation to how well they are able to help us understand the primary phenomena under study. They are not better nor worse, per se, just more or less appropriate under different circumstances. Some such as Amber Peterman (and Ogden, 2020) complain that RCTs are singled out for special treatment:
Peterman offers a thoughtful counter-critique suggesting that RCTs shouldn’t be singled out. She’s right in the sense that many of the above critiques can also be levelled at other experimental approaches and to a lesser degree to quasi-experimental approaches. The most amusing riposte to Deaton and Cartwright’s critique came from Pamela Jakiela who responded by saying: “Nice insight by Deaton and Cartwright, but for some reason they keep spelling ‘study’ as R-C-T.” Had Jakiela said “experiment,” it might have been rather more accurate.
When we use the invective of “randomistas” we’re really referring to those who do experiments with control groups as least as much as we are referring to those who randomize treatments. Though, we’re also referring to an epistemic fissure between experimental approaches and other forms of causation (as illustrated in Quadrant Conseil’s Impact Tree, which, as Thomas Delahais reminded me, randomistas typically misinterpret merely as the “qual-quant” debate). It’s deeper than that. As John Mackie put it, it’s not just about data collection methods, it’s about the “cement of the universe.”
My response to Peterman was if you claim to be special and better than all others, a priori, you can (and should) expect the strongest criticism. Or, at the very least, you should be held to as high a standard as you claim your method deserves. Pride comes before a fall, as Icarus found out.
Despite these serious limitations, RCTs continue to expand. Timothy Ogden (2020) is quite wrong to suggest that RCTs are in the ‘slope of enlightenment’ phase of the hype cycle. We are clearly still at the ‘peak of inflated expectations.’ Facundo Herrera recently shared the following trend from the AEA registry which illustrates this:
So, why did they expand? Arvind Subramanian, a previous Chief Economic Advisor to the government of India, argues that this growth is due to RCTs’ promotion by “a very incestuous club of prominent academics, philanthropy [to] mostly weak governments.” Florent Bédécarrats, Isabelle Guérin, and François Roubaud argue that the success of RCTs is driven mainly by a new scientific business model. It’s the same club that has effectively promoted this business model. Agathe Deveaux-Spatarakis explains this well looking behind the scenes of the French evidence-based policy movement and what she describes as a “lobbying coalition of evidence suppliers wishing to conduct RCTs.” Though, when you consider the enormous expense of RCTs, it’s highly questionable that this buisness model offers good value for money.
Much of this growth still takes the form of “helicopter research,” whereby researchers from the Global North parachute their ideas into new contexts in the South, frequently, with poor understanding of these contexts. As Martin Ravallion put it, “researchers mainly based in the rich world lead randomized controlled trials whose subjects are mainly people in the poor world.” Some such as Peterman are strongly against this practice, at least in the abstract. Frequently, in my experience, these researchers treat their in-country researchers poorly, and research subjects even worse. Yet, as Ogden (2020) points out, “randomistas clearly believe that experimentation with human beings is ethical… [because what counts to them is] what can be learned from the experiment.” It is precisely for this abstract reason that research subjects come last.
As I discussed in randomista mania, for all of these reasons, and more, RCTs substantially over-used. So, we should be rather more reflexive in considering when they are most appropriate and useful.
In the next blog, I aim to consider what randomistas have done to address some of the aforementioned concerns. I will argue that these changes are very welcome and move the field forward. But, the changes themselves suggest entropy at the heart of the randomista project and demonstrate that the game is up for any arguments of a “gold standard.”