Pivot or entropy? The end of the gold standard

11 min readMay 25, 2024

In my previous blog, I documented some of the most common critiques of the supposed “gold standard,” Randomised Control Trials (RCTs). These included: (1) theoretical shortcomings, (2) issues with construct validity and poor theorisation; (3) ignoring context; (4) ignoring human agency; (5) distorting and narrowing the scope of interventions, (6) limits at explaining how and why change happens; (7) incapacity to deal with complexity; (8) limited external validity and generalisability; (9) high inflexibility and difficulty to implement at scale, and; (10) virtual denial of ethical considerations.

I recently read J-Pal’s blog The next decade of RCT research: What can we learn from recent innovations in methods? I was pleased to find, alongside the usual spin, some thoughtful reflections on how RCTs can improve. So, in this blog, I aim to consider where the randomistas have pivoted away from some of the central tenets of the faith, and as a result, have improved the potential usefulness and relevance of experiments.

This pivot is not entirely new. Timothy Ogden helpfully explained the evolution of RCTs and Lant Pritchett has even gone so far to argue:

“None of the original users of randomisation have stuck to their original set of claims because they have learned and acknowledged they were completely wrong, so they have moved onto a completely new set of claims but without being super clear what they learned, how they learned it and why they shifted because it’s never super beneficial to you to tell your funders you made a mistake.”

This is classic Pritchett hyperbole, but I believe the pivots have become more pronounced in recent years. And I think they betray an harmartia (i.e., fatal flaw) at the centre of the randomista project to set RCTs apart as the gold standard. Arguably, these pivots actually imply entropy. I aim to discuss four of my top 10 weaknesses and what randomista’s have done to fix the plumbing: (1) construct validity and poor theory; (2) external validity; (3) complexity and adaptation, and; (4) ethics.

Construct validity and poor theory

Randomistas are notoriously bad at definitions, as 3ie’s Development Evidence Portal clearly illustrates. They like neat, simple, and clear classifications. And yet, many things in the real world just aren’t like that. Part of this comes down to the fact that RCTs are based on flawed assumptions of reductionism and methodological individualism. The notion that the world’s problems can be solved by plumbing is problematic (Harvold Kvangraven, 2019). It implies that complex issues can be broken down into smaller parts and then merely reassembled and aggregated. As Yuen Yuen Ang puts it, discussing Esther Duflo’s argument on plumbing, this stems not only from the wrong theory, but arguably the wrong paradigm.

It’s important you first understand what you’re trying to study/assess and only then might you have any real shot of theorising and actually measuring it. As the J-Pal blog points out, it’s difficult to measure complex, multi-dimensional constructs such as women’s empowerment. To address this, Seema Jayachandran and colleagues therefore created a benchmark measure of women’s empowerment using qualitative data from in-depth interviews and then used machine learning algorithms to identify items from widely used questionnaires. In other words, they worked from the bottom up to help determine what might be measured through surveys. This is never really likely to be enough, as Naila Kabeer’s study shows. But, at least it’s a start.

Randomistas are also typically poor at theorising, for many the same kind of reasons. As Matt Barnard noted recently, “counterfactual evaluations can in principle be ‘black box’ evaluations that just involve measuring specified outcomes without any clear idea about how an intervention is meant to work.” Mercifully, this is not something he recommends. Instead, he argues that we should be identifying mechanisms in theories of change (by which he actually means we should be using research-based social theories, as Huey Chen argued decades ago). Even good RCTs from otherwise clever researchers such as in the Metaketa Initiative fell at the first hurdle due to poor theorisation. If they had more carefully read the non-experimental research in the field, they could probably have predicted they’d get null results across most, if not all, the studies.

The logic of experimental inference and misplaced belief in evidence hierarchies unfortunately create a straitjacket which is pretty hard to escape. Arrows flow in only one direction in an artificially contrived vacuum. Though, when randomistas are making inferences about an intervention’s net effect, what they’re actually doing is saying the intervention + all these contextual conditions = the effect. All the other variables (or rather, causal factors) are baked into the “black box.”

We assume that the researchers are clever enough to identify the correct control variables (despite the fact they often know little about the context and have poor theory). To me, this is highly doubtful in most cases. Particularly in thematic areas that are poorly defined and understudied, such experiments are potentially calamitous. Indeed it’s only true to say that “there is only one difference between the two groups” if you have perfectly understood all the potential differences between the two groups and all units of analysis within those groups (see Scriven for an adjacent discussion). I don’t believe anyone has ever achieve this. Even standing on the shoulders of giants, this is a tall order. And given randomistas’ seemingly common lack of concern for context it’s almost certainly impossible.

However, some randomistas deserve credit for thinking outside the black box. Howard White, in particular, deserves credit for proposing theory-based impact evaluation, theory-based systematic reviews, and later theory-based evidence gap maps. I’m generally not a fan of these for their evidence hierarchies, but some of these gap maps have started to include a broader range of qualitative and theory-based evidence. My old colleagues Ada Sonnenfield and Nick Moore (and Hugh Sharma Waddington) deserve a good deal of credit for this too. While this won’t solve all the problems of a flawed evidence-based policy perspective, it’s a big step in the right direction to understanding which are gaps are due to methodological blindspots and which are real gaps.

The concession of the need for clear theory suggests as fundamental problem with a “black box” counterfactual epistemology and approach to causation. Randomistas now acknowledge they need at least a flavour of generative causation for them to have any hope at explanation.

External validity

Randomistas have historically struggled with issues of external validity mostly because of their built-in disregard for context. If you genuinely believe you can control for external factors, it’s very hard to take context seriously. Nonetheless, randomistas can at least do better. I’ve explained in the past about why I was impressed by Dean Karlan and Mary Kay Gugerty’s thoughtful work on the Goldilocks Challenge of right-fit evidence, and Mary Anne Bates and Rachel Glennerster work on the Generalizability Puzzle. These both take experimental approaches forward, firstly to question when experiments are actually inappropriate (perhaps most importantly), and secondly to think about how and to what degree it’s reasonable to generalise from the findings of particular studies/evaluations or collections of studies/evaluations.

Identifying moderating conditions offers a further way forward. One key area of progress which can at least help randomistas to take account of context comes from the Centre for Development Impact and Learning (CEDIL). Howard White and Eduardo Masset deserve credit in borrowing from Nancy Cartwright on the construction of middle-level theory and the importance of identifying moderating factors, which stems from configurational and generative approaches to causality. These can help randomistas to better identify potentially relevant factors which may enable, prevent, amplify, or dampen intervention effects, even if they are still (often misguidedly) attempting to control for external factors.

Randomistas can also consider certain aspects of real-world constraints, at least to some degree. John List, who wrote The Voltage Effect, recently argued that you should design your intervention within constraints needed to scale. List reminds us that experiments tend to be designed under optimal conditions of control. So, List recommends that we identify “important mediators, heterogeneity and causal moderators as well as whether the idea remains promising in the face of crucial real-world constraints it will face at scale.” In other words, he recommends that experiments take contextual factors more seriously, and study them more explicitly. List reminds us that “great caution should be taken when drawing conclusions from a localized experiment about a policy implemented at scale.”

Perhaps the best illustration of this comes from Julia Littell who recently compared different logics of generalization from systematic reviews. It’s really a masterclass in showing how difficult it is and why we should be very careful about generalising even from meta-reviews. Focusing on proximal similarity seems to be the best option, in my view. As Shaddish puts it, “we generalize most confidently to applications where treatments, settings, populations, outcomes, and times are most similar to those in the original research.” The challenge with this for randomistas is that if lose control over setting these parameters, they can’t generalise findings. The pivot of randomistas towards embedded experiments shows this clearly. As Jean Drèze reminds us, this can “trivialise public policy and compromise the independence of the researchers. It is also a fountain of ethical dilemmas including consent issues, conflicts of interest and compensation norms.” It’s clearly a dead end.

A related point is that experiments also demand fidelity (another key weakness), but randomistas have realised that this too is a problem in the real world. List’s call for identifying mediators and moderators is also a recognition that demanding fidelity at scale is essentially infeasible (as indeed does Drèze’s research). As List puts it:

“The difficulty arises because the object of interest in social science experiments is humans; thus, in most (or potentially all) cases, attempting to develop behavioural laws that parallel those from the natural sciences is a fool’s errand.”

With this kind of issue in mind, Michael Woolcock defined the level of implementer discretion as a key aspect of complexity. Randomistas don’t seem to like complexity. This is one reason why they are so stringent in their demands. This “human problem” outside the lab also signals another problem with experiments and scale — adaptation.

In my view, the concession of looking for mediators and moderators seems to demonstrate a fundamental problem with the very premise of identifying a control group. Better theorisation may help a bit, but the basic problem here is the unrealistic, unscalable, and often unethical, demand for control.

Complexity and adaptation

I’ve already discussed the weaknesses of adaptive trials and factorial designs for addressing issues of complexity, and in the previous blog, I cited the numerous issues with the inflexibility of experiments. Randomistas seem to have no serious answers to the complexity issue, other than a problematic appeal to the notion of restricted complexity (i.e., to “decomplexify” complexity). In my view, there are serious ontological, epistemic, and methodological problems with such an approach. At least, with regards to control groups and randomisation.

However, randomistas have a slightly better answer to the adaptation challenge. Timothy Ogden has even argued that randomistas have taken a Problem-driven Iterative Adaptation (PDIA) approach to research. Though this PDIA happens between interventions, so Ogden completely misses the point of adaptive management regarding the fundamental importance of course correction during implementation. It’s simply not remotely feasible to run dozens of adaptive trials during a two or three year project.

Nonetheless, another interesting evolution in the last few years is the growth of interest in A/B testing, without a control group. The World Food Programme (WFP) stands out as a place which has recently promoted A/B testing in international development. The heart of an RCT is a direct comparisons between one or more groups (which receives a treatment) and a control group which does not receive the (new) treatment (a standard treatment, or a placebo). You then randomize participants who are supposed to be in either the treatment or control group. The combination of these two is what helps to establish the net effects of an intervention.

What A/B testing offers instead is a treatment group and an alternative treatment. Sometimes A/B testing has a (pure) control group, but sometimes not. As WFP put it, ‘a pure control group isn’t necessary to identify which interventions are more effective.’ So, rather than comparing between groups that did and didn’t receive the intervention, you’re comparing the relative effectiveness of different treatments. Sometimes this will be an existing intervention with an add on or sometimes this will simply be two different interventions, without a control group. You can have the randomisation without a (pure) control group. Randomisation helps to minimise, but does not eliminate, selection biases.

In my view, the concession of having randomisation without a pure control group seems to undermine the basic foundations of experimental impact evaluation — a counterfactual without the intervention. What we’re left with is simply randomising treatments and a comparison.

Ethics

As The Economist discussed recently in How poor Kenyans became economists’ guinea pigs, RCTs still have serious ethical questions to answer. Some — such as Nimi Hoffmann — have thus argued for a moratorium. Yet, once again, what randomistas have done instead is pivot.

Last year, I questioned how wise it was to expand experiments in the humanitarian aid sector. My assumption was that Ananda Young of the Humanitarian Assistance Evidence Cycle (HAEC) was promoting the gold standard — RCTs. But, it turned out I was wrong. She was actually promoting the new silver standard of A/B testing.

The other main issue that A/B testing appears to offer is at least a partial answer to the ethics problem in RCTs. HAEC notes that the common critique of RCTs is that “withholding programming from vulnerable populations to construct a control group is definitively unethical.” No argument there. They cited an implementer:

“There is normally a universal approach to coverage for humanitarian response that makes having a control group really hard. Everybody that needs attention should get it.”

Rightly so. How this is even a matter of debate is baffling.

Though WFP points out that in emergency contexts, the A/B approach can ‘help to address potential ethical concerns because it doesn’t require a pure control group (e.g. group that doesn’t receive any support).’ HAEC recently came to the same conclusions:

“HAEC encourages its funded impact evaluations to focus on operational research questions utilizing A|B testing approaches that avoid the need for a pure control group.”

It’s hard to disagree that this is preferable to having a control group under such circumstances. There are still plenty of ethical problems that come with randomisation. List’s paper describes the selection decision as one of “covertness: the participants do not know that they are being randomized into treatment–control or that their behaviour is being scrutinized.” The blog by Sprinkel noted that RCT presented “real challenges to [CARE’s] feminist principles of gender transformative programming and CARE’s mission and values.” Ensuring random selection “meant much longer distances to travel in order to participate.” So, even without a control group, there are serious selection issues that come with randomised experiments.

In my view, the concession that a pure control group is often unethical shows that the notion of equipoise is often nonsense and that withholding treatment should only be done in situations where there is no genuine need.

In summary then, these pivots demonstrate that RCTs have some fundamental flaws. What we’re ultimately left with, in my opinion, is a carapace of “randomisation.” I think these concessions go a considerable way to demonstrate that there is no “gold standard” method. There never was, and probably never will be.

Testimony from two of the most prominent promotors and producers of RCTs demonstrate that any serious debate on this matter is over. In response to my last blog, the Director of the Australian Centre of Evaluation (ACE), Harry Greenwell, argued that RCTs are appropriate more often than I think they are (fair enough, the may well be), though he noted that my “general proposition is true, and is a good reason for avoiding ‘gold standard’ language.” The Director of Evaluation and Evidence Synthesis at the Global Development Network, Howard White, also responded that: “the appropriate method for the research or evaluation question [is what should drive choices]. No methods absolutism and no evidence hierarchy.”

So, it’s time we moved on.

Pivot or entropy? The end of the gold standard

Construct validity and poor theory

External validity

Complexity and adaptation

Ethics

Written by Thomas Aston