Mark Rothko, №7

Evaluating complexity, simplistically

Thomas Aston
9 min readFeb 16, 2022

A study was recently published by the Centre of Excellence for Development Impact and Learning (CEDIL) entitled Evaluating complex interventions in international development. This is the sort of title that raises great expectations. Complexity is a hugely popular theme and many of us are keen to know more about how to evaluate efforts that seek to achieve results amid complexity.

In April 2021, CEDIL conducted a webinar on the paper, and in July 2021 CEDIL published a blog. In September 2021, I wrote a blog expressing some reservations regarding the focus of the study; its apparent over-emphasis on interventions, under-emphasis on context, as well as its choice of some supposedly under-used methods. These methods were: (1) factorial designs; (2) adaptive trials; (3) Qualitative Comparative Analysis (QCA); (4) synthetic control; (5) agent-based modelling, and system dynamics.

On one hand, I think it’s important to look at apparently under-used methods, providing these are appropriate to the task at hand. I would agree that QCA has been under-used, and colleagues and I have explained why it’s an appropriate method for assessing complex change processes in a recent Evaluation article (open access here). Agent-based modelling may well have an argument for increased use (in combination with other methods such as Process Tracing, for example) and synthetic controls are perhaps worth looking into further.

In that Evaluation article, however, we discussed why experimental methods are largely inappropriate to the task of evaluating complex change processes. Such methods unhelpfully conceive context as a source of bias to be eliminated, they set counterproductive intervention boundaries which constrain intervention scope and scale, their emphasis on fidelity further constrains the potential for emergence and adaptation, and impede real-time learning, to mention but a few significant weaknesses.

In my blog on the webinar, I explained why we should be looking beyond interventions and take intervention context seriously. Nonetheless, the authors’ focus is on complex interventions, which they define as interventions made up of many components that interact in non-trivial ways. They break this down into the following key characteristics:

Masset et al. 2021

I think such a typology is somewhat helpful in determining the potential dimensionality of an intervention (i.e., how complicated and, to a lesser degree, how internally complex). Though as Rick Davies notes, there are some limitations of the typology:

As discussed previously, I find Bamberger et al.’s checklist more helpful than this typology because it has greater emphasis on intervention context, potential interactions, and social complexity. But, I think Masset et al.’s typology still has some merit.

Linear non-linearity and simple complexity

While the typology might be helpful in some way, the tone of the paper, its ontolotical and epistemic orientation, and methodological preferences are more concerning. It’s full of unprovoked defensive statements such as:

‘Linear relationships between variable[s] might be more common than it is thought, allowing the use of experiments and quasi-experiments.’

This is followed by the pugilistic assertion that:

‘The relevance of non-linearities should be demonstrated with the data and observation, rather than being postulated.’

While I’d agree that demonstration matters, such assertions indicate that this is a paper on complexity which is, paradoxically, hostile to key aspects of complexity thinking.

The authors dispute Rogers’ (2009) — I think, reasonable — argument that standard methods of causal inference can be employed in the evaluation of “complicated” interventions, but not in the evaluation of “complex” ones. There are then several ostensibly confusing statements in the paper such as:

‘Even if relationships are non-linear, this does not mean they cannot be analysed using linear methods. Non-linear relationships can be linear within the restricted range that is of policy interest.’

By ‘restricting the range of input variables,’ they argue, ‘the response of the outcome becomes linear.’

What emerges clearly in the paper through such statements is an argument to restrict the scope of interventions, and presumably, to limit interactions within interventions and between interventions and context so that they can be more easily studied via preferred methods (much like Rogers’ characterisation of “simple”). The authors acknowledge that researchers ‘neglect complexity and evaluate interventions as though they were not complex by singling out the impact of a single component.’ Yet, throughout the paper, it seems clear that the authors are looking to simplify complexity to make it fit the very same randomista worldview and methodological hierarchy.

Another consistent trend in the authors’ argument is that critiques regarding the appropriateness of experimental and quasi-experimental methods for evaluating complexity have been “exaggerated.” They argue (not unreasonably) that claiming an intervention is complex is not an ‘excuse for not conducting a rigorous evaluation.’ But, by “rigorous evaluation,” the authors seem to mean experimental or quasi-experimental evaluation. The chief aim of the paper seems to be to justify the use of experimental methods to evaluate complex programming through the side door.

I say this because the authors only claim to include methods that ‘compare changes in outcomes in an intervention group against changes that would have occurred in the absence of the intervention (a nod to 3ie’s restricted counterfactual definition of impact evaluation rather than the more open OECD definition).’ In fact, this is not even accurate for the methods included in the paper. QCA does not fit such a definition in the first place (it’s a set theoretic method). There is also a clear (and wholy unjustified) methodological hierarchy behind the authors’ argument.

Masset et al.:

‘Did not include methods with a more questionable causal approach, such as process tracing or contribution analysis, and we did not include qualitative approaches.’

Except, of course, Qualitative Comparative Analysis (right?).

The authors:

‘Decided to err on the side of caution by excluding methods that require strong assumptions for being causal, such as process tracing.’

What makes these methods “questionable” is not clearly explained. Such statements arise in the paper as non sequitur cheap shots at methods the authors seem not to understand. These seem wholly unnecessary and unreasonable, given that the explicit focus of the paper is on supposedly under-used methods. Such statements are stranger still when one considers that a substantial proportion of grants provided and papers published by CEDIL were for process tracing studies (including one of which I was part).

For some reason, QCA, factorial experiments, and modelling methods are judged to have strong assumptions, but process tracing and contribution analysis are not. This position itself seems to be full of questionable assumptions. For anyone who has actually used QCA (I have, only once), it’s clear that the strength of those assumptions relies on the strength of within-case evidence prior to developing “truth tables.” Methods such as process tracing, despite their limitations, are designed precisely for assessing the strength of within-case evidence. And, of course, experiments and modelling also sometimes (perhaps, regularly) rely on highly questionable assumptions.

The study is also full of curiously breathy statements such as ‘eventually, an RCT can be carried out to test the effectiveness of the interventions identified in the second stage.’ It’s as though, somehow, an RCT will come to save us from our complexity malaise.

Simple studies for simple interventions?

As I mentioned in my previous blog, I think we need to question the appropriateness of factorial and adaptive experiments for assessing complexity. As the paper is now published, let’s take a brief look at what it found and whether my concerns were justified.

Factorial designs

Factorial designs are randomised experiments that assess the impact of different treatments and of their interactions.

The paper reviewed the use of factorial designs in the evaluation of development interventions and found 27 studies. Nearly all the studies consisted of 2-by-2 factorials and with just one exception, they were all designed as multi-arm trials rather than as factorial designs. According to Masset et al., only one of the studies reviewed employed a true factorial design. So, we appear to be drawing conclusions on a very small sample size. And more importantly, as the authors themselves note (notwithstanding the above exception, I think):

‘These studies are extremely simple and do not try to assess the effects of different interactions between interventions.’

So, whether due to misapplication or mistaken identity, from the evidence presented in the paper, factorial designs appear to offer limited value added to assessing complex interventions (and much less, complex processes or systems).

What about adaptive trials?

Adaptive trials

An adaptive trial is a randomised experiment that allows for changes in the study design during implementation based on the data collected in the early stages of the study.

Masset et al. note that most empirical applications of adaptive trials have been in drug testing. No surprise there. Their review found only two studies that employed an adaptive design in the evaluation of a development intervention.

In the first intervention, one group of job-seeking Syrian refugees was provided a small unconditional transfer, another group received personal coaching to prepare for a job interview, the control group received an information flyer. Study participants were then reassigned to the most effective treatment based on observed employment outcomes.

In the second intervention on agriculture extension services in India the project contacted farmers by phone and administered a short questionnaire to enrol farmers in the programme. There were 6 different types of phone call.

Based on Masset et al’s own criteria for complex interventions, I would argue that both of these are very simple, and the interactions seem relatively trivial, given their phasing. In practice, these appear to be akin to multi-arm, multi-stage trials, but with the opportunity to pick the preferred intervention treatment after the first phase.

As far as I can tell from the paper, the interventions cited involve few components, appear to incentivise few and simple behaviours, seem to involve few groups, consist of few sectors, engage few stakeholders, are applied at limited scale, have high levels of standardisation, and appear to offer relatively limited opportunities for emergent outcomes.

So, what does this tell us? I think what the paper actually tells us is that experimental methods are useful for relatively simple interventions, not meaningfully complex ones.

Consider how these methods might handle the evaluative challenges of highly complex programme such as the Partnership to Engage, Learn and Reform (PERL) in Nigeria. The programme (which I worked on) aimed to promote governance reforms in highly diverse socio-political contexts. Until recently, it worked across numerous locations at district, state, and federal levels. It worked in several sectors in many of these locations (internal revenue, budget, local governance, health and education, among others). Reforms were not selected a priori, but emerged based on several layers of political economy analyses which helped to understand which reforms might gain traction (or not), and the reforms targeted evolved over the course of the programme. PERL worked on both the supply side to build state capacity and political incentives to deliver reforms and with diverse civil society groups, private sector actors, and media organisations to build demand for reforms, and it brought together regional platforms to adapt reforms and practices across states. It should come as little surprise then, that the programme had many components and that we’re talking about many difficult processes of behaviour change.

For the most part, experimental methods prevent us from designing such a programme in the first place. And they seem to be a poor fit for evaluating a complex programme like PERL.

Isolated… standardised… stable

Now, this doesn’t mean that experiments should never be used in complex programming (or contexts). While I can think of almost no appropriate uses, there might perhaps be isolated corners of interventions or phases of interventions where experiments could potentially add some value. As the UK Treasury’s Magenta Book points out, experimental methods are:

‘Weak at dealing with contextualisation…. Difficult to apply where causal pathways are complex… hide the emergence of new phenomena in a complex system. [It can be…] inappropriate or counterproductive to try to standardise the intervention or isolate a control group…This is not to say that experimental designs cannot be useful in complex environments, if one element can be isolated and is relatively standardised, and the wider context stable.’

But, it should be obvious that such conditions are very rare in complex environments and might only be possible in a peripheral corner of an intervention which is standardised, self-contained, segregated and stable (i.e., a vacuum). That is, experimental (and quasi-experimental) methods are appropriate in conditions which are intrisically or arificially simple.

Of course, even if we can think of an appropriate use, we have to assume that finding a valid control group is not only feasible but desirable (control groups often run counter to the aims of complex interventions like PERL), that costs are non-prohibitive and offer good value for money (even supposedly low cost trials are immensely expensive and beyond the reach of most programmes), and that there are limited ethical issues (we have all heard the horror stories).

To put the challenge back to the authors:

‘The relevance of [experimental methods] should be demonstrated with the data and observation, rather than being postulated.’

My conclusion from reading the paper and the data presented in it is that while these two experimental methods may be more appropriate than RCTs (or, less inappropriate), their value addition for assessing complex interventions and processes remains very limited in both theory and practice. But, have a read for yourself and see if you agree or not.



Thomas Aston

I'm an independent consultant specialising in theory-based and participatory evaluation methods.