Success, failure, or both?
In a recent must read blog, Howard White pointed out that ‘qualitative evaluation approaches find 80% of interventions successful whereas the consensus from effectiveness studies is that 80% of interventions don’t work.’
“Big, if true,” as the meme goes.
White presents this finding as proof of positive bias in qualitative studies. White doesn’t cite the evidence for the finding, but I’m sure he has one. While I don’t doubt there is some positive bias in “qualitative” studies, I don’t think he’s making a like-for-like comparison within the same interventions. So, it’s a misleading comparison. And, I think he misses an important countervailing point; quantitative effectiveness studies may also display some degree of negative bias.
Firstly, there are construct validity issues, which I’ll discuss at greater length below. I’d argue that many, if not most, experimental studies are assessing too few dependent (outcome) variables related to the potential effects of interventions and may even be assessing the wrong things in the first place. And when these targets are not hit, interventions are deemed a total failure. There is no room for equivocation or nuance.
Secondly, as I discussed in my blog Randomista mania, we should remember that interventions are often distorted to meet the stringent requirements of experimental impact evaluation designs. They need to be simple; tightly bounded; controllable. This therefore restricts what interventions can be and do. Jonathan Fox has critiqued the selective profiling of “rigourous” evidence from small scale interventions with modest doses in the transparency and accountability sector. Some such interventions such as the Metaketa Initiative, for example, seemed doomed to failure from the very beginning. It was no surprise at all that they found null effects. The utterly perplexing, and often unethical, pretence of equipoise in experimental designs (i.e., supposed uncertainty of the comparative therapeutic merits of each arm in a trial) means we can waste millions of dollars on things that reading non-experimental studies would suggest probably wouldn’t be very effective in the first place. But, as this isn’t considered “rigorous” by randomistas, such evidence is regularly ignored.
Many “qualitative” (or more accurately case-based) designs tend to be less restricted in their scope in the search for outcomes. And thus, logically, they have greater opportunities for finding a wider range of positive and negative outcomes. Hence, it’s entirely possible the stark differences in findings are at least partly a result of the fact that they are simply looking at different things in different ways. This is discussed at length in Gary Goertz and James Mahoney’s A Tale of Two Cultures.
Researchers and evaluators ultimately make political choices regarding which effects they want to highlight and which they want to hide. The Improving Teacher Performance and Accountability (KIAT Guru) project in Indonesia which introduced social accountability mechanisms in schools to improve student test scores is a vivid example, but it demonstrates the opposite of what White contends. It appears like a great success in the quantitative study focused on a few key variables (test scores) and in many respects looks like a failure in the qualitative study looking at a broader set of variables (e.g., the wider curriculum and educational processes). Another highly selective synthesis seems to bury the bad news. As Grazielli Zimmer Santos and I found, depending on which variables you look at, it’s both a success and a failure, and what counts is how you weigh up the positive and negative effects of any intervention.
As Florencia Guerzovich, Alix Wadeson and I discussed, in the transparency, accountability and participation sector, these political choices have led to hyperbolic tales of triumph and disaster. With less than a handful of experimental studies which showed null effects in just one country, foundations and bilateral donors collectively lost their composure. Nathaniel Heller recently suggested that a failure to more convincingly demonstrate that citizen engagement, anti-corruption efforts, and governance reforms lead to development outcomes was part of the real cause of Global Integrity’s recent demise. Good organisations are closing down partly because of how we play the evidence game; so, the stakes are high.
Mapping the territory
The point about assessing too few dependent variables is that we’re often only covering a small part of that territory, and this may be misleading (i.e., it may lead to false negatives). In an earlier blog, I called attention to some of E. Jane Davidson’s excellent work on rubrics. In her presentation, she has a graphic which fundamentally changed how I saw evaluation, and measurement in particular. It illustrates that when we select indicators to measure or assess something, we’re not covering the whole outcome domain, but just a part of it.
Failing to understand the problem of limited coverage inherent in indicators is at least part of the reason for the hyperbolic doom narrative I mentioned above.
The allure of precision is that it seems valid. What we don’t want, however, is to hit the target, but miss the point, as Chris Mowles perceptively discussed recently in relation to the UK government’s deceptive attempt to hit COVID-19 test targets. In a previous blog on basketball statistics, I discussed the perils of metric fixation and false precision, as exposed by books like The Metric Society and The Tyranny of Metrics. So, you can read more about when neat measurement is a appropriate.
Conceptual ambiguity and fuzziness
The challenge of measuring change and results in voice and accountability work has long been a headache for researchers and evaluators. After all, as Julia Fischer-Mackay and Jonathan Fox suggest, many concepts in the governance field are ambiguous and contested. As a result, as Jeremey Holland et al. note, ‘identifying a set of indicators that can simplify and capture complex processes and relationships that are transformed through voice and accountability interventions is often difficult.’ Similarly, as Transparency International point out, ‘the measurement of corruption is a longstanding challenge for both academics and the policy community, due to the absence of unanimously agreed-upon definitions and the widespread belief that, due to its informal and hidden nature, corruption is an unobservable phenomenon.’ As a result, every year, there’s a new blog critiquing the validity of Transparency International’s Corruption Perceptions Index.
Hitting the wrong target is more common than you might think. Fischer-Mackay and Fox recently discussed the perils of “slippery indicators,” wherein they found that several prominent studies in the accountability field had measurement validity issues because they did ‘not measure the real-world processes they claim[ed] to address.’ As Fox noted in the blog above, many experiments, in particular, use very indirect indicators of concepts like participation or accountability (evaluators sometimes call concepts “criteria”). I’d argue there are at least two reasons for this. Many experimental scholars have weak theory and often have a pretty tenuous grasp of the sectors they are assessing. Fischer-Mackay and Fox concluded that studies that claimed to be about community monitoring appeared not to be doing community monitoring in the first place. So, to say they the interventions “failed” might be pretty meaningless, at best, and seriously misleading at worst.
Fuzzy concepts (or fuzzy proxies) is a challenge we’re facing head on in a new Results for Development (R4D) project, the Governance Action Hub. The aim of the Hub is for coalitions to define what issues they want to address, and we want these coalitions to be able to tell their own stories about change. So, we can’t (and shouldn’t) be too prescriptive a priori. Recognising this allows us to be open about our own assumptions including about the direction of travel, our theory of change. It also allows us to discuss this and embed learning around it within locally-led processes, negotiating our own assumptions with those of the local coalitions. In sum, we’re trying to get comfortable with fuzziness.
As a result, rather than developing highly precise outcome-level indicators which may miss the point, tools like basket indicators and rubrics seem a more appropriate place to start. While fuzziness has its risks, we need some degree of flexibility for concepts which escape reductive measurement, and fuzziness allows us to capture appropriate degrees of variation in the types of outcomes coalitions might (or might not) achieve. You should only be highly precise if you’re really clear what the target is, and that hitting that target is actually meaningful for what you’re really trying to change.
I recall a conversation with Marina Apgar about when theories of change are good enough to test. She told me perhaps after a year or so. The same surely applies to many outcome-level indicators.
The challenge of overlapping concepts
By now, you may have noticed that I’ve used transparency, accountability and participation, citizen engagement, social accountability, voice, corruption, and governance almost conterminously. These are all overlapping fields of work. In reviewing a dozen evidence reviews in the sector, Zimmer Santos and I found that what was mostly being reviewed was transparency and responsiveness, not accountability. Below illustrates the number of mentions of key terms:
The ugly pie chart below illustrates the number of empirical references to these key terms in the reviews. Scholars long held that “strong” enforceability such as sanctions and litigation were crucial to achieving outcomes and that what they pejoratively called “weak” forms of citizen engagement were less effective. The data reveal that this was, and remains, more a theoretical proposition than an empirical “fact.”
Given the frequent gap between theory and practice, we might do well to define the boundaries of some concepts like social accountability more clearly, and then distinguish these from other supportive areas of work like advocacy. This would help to avoid the malaise of 3ie’s Development Evidence Portal in defining a concept like accountability, as I discussed in a previous blog. Assuming that all paths to these concepts neatly overlap is also highly questionable. I’ve argued in the past that there are sometimes important trade-offs in results depending on which are the primary dependent variables (outcomes of interest) you hope to make progress on. We should be attentive to both positive and negative interaction effects.
Responsiveness is very often the main outcome area we’re looking at in the governance sector because it’s a flexible proxy of progress on development outcomes of concern to citizens. The definition I tend to use is a straightforward one written for Department for International Development (DFID) governance advisors some time ago by Mick Moore and Graham Teskey: Responsiveness is an action or series of actions by which governments ‘identify and then meet the needs or wants of the people (Moore and Teskey, 2006: 3).’ I like this definition because it’s about action(s) which we can substantiated as well as giving us an idea of the people to whom governments should respond.
Different levels or depths of change
We can also hit targets at different levels, of course. Outcome Mapping, for example, has a change ladder with (usually) three different levels of change, as illustrated below.
We can consider different levels of change for many key concepts in the governance field. Brendan Halloran from the International Budget Partnership (IBP) argues that there may be different depths to responsiveness, and there may also be overlaps between responsiveness and accountability (what he calls “accountable responsiveness”). Halloran distinguishes “responses” (relatively one-off, isolated actions) and “responsiveness” (sustained and reliable patterns of positive response by governments to citizens). I’d personally call the latter institutionalised responsiveness, but I think Halloran helpfully captures the fact that there are different levels (or depths) of response, which can be captured in a scale or rubric. IBP has a kind of scale they use to capture the level of commitment in relation to their Open Budget Survey (OBS), for example. Ultimately, once we’ve defined our concepts, we need to consider whether there are different levels of change or whether it’s simply a matter of presence or absence. I find that most people are comfortable with either three or five levels of change, but more than that becomes a bit overwhelming. Less than three, and you may as well have a checklist or a simple in/out measure.
All in all, there’s a balance to be struck here. It’s the classic Goldilocks and three bears fairy tale. We need clear enough concepts and conceptual boundaries to ensure we’re assessing similar enough and meaningful things. But, when we measure (or assess) these concepts, we need to be careful not to choose overly-precise indicators which risk us measuring the wrong thing, or only one of the many important things we want to cover. We need to consider whether there are different levels of change or not, and ensure that if there are, we’re not left with an overwhelming task of measuring more than we can manage, or more than we can make sense of.