Method evangelists and zealots need not apply
Written with Megan Colnar
In the last 20 years, evaluation methods for social change efforts have proliferated. What was once a fairly bland landscape is now a thriving field with dozens of methods ranging in application, purpose, and rigor. Better Evaluation identifies 26 umbrella approaches from appreciative inquiry to utilization-focused evaluation, and many of these have multiple credible and tested ‘offshoots’ of their own. So, what is an evaluator to do in the face of so many potential choices?
Any evaluation of social change should have at its core, the purpose of assessing change over a particular period by understanding what happened and the relationship between these changes and the social change actors’ efforts. The best evaluations are used with learning and adaptation in mind to promote changes in practice or behavior of social change actors working on these issues or in this context in the future. Most evaluators who have been responsible for assessing social change over the life of a program, project, or strategy know that you must be prepared to pull approaches from across the method spectrum to serve it. There are times when you need a quick and regular pulse check (e.g., After Action Review) and others where you need to draw conclusions about cause and effect (e.g., impact evaluations). The more complex the problem, the more likely it is that you will need to draw from multiple methods over time to effectively evaluate the social change efforts. And yet, we still find so much single method evangelism in the evaluation community. As Rick Davies puts it:
Unfortunately, it’s all too common among evaluators to prioritize particular evaluation tools, methods, and approaches over all else. So much so, that entire careers are built on the premise of delivering one type of evaluation, and that can lead to…well, method dogmatism.
Make no mistake, we need deep, thoughtful evaluation practice — and good methods need “experts” to support their use in the world. But when method/approach is seen as THE answer, we have a problem. Especially when a method is synonymous with your livelihood, you stop thinking about whether, when, and how a method is useful, and instead you think of how to use the method as much as possible and often as “purely” as possible. You own the method, and not the wider purpose of what evaluation is meant to offer in the first place. This is an issue for many method communities. But, in this case we want to look at Outcome Harvesting and recent battles over its credibility.
Outcome Harvesting & The Dutch Ministry of Foreign Affairs
Outcome Harvesting is a method that collects (“harvests”) evidence of what has changed (defined as “outcomes”) and then, working backwards, determines whether and how and intervention has contributed to these changes.
We’re both evaluators who’ve had internal and external roles in organizations. We see method pluralism and flexibility as the cornerstone to good evaluation practice. We also have specialized training and ample experience in several methods, including Outcome Harvesting (OH). In fact, we’re both regular users and commissioners of OH for formative and summative evaluations — singing its praises in blogs, journals, and have committed substantial resources for its use.
A few years ago, the Dutch Ministry of Foreign Affairs (MoFA) made the courageous step to focus on outcomes rather than outputs in reporting of the Dialogue and Dissent program (2016–2021), one of the largest programs from the MoFA. As a result, it recommended the use of Outcome Harvesting to many of the program’s grantees.
However, in 2020, the Policy and Operations Evaluation Department (IOB) updated their Evaluation Quality Criteria. In their guidance for qualitative methods, there was a bomb for Outcome Harvesting. The guidelines noted that: “Outcome Harvesting has gained popularity amongst practitioners and evaluators… [but the] IOB recommends explicitly against the use of Outcome Harvesting.” The report continued by stating that the method could neither evaluate effectiveness nor “validly establish the contribution of interventions to observed outcomes,” and went on to challenge the method for its independence (both the evaluators and sources), bias, and its approach to triangulation.
Understandably, this was met with confusion among grantees and evaluation practitioners, given that the MoFA had previously recommended the method. The IOB conducted an evaluation review in 2019 to inform the new guidance. We know from personal experience and even from some those on the evaluation review panels that the quality of these evaluations was mixed. This led the IOB to determine that OH was not appropriate. So, it advised against its use and thus grantees had to downplay or reconsider the method in their final evaluations.
The IOB’s guidance on qualitative approaches relied heavily on a paper on addressing attribution of cause and effect in small n impact evaluations. In the paper, Howard White and Daniel Phillips recommend four theory-based qualitative evaluation methods: 1) Realist Evaluation; 4) General Elimination Methodology; 3) Process Tracing, and 4) Contribution Analysis. In sum, they argued that these were suitable methods for making a plausible claim of effectiveness because they make explicit hypotheses, develop causal chains, and test for rival explanations. On the other hand, they found more participatory qualitative methods to be less suitable because “they do not set out to address attribution of cause and effect as explicitly.” Outcome Harvesting wasn’t even addressed in White and Phillips’ paper, but comparable methods like Outcome Mapping (OM) and Most Significant Change (MSC) were.
In our view, attribution is the wrong starting point for most MoFA programming. It’s unfortunate to see the IOB so singularly focused on attribution, when so much of the work MoFA funds has little hope of establishing clear causal attribution to the interventions delivered and the outcomes observed — like in the Dialogue and Dissent program. Frankly, (sole) attribution is difficult, if not impossible to establish for complex, multi-actor work.
Moreover, the IOB’s wholesale dismissal of participatory methods is concerning. It displays a very narrow-minded appraisal of “rigor,” which we find to be both outdated and inaccurate.
Rick Davies’ response to the IOB’s critique of participatory methods like MSC, was the following:
“MSC is about identifying changes that people value, not substantiating causal claims. Though once identified, causal mechanisms could then be investigated by one means or another.”
In our view, this is the right spirit in which to take the IOB’s critique. Identifying changes that people value is hugely important, and this is something which is also a key strength of Outcome Harvesting.
On the surface, it seems unfair that OH was singled out especially.
We also now understand that the Campbell Collaboration is advising the IOB on this issue. As a chiefly quantitative-focused organization, we wonder whether they are really the most appropriate organization to advise on qualitative methods. But, we look forward to hearing what is recommended. In practice though, we’ve seen too many sloppy Outcome Harvesting evaluations. As a donor said to one of us recently, “we’ve all been involved in a bad outcome harvest, haven’t we?”
Then, is there any merit to the IOB’s criticism? And perhaps more importantly, how should we react to and address this kind of criticism when it arises?
Outcome Harvesting and its discontents
An aversion to planning and under-investment in data collection
Outcome Harvesting sells itself as a method for which you can gather data ex-post. You often hear OH evaluators and program managers alike lauding this feature of the method. But in practice, it often seems to excuse a lack of planning and investment in effective monitoring and data gathering as the work is implemented. It’s relatively common to see a consultancy scope of work on Peregrine where the client hasn’t gathered much (or any) data, didn’t take the time to identify potential outcomes at the beginning (however loosely), and yet still wants to do an evaluation. We know of several organizations which misuse the word “complex” as philosophical cover to avoid serious planning and monitoring. The default in these scenarios seems to be — do an outcome harvest.
Evaluators cannot create data out of thin air, and the longer a program goes without data gathering tied to its intended (or unintended) outcomes and impacts, the harder it is to evaluate at any point, using any method. So, in these moments, where OH has become the preferred method, data to inform the evaluation is being created — often through participatory exercises (e.g., focus group discussions) by those closest to the program or intervention — and the program logic which may or may not have existed previously is retro-fitted to try to tell the story of what happened.
This isn’t the story of all Outcome Harvest evaluations, of course. The supposed impossibility of planning used to be a key tension point between Outcome Harvesting and Outcome Mapping, at least for some advocates of OH. Thankfully, there has been some more recent openness to complementarities, yet the above problems are partly an externality of that original fissure. A substantial proportion of OH practitioners and commissioners now do explicitly seek to gather at least some monitoring data, but it is nonetheless increasingly the method of choice for unplanned, unmeasured, and unmonitored social impact programs and policies. We need greater critical reflection around monitoring progress. “Complexity” is an erroneous excuse.
Substantiation: An Achilles’ Heel?
OH is sold as a participatory method that can be applied by and with teams — a huge relief in a world of methods where program implementers are largely kept at an arm’s length. Teams’ closeness to the process often makes OH an effective reflection exercise, leading to dynamic conversations and allowing unanticipated outcomes to emerge as part of the evaluation process. However, this proximity and how it is managed is part of what makes the method so predisposed to certain biases and blind spots.
In our view, OH’s participatory character itself is not the problem (participation can actually enhance rigor in various ways), but rather an issue with the method’s substantiation (i.e., verification) step. An evidence review of the use of OH within the Dialogue and Dissent program found that the vast majority of OH evaluations submitted skipped the critically important substantiation step.
The method’s originator, Ricardo Wilson-Grau, refers to substantiation options to verify accuracy and to deepen and broaden understanding of the outcome. Substantiation comes down to finding a third party to go on the record to corroborate, refute, or refine statements. These parties are supposed to be knowledgeable, authoritative, independent, and accessible. However, finding an appropriate third party is very challenging — as you need someone with sufficient information about the program/intervention but sufficient distance to have not been involved in its implementation or ideation. The recommended email survey template, in particular, has lots of potential for courtesy and confirmation biases. Yet, there is little discussion of the quality of evidence in OH (i.e, its “probative value”). It is high time for more critical reflection on this issue.
To many outsiders of the method, skipping this step seems unbelievable, but in truth the method itself has always had a fairly agnostic approach to this step. Ricardo Wilson-Grau’s definitive book on the method notes that substantiation isn’t always necessary (p. 89). In his view, the process itself generates “highly reliable data about actual change in behavior that took place, and how an intervention contributed.” Yet, this is not always the case.
This agnostic guidance and the propensity for so many organizations to skip or under-invest in the substantiation step is a key weakness. Poor quality (or absent) substantiation was likely at least one major reason that the IOB was so concerned about OH. From our perspective, we tend to agree. Weak substantiation is Outcome Harvesting’s Achilles heel.
Reaction by Outcome Harvesters
With these questions in mind and decent knowledge of OH’s methodological strengths and weaknesses, we decided to join the Outcome Harvest Community of Practice discussion focused on reviewing and responding to the IOB’s omission of Outcome Harvesting on its list of recommended evaluation methods in 2021. Megan was in the middle of overseeing a massive OH evaluation at the time and Tom was supporting one of the grantees in the Dialogue and Dissent program to prepare for their final evaluation, which drew on several rounds of OH after their mid-term. But we have to confess, we also went with some sense of trepidation. We both felt a bit uncomfortable reading the back and forth over the community’s list-serve and the conspicuous lack of reflexivity in the responses to the IOB’s decision.
We completely agreed that the IOB’s direct targeting of OH was unfair and believe that the Ministry had come to a decision about a method because of poor implementation and not because of unique flaws in the method. Yet, we also thought that the IOB’s wider points on validity, triangulation, and insufficient independence had some merit to them, and were worth reflecting on among the method’s leading practitioners. The meeting goers then split up into break out groups that roughly aligned with the major criticisms leveled by the IOB.
We both chose a different breakout group, and in the small group session on attribution, the group was kicked off by a long-time OH practitioner who categorically refuted the IOB’s claims on this point. When one of us in the group responded, “shouldn’t we pause to consider whether there is any merit to their concerns?”, this same practitioner jumped in once again to refute this line of thinking. This community — a theoretical ‘safe space’ for understanding and supporting OH practice — turned out to have very little space for doubt among the faithful.
To our knowledge, the conversation has not evolved in the community after the workshop, but in our view, there needs to be greater critical reflection. While there has been some more recent openness to combine methods, a wider reflection on methodological weaknesses has not been forthcoming. It is incumbent on all methods communities to do this.
Despite this, both of us continue using OH and plan to for the foreseeable future. We also continue to have great respect and appreciation for the work done by and learning supported via the OH Community of Practice. Yet, we are convinced that these ‘safe spaces’ should be used not only be for learning about how to apply a method, but also for critique to improve a method and its practice, to maximize its strengths as well as bolster its weaknesses.
Single-method worship, your time is near
No approach, tool, method, or technique is perfect. There are no gold standards. However, evaluation “brands” often act as if they had miracle cures, and that their evaluators were members of an exalted priesthood. Of course, you might immediately think of the Randomistas whose fervor for Randomized Control Trials (RCTs) never let a pesky thing like being ‘a completely ill-suited method for evaluating complex change’ stand in the way of their evangelism and determination to apply them anyway, and all of this despite the many ethical concerns that arise with applying the method in social science research. Yet, this issue is not just confined to the Randomistas, the exchanges on the realist evaluation list serve –RAMESES–alone is testament to a comparable credal observance and method police. Some prominent Bayesian process tracers appear to have much of the same misplaced zealotry.
But where does this zealotry come from? Why can’t/isn’t evaluation our ‘higher calling’, while various and continually refined methods are pulled in and used to fulfill this purpose? Why wouldn’t evaluators be the FIRST to recognize (and react to) shortcomings of each method they use and propose?
Our friend and fellow consultant, Dave Algoso, had this to say in late 2020:
“Consultants and academics alike face real economic incentives to carve out their space and convince people that the particular nuances of their methods make a difference. It ties to the idea of expertise, that the more arcane and jargon-filled your approach, the more advanced it must be.”
These incentives create problems; they make us more insular and censorious to criticism. Even worse maybe, are the consequences these reactions have to the opinion of and reliance on the evaluation sector as a whole. While Outcome Harvesting does not have the same notoriety for self-regard as many Realists or Randomistas, we fear that the Outcome Harvesting community fell victim to this trap in failing to critically engage with the IOB’s critique. Though we hesitate to recommend the (re-)emergence of the method police, we think that committed practitioners should regularly review the use of methods, seeking to both tweak and improve its application, as well as discard or disavow its traps and failures. And we can both fully appreciate how/why financial and professional pressure might result when a major funder disavows a primary pillar of your livelihood. At the end of the day though, reactions like this aren’t helping us to move forward, nor do they make evaluators better partners for social change.
So, we’re here to say enough (again and still) of attaching yourself to a single method, evaluators — and even more importantly, enough with the limitless proposition of one method as superior above all the rest! Every method has tradeoffs, weaknesses, and limitations. Costs, context, questions, scope, and level of certainty required will lead you down different paths and methodological considerations. For the modern evaluator, we guess we’re saying that the “gold standard” should be using the most appropriate methods for the questions at hand. In this context, the industry’s “model evaluator” would be method agnostic, prioritizing the ability to pluck from the wide ranging methodological spectrum and comfortably blend and adapt these methods to serve the questions and moment at hand.
Ruth Levine, one of our favorite evidence champions, gave a rallying call to the nearly 1,700 evaluators, researchers, and other participants gathered for USAID’s Evidence Day in 2017 that:
“We have to lift this conversation above internecine debates about methods […] We have to assert, strongly, persistently that failing to use facts and evidence in decision making about matters of consequence is not only dumb, but wrong — deeply, irretrievably wrong.”
Unfortunately, in the years since this gathering, we, in the evaluation community, appear to have made little progress in the method wars. This rallying call must remain at the forefront of the evaluation field’s agenda to have any hope at being the kind of partners that social change actors deserve.