Balancing biases in evaluation

10 min readJun 22, 2022

I’ve read a lot on bias recently, and I figured it’s time we had an honest (and proportional) conversation about it.

As Megan Colnar and I discussed recently, in 2020, the Policy and Operations Evaluation Department (IOB) at the Dutch Ministry of Foreign Affairs updated their Evaluation Quality Criteria. This included a pointed critique of Outcome Harvesting and other participatory evaluation methods. It relied heavily on an important paper by Howard White and Daniel Phillips on addressing attribution of cause and effect in small n impact evaluations. In a recent paper on the unfinished evidence revolution, Howard White doubled down on this critique of certain small-n methods, and I wanted to take a look at that critique.

According to 3ie’s quality appraisal checklist for systematic reviews, ‘bias is a systematic error or deviation from the truth in results or inferences.’ From the guidance on how to use the risk of bias tool (ROBIS), we find that ‘bias occurs if systematic flaws or limitations… distort the results.’ So, ostensibly, bias is bad and should be avoided at all costs.

3ie tell us that the main types of bias in systematic reviews are related to selection, performance (contamination), attrition, detection, and reporting. ROBIS focuses on study eligibility criteria, study identification and selection, data collection and study appraisal, and synthesis and findings. Other tools also commonly mention publication bias. So, bias is clearly a big deal for quants.

Small-n biases

Yet, small-n biases are somewhat different. These are often cognitive biases. There are arguably three broad types of the most common biases: (1) selection biases, (2) respondent biases, and (3) evaluator biases. Selection biases refer to the non-representative choice of cases and respondents (i.e., casing/sampling). Respondent biases include things such as self-serving bias, social acceptability bias, and courtesy bias. And evaluator biases typically include things such as confirmation bias, contract renewal bias, and friendship bias.

In their review of small n impact evaluations, White and Phillips came to the conclusion that some methods were better than others for making causal claims and in reducing small-n biases. They recommended Realist Evaluation, General Elimination Methodology; Process Tracing, and Contribution Analysis. However, they recommend against Most Significant Change, the Success Case Method, Method for Impact Assessment of Programs and Projects (MAPP), and Outcome Mapping.

They argued that the unrecommended methods don’t make causal explanation their primary goal, but when they do so, they ‘rely on the actions and reported experiences and perceptions of stakeholders in order to do so.’

This is therefore a critique of participatory methods in general, given the assumption that the reported experiences and perceptions are ineluctably biased, and thus distort the results. The IOB made a similar point about triangulation and independence in Outcome Harvesting. Megan Colnar and I pointed out that there are some potential weaknesses in Outcome Harvesting in relation to the substantiation step, as I’ll discuss below. But, to reject a method wholesale seems unwarranted. And indeed, as participatory approaches solicit multiple perspectives, they tend to enhance triangulation anyway. Rigour can actually be inclusive, and humane.

Is there really a hierarchy of methods?

White’s critique is wide-ranging. He criticises inductive approaches and the apparent lack of transparency in qualitative methods, in general. Neither criticism is particularly well founded, in my view. Deductive approaches also have various flaws (whether quantitative or qualitative), and it’s not as though quantitative methods are somehow immune from transparency issues.

White then goes on to suggest that an “effects of causes” approach to evaluation is a prevalent source of bias. He sees an “effects of causes” approach as asking questions such as “we did this, what happened as a result?” On the other hand, he argues that the alternative “causes of effects” approach poses questions such as “this happened, what explains it.”

White doesn’t tell us what his source for these definitions is, but the one colleagues and I used in a recent article comes from Gary Goertz and James Mahoney (2012: 41–42). They argue that effects-of-causes come from a ‘quantitative culture…estimating the average effects of particular variables within population samples.’ A causes-of-effects approach, on the other hand, stems from a ‘qualitative culture… [It] start with events that have occurred in the real world and move backwards to ask about the causes… [assessing] whether factors are necessary or jointly sufficient for specific outcomes.’

White boils the supposed problem down to a what he considers a “crude” interview question such as ‘I am working for Agency X to conduct an evaluation of Project Y. What has been the impact of Project Y?’ He’s right that leading questions suffer from courtesy bias. What is less immediately clear is what this might have to do with a method hierarchy. So, let’s take look at that.

One of White’s preferred methods is Realist Evaluation. It’s a method with various advantages, as I’ve discussed at length in the past. But, realist interviews have serious problems with confirmation, courtesy, and agreement bias. The realist interview’s “teacher–learner cycle” is an object lesson in leading questions.

In contrast, in their Total Quality Framework book, Margaret Roller and Paul Lavrakas remind us to:

‘Moderate the specificity conveyed regarding the purpose of the study before an interview, and that they should rarely be told about the specific hypothesis being studied because interviewees may respond to questions by telling interviewers what they think interviewers want to hear.’

In addition to this baked in courtesy bias, the risk of confirmation bias in Realist Evaluation stems from its focus on theory-testing around specific hypotheses, not the absence of a theory to test (as in methods like Outcome Harvesting). Hence, more inductive and/or theory-blind methods can be preferable for this kind of bias.

White notes that Outcome Harvesting and The Qualitative Impact Protocol (QuIP) both adopt a causes-of-effects approach. As colleagues and I discussed, most qualitative approaches mentioned above also take such an approach (at least by Goertz and Mahoney’s definition).

White suggests that Outcome Harvesting asks stakeholders what their main achievements are and then construct and test a causal chain from the intervention to the outcomes. This is not entirely accurate. In fact, you write an outcome statement and a contribution statement separately, and there is no explicit causal chain (something which Vassen et al. 2020 consider a key weakness of the method). See below:

Wilson-Grau and Britt’s (2012) guidance further points out that ‘it may be useful to include other dimensions such as the history, context, contribution of other social actors, and emerging evidence of impact on people’s lives or the state of the environment.’ So, you don’t (or shouldn’t) make claims in a vacuum. In my view, a lot of perceived problems with Outcome Harvesting are likely to be due to poor quality evaluations as much as they are inherent problems with the method itself.

I think that Outcome Harvesting does have some potential problems of courtesy, confirmation, and agreement bias (as well as question order bias), particularly related to the email survey format below:

For me, this format suffers from the same problem as the realist interview because you present your outcome statement and then ask to what degree do you (as a respondent) agree that the information is accurate? It leaves more latitude to disagree than in a realist interview because it asks the respondent to explain any disagreement explicitly, but when you ask people whether they agree or not (especially when two responses are agree and only one is disagree) then there is a higher chance they will agree (agreement bias). For this reason, I would conduct an interview instead of an email survey, and I wouldn’t tell the interviewee what the outcome statement was, because this is similar to telling them what my hypothesis was.

But, Richard Smith (my own Outcome Harvesting trainer) had a different view. He told me that, in his experience, people feel more confident in disagreeing by email than in person. So, for some people, an email survey may be better than an interview at reducing courtesy bias. I think we’re both right. And this made me think that we’re not really eradicating biases (which seems impossible in all evaluation and research), but balancing out different biases when we use different methods.

This is mostly about specific tools rather than methods, per se. Roller and Lavrakas suggest that social desirability bias may be less likely in a telephone interview because this mode has less pressure on the interviewee than an in-person interview. And yet, in an online in-depth interview, respondents are less likely to reveal sensitive information. So, there is no unimpeachable tool nor mode of data collection. There are usually trade-offs.

White argues that QuIP has some advantages over Outcome Harvesting. For me, its main strength is in its coding structure/process. White highlights that, in QuIP, interviews are explicitly oriented around the question ‘this change happened, what may have caused it?’ This seems very sensible. In many respects, all that is required is an open, rather than a closed, question. There is nothing unique to a particular method to have open interview or survey questions based on a causes-of-effects approach. It may be that open questions are more common in Process Tracing and QuIP than in Realist Evaluation and Outcome Harvesting, but there’s no intrinsic (epistemic) reason that this needs be the case.

However, White points out that QuIP is also designed in such a way that neither the interviewer nor the respondent are aware of the intervention being evaluated (they are “blindfolded”). Blindfolding also features in Veil of Ignorance Process Tracing (VoIPT). This innovation was met with some hostility by several of the most eminent process tracers. I personally disagreed with much of Tasha Fairfield’s critique (which uncritically marketed her alternative approach), but I found Derek Beach’s critique of blindfolding convincing. His response begins by challenging the suggestion that confirmation bias is an endemic problem in Process Tracing. I don’t think it is either. In fact, the explicit assessment of rival claims would seem to count against this.

Beach went on to argue that using inexperienced research assistants (RAs) to collect and code empirical material is problematic. For Beach:

‘[An] inexperienced RA would lack the case-related knowledge to be able to do a good interview (e.g., follow-up questions), and more critically, would not be able to engage in proper source criticism of interviews, archival documents, or secondary sources — all of which would contain implicit or explicit bias that requires significant theoretical and empirical knowledge to evaluate If it was an important factor in the changes it is meant to have affected then it will come up.’

I think it’s hard to argue with this. On one hand, we’re trying to reduce potential confirmation bias which might arise from an RA knowing the theory (or even the intervention) and seeking out answers that would fit such an explanation. This is definitely sensible, in principle. I’ve certainly had RAs who knew the intervention and despite an open question template to avoid leading questions, they got frustrated and asked leading questions anyway (so, I couldn’t use the findings, without massive caveats). But, on the other hand, when an RA doesn’t really know what they’re looking at (or for), the quality of the interview is likely to be more limited. I’d be interested to hear how the QuIP designers handle this challenge in practice.

Hence, it’s worth reflecting on whether perhaps the risk of confirmation bias is possibly lower than the risk of a false negative. Indeed, how much value added might blindfolding bring that you wouldn’t get by more open questions and a more competent (and honest) interviewer? Or might there be other ways to limit the information shared with interviewees (ethically) and not tell them your theory, while sharing relevant information with RAs so that they can ask the most useful questions to test a theory?

Again, I think that we’re talking about balancing different biases here. No method has all the answers.

For White, ‘the most important bias is probably simply ignoring other possible explanations of changes which may otherwise be attributed to the project.’ Some methods are better than others for countering this. The General Elimination Methodology and Process Tracing are explicitly geared towards assessing rival hypotheses (or claims). But, while this is built into these methods, there’s nothing stopping other methods from assessing rival hypotheses (or claims). Isn’t this what we do in research? For instance, in the early stages of a realist interview you can explore rival theories. There’s also no obvious reason why you couldn’t have a strongly disagree response option and a follow up question in the Outcome Harvesting survey template (or an interview) to ask what else might explain the change (as Jess Dart did). Or you can combine Outcome Harvesting with Process Tracing, as I’ve done in the past. You don’t necessarily have to share the explanation with respondents anyway. As I’ve argued above, these are mostly choices related to specific tools rather than methods as a whole.

Where I agreed with Fairfield in her critique (albeit, in my view, unrelated to VoIPT) was her point that ‘dishonest scholars can always find ways to be dishonest, regardless of whatever constraints are imposed by the discipline.’ I’d say that a lack of integrity eats bias for lunch. So, yes, let’s take biases seriously, whatever method (or combination of methods) we’re using. But let’s not impose a strict hierarchy of methods which simply reproduces the anchoring biases (often, with quite poorly founded assumptions) of our own preferred methods.

Balancing biases in evaluation

Small-n biases

Is there really a hierarchy of methods?

Written by Thomas Aston