“Real”​ process tracing: part 5 — evidence

Thomas Aston
8 min readDec 28, 2020
Wassily Kandinsky, Fragment 1 for Composition VII

One key dimension I wanted to cover in this dialogue between Realist Evaluation (RE) and Process Tracing (PT) is evidence. I’ve already given an account of what good evidence looks like in PT using the example of a murder case. So, this blog will consider the merits of RE’s approach to evidence and then briefly compare it with PT.

Contrary to the comments made on the RAMESES list recently, asking how to address the weight of evidence in RE is not a silly question at all. It is a very reasonable question that merits comment and discussion.

Context-dependent explanation and evidence

Ever the contrarians, some realists have called into question the concepts of evidence, validity, and rigour. In Thomas Schwandt’s Dictionary of Qualitative Inquiry, evidence is ‘information that has a bearing on determining the validity of a claim.’

Yet, for Joseph Maxwell (2012), evidence is relational to a particular claim (theory, hypothesis, etc.). It can’t be assessed in a context-independent way. Maxwell (2012) thus argues that whether something counts as evidence depends on how the fact is obtained and on the plausibility of alternative claims. No disagreement here.

Perhaps unknowingly, assessing the quality of evidence in a context-dependent way linked to a particular hypothesis and assessing the plausibility of alternative claims is all central to the way Process Tracing views evidence. Not all evidence is equal. Some is very powerful with respect to your hypothesis (or theory) whereas other evidence may be very weak. This inferential power is referred to as “probative value.” I mentioned this in a previous blog onsampling and I discussed this at length in the blog miracles, false confessions, and what good evidence looks like.

In my view, Maxwell (2012) rightly challenges a context-free and method-based understanding of evidence quality and validity. He suggests that we should question the positivist criteria of internal validity and external validity, and reliability. Instead, he advocates that criteria such as credibility, authenticity, and transferability might be more appropriate for assessing complex processes of change. Maxwell’s (2012) perspective is that validity should ‘pertain[…] to the accounts or conclusions reached by using a particular method in a particular context, not the method itself.’

Probative value is also one of three criteria Schwandt recommends in his chapter of the book What Counts as Credible Evidence in Applied Research and Evaluation Practice?, alongside credibility and relevance. While probative value is a really useful concept for identifying credible evidence and uncovering context-dependent mechanistic explanation, I’ve never seen this discussed in realist research. We’re recommended to sample based on mechanistic potential (Emmel, 2013), and this is pretty close to the concept of probative value. So, it seems to be a big missed opportunity that Maxwell (2012) failed to mention it. Perhaps surprisingly, the strength of evidence behind inferences is also given extremely light treatment in RAMESES reporting standards (see Wong et al. 2016). Rigour seems more based on philosophical fit than on the causal leverage which comes from the evidence itself.

Yet one could argue that “rigour” itself may be up for challenge. Hallie Preskill and Jewlya Lynn, for instance, argue that we should redefine rigour for evaluation in complex adaptive settings. They propose the criteria of:

  1. Quality of thinking: The extent to which the evaluation’s design and implementation engages in deep analysis that focuses on patterns, themes, and values, seeks alternative explanations and interpretations, grounded in the research literature; and looks for outliers that offer different perspectives.
  2. Credibility and legitimacy of claims: The extent to which the data is trustworthy, including the confidence in the findings; the transferability of findings to other contexts; the consistency and repeatability of the findings; and the extent to which the findings are shaped by respondents, rather than evaluator bias, motivation, or interests.
  3. Cultural responsiveness and context: The extent to which the evaluation questions, methods, and analysis respect and reflect the stakeholders’ values and context, their definitions of success, their experiences and perceptions, and their insights about what is happening.
  4. Quality and value of the learning process: The extent to which the learning process engages the people who most need the information, in a way that allows for reflection, dialogue, testing assumptions, and asking new questions, directly contributing to making decisions that help improve the process and outcomes.

I hadn’t thought hard enough on this, but I was recently challenged on my overly narrow and perhaps somewhat traditional view by Chris Roche. And I think he was right to challenge me. What constitutes good evidence, rigour, and validity is perhaps more contestable than we might assume.

Indeed, this might speak to a wider concern of hijacking the very meaning of impact evaluation; focusing too much on “causal analysis,” and not enough on description, valuation, explanation, or prediction. I might myself, be guilty of contributing to this. In any case, all of the above refer in some way to what evidence we have behind the claims we make.

Evidentiary fragments

How we understand the weight of evidence isn’t merely a matter of volume. It can also be about density or depth. If we’re concerned not only with measurement, but also meaning, then the strength of evidence is not determined, simply, by having more of it. This perspective led Ray Pawson to argue that rather than entire studies, evidential fragments should be the unit of analysis for realist syntheses. The point here is that embedded within even “poor” studies there might be good evidence that reveals new insights about mechanisms.

Pawson et al. make the perceptive point that:

‘Excluding all but a tiny minority of relevant studies on the grounds of ‘rigour’ would reduce rather than increase the validity and generalisability of review findings since different primary studies contribute different elements to the rich picture that constitutes the overall synthesis of evidence (Pawson et al. 2004: 20).’

They are certainly right to underscore that there is no legitimate hierarchy of evidence which can be judged purely on the methods employed. Randomised Control Trials (RCTs) do not necessarily sit at the top, nor do descriptive case studies and opinion pieces necessarily sit at the bottom. Indeed, even “good” studies may well have poor evidence for answering the questions you have. Particularly if you’re looking for evidence on mechanisms, even Nobel laureates Banerjee, Duflo, and Kremer might have quite poor evidence to offer, as I found out recently looking at how sanctions might contribute to education outcomes (or not). Why? Because RCTs typically lack an explanation (or even description) of these connections. In fact, the nearest thing to mechanistic evidence they offer is quite commonly little more than the authors’ speculations based on gaps in their own evidence.

When we talk of a hierarchy of evidence, I can’t help but think about the hierarchy of genres in art and how these have evolved over time (I’m an art fan. Humour me). History painting captured the most important subjects of religion and mythology (equating RCTs), portrait painting came next (quasi-experimental designs), followed by genre painting (equivalent to before and after studies), landscapes (descriptive case studies), animal painting, and then still life (opinion pieces). Of course, this all changed with the emergence of portraits as being worthy of consideration as monarchs were deposed and societies secularised, and with the rise of landscapes from Grand Tour expeditions of the British upper classes, and the acceptance of Impressionism as many people became acquisitively middle class.

It’s right to question the hierarchy and even (perhaps) to cherry pick good evidence from bad studies, but the question then becomes how we define “good evidence.” Maxwell (2012) draws our attention to different types of validity (descriptive, interpretive, theoretical, and evaluative) to demonstrate that the terrain is more complex than often appears. Failing to reflect on the probative value of evidence presents to potential risks for RE, however.

The first risk is potentially failing to consider seriously enough how the fact is obtained and what this means in context. The big problem with realist interviews (as I’ll discuss in the next blog) is that if you reveal your theory to an interviewee, you would generally expectthem to confirm it. Unless one can somehow control the various respondent biases wrapped up in the teacher-learner cycle, testimony elicited from realist interviewing will tend to have low probative value.

The second is the risk of failing to distinguish between descriptive and explanatory (or causal) evidence. The problem of eliminating a hierarchy of evidence based on methods is that you might fail to distinguish evidentiary “nuggets of wisdom” which might be descriptive (“straw-in-the-wind,” in PT test language) and those which likely have potentially higher levels of explanatory power (“hoop tests” and “smoking guns”). Whether something is a nugget, or not, ultimately relies on its probative value for whatever theory one is developing, refining or testing.

Bayesian logic

Finally, extending Maxwell’s (2012) view that whether something counts as evidence depends on how the fact is obtained, Process Tracing offers another useful concept — Bayesian logic. Essentially, Bayesian logic is about update our prior beliefs and our confidence in a causal explanation in light of new information. Here’s a really accessible video on Bayes theorem for those that are interested. Developing and refining theory is precisely about updating our prior beliefs and revising our confidence in particular causal explanations, so it’s surprising to me that (to my knowledge) this connection hasn’t been made before.

Process Tracing’s evidence tests are now increasingly argued to be underpinned by Bayesian logic. You can combine Bayes logic with probative value: you update your confidence in your hypothesis based on what different forms of evidence (and how they are obtained) can do for your hypothesis. Does the new evidence point towards disconfirming your theory (“hoop test”) or confirming your theory (“smoking gun”)? What can the way the evidence was obtained tell you about how credible it is? What does that new information do for your hypothesis?

In particular, your evidence may have more confirmatory power when you do not expect to find that evidence. It’s powerful evidence, in part, because you’re unlikely to find it. For instance, an interviewee isn’t likely to tell you something that appears to be against their own personal interests. If, let’s say, the current President of the United States were to privately reveal to an aide on camera that he had colluded with Russia to win the election, we would be quite likely to believe it. We would believe it because of how the evidence was obtained; because of its context and the incentives the President has to hide such information, rather than reveal it. However, if he were to tell us in public that he didn’t collude with Russia, we wouldn’t give that testimony much weight. Why? Because that’s exactly what we’d expecthim to say.

That’s the power of Bayesian logic and probative value. I hope I’ve convinced you that concepts can potentially help make the most of evidentiary fragments and nuggets, and shed light on mechanisms in a more compelling way. And if you still don’t believe me, read my past blog Miracles, false confessions, and what good evidence looks like. You might also be interested in the Quality of Evidence Rubrics I put together recently, or a seminar I just did for the Centre for Development Impact on this topic.

In the final blog in the series, I will look at RE’s approach to interviews, and question whether this may be a bridge too far.



Thomas Aston

I'm an independent consultant specialising in theory-based and participatory evaluation methods.