Miracles, false confessions, and what good evidence looks like

Thomas Aston
8 min readDec 21, 2020


Yves Klein, Ant 88

In my last blog, I wrote about how power, politics, and prestige (and perhaps inertia) may be getting in our way to improve monitoring, evaluation and learning together. These power plays have led us to believe there is a gold standard of scientific evidence. In this blog, I want to persuade you there’s another way to think about evidence which is just as powerful, and employs a logic we use every day.

Making a Murderer

Last year I got hooked on the Netflix show Making a Murderer, an American true crime documentary which tells the story of Steven Avery, a working class man from a small town in Wisconsin who served 18 years in prison for the wrongful conviction of sexual assault and attempted murder. He was later charged and convicted for the murder of another woman, with his teenage nephew convicted as an accessory in the murder.

Overall, it’s an extraordinary case of conviction based on what looks like thin evidence. Like John Grisham’s book which was also turned into Netflix show, The Innocent Man, it’s as much a tale of rustbelt America and class prejudice as it is about murder. Ringing in my ears is the phrase from the defendant’s father: “poor people lose. Poor people lose all the time.” In any case, both shows got me thinking anew about what good evidence looks like.

In the book What Counts as Credible Evidence in Applied Research and Evaluation Practice? I was expecting to find all the answers. I didn’t find them, but Michael Scriven was a lucid as ever in his chapter and Thomas Schwandt’s suggestion that we should actually be looking for a combination of credibility, relevance, and probative value rang true.

Process tracing offers the opportunity to assess how firm proof might be through evidence tests. These tests allow you to see how far general causal indications stand up to scrutiny, to help confirm a specific explanation and to reject rival explanations. A common way to describe process tracing is to see it as akin to the work of a detective like Sherlock Holmes investigating a crime or a lawyer presenting evidence to a jury.

To do this in process tracing, you (i) predict what (typically) observable evidence you would expect and/or hope to find if your explanation of who did it were true, assess its “probative value”; (ii) gather empirical evidence, and; (iii) assess whether we can trust the evidence found for each key step in the case (Beach and Pederson, 2019: 4, 178). You can then make a judgement about how confident you are that your explanation is correct.

Looking for a miracle

In process tracing, different pieces of evidence are classified and graded on the basis of their supposed inferential power or “probative value.” Effectively, this is how well evidence fits your hypothesis. You can find a longer explanation of probative value and Bayes formula which underpins this logic in a three-part blog from Gavin Stedman-Bryce here.

A little known fact is that Bayes formula, the formula which helps scientists assess the probability that something is true based on new data, was formulated by a monk in order to identify what evidence would be strong enough to prove god’s existence. That is to say, evidence of miracles. I think we can all agree that miracles are pretty uncommon. Today, Bayes formula underpins much of statistics and data science, but also some forms of process tracing. Judea Pearl has a nice account of this in The Book of Why: The New Science of Cause and Effect.

In the murder, you might expect to find evidence that the accused can be placed at the scene of the crime, that the timing matches, that the police sketch looks like the suspect. These might be “hoop tests,” because if the suspect can’t be placed at the scene of the crime or the timing doesn’t match, for example, you should probably rule out the suspect. You might also call a hoop test a “plausibility probe.” Ultimately, it’s evidence you need, because if you don’t find it or this evidence doesn’t fit, your explanation is implausible and your case falls apart. “Hoop tests” are thus useful to disconfirm a hypothesis, but aren’t enough to confirm it.

You might then search for more compelling evidence that specifically links the accused to the details of the crime such as fingerprints, eyewitness testimony or even confessions, but which is far less common. This is evidence you hope to find but don’t necessarily expect to find. This is a “smoking gun” test. Fingerprints ought to be highly specific to a particular person (as an identical twin, I should know). Ultimately, it’s evidence that’s unique. Murderers don’t always leave fingerprints, so they’re less common. Of course, seeing a photo or video of the suspect’s smoking gun aimed at the recently deceased would be even less common evidence. It is this that makes it strong evidence. Together, these two types of evidence (what you expect to find and what you hope to find) allow us to more accurately adjudicate between rival explanations. Adapting Fairfield and Charman’s language, not failing hoop tests whispers in favour of a hypothesis but passing a smoking gun test allows you to shout in favour of a given hypothesis (Befani and Stedman-Bryce, 2016; Fairfield and Charman, 2017).

Evidence like a matching time and place is just circumstantial, and a likeness is just a similarity. It’s rarely enough to convict someone, even when taken together as an accumulated body of evidence, as you could still ask “so what?” However, fingerprints found at the scene of the crime, eyewitness testimony, and confessions often are deemed sufficient to convict someone.

Evidence that can rule out your explanation should be taken a lot more seriously than evidence that can help establish that your story of events is plausible. So, in a murder case, this might be evidence of another person’s fingerprints or a confession by somebody else admitting they committed the crime rather than the suspect.

However, for project evaluation, most available evidence is likely to help you show what you did happened as you say it did (hoop tests) rather than whether there is a unique connection between what you did and what caused change (smoking guns). As a rule of thumb, the more unique your evidence is to your intervention, the better. To go back to the murder example, evidence that is uniquely traceable to you pretty much rules out other suspects. See here for an example of potentially forging meeting records in development projects in Bangladesh to illustrate why documentary evidence often isn’t very unique.

Below you can see the formal evidence tests proposed by Van Evera (1997), who also used the example of murder, adapted by David Collier (2011):

These formal tests can certainly help structure your thinking, but at the end of the day, you’re making judgements based on whatever paradigm you come from and beliefs you hold. What evidence you’re looking for reflects this frame, and how you adjudicate its value also reflects what is deemed credible and relevant to a particular audience. There is always some form of bias and power dynamics involved.

Fingerprints are just traces

The Yves Klein painting above almost certainly has a trace of the model, but in fact, traces often diminish as time passes, just as memory fades. By pure coincidence, my cousin happens to be a forensic anthropologist who studies fingerprints.

Forensics, such as fingerprint analysis, are a crucial area for investigation, as are the search for eyewitness at the scene of the crime. However, it may surprise you to hear that eyewitness accounts typically have pretty low probative value. Meta-analyses suggest that mistaken eyewitness identification occurred in 75% or more of cases in which a convicted individual was later exonerated on the basis of DNA evidence in the USA. Indeed, as much as 11.6% of cases partly convicted on DNA evidence would support a claim of wrongful conviction in the state of Virginia.

The burden of proof for murder is generally “beyond reasonable doubt.” This means that there is effectively no reasonable doubt, or that it is simply implausible for a reasonable person to doubt. The CIA referred to 90% confidence as “beyond reasonable doubt (CIA, 1968: 5 in Beach and Pedersen, 2019: 179).” So, reasonable doubt might actually be quite a lot of doubt if you think about it.

One key question in the show was whether Avery’s nephew (a cognitively impaired 16 year old), under coercion from the police (which we can see from a video recording), produced a false confession or not. On the surface, a confession sounds like really strong evidence. It certainly seems stronger than some circumstantial evidence. After all, why would you confess to a crime you didn’t actually commit?

In fact, even confessions can be highly suspect. More than a quarter of overturned wrongful convictions in the USA involve a false confession. The Innocence project has exonerated hundreds of prisoners, and one fifth of these were convicted chiefly due to false confessions. So, who you choose as credible eyewitnesses really matters, as does the relevance of the information they can provide about the case and probative value of the specific evidence they provide. As we can see from the Avery case, socioeconomic profiles make a difference to perceived credibility.

Realist evaluators recommend interview sampling based on Context Mechanism Outcome (CMO) investigation potential (Pawson and Tilley, 1997). That is, who is likely to shed light on what, how, and why an outcome was achieved. While this makes good sense, I find interview sampling (and data analysis) based on their probative value to be more compelling still. Different interviewees can help confirm, refute, or refine your theory, but you should give more weight to the testimony of those who you wouldn’t expect to agree with you. Unless you’ve elicited a false confession by leading the interviewee too much, testimony which appears to go against an interviewee’s own (assumed) best interests is likely to be more credible than testimony which conforms to their interests and which simply parrots back our preferred explanation.

So, not all evidence is equal, and no evidence is perfect. However, thinking hard about what good evidence looks like can really help increase your confidence that you have a good explanation.

In the next blog, I will look at actors, relationships, and how to be more realistic in assessing behaviour change.

Thanks to Kaia Ambrose for comments.



Thomas Aston

I'm an independent consultant specialising in theory-based and participatory evaluation methods.