This post follows from the previous review of Richard Carrier’s “Proving History”, which attempts to use Bayes’s Theorem to prove Jesus didn’t exist. In my review I point out a selection of the mathematical problems with that book, even though I quite enjoyed it. This post is designed to explain what Bayes’s Theorem actually does, and show why it isn’t particularly useful outside of specific domains. It is a journey through basic probability theory, for folks who aren’t into math (though I’ll assume high-school math). It is designed to be simple, and therefore is rather long. I will update it and clarify it from time to time.
Let’s think about the birth of Christianity. How did it happen? We don’t know, which is to say there are a lot of different things that could have happened. Let’s use an illustration to picture this.
Complex diagram, eh? I want this rectangle to represent all possible histories: everything that could have happened. In math we call this rectangle the ‘universe‘, but meant metaphorically: the universe of possibilities. In the rectangle each point is one particular history. So there is one point which is the actual history, the one-true-past (OTP in the diagram below), but we don’t know which it is. In fact, we can surely agree we’ve no hope of ever finding it, right? To some extent there will always be things in history that are uncertain.
When we talk about something happening in history, we aren’t narrowing down history to a point. If we consider the claim “Jesus was the illegitimate child of a Roman soldier”, there are a range of possible histories involving such a Jesus. Even if we knew 100% that were true, there would be a whole range of different histories including that fact.
Napolean moved his knife in a particular way during his meal on January 1st 1820, but he could have moved that knife in any way, or been without a knife, and the things we want to say about him wouldn’t change. His actual knife manipulation is part of the one-true-past, but totally irrelevant for Napoleonic history1.
So any claim about history represents a whole set of possible histories. We draw such sets as circles. And if you’re a child of the new math, you’ll recognize the above as a Venn diagram. But I want to stress what the diagram actually means, so try to forget most of your Venn diagram math for a while.
At this point we can talk about what a probability is.
There are essentially an infinite number of possible histories (the question of whether it is literally infinite is one for the philosophy of physics, but even if finite, it would be so large as to be practically infinite for the purpose of our task). So each specific history would be infinitely unlikely. We can’t possibly say anything useful about how likely any specific point is, we can’t talk about the probability of a particular history.
So again we turn to our sets. Each set has some likelihood of the one-true-past lying somewhere inside it. How likely is it that Jesus was born in Bethlehem? That’s another way of asking how likely it is that the one-true-past lies in the set of possible histories that we would label “Jesus Born in Bethlehem”. The individual possibilities in the set don’t have a meaningful likelihood, but our historical claims encompass many possibilities, and as a whole those claims do have meaningful likelihood. In other words, when we talk about how likely something was to have happened, we are always talking about a sets of possibilities that match our claim.
We can represent the likelihood on the diagram by drawing the set bigger or smaller. If we have two sets, one drawn double the size of the other, then the one-true-past is twice as likely to be in the one that is drawn larger.
So now we can define what a probability is for a historical claim. A probability is a ratio of the likelihood of a set, relative to the whole universe of possibilities. Or, in terms of the diagram, what fraction of the rectangle is taken up by the set of possibilities matching our claim?
If we can somehow turn likelihood into a number, (i.e. let’s say that the likelihood of a set S is a nmber written L(S)) and if the universe is represented by the set U, probability can be mathematically defined as:
But where do these ‘likelihood’ numbers come from? That’s a good question, and one that turns out to be very hard to give an answer for that works in all cases. But for our purpose, just think of them as a place-holder for any of a whole range of different things we could use to calculate a probability. For example: if we were to calculate the probability of rolling 6 on a die, the likelihood numbers would be the number of sides: the likelihood of rolling a 6 would be 1 side, the likelihood of rolling anything would be 6 sides, so the probability of rolling a six is 1/6. If we’re interested in the probability of a scanner diagnosing a disease, the likelihoods would be the numbers of scans: on top would be the number of successful scans, the number on the bottom would be the total number of scans. We use the abstraction as a way of saying “it doesn’t much matter what these things are, as long as they behave in a particular way, the result is a probability”.
Now we’ve got to probabilities, we’ve used these ‘likelihoods’ as a ladder, and we can move on. We only really worry about how the probability is calculated when we have to calculate one, and then we do need to figure out what goes on the top and bottom of the division.
In this diagram we have two sets. These are two claims, or two sets of possible histories. The sets may overlap in any combination. If no possible history could match both claims (e.g. “Jesus was born in Bethlehem” and “Jesus was born in Nazereth”), then the two circles wouldn’t touch [kudos if you are thinking “maybe there are ways both could be kind-of true” – that’s some math for another day]. Or it might be that the claims are concentric (“Jesus was born in Bethlehem”, “Jesus was born”), any possibility in one set, will always be in another. Or they may, as in this case, overlap (“Jesus was born in Nazereth”, “Jesus was born illegitimately”).
I’ve been giving examples of sets of historical claims, but there is another type of set that is important: the set of possible histories matching something that we know happened. Of all the possible histories, how many of them produce a New Testament record that is similar to the one we know?
This might seem odd. Why does our universe include things we know aren’t true? Why are there possibilities which lead to us never having a New Testament? Why are there histories where we have a surviving comprehensive set of writings by Jesus? Can’t we just reject those outright? The unhelpful answer is that we need them for the math to work. As we’ll see, Bayes’s Theorem requires us to deal with the probability that history turned out the way it did. I’ll give an example later of this kind of counter-factual reasoning.
So we have these two kinds of set. One kind which are historical claims, and the other which represent known facts. The latter are often called Evidence, abbreviated E, the former are Hypotheses, or H. So let’s draw another diagram.
where H∩E means the intersection of sets H and E – the set of possible histories where we both see the evidence and where our hypothesis is true (you can read the mathematical symbol ∩ as “and”).
Here is the basic historical problem. We have a universe of possible histories. Some of those histories could have given rise to the evidence we know, some might incorporate our hypothesis. We know the one true past lies in E, but we want to know how likely it is to be in the overlap, rather than the bit of E outside H. In other words, how likely is it that the Hypothesis true, given the Evidence we know?
Above, I said that probability is how likely a set is, relative to the whole universe. This is a simplification we have to revisit now. Probability is actually how likely one sets is, relative to some other set that completely encompasses it (a superset in math terms).
We’re not actually interested in how likely our Hypothesis is, relative to all histories that could possibly have been. We’re only interested in how likely our hypothesis is, given our evidence: given that the one-true-past is in E.
So the set we’re interested in is the overlap where we have the evidence and the hypothesis is true. And the superset we want to compare it to is E, because we know the one-true-past is in there (or at least we are willing to assume it is). This is what is known as a conditional probability. It says how likely is H, given that we know or assume E is true: we write it as P(H|E) (read as “the probability of H, given E”). And from the diagram it should be clear the answer is:
It is the ratio of the size of the overlap, relative to the size of the whole of E. This is the same as our previous definition of probability, only before we were comparing it to the whole universe U, now we’re comparing it to just the part of U where E is true2.
We could write all probabilities as conditional probabilities, because ultimately any probability is relative to something. We could write P(S|U) to say that we’re interested in the probability of S relative to the universe. We could, but it would be pointless, because that is what P(S) means. Put another way, P(S) is just a conveniently simplified way of writing P(S|U).
So what is a conditional probability doing? It is zooming in, so we’re no longer talking about probabilities relative to the whole universe of possibilities (most of which we know aren’t true anyway), we’re now zooming in, to probabilities relative to things we know are true, or we’re willing to assume are true. Conditional probabilities throw away the rest of the universe of possibilities and just focus on one area: for P(H|E), we zoom into the set E, and treat E as if it were the universe of possibilities. We’re throwing away all those counter-factuals, and concentrating on just the bits that match the evidence.
The equation for conditional probability is simple, but in many cases it is hard to find P(H∩E), so we can manipulate it a little, to remove P(H∩E) and replace it with something simpler to calculate.
Bayes’s Theorem is one of many such manipulations. We can use some basic high school math to derive it:
Step-by-step math explanation: The first line is just the formula for conditional probability again. If we multiply both sides by P(E) (and therefore move it from one side of the equation to the other) we get the first two parts on the second line. We then assume that P(H∩E) = P(E∩H) (in other words, the size of the overlap in our diagram is the same regardless of which order we write the two sets), which means that we can get the fourth term on the second line just by changing over E and H in the first term. Line three repeats these two terms on one line without the P(H∩E) and P(E∩H) in the middle. We then divide by P(E) again to get line four, which gives us an equation for P(H|E) again.
What is Bayes’s Theorem doing? Notice the denominator is the same as for conditional probability P(E), so what Bayes’s Theorem is doing is giving us a way to calculate P(H∩E) differently. It is saying that we can calculate P(H∩E) by looking at the proportion of H taken up by H∩E, multiplied by the total probability of H. If I want to find the amount of water in a cup, I could say “its half the cup, the cup holds half a pint, so I have one half times half a pint, which is a quarter of a pint”. That’s the same logic here. The numerator of Bayes’s theorem is just another way to calculate P(H∩E).
So what is Bayes’s Theorem for? It let’s us get to the value we’re interested in — P(H|E) — if we happen to know, or can calculate, the other three quantities: the probability of each set, P(H) and P(E) (relative to the universe of possibilities), and the probability of seeing the evidence if the hypothesis were true P(E|H). Notice that, unlike the previous formula, we’ve now got three things to find in order to use the equation. And either way, we still need to calculate the probability of the evidence, P(E).
Bayes’s Theorem can also be useful if we could calculate P(H∩E), but with much lower accuracy than we can calculate P(H) and P(E|H). Then we’d expect our result from Bayes’s Theorem to be a more accurate value for P(H|E). If, on the other hand we could measure P(H∩E), or we had a different way to calculate that, we wouldn’t need Bayes’s Theorem.
Bayes’s Theorem is not a magic bullet, it is just one way of calculating P(H|E). In particular it is the simplest formula for reversing the condition, if you know P(E|H), you use Bayes’s Theorem to give you P(H|E)3.
So the obvious question is: if we want to know P(H|E), what shall we use to calculate it? Either of the two formulae above need us to calculate P(E), in the universe of possible histories, how likely are we to have ended up with the evidence we have? Can we calculate that?
And here things start to get tricky. I’ve never seen any credible way of doing so. What would it mean to find the probability of the New Testament, say?
Even once we’ve done that, we’d only be justified in using Bayes’s Theorem if our calculations for P(H) and P(E|H) are much more accurate than we could manage for P(H∩E). Is that true?
I’m not sure I can imagine a way of calculating either P(H∩E) or P(E|H) for a historical event. How would we credibly calculate the probability of the New Testament, given the Historical Jesus? Or the probably of having both New Testament and Historical Jesus in some universe of possibilities? If you want to use this math, you need to justify how on earth you can put numbers on these quantities. And, as we’ll see when we talk about how these formulae magnify errors, you’ll need to do more than just guess.
But what of Carrier’s (and William Lane Craig’s) favoured version of Bayes’s Theorem? It is is derived from the normal version by observing:
in other words, the set E is just made up of the bit that overlaps with H and the bit that doesn’t (~H means “not in H”), so because
(which was the rearrangement of the conditional probability formula we used on line two of our derivation of Bayes’s Theorem), we can write Bayes’s Theorem as
Does that help?
I can’t see how. This is just a further manipulation. The bottom of this equation is still just P(E), we’ve just come up with a different way to calculate it, one involving more terms4. We’d be justified in doing so, only if these terms were obviously easier to calculate, or could be calculated with significantly lower error than P(E).
If these terms are estimates, then we’re just using more estimates that we haven’t justified. We’re still having to calculate P(E|H), and now P(E|~H) too. I cannot conceive of a way to do this that isn’t just unredeemable guesswork. And it is telling nobody I’ve seen advocate Bayes’s Theorem in history has actually worked through such a process with anything but estimates.
This is bad news, and it might seem that Bayes’s Theorem could never be any useful for anything. But there are cases when we do have the right data.
Let’s imagine that we’re trying a suspect for murder. The suspect has a DNA match at the scene (the Evidence). Our hypothesis is that the DNA came from the suspect. What is P(H|E) – the probability that the DNA is the suspect’s, given that it is a match? This is a historical question, right? We’re asked to find what happened in history, given the evidence before us. We can use Bayes here, because we can get all the different terms.
P(E|H) is simple – what is the probability our test would give a match, given the DNA was the suspect’s? This is the accuracy of the test, and is probably known. P(E) is the probability that we’d get a match regardless. We can use a figure for the probability that two random people would have matching DNA. P(H) is the probability that our suspect is the murderer, in the absence of evidence. This is the probability that any random person is the murderer (if we had no evidence, we’d have no reason to suspect any particular person). So the three terms we need can be convincingly provided, measured, and their errors calculated. And, crucially, these three terms are much easier to calculate, with lower errors, than if we used the P(H∩E) form. What could we measure to find the probability that the suspect is the murderer and their DNA matched? Probably nothing – Bayes’s Theorem really is the best tool to find the conditional probability we’re interested in.
While we’re thinking about this example, I want to return briefly to what I said about counter-factual reasoning. Remember I said that Bayes’s Theorem needs us to work with a universe of possibilities where things we know are true, might not be true? The trial example shows this. We are calculating the probability that the suspect’s DNA would match the sample at the crime scene – but this is counter-factual, because we know it did (otherwise we’d not be doing the calculation). We’re calculating the probability that the DNA would match, assuming the suspect were the murderer, but again, this is counter-factual, because the DNA did match, and we’re trying to figure out whether they are the murderer. This example shows that the universe of possibilities we must consider has to be bigger than the things we know are true. We have to work with counter-factuals, to get the right values.
So Bayes’s Theorem is useful when we have the right inputs. Is it useful in history? I don’t think so. What is the P(E) if the E we’re interested in is the New Testament? Or Jospehus? I simply don’t see how you can give a number that is rooted in anything but a random guess. I’ve not seen it argued with any kind of rational basis.
So ultimately we end up with this situation. Bayes’s Theorem is used in these kind of historical debates to feed in random guesses and pretend the output is meaningful. I hope if you’ve been patient enough to follow along, you’ll see that Bayes’s Theorem has a very specific meaning, and that when seen in the cold light of day for what it is actually doing, the idea that it can be numerically applied to general questions in history is obviously ludicrous.
But, you might say, in Carrier’s book he pretty much admits that numerical values are unreliable, and suggests that we can make broad estimates, erring on the side of caution and do what he calls an a fortiori argument – if a result comes from putting in unrealistically conservative estimates, then that result can only get stronger if we make the estimates more accurate. This isn’t true, unfortunately, but for that, we’ll have to delve into the way these formulas impact errors in the estimates. We can calculate the accuracy of the output, given the accuracy of each input, and it isn’t very helpful for a fortiori reasoning. That is a topic for another part.
As is the little teaser from earlier, where I mentioned that, in subjective historical work, sets that seem not to overlap can be imagined to overlap in some situations. This is another problem for historical use of probability theory, but to do it justice we’ll need to talk about philosophical vagueness and how we deal with that in mathematics.
Whether I get to those other posts or not, the summary is that both of them significantly reduce the accuracy of the conclusions that you can reach with these formula, if your inputs are uncertain. It doesn’t take much uncertainty on the input before you loose any plausibility for your output.
1 Of course, we can hypothesize some historical question for which it might not be irrelevant. Perhaps we’re interested in whether he was sick that day, or whether he was suffering a degenerating condition that left his hands compromised. Still, the point stands, even those claims still encompass a set of histories, they don’t refer to a single point.
2 Our definition of probability involved L(S) values, what happened to them? Why are we now dividing probabilities? Remember that a Likelihood, L(S), could be any number that represented how likely something was. So something twice as likely had double the L(S) value. I used examples like number of scans or number of sides of a die, but probability values also meet those criteria, so they can also be used as L(S) values. The opposite isn’t true, not every Likelihood value is a probability (e.g. we could have 2,000 scans, which would be a valid L(S) value, but 2,000 is not a valid probability).
3 Though Bayes’s Theorem is often quoted as being a way to reverse the condition P(H|E) from P(E|H), it does still rely on P(E) and P(H). You can do further algebraic manipulations to find these quantities, one of which we’ll see later to calculate P(E). Here the nomenclature is a bit complex. Though Bayes’s Theorem is a simple algebraic manipulation of conditional probability, further manipulation doesn’t necessarily mean a formula is no longer a statement of Bayes’s Theorem. The presence of P(E|H) in the numerator is normally good enough for folks to call it Bayes’s Theorem, even if the P(E) and P(H) terms are replaced by more complex calculations.
4 You’ll notice, however, that P(E|H)P(H) is on both the top and the bottom of the fraction now. So it may seem that we’re using the same estimate twice, cutting down the number of things to find. This is only partially helpful, though. If I write a follow up post on errors and accuracy, I’ll show why I think that errors on top and bottom can pull in different directions.