Bayes’ Theorem for Programmers

What Is Belief?

We believe some things, and not others. We believe that if we put our key in the lock, it will open the door. We believe it’s unlikely that if we push on a locked door, it will open. We believe it’s more likely that a given football team win a specific game, than that they will win it with a margin of 10 points. But what does it mean? A popular way to indicate our belief in something, as in the case of the football team, is to bet on it.

While mathematicians talk in probabilities, for example, “there is 2/3 probability that the 49ers will win the game”, gamblers talk in odds, “there is two-to-one chance that the 49ers will win”. The gambler notation, also known as odds, is frequently more useful. Programmers, however, talk in bits. Bits are just a funny way of saying “log to the base of 2”. Scientists usually use log to the base of e, and engineers use log to the base of 10 (with a convenience factor of 10, for historical reasons, and called “decibels”).

If the odds are one-to-one, for example with evenly matched football teams, the log is log(1/1)=log(1)=0. When the odds favor the first, for example with odds of 2:1 (corresponding to a probability of 1/3), the bits are log(2/1)=log(2)=1. What if one of the teams is a huge favorite, and the odds are 32-to-1? The bits are log(32/1)=log(32)=5. What about the other team winning, with the odds of 1-to-32? log(1/32)=-log(32)=-5. A little bit of algebra should convince you that this is the case in general. This notation is called “logit” (for weird historical reasons), but we will call it “belief”. Belief, therefore, is measured in bits!

So we have our first result:

Belief(An event happens)=-Belief(The event does not happen)

We indicate events, for clarity, by upper case letters, and so in short: Belief(A)=-Belief(not A).

Here is a way to measure our belief, approximately. Assume we have a fair coin. We call by H the event “the coin comes up heads”. We call H[n] the event “the coin comes up heads n times in a row”. What are our beliefs in the various H[n]s?

lP(H)=0 (The odds are 50:50, so log(50/50)=log(1)=0)
lP(H[2])=log(1/3)=(approximately) -1.5 bits
log(H[3])=log(1/7)=(approximately) -3 bits
log(H[4])=log(1/15)=(approximately) -4 bits

log(H[n]) = (approximately) -n bits for very large n

By our earlier results, we also know what our belief in “not H[n]” should be! Now, when you want to measure your belief in something, ask yourself “would I prefer betting on that to H[5]? not H[5]?”. Note that as a rational gambler (hah!) you should prefer betting on A to B if, and only if, Belief(A)>Belief(B). So by measuring your preferences, you measure your belief, to the nearest bit. As a curious aside, what does it mean to “believe absolutely” in something? If your probability for A is 1, then the odds for A are 1-to-0: infinity, in other words! Since log(infinity)=infinity, an “absolute belief” is “infinite belief”. Similarly, to “not believe absolutely” is to assign a negative infinity of belief to it. Mostly, we assume that all our beliefs are finite: it makes the math easier! Sometimes we will say “strength of belief” instead of “belief”, when we want to make sure that we mean the technical term for belief we introduced here.

Now we try to formally define the term “overconfidence”. We call someone “overconfident” if we can find a large number of beliefs she holds with strength at least X (a positive number) and significantly more than 1 out of expit(X) [expit is the opposite of logit — expit(t)=exp(t)/(1+exp(t)] of them is wrong. For example, let us suppose Barry has 100 things he believes in with strength over 5 bits. We expect no more than 1 out of 32 of them to be wrong, or about 3. If there are 10 beliefs that turn out to be wrong, Barry is overconfident.

Note that assuming Barry at least obeys the law of probability theory, a belief in something with strength of -5 bits means believing in its opposite with a strength of 5 bits, and so limiting ourselves to positive Xs is not really a limitation at all. An example of someone who is overconfident, for comic relief, is C-3PO: “Sir, the possibility of successfully navigating an asteroid field is approximately 3,720 to 1”. Han Solo is completely correct in telling C-3PO not to mention anything about odds, considering 3PO’s 12 bits of belief that the asteroid field will kill them.

Similarly, we define “underconfident”. We call someone “underconfident” if we can find a large number of beliefs he holds with strength no more than X, and less than 1 out of expit(S) are wrong. For example, if Sally has 100 things she believes in with strength under 5 bits, and only one of them is wrong, Sally is underconfident.

If someone is neither overconfident or underconfident, we call them “well-calibrated”. If they say the odds are 3,720-to-1, we can expect them to be wrong about once in 3,720 times they say that. It will be assumed that well-calibration is a desirable property of our beliefs, for the sake of this explanation.

We introduce our second concept: conditional probability. There are many more color-blind men then women. You see a mother with a baby in her arms. What is the probability that the baby is color blind? If you are in the United States, the answer is about 3.7% (and so, your belief in a random baby being color blind is about -5 bits). Ah, but now you ask the mother “is it a boy or a girl?” and she answers “a boy”. Now you’re not asking about the probability of “the baby is color blind” but about probability of “baby is color blind” given “baby is a boy” — and the probability is now 7% (or -3.5 bits of belief).

Now, it will sometimes be the case that you will want to invert the direction of the conditionality. Why? Suppose someone is standing in front of you in a mask. You cannot tell their sex, but you want to know if they’re male or female (genetically). You show them a test for color blindness, and it tells you they are color blind. Your previous belief, since your friend who likes practical jokes picked them randomly off the street and put a mask on them, is Belief(male)=0. But now you know they are color blind. For brevity, will define A as “person is male” and E as “person is color blind”. What should your answer be? We are going to introduce a little bit more notation for that. For E to tell us anything about A, it must be the case that the probability of E given A is different from that of E given not-A. Statisticians measure the ratio, Probability(E given A)/Probability(E given not A), and call it “the likelihood ratio”. We are going to take the log of that, and call it “Evidence(E gives about A)”. The name “Evidence” is justified by Bayes’ theorem, which says:

Belief(A after seeing E) = Belief(A) + Evidence(E gives about A)

Now, the probability of a male being color blind is 7%. The probablity of a female being color blind is 0.4%. 7/0.4 is about 15, or about 4 bits’ worth. Belief(A) is 0. Therefore, Belief(A given E)=4.

This is all there is to Bayes’ theorem! Bayes’ theorem says: “If your beliefs are well-calibrated, then they will still be well-calibrated after taking into account seeing E, according to the formula above.” It means that to keep well calibration, we must add exactly the evidence. If we don’t add exactly the evidence, adding a little bit more, or a little bit less, we will lose well-calibration, becoming over- or underconfident.

The useful mnemonic to remember this formula “our belief after seeing evidence should be the sum of the prior belief and the evidence”. If we were writing it in a program, we might write something like:

Belief[A] += Evidence_for_A

This formulation explains why this process of evaluating probabilities based on Bayes’ theorem and the observed evidence is called “updating” or “updating beliefs based on evidence”.

The nice thing about this formulation of Bayes’ theorem is that it’s additive:

Belief(A given E1 and E2)=Belief(A)+Evidence(E1 gives about A)+Evidence(E2 gives about A, given E1)

In many cases, when there is no connection between E1 and E2 (in a certain formal sense), this further simplifies to:

Belief(A given E1 and E2)=Belief(A)+Evidence(E1 gives about A)+Evidence(E2 gives about A)

A case it does not simplify is, for example, retaking a test. Our belief that a student knows the material should update further if they fail another test, but not as much as when they fail the first one — the student might have fear of tests.

The exact conditions for simplification can be stated as “E1 and E2 are independent given either A or not A”. When is this condition likely to be fulfilled? When getting “evidence” about a hypothesis. Here is an example: you have one loaded die, which falls on 1 about half the time, and one fair die, which falls on every number 1/6 of the time. You roll a random die, and it comes out “111”. What are the chances it is the loaded die?

We first notice that whether the die is loaded or not, the throws are independent of each other. Therefore, we can use the simplified formula above:

Belief(loaded die given 111)=Belief(loaded die)+3*Evidence(roll of 1 gives about loaded)=~0+3*1.5=4.5

Again, we remind ourselves that log(P(1 given loaded)/P(1 given fair)) is the evidence for the die being loaded given a single throw that comes up 1.

Converting back, we see the probability is somewhere around 0.97.

Sometimes, calculating Probability(E given not A) is unfeasible — “not A” is simply too big. In this case, we can use Bayes’ theorem to compare just two mutually exclusive theories, A and B, by assuming that one is true. We note here that in this case:

Probability(E given (A or B) and not A)=Probability(E given B)
Probability(E given (A or B) and A)=Probability(E given A)

Therefore

Belief(A assuming A or B after seeing E)=Belief(A assuming A or B)+Evidence(E gives for A assuming A or B)

The same comments about additivity of evidence still apply, of course. This is the form of testing usually done in science: while there is no a-priori assurance that one of our theories is true, this measures “relative strength of predictions”. We call log(Probability(E given A)/Probability(E given B)) the “relative evidence” E gives to differentiate between A and B.

Here is an example of applying the ideas in Bayes’ theorem to information theory. Let us assume we have a message M we wish to send, of 5 bits. The message is “maximum-entropy”: our listener on the other side has better prior information on any of the bits other than Belief(bit=0)=0.

In this case, the prior belief is Belief(The other side wanted to send M)=-5. We wish to transmit evidence about the message, to get the other side to guess it. Unfortunately, our channel garbles bits at probability 1/4. The chance that a bit we sent to the other side arrived in an ungarbled form is 3/4, therefore. Assume that we have lots and lots of functions, which we will indicate by H, from five bit messages to one bit such that H is 0 on exactly half of them, and that any two Hs are the same on half the messages.

We wish to make sure that the belief assigned to the correct hypothesis is Belief(M)>0. How many bits do we need to send?

We have Belief(M)=-5 on prior information. Our assumptions make the evidence be independent (this is why we assume that any two E-functions are equal on exactly half of the messages), so we need >5 bits of evidence to drag it over the 0 threshold. How much information does a single transmission give?

Let us choose a single function, H. We send H(M) to the other side. When our listener sees a bit on the other side, she assumes it might be garbled. We will indicate as E the event “The bit that arrives at the other side is H(M)”. Because she is interested in the probability of the message being M, she calculates the likelihood of M given the bit: Probability(E given M)/Probability(E given not M). It is easier to calculate Probability(E) rather than Probability(E given ~M), and a fairly accurate approximation: about 2% (and this accuracy grows quickly as the number of bits grows). P(E) is 1/2 — because garbling a bit which already has probability 1/2 keeps the probability being 1/2. As can be easily seen, P(E given M)=3/4.

Hence, the evidence is log(3/4/(1/2))=log(3/2)=~.5. This, of course, is the evidence transmitted by a correct bit: an incorrect bit transmits -0.5 bits of evidence. The average transmission will send .5*0.75-0.5*0.25=0.25 bits. Therefore, we need to transmit about 20 bits to make Belief(M|E_1,…,E_20)>0.

Note that in practical information theory, it would be unfeasible to calculate Belief(M given E_i) for every M, and therefore various optimization are necessary. However, the calculation above answers the theoretical question: “what is the minimum number of bits needed to specify a message”. Any practical answer has to transmit at least 20 bits — any less, and the message is under-specified.

Acknowledgements: I would like to thank Eliezer Yudkowsky for his Gentle introduction to Bayes’ theorem which taught me about the logit notation, and inspired this essay.

Bonus content: Proof of Bayes’ Theorem

This is just a sequence of algebraic manipulations. Read this if you are not convinced that my formulation of Bayes’ theorem is correct:

The formula for conditional probability:

P(A|B) = P(A and B)/P(B)

[In other words, if someone is well-calibrated, they must obey this formula.]

Therefore

P(B|A) = P(A and B)/P(A)

If we divide those two formulae, we get

P(A|B)/P(B|A) = P(A)/P(B)

Or

P(A|B) = P(B|A)P(A)/P(B)

Now let’s apply it to (not A):

1-P(A|B) = P(not A|B) = P(B|not A)P(not A)/P(B) = P(B|not A)(1-P(A))/P(B)

Let’s divide those two formulas by each other:

P(A|B)/(1-P(A|B)) = (P(B|A)/P(B|not A))(P(A)/(1-P(A)))

Now let’s take the log of both sides:

log(P(A|B)/(1-P(A|B))) = log(P(B|A)/P(B|not A)(P(A)/(1-P(A)))) = log(P(B|A)/P(B|not A))+log(P(A)/(1-P(A)))

But that’s the definition of logit, and evidence!

Belief(A given B) = Evidence(B gives about A) + Belief(A)

One Response to Bayes’ Theorem for Programmers

Leave a comment