The Fundamentals of Bayes’ Theorem in Python

Getting back to the basics of Bayes' Theorem using Python.

Thomas Bayes and Bayesianism

Thomas Bayes was a rather obscure 18th Century English clergyman and it is not even certain when and where he was born, but it was around 1701 and possibly in Hertfordshire just north of London. His only mark on history is the eponymous Bayes' Theorem but the name Bayesian is now used in many different areas, sometimes with only tenuous links to the original theorem.

This gives the impression that Bayesianism is a huge and complex field covering not just probability but extending in to philosophy, computer science and beyond. In this article I will get back to the basics of the theorem, firstly by applying it to its "standard" example of medical tests, and then writing a simple demonstration of its use in Python.

Bayes' Theorem

Bayes' Theorem is basically a simple formula so let's start by chalking it up.

Bayes Theorem

You may be familiar with the P(A) notation used in probability theory to denote the probability of a specific outcome or event, A. We use decimals and all probabilities add to 1, so for example if we know 1% of the population has a certain disease then P(ill) = 0.01, P(healthy) = 0.99 and 0.01 + 0.99 = 1.

The | symbol used in the formula extends the notation to indicate the probability of a certain outcome given an existing state, and the | can be read as "given". If in the above example we assume a test is available for the disease then Bayes' Theorem allows us to calculate the probability of a person having the disease given a positive test result.

You might assume that the probability of someone having a disease if they test positive is 1, and conversely the probability is 0 if they test negative. Unfortunately no medical test is perfect: some people with the disease will test negative and some people who do not have the disease will test positive. Even with a highly accurate test this can lead to some startlingly inaccurate results, as we will see.

Let's make up a few fictitious numbers for an equally fictitious disease, just for demonstration purposes. We need to know the population and the percentage which has the disease. We also need a couple of numbers to describe the accuracy of the test: what percentage of people with the disease test positive, and what percentage of people who are healthy test negative. These are the sensitivity and specificity.

Population1,000,000
Ill1% or 0.01
Sensitivity99% or 0.99
Specificity99% or 0.99

Now let's assume everyone has been tested and we have the following figures:

Number of ill people (1% of population)10,000
Ill people correctly tested positive (99%)9,900
Ill people incorrectly tested negative (1%)100
Number of healthy people (99% of population)990,000
Healthy people correctly tested negative (99%)980,100
Healthy people incorrectly tested positive (1%)9,900

The sensitivity and specificity rates of 99% look impressive, but as you can see from the previous table the number of healthy people who wrongly tested positive (shown in bold) is exactly the same as the number of ill people who correctly tested positive (again shown in bold). Therefore if a person tests positive there is only a probability of 0.5 that they are actually ill.

Plugging the Numbers into the Formula

Using the process above we established the probability of a person testing positive actually having the disease. However, it was a messy process which can be simplified by using the formula for Bayes' Theorem.

This is the theorem applied to our sample problem, which as you can see gives us the 0.5 result we are looking for.

Bayes Theorem

The values above the line are straightforward, and come straight from our table of known data. However, the part below the line, P(positive), needs to be calculated from:

P(healthy) * P(positive|healthy) + P(ill) * P(positive|ill)

This gives us the overall probability of testing positive, irrespective of whether the subject is ill or healthy.

Let's Code It

We can stare at a (virtual) blackboard all day but to fully understand what's going on it's a good idea to implement the formula in code. This also gives us the opportunity to change values quickly and easily to see how this affects the outcome.

The code for this project is all in one short file called bayes.py, and you can download it as a zip file or clone/download the Github repository.

Source Code Links

ZIP File
GitHub

This is the source code in its entirety.

bayes.py

def main():

    """
    Call 2 functions to calculate conditional probabilities,
    firstly using basic arithmetic and then using Bayes' formula.
    """

    population = 1000000

    # These 3 variables are for the known probabilities.
    # Change them to see the effect on P(ill|positive)
    P_ill = 0.01
    P_positive_if_ill = 0.99 # sensitivity
    P_negative_if_healthy = 0.99 # specificity

    calculate_without_bayes(population, P_ill, P_positive_if_ill, P_negative_if_healthy)

    print()

    calculate_with_bayes(P_ill, P_positive_if_ill, P_negative_if_healthy)


def calculate_without_bayes(population, P_ill, P_positive_if_ill, P_negative_if_healthy):

    """
    Calculate P(ill | positive) without Bayes' formula.
    This is more laborious but shows how the result is
    calculated using basic arithmetic.
    """

    heading = "Calculate P(ill | positive) without Bayes' Theorem"
    print(heading)
    print("=" * len(heading) + "\n")

    percent_ill = P_ill * 100
    number_ill = population * P_ill
    number_healthy = population * (1 - P_ill)
    ill_positive = number_ill * P_positive_if_ill
    healthy_positive = number_healthy * (1 - P_negative_if_healthy)
    P_ill_if_positive = ill_positive / (ill_positive + healthy_positive)

    print(f"Population:                {population}")
    print(f"Percent ill:               {percent_ill}%")
    print(f"Number ill:                {number_ill:>.0f}")
    print(f"Number healthy:            {number_healthy:>.0f}")
    print(f"P(positive if ill):        {P_positive_if_ill}")
    print(f"P(negative if healthy):    {P_negative_if_healthy}")
    print(f"Ill and test positive:     {ill_positive:>.0f}")
    print(f"Healthy but test positive: {healthy_positive:>.0f}")
    print(f"P(ill | positive):        {P_ill_if_positive:>.2f}")


def calculate_with_bayes(P_ill, P_positive_if_ill, P_negative_if_healthy):

    """
    Calculate P(ill | positive) with Bayes' Theorem.
    """

    P_healthy = 1 - P_ill
    P_positive_if_healthy = 1 - P_negative_if_healthy
    P_ill_if_positive = (P_positive_if_ill * P_ill) / ((P_healthy * P_positive_if_healthy) + (P_ill * P_positive_if_ill))

    heading = "Calculate P(ill | positive) with Bayes' Theorem"
    print(heading)
    print("=" * len(heading) + "\n")

    print(f"P(ill):                 {P_ill}")
    print(f"P(healthy):             {P_healthy}")
    print(f"P(positive if ill):     {P_positive_if_ill}")
    print(f"P(positive if healthy): {P_positive_if_healthy:>.2f}\n")

    print("                                        P(positive if ill) * P(ill)")
    print("P(ill | positive) = -------------------------------------------------------------------")
    print("                      P(healthy) * P(positive if healthy) + P(ill) * P(positive if ill)")

    print("\n")

    print(f"                                               {P_positive_if_ill} * {P_ill}")
    print("                   = -------------------------------------------------------------------")
    print(f"                                         {P_healthy} * {P_positive_if_healthy:>.2f} + {P_ill} * {P_positive_if_ill}")

    print("\n")

    print(f"                   = {P_ill_if_positive:>.2f}")


main()

The main Function

Here we just create a few variables for the population and probabilities which are then passed to the two functions which calculate the probability of being ill if testing positive.

The calculate_without_bayes Function

In this function we calculate a few interim values from the specified population and probabilities, and them use them for our ultimate goal of finding the probability of being ill if testing positive.

All the values are then printed which gives an intuitive idea of the process, but this is a bit long-winded so in the next function we'll do it the "correct" way using Bayes' formula.

The calculate_with_bayes Function

Firstly we need to calculate a couple more probabilities from those we already know: the probability of being healthy and the probability of testing positive if healthy. After doing this we can go ahead and implement Bayes' Theorem.

The rest of the function is taken up with printing out the results, including the interim calculations.

Let's Run It

Now we can go ahead and run the program with this command:

Run

python3.8 bayes.py

The output is

Program Output

Calculate P(ill | positive) without Bayes' Theorem
==================================================

Population:                1000000
Percent ill:               1.0%
Number ill:                10000
Number healthy:            990000
P(positive if ill):        0.99
P(negative if healthy):    0.99
Ill and test positive:     9900
Healthy but test positive: 9900
P(ill | positive):         0.50

Calculate P(ill | positive) with Bayes' Theorem
===============================================

P(ill):                 0.01
P(healthy):             0.99
P(positive if ill):     0.99
P(positive if healthy): 0.01

                                        P(positive if ill) * P(ill)
P(ill | positive) = -------------------------------------------------------------------
                      P(healthy) * P(positive if healthy) + P(ill) * P(positive if ill)


                                               0.99 * 0.01
                  = -------------------------------------------------------------------
                                         0.99 * 0.01 + 0.01 * 0.99


                  = 0.50

You might want to experiment with different sensitivities and specificities. The 99% ones I used are actually very high and many real world medical tests are much less accurate, which as you have probably realised means that the chances of a person having a disease if they test positive can be very low.

So does this mean that mass testing or screening of patients even if they have no symptoms is too inaccurate to be worthwhile? This is really a matter of opinion, but if you hear of or have personal experience of misdiagnoses then please bear in mind Thomas Bayes and his theorem.