# Normal Distributions in Python One of the most useful bits of number-crunching you can do is to calculate the probability distribution of a set of data in the earnest hope that it will be a reasonable fit for one of the recognised distributions such as the normal distribution. In this project I will write a Python class to calculate the normal distribution for a given data set.

## The Normal Distribution

Most people will have an intuitive understanding of the normal distribution even if they have no formal education in statistics. As an example let's think about people's heights: a few people are very short, rather more are fairly short, a lot are around average height, and then the numbers drop back down as we start to consider heights above average, eventually reaching zero.

If we draw a graph of heights and their respective frequencies we might expect to see something like this. (Note this is just fictitious data I have created for this project.) The bars show actual data, whereas the red circles show the theoretical normal distribution for this particular data. As you can see there is a very good but not perfect fit, for example the frequency of 66 inches is 21, but the theoretical frequency assuming the data fits a normal distribution is 20.1516.

So how do we calculate, for example, 20.1516 from our data? Short answer: using the formula below, borrowed from Wikipedia. The variables used in this formula are:

• x - the individual data value, eg. 21
• μ - the arithmetic mean of the data
• σ - the standard deviation of the data The formula looks slightly intimidating but later on I'll implement it in Python to calculate the normal values for each item of data. In fact that is at the core of this whole project - to iterate a set of data and apply the above formula to calculate the values plotted as red circles.

## The Plan

Before starting to code let's consider our requirements and how we are going to implement them. Basically what we want to end up with is a list of dictionaries with an item for each different value in the original data, and attributes for the following:

• The value (eg. height)
• The actual frequency of that value
• The actual probability of that value
• The theoretical frequency according to the normal distribution
• The theoretical probability according to the normal distribution

For simplicity I am restricting the data to integers, each of which will be counted separately when calculating frequencies. A refinement would be to allow for real values and to group these into class intervals for the purposes of calculating frequencies. A further refinement could be to extend the code to calculate other types of probability distribution.

## Coding

Create a new folder and within it create the following empty files. You can download the source code as a zip or clone/download it from Github if you prefer.

• data.py
• normaldistribution.py
• main.py

In the real world we would be getting our data from a database or some kind of file but for the purposes of this project I have written a function to populate an array with hard-coded sample data. In a couple of earlier projects on this site I have used randomly generated sample data but here we need data which roughly follows a normal distribution so have resorted to hard coding it. I have on my list of future projects a program to generate random data following certain probability distributions but for now this will have to suffice.

Open data.py and enter the function body.

data.py

def generate_data():

"""
Create a list of data roughly following a normal distribution.
"""

data = [66,64,65,68,69,71,69,67,66,64,66,60,65,68,66,66,67,66,60,
63,69,64,66,68,65,65,71,65,70,67,66,69,67,68,70,60,66,63,
67,67,68,68,63,72,69,71,63,69,67,64,67,71,68,62,64,66,68,
65,65,66,62,67,63,67,64,67,67,64,69,69,64,62,63,63,70,65,
68,66,67,66,64,62,68,71,63,65,67,66,69,66,67,70,70,66,69,
66,64,62,64,64,61,67,65,67,68,69,71,65,61,71,69,67,61,66,
61,66,65,61,65,70,63,65,64,70,67,62,66,68,66,65,68,65,65,
70,68,68,62,63]

return data

The generate_data function simply creates and returns a list of hard-coded sample data. Now open normaldistribution.py and enter or paste the following. I'll run through it in detail afterwards.

normaldistribution.py

import math

class NormalDistribution(object):

"""
Has list attributes for data and the data's probability distribution, which
is calculated by calling the calculate_prob_dist method. Also attributes for totals.
"""

def __init__(self):

"""
Create attributes with default values.
"""

self.data = []
self.probability_distribution = []

self.total_frequency = 0
self.total_probability = 0.0
self.total_normal_probability = 0.0
self.total_normal_frequency = 0.0

def calculate_prob_dist(self):

"""
Iterate the data, adding dictionary to probability distribution list for new values
or incrementing frequencies of existing ones.
Then calculates actual probabilities, and normal probabilities and frequencies.
"""

self.total_frequency = len(self.data)

data_total = 0
sum_of_squares = 0

for item in self.data:

# try to find current value in probability distribution list
index = self.__index_of(item, self.probability_distribution)

if index < 0:

# if not present add to list
self.probability_distribution.append({"value": item, "frequency": 1, "probability": 0, "normal_probability": 0, "normal_frequency": 0})

else:

# if present increment frequency
self.probability_distribution[index]["frequency"] += 1

data_total += item
sum_of_squares += (item ** 2)

# calculate mean, variance and standard deviation
mean = data_total / self.total_frequency
variance = (sum_of_squares - ((data_total ** 2) / self.total_frequency)) / self.total_frequency
stddev = variance ** 0.5

# calculate probabilities of each unique value
for pd in self.probability_distribution:

pd["probability"] = pd["frequency"] / len(self.data)

pd["normal_probability"] = ((1.0 / (stddev * math.sqrt(2.0 * math.pi))) * (pow(math.e, -1.0 * ((pow((pd["value"] - mean), 2.0)) / ( variance * 2.0)))))

pd["normal_frequency"] = pd["normal_probability"] * len(self.data)

self.total_probability += pd["probability"]
self.total_normal_probability += pd["normal_probability"]
self.total_normal_frequency += pd["normal_frequency"]

self.probability_distribution.sort(key = lambda k: k['value'])

def print_prob_dist(self):

"""
Print details of each item in the probability distribution and totals.
"""

print("Value | Probability | Normal Prob | Freq | Normal Freq\n------------------------------------------------------");

for item in self.probability_distribution:

print("{:5d} |{:12.4f} |{:12.4f} |{:5d} |{:12.4f}".format(item["value"], item["probability"], item["normal_probability"], item["frequency"], item["normal_frequency"]))

print("------------------------------------------------------")

print("      |{:12.4f} |{:12.6f} |{:5.0f} |{:12.4f}".format(self.total_probability, self.total_normal_probability, self.total_frequency, self.total_normal_frequency))

print("------------------------------------------------------")

def __index_of(self, n, probdist):

"""
Return index of n if already in probdist, or -1 if not.
"""

for i in range(0, len(probdist)):

if probdist[i]["value"] == n:

return i

return -1

The __init__ function simply sets up a few attributes with default values. The data attribute is to be set by calling code, the rest being calculated by the class.

Now we move on to calculate_prob_dist, the central method of the whole project.

After setting total_frequency and creating a couple of variables for running totals we loop through the data. For each value we need to add a new dictionary to the probability_distribution list with a frequency of 1 if the value is not already in the list, or just increment the relevant frequency if it is. Finding the index, if any, is farmed out to a separate function called __index_of. Also in the loop we increment data_total and sum_of_squares for later use.

We can now calculate the arithmetic mean, variance and standard deviation, before iterating the probability_distribution list to fill in the missing values.

For each discrete value we calculate the following three values:

• probability - simply the frequency divided by size

• normal_probability - this uses the equation shown above but repeated here • normal_frequency - this is the normal probability multiplied by the number of data items

We also increment the three totals, total_probability, total_normal_probability and total_normal_frequency.

At the moment the probability_distribution list is in order of the first occurence of each value so finally we sort it. Note the key argument of the sort method is a lambda returning 'value'.

Now let's look at the print_prob_dist function. This prints out a heading and then iterates the probability_distribution list, printing out the contents in a table before printing out the totals.

Finally comes the __index_of function. This iterates the specified list to find the number n and returns its index or, if the loop ends without finding it, it returns -1.

Now the class and function to generate data are finished so we can try them out. Open main.py and type or paste this code.

main.py

import data
import normaldistribution

def main():

"""
Demonstrate the NormalDistribution class with a dataset
created using the data module.
"""

print("-----------------------")
print("| codedrome.com       |")
print("| Normal Distribution |")
print("-----------------------\n")

d = data.generate_data()

nd = normaldistribution.NormalDistribution()

nd.data = d

nd.calculate_prob_dist()

nd.print_prob_dist()

main()

This is all quite straightforward - we just grab a list of sample data, create a NormalDistribution object and set its data attribute. We then call its calculate_prob_dist and print_prob_dist methods.

We have now finished coding so run the program with the following in your terminal.

Running

python3.7 main.py

This is the program output. The normal probability and frequency totals don't quite add up to 1 and 138 respectively due to roundings.

Program Output

-----------------------
| codedrome.com       |
| Normal Distribution |
-----------------------

Value | Probability | Normal Prob | Freq | Normal Freq
------------------------------------------------------
60 |      0.0217 |      0.0120 |    3 |      1.6597
61 |      0.0362 |      0.0255 |    5 |      3.5189
62 |      0.0507 |      0.0473 |    7 |      6.5242
63 |      0.0725 |      0.0766 |   10 |     10.5773
64 |      0.0942 |      0.1087 |   13 |     14.9952
65 |      0.1232 |      0.1347 |   17 |     18.5894
66 |      0.1522 |      0.1460 |   21 |     20.1516
67 |      0.1377 |      0.1384 |   19 |     19.1024
68 |      0.1087 |      0.1147 |   15 |     15.8342
69 |      0.0870 |      0.0832 |   12 |     11.4773
70 |      0.0580 |      0.0527 |    8 |      7.2747
71 |      0.0507 |      0.0292 |    7 |      4.0320
72 |      0.0072 |      0.0142 |    1 |      1.9542
------------------------------------------------------
|      1.0000 |    0.983269 |  138 |    135.6912
------------------------------------------------------