This project shows how to calculate simple statistics from a list of numbers. It covers the most basic areas of classical statistics which might seem a bit old-fashioned in an era of big data and machine learning algorithms, but even the most complex of data science investigations are likely to start out with a few simple statistics.
I don't want to mix up the discussion of the various statistics we will be calculating with the discussion of the actual code, so will run through it first. If you understand this stuff already just skip straight to Coding
The statistics we'll calculate are the following:
- Arithmetic Mean
- Lower Quartile
- Upper Quartile
- Inter-Quartile Range
- Standard Deviation of Population
- Standard Deviation of Sample
- Variance of Population
- Variance of Sample
Many of these are self-explanatory but a few might not be familiar so I will give a brief overview of those.
Quartiles and Medians
I am sure everyone understands the arithmetic mean: it is what most people think of as the "average" and is just the total of all numbers divided by the count. It is one of several values known as "measures of central tendency" which are intended to give an idea of a central or typical value. However, if the data is not evenly distributed this can give a distorted impression, so the median gives a better idea of a typical or central value. It is quite simply the middle value when the data is sorted into order.
The quartiles are examples of percentiles, ie. the values a certain percentage from the beginning and end of sorted data. Lower and upper quartiles are the values 25% and 75% along respectively. Their purpose is to complement or even replace the minimum and maximum values which might be what are known as "outliers", ie. they are significantly lower or higher than the main body of values and therefore give a misleading impression of the range. I have used the terms lower quartile, median and upper quartile, but the terms 1st, 2nd and 3rd quartile are also widely used. Quartiles are probably the most widely used percentiles but any percentage can be used, deciles (the 10th and 90th percentiles) also being commonly used.
Calculating quartiles and the median sounds simple, just make sure the data is sorted and pick the relevant values from it. However, it's not quite so simple because if there is an even number of values in the data there is no single middle value. In this case we take the mean of the two central values. Irrespective of whether the count is odd or even, the counts of the lower and upper halves may be odd or even, requiring the same approach as with calculating the median.
In order to show how the quartiles and median are calculated in each of the four possible permutations, I will show some sample data.
Firstly, the overall count is odd, but the count of the two halves used to calculate the quartiles is even. The blue cells show the quartiles or the cells averaged to calculate the quartiles. The green cells are the median or cells averaged to calculate the median. Note that if the count is odd we ignore the median when calculating the quartiles.
Secondly, the overall count is odd, but this time the count of the two halves used to calculate the quartiles is also odd so we just pick the middle values for the quartiles.
Next, the count is even and the count of the two halves used to calculate the quartiles is also even. The two median values are included in the values used to calculate the quartiles.
Lastly, the count is even but the count of the two halves used to calculate the quartiles is odd.
Measures of central tendency give no impression of how widely the data is spread so we will also calculate the range (maximum - minimum) and the inter-quartile range. The latter of course is useful in eliminating the misleading effects of any outliers. Such values are known as measures of spread and another is the standard deviation which deserves a section to itself.
Variances and Standard Deviations
The standard deviation can be thought of as the average (ie. mean) amount by which values differ from the mean. That is not a precise definition but it gives an impression of what it signifies. Of course the actual mean by which values differ from the mean would be 0 as positive and negative values cancel out. To get round this the variance is calculated using the squares of each value, and the square root is taken to obtain the standard deviation.
The standard deviation is a useful indicator in its own right but along with the variance it is also used to calculate other statistics such as various coefficients of skewness, as we shall see later, as well as in correlations and regressions and many other applications.
If you want a detailed description of standard deviation take a look at the Wikipedia article.
Another statistic we will calculate is a coefficient of skewness. I say a rather than the as there are plenty to choose from, the one I am using is Pearson's second skewness coefficient (median skewness). Again you might like to read the Wikipedia article for full details but briefly this gives an indicator of how assymetric the data is around the median.
At the heart of this project will be a class to hold each of the stats we will be calculating, and a function to actually calculate them.
Create a new folder somewhere and then create the following empty files in it. You can download the source code as a zip or clone/download from Github if you prefer.
Source Code Links
Let's start by looking at statistics.py.
import math class Statistics(): """ Class has a single attribute for data, and a number of attributes for various statistics on that data. After creating instance, set data attribute and call calculate to populate statistics attributes. A single instance can therefore be used repeatedly on various data sets. Also has methods to print data and statistics. """ def __init__(self): """ Simply create a set of attributes with default values. """ self.data =  self.count = 0 self.total = 0 self.arithmetic_mean = 0 self.minimum = 0 self.lower_quartile = 0 self.median = 0 self.upper_quartile = 0 self.maximum = 0 self.overall_range = 0 self.inter_quartile_range = 0 self.standard_deviation_population = 0 self.standard_deviation_sample = 0 self.variance_population = 0 self.variance_sample = 0 self.skew = 0 def output_data(self): """ Iterate and print data """ print("Number of items: " + str(len(self.data))) for i, v in enumerate(self.data): print(str(i) + "\t" + str(v)) def output_statistics(self): """ Print statistics in a neat format. """ print("Count: " + str(self.count)) print("Total: " + str(self.total)) print("Arithmetic mean: " + str(self.arithmetic_mean)) print("Minimum: " + str(self.minimum)) print("Lower quartile: " + str(self.lower_quartile)) print("Median: " + str(self.median)) print("Upper quartile: " + str(self.upper_quartile)) print("Maximum: " + str(self.maximum)) print("Overall range: " + str(self.overall_range)) print("Inter quartile range: " + str(self.inter_quartile_range)) print("Standard deviation population: " + str(self.standard_deviation_population)) print("Standard deviation sample: " + str(self.standard_deviation_sample)) print("Variance population: " + str(self.variance_population)) print("Variance sample: " + str(self.variance_sample)) print("Skew: " + str(self.skew)) def __is_even(self, n): return n % 2 == 0 def calculate(self): """ Calculate statistics from data. Individual calculations are described in comments. """ sum_of_squares = 0; lower_quartile_index_1 = 0 lower_quartile_index_2 = 0 # data needs to be sorted for median etc self.data.sort() # count is just the size of the data set self.count = len(self.data) # initialize total to 0, and then iterate data # calculating total and sum of squares self.total = 0 for i in self.data: self.total += i sum_of_squares += i ** 2 # the arithmetic mean is simply the total divided by the count self.arithmetic_mean = self.total / self.count # method of calculating median and quartiles is different for odd and even count if self.__is_even(self.count): self.median = (self.data[int(((self.count) / 2) - 1)] + self.data[int(self.count / 2)]) / 2 if self.__is_even(self.count / 2): # even / even lower_quartile_index_1 = (self.count / 2) / 2 lower_quartile_index_2 = lower_quartile_index_1 - 1 self.lower_quartile = (self.data[int(lower_quartile_index_1)] + self.data[int(lower_quartile_index_2)]) / 2 self.upper_quartile = (self.data[int(self.count - 1 - lower_quartile_index_1)] + self.data[int(self.count - 1 - lower_quartile_index_2)]) / 2 else: # even / odd lower_quartile_index_1 = ((self.count / 2) - 1) / 2 self.lower_quartile = self.data[lower_quartile_index_1] self.upper_quartile = self.data[self.count - 1 - lower_quartile_index_1] else: self.median = self.data[((self.count + 1) / 2) - 1] if self.__is_even((self.count - 1) / 2): # odd / even lower_quartile_index_1 = ((self.count - 1) / 2) / 2 lower_quartile_index_2 = lower_quartile_index_1 - 1 self.lower_quartile = (self.data[lower_quartile_index_1] + self.data[lower_quartile_index_2]) / 2 self.upper_quartile = (self.data[self.count - 1 - lower_quartile_index_1] + self.data[self.count - 1 - lower_quartile_index_2]) / 2 else: # odd / odd lower_quartile_index_1 = (((self.count - 1) / 2) - 1) / 2 self.lower_quartile = self.data[lower_quartile_index_1] self.upper_quartile = self.data[self.count - 1 - lower_quartile_index_1] # the data is sorted so the mimimum and maximum are the first and last values self.minimum = self.data self.maximum = self.data[self.count - 1] # the range is difference between the minimum and the maximum self.overall_range = self.maximum - self.minimum # and the inter-quartile range is the difference between the upper and lower quartiles self.inter_quartile_range = self.upper_quartile - self.lower_quartile # this is the formula for the POPULATION variance self.variance_population = (sum_of_squares - ((self.total ** 2) / self.count)) / self.count # the standard deviation is the square root of the variance self.standard_deviation_population = math.sqrt(self.variance_population); # the formula for the sample variance is slightly different in that it use count -1 self.variance_sample = (sum_of_squares - ((self.total ** 2) / self.count)) / (self.count - 1) # the sample standard deviation is the square root of the sample variance self.standard_deviation_sample = math.sqrt(self.variance_sample) # this is Pearson's second skewness coefficient, one of many measures of skewness self.skew = (3.0 * (self.arithmetic_mean - self.median)) / self.standard_deviation_population;
The __init__ function adds an empty list to the object and then a property for each statistic we will be calculation, each initialised to 0.
The output_data function simply iterates and prints the list of data.
Next we have output_statistics which prints each statistic on a separate line. The __is_even private function simply checks for evenness using the mod operator %.
Lastly comes the core calculate method. Rather than trying to describe the numerous calculations separately I have added comments throughout the code.
Now we can move on to main.py where we put our code to use.
import statistics import random def main(): """ Create a set of random data and use the statistics class to calculate and print statistics. """ print("-----------------") print("| codedrome.com |") print("| Statistics |") print("-----------------\n") s = statistics.Statistics() data =  for i in range(0, 12): data.append(random.randint(1, 128)) s.data = data s.calculate() s.output_data() s.output_statistics() main()
Firstly a Statistics object is created and then a list for data. Random numbers are added to the list which is then set as the Statistics object's data. Then all we need to do is call the calculate, output_data and output_statistics functions.
Run the program with this command...
Running the program
...which will give you something like this...
----------------- | codedrome.com | | Statistics | ----------------- Number of items: 12 0 6 1 29 2 31 3 45 4 55 5 71 6 78 7 80 8 92 9 108 10 120 11 126 Count: 12 Total: 841 Arithmetic mean: 70.08333333333333 Minimum: 6 Lower quartile: 38.0 Median: 74.5 Upper quartile: 100.0 Maximum: 126 Overall range: 120 Inter quartile range: 62.0 Standard deviation population: 36.37411701868361 Standard deviation sample: 37.991526168424194 Variance population: 1323.0763888888887 Variance sample: 1443.3560606060603 Skew: -0.3642700108209949