Correlation is the process of quantifying the relationship between two sets of values, and in this post I will be writing code in Python to calculate possibly the best-known type of correlation - the Pearson Correlation Coefficient.
An Overview of Correlation
Consider the following three data sets and their graphs, or, more accurately, scatter plots. The x-axis represents an independent variable, and the y-axis represents a dependent variable.
Data Set 1
In this data set all the points are on a straight line - there is clearly a relationship between them.
Data Set 2
Here the data is a bit more scattered but there is still a clear relationship, subject to some minor fluctuations.
Data Set 3
In this last set of data the points appear completely random, with little or no relationship between x and corresponding y values.
Quantifying the Relationships
Relationships within bivariate data such as the examples above can be precisely quantified by calculating the Pearson Correlation Coefficient, usually referred to as ρ, the Greek letter rho. This gives us a number between -1 and 1, -1 being a perfect negative correlation and 1 being a perfect positive correlation. Values in between represent various degrees of imperfect correlation, with 0 being none at all.
As rough estimates, in Data Set 1 we would expect ρ to be 1, in Data Set 2 we would expect ρ to be around 0.9, and in Data Set 3 ρ would be close to 0.
Full details of the Pearson Correlation Coefficient are on Wikipedia. In short the formula is
Formula for Pearson's Correlation Coefficient
ρX,Y = cov(X,Y) / (σX * σY)
where cov is the covariance of the data paired data, and σ is the standard deviation of the data. These will be explored as we get to them.
There is a famous phrase in statistics: "correlation does not imply causation". A strong correlation between two variables might just be by chance, even where common sense might make you believe the association could be genuine. A strong correlation should only ever be used as a basis for further investigation, particularly in this age of Big Data number crunching where seemingly unrelated data sets can be compared quickly and cheaply.
Create a new folder and then create the following empty files:
You can also download the source code from the Downloads page or clone/download from Github if you prefer.
Source Code Links
import math def pearson_correlation(independent, dependent): """ Implements Pearson's Correlation, using several utility functions to calculate intermediate values before calculating and returning rho. """ # covariance independent_mean = _arithmetic_mean(independent) dependent_mean = _arithmetic_mean(dependent) products_mean = _mean_of_products(independent, dependent) covariance = products_mean - (independent_mean * dependent_mean) # standard deviations of independent values independent_standard_deviation = _standard_deviation(independent) # standard deviations of dependent values dependent_standard_deviation = _standard_deviation(dependent) # Pearson Correlation Coefficient rho = covariance / (independent_standard_deviation * dependent_standard_deviation) return rho def _arithmetic_mean(data): """ Total / count: the everyday meaning of "average" """ total = 0 for i in data: total+= i return total / len(data) def _mean_of_products(data1, data2): """ The mean of the products of the corresponding values of bivariate data """ total = 0 for i in range(0, len(data1)): total += (data1[i] * data2[i]) return total / len(data1) def _standard_deviation(data): """ A measure of how individual values typically differ from the mean_of_data. The square root of the variance. """ squares =  for i in data: squares.append(i ** 2) mean_of_squares = _arithmetic_mean(squares) mean_of_data = _arithmetic_mean(data) square_of_mean = mean_of_data ** 2 variance = mean_of_squares - square_of_mean std_dev = math.sqrt(variance) return std_dev
Most of the work done by pearson_correlation is delegated to other functions, which calculate various intermediate values needed to actually calculate the correlation coefficient. Firstly we need the arithmetic means of both sets of data, and the mean of the products of the twin datasets. Next we calculate the covariance.
Covariance is a rather complex concept which you might like to look into on Wikipedia or elsewhere, but an easy way to remember how to calculate it is mean of products minus product of means. The mean of products is the mean of each pair of values multiplied together, and the product of means is simply the means of the two sets of data multiplied together.
Next we need the standard deviations of the two datasets. Standard deviation can be thought of roughly as the average amount by which the individual values vary from the mean. (It is not that exactly, which is a different figure known as mean of absolute deviation or MAD, but the concepts are similar.) The standard deviation is the square root of the variance which uses squared values to eliminate negative differences. An easy way to remember how to calculate a variance is mean of squares minus square of mean. The mean of squares is simple enough, just square all the values and calculate their mean; and the square of mean is self-explanatory. Once you have calculated the variance don't forget to calculate its square root to get the standard deviation.
Having calculated all the intermediate values we can then use them to implement the formula from above:
Formula for Pearson's Correlation Coefficient
ρX,Y = cov(X,Y) / (σX * σY)
The three private functions follow, which calculate the arithmetic mean, mean of products and standard deviation. These have wide uses beyond Pearson's Correlation and could alternatively be implemented as part of a general purpose statistical library.
Before we try out the above code we need a few pieces of data so I have created a function called populatedata in data.py which return a list populated with one of the three datasets show above.
def populatedata(independent, dependent, dataset): """ Populates two lists with one of three sets of bivariate data suitable for testing and demonstrating Pearson's Correlation """ del independent[:] del dependent[:] if dataset == 1: independent.extend([10,20,40,45,60,65,75,80]) dependent.extend([32,44,68,74,92,98,110,116]) return True elif dataset == 2: independent.extend([10,20,40,45,60,65,75,80]) dependent.extend([40,40,60,80,90,110,100,130]) return True elif dataset == 3: independent.extend([10,20,40,45,60,65,75,80]) dependent.extend([100,10,130,90,40,80,180,50]) return True else: return False
Now we can move on to our main function.
import data import pearsoncorrelation def main(): """ Iterates the three available sets of data and calls function to calculate rho. Then prints the data and Pearson Correlation Coefficient. """ print("-------------------------") print("| codedrome.com |") print("| Pearson's Correlation |") print("-------------------------\n") independent =  dependent =  for d in range(1, 4): if data.populatedata(independent, dependent, d) == True: rho = pearsoncorrelation.pearson_correlation(independent, dependent) print("Dataset %d\n---------" % d) print("Independent data: " + (str(independent))) print("Dependent data: " + (str(dependent))) print("Pearson Correlation Coefficient rho = %1.2f\n" % rho) else: print("Cannot populate Dataset %d" % d) main()
We firstly create a pair of lists and then loop through the three available datasets, populating the lists and calculating the correlation coefficient. Finally the data and correlation are printed.
We have now finished writing code so can run it with the following command:
The output is:
------------------------- | codedrome.com | | Pearson's Correlation | ------------------------- Dataset 1 --------- Independent data: [10, 20, 40, 45, 60, 65, 75, 80] Dependent data: [32, 44, 68, 74, 92, 98, 110, 116] Pearson Correlation Coefficient rho = 1.00 Dataset 2 --------- Independent data: [10, 20, 40, 45, 60, 65, 75, 80] Dependent data: [40, 40, 60, 80, 90, 110, 100, 130] Pearson Correlation Coefficient rho = 0.96 Dataset 3 --------- Independent data: [10, 20, 40, 45, 60, 65, 75, 80] Dependent data: [100, 10, 130, 90, 40, 80, 180, 50] Pearson Correlation Coefficient rho = 0.21
Earlier I gave rough estimates of ρ for the three data sets as 1, 0.9 and 0. As you can see they are actually 1.0, 0.96 and 0.21 to 2dp. You can change the accurancy printed if you wish.
Next Step: Regression
If you have discovered a strong correlation in your data your next step will probably be to put numbers to that relationship by working out the formula of the straight line which best fits the data. That process is called linear regression and will be the subject of a future post.