Pearson Correlation in Python

Correlation is the process of quantifying the relationship between two sets of values, and in this post I will be writing code in Python to calculate possibly the best-known type of correlation - the Pearson Correlation Coefficient.

An Overview of Correlation

Consider the following three data sets and their graphs, or, more accurately, scatter plots. The x-axis represents an independent variable, and the y-axis represents a dependent variable.

Data Set 1

xy
1032
2044
4068
4574
6092
6598
75110
80116

dataset1

In this data set all the points are on a straight line - there is clearly a relationship between them.

Data Set 2

xy
1040
2040
4060
4580
6090
65110
75100
80130

dataset2

Here the data is a bit more scattered but there is still a clear relationship, subject to some minor fluctuations.

Data Set 3

xy
10100
2010
40130
4590
6040
6580
75180
8050

dataset3

In this last set of data the points appear completely random, with little or no relationship between x and corresponding y values.

Quantifying the Relationships

Relationships within bivariate data such as the examples above can be precisely quantified by calculating the Pearson Correlation Coefficient, usually referred to as ρ, the Greek letter rho. This gives us a number between -1 and 1, -1 being a perfect negative correlation and 1 being a perfect positive correlation. Values in between represent various degrees of imperfect correlation, with 0 being none at all.

As rough estimates, in Data Set 1 we would expect ρ to be 1, in Data Set 2 we would expect ρ to be around 0.9, and in Data Set 3 ρ would be close to 0.

Full details of the Pearson Correlation Coefficient are on Wikipedia. In short the formula is

Formula for Pearson's Correlation Coefficient

ρX,Y = cov(X,Y) / (σX * σY)

where cov is the covariance of the data paired data, and σ is the standard deviation of the data. These will be explored as we get to them.

Caveat

There is a famous phrase in statistics: "correlation does not imply causation". A strong correlation between two variables might just be by chance, even where common sense might make you believe the association could be genuine. A strong correlation should only ever be used as a basis for further investigation, particularly in this age of Big Data number crunching where seemingly unrelated data sets can be compared quickly and cheaply.

Coding

Create a new folder and then create the following empty files:

  • pearsoncorrelation.py
  • data.py
  • main.py

You can also download the source code from the Downloads page or clone/download from Github if you prefer.

Source Code Links

ZIP File
GitHub

pearsoncorrelation.py

import math


def pearson_correlation(independent, dependent):

    """
    Implements Pearson's Correlation, using several utility functions to
    calculate intermediate values before calculating and returning rho.
    """

    # covariance
    independent_mean = _arithmetic_mean(independent)
    dependent_mean = _arithmetic_mean(dependent)
    products_mean = _mean_of_products(independent, dependent)
    covariance = products_mean - (independent_mean * dependent_mean)

    # standard deviations of independent values
    independent_standard_deviation = _standard_deviation(independent)

    # standard deviations of dependent values
    dependent_standard_deviation = _standard_deviation(dependent)

    # Pearson Correlation Coefficient
    rho = covariance / (independent_standard_deviation * dependent_standard_deviation)

    return rho


def  _arithmetic_mean(data):

    """
    Total / count: the everyday meaning of "average"
    """

    total = 0

    for i in data:
        total+= i

    return total / len(data)


def  _mean_of_products(data1, data2):

    """
    The mean of the products of the corresponding values of bivariate data
    """

    total = 0

    for i in range(0, len(data1)):
        total += (data1[i] * data2[i])

    return total / len(data1)


def  _standard_deviation(data):

    """
    A measure of how individual values typically differ from the mean_of_data.
    The square root of the variance.
    """

    squares = []

    for i in data:
        squares.append(i ** 2)

    mean_of_squares = _arithmetic_mean(squares)
    mean_of_data = _arithmetic_mean(data)
    square_of_mean = mean_of_data ** 2
    variance = mean_of_squares - square_of_mean
    std_dev = math.sqrt(variance)

    return std_dev

Most of the work done by pearson_correlation is delegated to other functions, which calculate various intermediate values needed to actually calculate the correlation coefficient. Firstly we need the arithmetic means of both sets of data, and the mean of the products of the twin datasets. Next we calculate the covariance.

Covariance is a rather complex concept which you might like to look into on Wikipedia or elsewhere, but an easy way to remember how to calculate it is mean of products minus product of means. The mean of products is the mean of each pair of values multiplied together, and the product of means is simply the means of the two sets of data multiplied together.

Next we need the standard deviations of the two datasets. Standard deviation can be thought of roughly as the average amount by which the individual values vary from the mean. (It is not that exactly, which is a different figure known as mean of absolute deviation or MAD, but the concepts are similar.) The standard deviation is the square root of the variance which uses squared values to eliminate negative differences. An easy way to remember how to calculate a variance is mean of squares minus square of mean. The mean of squares is simple enough, just square all the values and calculate their mean; and the square of mean is self-explanatory. Once you have calculated the variance don't forget to calculate its square root to get the standard deviation.

Having calculated all the intermediate values we can then use them to implement the formula from above:

Formula for Pearson's Correlation Coefficient

ρX,Y = cov(X,Y) / (σX * σY)

The three private functions follow, which calculate the arithmetic mean, mean of products and standard deviation. These have wide uses beyond Pearson's Correlation and could alternatively be implemented as part of a general purpose statistical library.

Before we try out the above code we need a few pieces of data so I have created a function called populatedata in data.py which return a list populated with one of the three datasets show above.

data.py

def populatedata(independent, dependent, dataset):

    """
    Populates two lists with one of three sets of bivariate data
    suitable for testing and demonstrating Pearson's Correlation
    """

    del independent[:]
    del dependent[:]

    if dataset == 1:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([32,44,68,74,92,98,110,116])
        return True

    elif dataset == 2:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([40,40,60,80,90,110,100,130])
        return True

    elif dataset == 3:
        independent.extend([10,20,40,45,60,65,75,80])
        dependent.extend([100,10,130,90,40,80,180,50])
        return True

    else:
        return False

Now we can move on to our main function.

main.py

import data
import pearsoncorrelation


def main():

    """
    Iterates the three available sets of data
    and calls function to calculate rho.
    Then prints the data and Pearson Correlation Coefficient.
    """

    print("-------------------------")
    print("| codedrome.com         |")
    print("| Pearson's Correlation |")
    print("-------------------------\n")

    independent = []
    dependent = []

    for d in range(1, 4):

        if data.populatedata(independent, dependent, d) == True:

            rho = pearsoncorrelation.pearson_correlation(independent, dependent)

            print("Dataset %d\n---------" % d)
            print("Independent data: " + (str(independent)))
            print("Dependent data:   " + (str(dependent)))
            print("Pearson Correlation Coefficient rho = %1.2f\n" % rho)
        else:
            print("Cannot populate Dataset %d" % d)


main()

We firstly create a pair of lists and then loop through the three available datasets, populating the lists and calculating the correlation coefficient. Finally the data and correlation are printed.

We have now finished writing code so can run it with the following command:

Program Output

python3.7 main.py

The output is:

Program Output

-------------------------
| codedrome.com         |
| Pearson's Correlation |
-------------------------

Dataset 1
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [32, 44, 68, 74, 92, 98, 110, 116]
Pearson Correlation Coefficient rho = 1.00

Dataset 2
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [40, 40, 60, 80, 90, 110, 100, 130]
Pearson Correlation Coefficient rho = 0.96

Dataset 3
---------
Independent data: [10, 20, 40, 45, 60, 65, 75, 80]
Dependent data:   [100, 10, 130, 90, 40, 80, 180, 50]
Pearson Correlation Coefficient rho = 0.21

Earlier I gave rough estimates of ρ for the three data sets as 1, 0.9 and 0. As you can see they are actually 1.0, 0.96 and 0.21 to 2dp. You can change the accurancy printed if you wish.

Next Step: Regression

If you have discovered a strong correlation in your data your next step will probably be to put numbers to that relationship by working out the formula of the straight line which best fits the data. That process is called linear regression and will be the subject of a future post.