Pearson Correlation in C

dataset3

Correlation is the process of quantifying the relationship between two sets of values and in this post I will be writing code in C to calculate possibly the best-known type of correlation - the Pearson Correlation Coefficient.

An Overview of Correlation

Consider the following three data sets and their graphs, or, more accurately, scatter plots. The x-axis represents an independent variable, and the y-axis represents a dependent variable.

Data Set 1

xy
1032
2044
4068
4574
6092
6598
75110
80116

dataset1

In this data set all the points are on a straight line - there is clearly a relationship between them.

Data Set 2

xy
1040
2040
4060
4580
6090
65110
75100
80130

dataset2

Here the data is a bit more scattered but there is still a clear relationship, subject to some minor fluctuations.

Data Set 3

xy
10100
2010
40130
4590
6040
6580
75180
8050

dataset3

In this last set of data the points appear completely random, with little or no relationship between x and corresponding y values.

Quantifying the Relationships

Relationships within bivariate data such as the examples above can be precisely quantified by calculating the Pearson Correlation Coefficient, usually referred to as ρ, the Greek letter rho. This gives us a number between -1 and 1, -1 being a perfect negative correlation and 1 being a perfect positive correlation. Values in between represent various degrees of imperfect correlation, with 0 being none at all.

As rough estimates, in Data Set 1 we would expect ρ to be 1, in Data Set 2 we would expect ρ to be around 0.9, and in Data Set 3 ρ would be close to 0.

Full details of the Pearson Correlation Coefficient are on Wikipedia. In short the formula is

Formula for Pearson Correlation Coefficient

ρX,Y = cov(X,Y) / (σX * σY)

where cov is the covariance of the data paired data, and σ is the standard deviation of the data. These will be explored as we get to them.

Caveat

There is a famous phrase in statistics: "correlation does not imply causation". A strong correlation between two variables might just be by chance, even where common sense might make you believe the association could be genuine. A strong correlation should only ever be used as a basis for further investigation, particularly in this age of Big Data number crunching where seemingly unrelated data sets can be compared quickly and cheaply.

Coding

For this project we will write a function to create sample arrays of data corresponding to that shown above, and then write the code necessary to calculate ρ for each of them.

This project consists of the following files which you can download as a zip, or clone/download the Github repository.

  • main.c
  • data.h
  • data.c
  • pearsoncorrelation.h
  • pearsoncorrelation.c

Source Code Links

ZIP File
GitHub

Let's look first at main.c.

main.c

#include<stdio.h>
#include<stdlib.h>
#include<stdbool.h>

#include"data.h"
#include"pearsoncorrelation.h"

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
void printdata(double* independent, double* dependent, int size);

//--------------------------------------------------------
// FUNCTION main
//--------------------------------------------------------
int main(void)
{
    puts("-----------------------");
    puts("| codedrome.com       |");
    puts("| Pearson Correlation |");
    puts("-----------------------");

    double independent[8];
    double dependent[8];
    double rho;

    for(int dataset = 1; dataset <=3; dataset++)
    {
        if(populatedata(independent, dependent, dataset))
        {
            printf("Data Set %d\n-----------\n", dataset);

            printdata(independent, dependent, 8);

            rho = pearson_correlation(independent, dependent, 8);

            printf("Pearson Correlation Coefficient rho = %lf\n\n", rho);
        }
    }

    return EXIT_SUCCESS;
}

//--------------------------------------------------------
// FUNCTION printdata
//--------------------------------------------------------
void printdata(double* independent, double* dependent, int size)
{
    for(int i = 0; i < size; i++)
    {
        printf("%3.0lf\t%3.0lf\n", independent[i], dependent[i]);
    }
}

In main we declare a pair of double arrays to hold our bivariate data. The size of these is hard-coded in as this is only testing and demonstration code. The actual function which calculates the correlation will be general-purpose, and will take data of any size. We also declare the variable rho to take the correlation coefficient.

The populatedata function can populate arrays with any of the three sample data sets above, so we for-loop from 1 to 3 to calculate the correlation for each in turn. The data set number and the data itself is printed by calling the very trivial printdata function, after which we call the not-so-trivial pearson_correlation function and print the result.

Next comes the populatedata function; this is the prototype...

data.h

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
bool populatedata(double independent[8], double dependent[8], int dataset);

...and this is the implementation.

data.c

#include<string.h>
#include<stdio.h>
#include<stdbool.h>

//--------------------------------------------------------
// FUNCTION populatedata
//--------------------------------------------------------
bool populatedata(double independent[8], double dependent[8], int dataset)
{
    bool success = true;

    switch(dataset)
    {
        case 1:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){32,44,68,74,92,98,110,116}, sizeof(double[8]));
            break;

        case 2:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){40,40,60,80,90,110,100,130}, sizeof(double[8]));
            break;

        case 3:
            memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8]));
            memcpy(dependent, (double[8]){100,10,130,90,40,80,180,50}, sizeof(double[8]));
            break;

        default:
            puts("Invalid dataset requested");
            success = false;
            break;
    }

    return success;
}

Basically all we are doing here is copying hard-coded data into the arrays supplied as arguments and returning true. If the dataset index is not 1, 2 or 3 we return false. Note that we checked this return value in main.

Now the moment you've all been waiting for - we will now actually code up the pearson_correlation function. The prototype is in pearsoncorrelation.h.

pearsoncorrelation.h

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
double pearson_correlation(double* independent, double* dependent, int size);

And this is pearsoncorrelation.c.

pearsoncorrelation.c

#include<stdlib.h>
#include<stdbool.h>
#include<stdio.h>
#include<string.h>
#include<math.h>

#include"pearsoncorrelation.h"

//--------------------------------------------------------
// FUNCTION PROTOTYPES
//--------------------------------------------------------
static double arithmetic_mean(double* data, int size);
static double mean_of_products(double* data1, double* data2, int size);
static double standard_deviation(double* data, int size);

//--------------------------------------------------------
// FUNCTION pearson_correlation
//--------------------------------------------------------
double pearson_correlation(double* independent, double* dependent, int size)
{
    double rho;

    // covariance
    double independent_mean = arithmetic_mean(independent, size);
    double dependent_mean = arithmetic_mean(dependent, size);
    double products_mean = mean_of_products(independent, dependent, size);
    double covariance = products_mean - (independent_mean * dependent_mean);

    // standard deviations of independent values
    double independent_standard_deviation = standard_deviation(independent, size);

    // standard deviations of dependent values
    double dependent_standard_deviation = standard_deviation(dependent, size);

    // Pearson Correlation Coefficient
    rho = covariance / (independent_standard_deviation * dependent_standard_deviation);

    return rho;
}

//--------------------------------------------------------
// FUNCTION arithmetic_mean
//--------------------------------------------------------
static double arithmetic_mean(double* data, int size)
{
    double total = 0;

    // note that incrementing total is done within the for loop
    for(int i = 0; i < size; total += data[i], i++);

    return total / size;
}

//--------------------------------------------------------
// FUNCTION mean_of_products
//--------------------------------------------------------
static double mean_of_products(double* data1, double* data2, int size)
{
    double total = 0;

    // note that incrementing total is done within the for loop
    for(int i = 0; i < size; total += (data1[i] * data2[i]), i++);

    return total / size;
}

//--------------------------------------------------------
// FUNCTION standard_deviation
//--------------------------------------------------------
static double standard_deviation(double* data, int size)
{
    double squares[size];

    for(int i = 0; i < size; i++)
    {
        squares[i] = pow(data[i], 2);
    }

    double mean_of_squares = arithmetic_mean(squares, size);
    double mean = arithmetic_mean(data, size);
    double square_of_mean = pow(mean, 2);
    double variance = mean_of_squares - square_of_mean;
    double std_dev = sqrt(variance);

    return std_dev;
}

As you can see much of the arithmetic is farmed out to separate functions:

  • arithmetic_mean
  • mean_of_products
  • standard_deviation

Just to recap, the formula for the Pearson Correlation Coefficient is

Formula for Pearson Correlation Coefficient

ρX,Y = cov(X,Y) / (σX * σY)

This is implemented near the end of the pearson_correlation function, most of which is concerned with calculating the three components of the formula:

  • The covariance of the bivariate data
  • The standard deviation of the independent data
  • The standard deviation of the dependent data

The code is hopefully self-documenting enough not to need an explanation of the actual calculation but I will add a couple of brief notes about covariance and standard deviation.

Covariance is a rather complex concept which you might like to look into on Wikipedia or elsewhere, but an easy way to remember how to calculate it is mean of products minus product of means. The mean of products is the mean of each pair of values multiplied together, and the product of means is simply the means of the two sets of data multiplied together.

Standard deviation can be thought of roughly as the average amount by which the individual values vary from the mean. (It is not that exactly, which is a different figure known as mean of absolute deviation or MAD, but the concepts are similar.) The standard deviation is the square root of the variance which uses squared values to eliminate negative differences. An easy way to remember how to calculate a variance is mean of squares minus square of mean. The mean of squares is simple enough, just square all the values and calculate their mean; and the square of mean is self-explanatory. Once you have calculated the variance don't forget to calculate its square root to get the standard deviation.

Compile and Run

We have now finished writing code so can actually compile and run it. In the terminal type

Compiling

gcc main.c data.c pearsoncorrelation.c -std=c11 -lm -o main

Note all three .c files are included, and we also need -lm as we're using math.h.

Now run the program with

Running

./main

The output is:

Program Output

-----------------------
| codedrome.com       |
| Pearson Correlation |
-----------------------

Data Set 1
-----------
 10      32
 20      44
 40      68
 45      74
 60      92
 65      98
 75     110
 80     116
Pearson Correlation Coefficient rho = 1.000000

Data Set 2
-----------
 10      40
 20      40
 40      60
 45      80
 60      90
 65     110
 75     100
 80     130
Pearson Correlation Coefficient rho = 0.960215

Data Set 3
-----------
 10     100
 20      10
 40     130
 45      90
 60      40
 65      80
 75     180
 80      50
Pearson Correlation Coefficient rho = 0.207800

Earlier I gave rough estimates of ρ for the three data sets as 1, 0.9 and 0. As you can see they are actually 1, 0.960215 and 0.2078.

Next Step: Regression

If you have discovered a strong correlation in your data your next step will probably be to put numbers to that relationship by working out the formula of the straight line which best fits the data. That process is called linear regression and will be the subject of a future post.

Leave a Reply

Your email address will not be published. Required fields are marked *