Correlation is the process of quantifying the relationship between two sets of values and in this post I will be writing code in C to calculate possibly the best-known type of correlation - the Pearson Correlation Coefficient.
An Overview of Correlation
Consider the following three data sets and their graphs, or, more accurately, scatter plots. The x-axis represents an independent variable, and the y-axis represents a dependent variable.
Data Set 1
x | y |
---|---|
10 | 32 |
20 | 44 |
40 | 68 |
45 | 74 |
60 | 92 |
65 | 98 |
75 | 110 |
80 | 116 |
In this data set all the points are on a straight line - there is clearly a relationship between them.
Data Set 2
x | y |
---|---|
10 | 40 |
20 | 40 |
40 | 60 |
45 | 80 |
60 | 90 |
65 | 110 |
75 | 100 |
80 | 130 |
Here the data is a bit more scattered but there is still a clear relationship, subject to some minor fluctuations.
Data Set 3
x | y |
---|---|
10 | 100 |
20 | 10 |
40 | 130 |
45 | 90 |
60 | 40 |
65 | 80 |
75 | 180 |
80 | 50 |
In this last set of data the points appear completely random, with little or no relationship between x and corresponding y values.
Quantifying the Relationships
Relationships within bivariate data such as the examples above can be precisely quantified by calculating the Pearson Correlation Coefficient, usually referred to as ρ, the Greek letter rho. This gives us a number between -1 and 1, -1 being a perfect negative correlation and 1 being a perfect positive correlation. Values in between represent various degrees of imperfect correlation, with 0 being none at all.
As rough estimates, in Data Set 1 we would expect ρ to be 1, in Data Set 2 we would expect ρ to be around 0.9, and in Data Set 3 ρ would be close to 0.
Full details of the Pearson Correlation Coefficient are on Wikipedia. In short the formula is
Formula for Pearson Correlation Coefficient
ρX,Y = cov(X,Y) / (σX * σY)
where cov is the covariance of the data paired data, and σ is the standard deviation of the data. These will be explored as we get to them.
Caveat
There is a famous phrase in statistics: "correlation does not imply causation". A strong correlation between two variables might just be by chance, even where common sense might make you believe the association could be genuine. A strong correlation should only ever be used as a basis for further investigation, particularly in this age of Big Data number crunching where seemingly unrelated data sets can be compared quickly and cheaply.
Coding
For this project we will write a function to create sample arrays of data corresponding to that shown above, and then write the code necessary to calculate ρ for each of them.
This project consists of the following files which you can download as a zip, or clone/download the Github repository.
- main.c
- data.h
- data.c
- pearsoncorrelation.h
- pearsoncorrelation.c
Source Code Links
Let's look first at main.c.
main.c
#include<stdio.h> #include<stdlib.h> #include<stdbool.h> #include"data.h" #include"pearsoncorrelation.h" //-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- void printdata(double* independent, double* dependent, int size); //-------------------------------------------------------- // FUNCTION main //-------------------------------------------------------- int main(void) { puts("-----------------------"); puts("| codedrome.com |"); puts("| Pearson Correlation |"); puts("-----------------------"); double independent[8]; double dependent[8]; double rho; for(int dataset = 1; dataset <=3; dataset++) { if(populatedata(independent, dependent, dataset)) { printf("Data Set %d\n-----------\n", dataset); printdata(independent, dependent, 8); rho = pearson_correlation(independent, dependent, 8); printf("Pearson Correlation Coefficient rho = %lf\n\n", rho); } } return EXIT_SUCCESS; } //-------------------------------------------------------- // FUNCTION printdata //-------------------------------------------------------- void printdata(double* independent, double* dependent, int size) { for(int i = 0; i < size; i++) { printf("%3.0lf\t%3.0lf\n", independent[i], dependent[i]); } }
In main we declare a pair of double arrays to hold our bivariate data. The size of these is hard-coded in as this is only testing and demonstration code. The actual function which calculates the correlation will be general-purpose, and will take data of any size. We also declare the variable rho to take the correlation coefficient.
The populatedata function can populate arrays with any of the three sample data sets above, so we for-loop from 1 to 3 to calculate the correlation for each in turn. The data set number and the data itself is printed by calling the very trivial printdata function, after which we call the not-so-trivial pearson_correlation function and print the result.
Next comes the populatedata function; this is the prototype...
data.h
//-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- bool populatedata(double independent[8], double dependent[8], int dataset);
...and this is the implementation.
data.c
#include<string.h> #include<stdio.h> #include<stdbool.h> //-------------------------------------------------------- // FUNCTION populatedata //-------------------------------------------------------- bool populatedata(double independent[8], double dependent[8], int dataset) { bool success = true; switch(dataset) { case 1: memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8])); memcpy(dependent, (double[8]){32,44,68,74,92,98,110,116}, sizeof(double[8])); break; case 2: memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8])); memcpy(dependent, (double[8]){40,40,60,80,90,110,100,130}, sizeof(double[8])); break; case 3: memcpy(independent, (double[8]){10,20,40,45,60,65,75,80}, sizeof(double[8])); memcpy(dependent, (double[8]){100,10,130,90,40,80,180,50}, sizeof(double[8])); break; default: puts("Invalid dataset requested"); success = false; break; } return success; }
Basically all we are doing here is copying hard-coded data into the arrays supplied as arguments and returning true. If the dataset index is not 1, 2 or 3 we return false. Note that we checked this return value in main.
Now the moment you've all been waiting for - we will now actually code up the pearson_correlation function. The prototype is in pearsoncorrelation.h.
pearsoncorrelation.h
//-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- double pearson_correlation(double* independent, double* dependent, int size);
And this is pearsoncorrelation.c.
pearsoncorrelation.c
#include<stdlib.h> #include<stdbool.h> #include<stdio.h> #include<string.h> #include<math.h> #include"pearsoncorrelation.h" //-------------------------------------------------------- // FUNCTION PROTOTYPES //-------------------------------------------------------- static double arithmetic_mean(double* data, int size); static double mean_of_products(double* data1, double* data2, int size); static double standard_deviation(double* data, int size); //-------------------------------------------------------- // FUNCTION pearson_correlation //-------------------------------------------------------- double pearson_correlation(double* independent, double* dependent, int size) { double rho; // covariance double independent_mean = arithmetic_mean(independent, size); double dependent_mean = arithmetic_mean(dependent, size); double products_mean = mean_of_products(independent, dependent, size); double covariance = products_mean - (independent_mean * dependent_mean); // standard deviations of independent values double independent_standard_deviation = standard_deviation(independent, size); // standard deviations of dependent values double dependent_standard_deviation = standard_deviation(dependent, size); // Pearson Correlation Coefficient rho = covariance / (independent_standard_deviation * dependent_standard_deviation); return rho; } //-------------------------------------------------------- // FUNCTION arithmetic_mean //-------------------------------------------------------- static double arithmetic_mean(double* data, int size) { double total = 0; // note that incrementing total is done within the for loop for(int i = 0; i < size; total += data[i], i++); return total / size; } //-------------------------------------------------------- // FUNCTION mean_of_products //-------------------------------------------------------- static double mean_of_products(double* data1, double* data2, int size) { double total = 0; // note that incrementing total is done within the for loop for(int i = 0; i < size; total += (data1[i] * data2[i]), i++); return total / size; } //-------------------------------------------------------- // FUNCTION standard_deviation //-------------------------------------------------------- static double standard_deviation(double* data, int size) { double squares[size]; for(int i = 0; i < size; i++) { squares[i] = pow(data[i], 2); } double mean_of_squares = arithmetic_mean(squares, size); double mean = arithmetic_mean(data, size); double square_of_mean = pow(mean, 2); double variance = mean_of_squares - square_of_mean; double std_dev = sqrt(variance); return std_dev; }
As you can see much of the arithmetic is farmed out to separate functions:
- arithmetic_mean
- mean_of_products
- standard_deviation
Just to recap, the formula for the Pearson Correlation Coefficient is
Formula for Pearson Correlation Coefficient
ρX,Y = cov(X,Y) / (σX * σY)
This is implemented near the end of the pearson_correlation function, most of which is concerned with calculating the three components of the formula:
- The covariance of the bivariate data
- The standard deviation of the independent data
- The standard deviation of the dependent data
The code is hopefully self-documenting enough not to need an explanation of the actual calculation but I will add a couple of brief notes about covariance and standard deviation.
Covariance is a rather complex concept which you might like to look into on Wikipedia or elsewhere, but an easy way to remember how to calculate it is mean of products minus product of means. The mean of products is the mean of each pair of values multiplied together, and the product of means is simply the means of the two sets of data multiplied together.
Standard deviation can be thought of roughly as the average amount by which the individual values vary from the mean. (It is not that exactly, which is a different figure known as mean of absolute deviation or MAD, but the concepts are similar.) The standard deviation is the square root of the variance which uses squared values to eliminate negative differences. An easy way to remember how to calculate a variance is mean of squares minus square of mean. The mean of squares is simple enough, just square all the values and calculate their mean; and the square of mean is self-explanatory. Once you have calculated the variance don't forget to calculate its square root to get the standard deviation.
Compile and Run
We have now finished writing code so can actually compile and run it. In the terminal type
Compiling
gcc main.c data.c pearsoncorrelation.c -std=c11 -lm -o main
Note all three .c files are included, and we also need -lm as we're using math.h.
Now run the program with
Running
./main
The output is:
Program Output
----------------------- | codedrome.com | | Pearson Correlation | ----------------------- Data Set 1 ----------- 10 32 20 44 40 68 45 74 60 92 65 98 75 110 80 116 Pearson Correlation Coefficient rho = 1.000000 Data Set 2 ----------- 10 40 20 40 40 60 45 80 60 90 65 110 75 100 80 130 Pearson Correlation Coefficient rho = 0.960215 Data Set 3 ----------- 10 100 20 10 40 130 45 90 60 40 65 80 75 180 80 50 Pearson Correlation Coefficient rho = 0.207800
Earlier I gave rough estimates of ρ for the three data sets as 1, 0.9 and 0. As you can see they are actually 1, 0.960215 and 0.2078.
Next Step: Regression
If you have discovered a strong correlation in your data your next step will probably be to put numbers to that relationship by working out the formula of the straight line which best fits the data. That process is called linear regression and will be the subject of a future post.