I recently wrote an article on Z-Scores which help with the problem of comparing two or more sets of exam grades when each set may have differing averages and spreads.
Even if you just have one exam grade you still have the problem of understanding how it compares to the rest of the grades, and in this post I'll write a Python implementation of percentile ranks which offer one possible solution to this problem.
For any given exam grade you can calculate its percentile rank, for example in the set of example grades I'll use for this project 75% has a percentile rank of 59, which means that candidates with 75% did better than 59% of the other candidates. This gives a much better indication of how the candidate measures up than just the grade on its own.
To calculate a set of percentile ranks we need to carry out the following steps:
- Calculate the frequencies of all scores in order of scores
- Calculate each frequency's percentage of the whole
- Calculate the cumulative totals of the percentages
- Set percentile ranks to the PREVIOUS cumulative percentage total
We use the previous cumulative total as we are calculating the percentage of candidates who got a lower grade than the current one. The percentile rank of the lowest grade is 0 because nobody got less than the lowest grade. Also, nobody can get a percentile rank of 100% - you can't score more than everyone else including yourself.
This project consists of two Python files:
- percentileranks.py
- percentileranks_demo.py
as well as a CSV file, grades.csv, which contains as set 150 fictitious exam grades. The files can be downloaded as a zip, or you can clone/download the Github repository.
Source Code Links
Let's look at percentileranks.py first.
percentileranks.py
import collections import math def calculate_percentile_ranks(data): """ Takes a list of numeric data and returns a list of dictionaries containing the discrete data values on order and their corresponding percentile ranks. Intermediate values are also included. """ percentile_ranks = [] frequencies = collections.Counter(data) frequencies = sorted(frequencies.items()) count = len(data) cumulative_percentage = 0 for item in frequencies: grade = int(item[0]) frequency = int(item[1]) percentage = frequency / count percentile_rank = cumulative_percentage cumulative_percentage += percentage percentile_ranks.append({"grade": grade, "frequency": frequency, "percentage": percentage, "cumulative_percentage": cumulative_percentage, "percentile_rank": percentile_rank}) return percentile_ranks def print_percentile_ranks(percentile_ranks): """ Print the data structure from calculate_percentile_ranks in a table format """ heading = "|Grade|Frequency|Percentage|Cumulative|Percentile Rank|" width = len(heading) print("-" * width) print(heading) print("-" * width) formatstring = "|{:5.0f}|{:9.0f}|{:9.2f}%|{:9.2f}%|{:15.0f}|" for row in percentile_ranks: print(formatstring.format(row["grade"], row["frequency"], row["percentage"] * 100, row["cumulative_percentage"] * 100, row["percentile_rank"] * 100)) print("-" * width)
Firstly we create an empty list which will hold the end results, and then use a collections.Counter to obtain the frequencies of the input data, which is then sorted. We then initialise the cumulative_percentage variable to 0; as I mentioned above the lowest grade has a percentile rank of 0, and this variable is used to set the percentile ranks. I have also used a count variable to tidy up the code a bit.
Now we iterate frequencies, assigning the five pieces of data we need for each grade to variables. The first two are grade and frequency which come straight from the frequencies list. For the rest we need to do a bit of arithmetic. The percentage is simply frequency divided by count. (Note that it is a decimal fraction rather than a percentage, this being easier to use in other calculations. We can multiply by 100 if we want to output it as an actual percentage.) Next we set percentile_rank to cumulative_percentage; for the lowest grade this will still be 0, but we then calculate it ready for the next grade.
Now we add a dictionary to the percentile_ranks list using the variables from above. I could have carried out the calculations within this step but it would have resulted in a long and messy bit of code so it's much neater this way.
Lastly we just need to return percentile_ranks.
The print_percentile_ranks function takes the data structure created by calculate_percentile_ranks and prints it out in a table format.
Let's move on to percentileranks_demo.py where we'll try out the module.
percentileranks_demo.py
import csv import percentileranks def main(): """ Test the percentileranks module with a set of exam grades from a CSV file. """ print("--------------------") print("| codedrome.com |") print("| Percentile Ranks |") print("--------------------\n") try: f_in = open("grades.csv") r = csv.DictReader(f_in, fieldnames=['grade']) grades = [] for item in r: grades.append(item['grade']) f_in.close() percentile_ranks = percentileranks.calculate_percentile_ranks(grades) percentileranks.print_percentile_ranks(percentile_ranks) except Exception as e: print(e) main()
Firstly we read in a set of grades using csv.DictReader and then copy it to a list. This is then passed to calculate_percentile_ranks, the result of which is passed to print_percentile_ranks.
Now let's run the program.
Run
python3.7 main.py
The output is
Program Output (partial)
-------------------- | codedrome.com | | Percentile Ranks | -------------------- ------------------------------------------------------- |Grade|Frequency|Percentage|Cumulative|Percentile Rank| ------------------------------------------------------- | 34| 2| 1.33%| 1.33%| 0| | 36| 2| 1.33%| 2.67%| 1| | 37| 3| 2.00%| 4.67%| 3| | 38| 2| 1.33%| 6.00%| 5| | 39| 2| 1.33%| 7.33%| 6| | 40| 1| 0.67%| 8.00%| 7| . . . | 91| 1| 0.67%| 90.67%| 90| | 92| 2| 1.33%| 92.00%| 91| | 93| 2| 1.33%| 93.33%| 92| | 94| 7| 4.67%| 98.00%| 93| | 95| 2| 1.33%| 99.33%| 98| | 96| 1| 0.67%| 100.00%| 99| -------------------------------------------------------
I have included the intermediate values Frequency, Percentage and Cumulative Percentage. These show how the end result is arrived at but aren't really necessary so you might like to leave them out.