Some time ago I wrote a post on calculating a selection of statistics from a list on numbers. I got some criticism from people saying that for some of the statistics I should have used Python built-in functions or functions from the Python Standard Library statistics module.
Doing so, however, would cause each of those functions to iterate over the entire dataset. If you want to calculate a number of different statistics in one go you can increase efficiency considerably with just one iteration.
I started writing a simple experiment to calculate the minimum, maximum, sum, mean and standard deviation of a list of numbers using Python's own functions, calculating them again using a single loop, and then comparing the performance.
I then decided to expand the experiment somewhat, firstly by running the plain Python code with PyPy instead of CPython, and then re-writing the Python as Cython. This article explores these experiments and presents the results.
CPython, PyPy and Cython
Python provides the following built-in functions for use with an iterable of numbers:
- min
- max
- sum
and these functions (among others) in the statistics module
- mean
- pstdev (standard deviation of a population)
These are fine if you just need a single statistic from a relatively small data set. However, if you need more than one statistic you can avoid repeated iterations by calculating them during or after a single loop.
Furthermore you can speed up that loop itself using PyPy, or by re-writing the function as Cython.
The standard CPython implementation of Python is an interpreter and therefore inherently slow, whereas PyPy is a just-in-time (JIT) compiler which provides significant performance improvements just by using it instead of CPython with no changes to you code.
At the time of writing (February 2019) PyPy implements Python up to 3.5.3. I won't explore it in detail here but this is its website https://pypy.org/ and it is definitely worth installing and trying out.
If you are willing to do a bit more work in search of increased performance then Cython is really worth getting into. It is a superset of Python to which you can add strong typing, and which is transpiled to C and then compiled into extension modules for use by regular Python.
Perhaps a typical workflow would be to write everything in Python, identify any bottlenecks in the code, and split that code out to Cython.
The improvement in performance depends very much on what the code does and whether you add typing in appropriate places, but it can be startlingly. An ideal use for Cython would be code which iterates and crunches large amounts of numeric data, such as this project.
If you are interested in Cython I recommend this tutorial which will get you up and running in no time Optimizing with Cython Introduction - Cython Tutorial.
The Project
Below is an empty scoreboard listing the various environments and methods. During this project I will implement and time each combination, filling in the scoreboard along the way.
Implementation | Function | Time(ms) | Index |
---|---|---|---|
CPython | python_functions | ||
CPython | single_iteration | ||
PyPy | python_functions | ||
PyPy | single_iteration | ||
Cython | python_functions | ||
Cython | single_iteration |
The project consists of the following files which you can download as a zip, or you can clone/download the Github repository if you prefer.
- speedexperiments.py
- functiontimer.py
- statsfunctions.py
- statsfunctionscython.pyx (note the .pyx extension)
- setup.py
Source Code Links
speedexperiments.py
import random import sys import os import functiontimer import statsfunctions import statsfunctionscython def main(): print("--------------------") print("| codedrome.com |") print("| SpeedExperiments |") print("--------------------\n") print("Using {}\n".format(os.path.basename(sys.executable))) print("Generating random data...\n") random.seed() data = [random.randint(1, 100) for i in range(1000000)] functiontimer.time_function(statsfunctions.python_functions, data, 10) functiontimer.time_function(statsfunctions.single_iteration, data, 10) # functiontimer.time_function(statsfunctionscython.python_functions, # data, # 10) # functiontimer.time_function(statsfunctionscython.single_iteration, # data, # 10) main()
At the top of the main function the name of the currently running executable is displayed. In this project it will be some version of CPython or PyPy.
Next a list of a million random numbers is generated. We then call functiontimer.time_function twice with a different function each time, the random data and a count of 10. The two separate functions called here, as well as functiontimer.time_function, will be described later.
Finally there are two further calls, commented out for the time being, which carry out the same tasks as the first two but using Cython functions.
functiontimer.py
import time import statistics def time_function(function, arguments, runcount): """ Runs the function with the arguments the specified number of times. Prints the minimum, maximum and mean durations in milliseconds. """ print("Running " + function.__name__ + "...") durations = [] for r in range(0, runcount): start_time = time.time() function(arguments) end_time = time.time() duration_ms = (end_time - start_time) * 1000 durations.append(duration_ms) print("min: {:6.0f}ms".format(min(durations))) print("max: {:6.0f}ms".format(max(durations))) print("mean: {:6.0f}ms".format(statistics.mean(durations)))
The Python Standard Library has a module called timeit which provides a function with the same name used to "time small bits of Python code". It is actually quite basic and doesn't do what I wanted to do here so I resorted to writing my own function.
It takes a function which it calls the specified number of times with the supplied arguments, and then prints out the shortest, longest and average times the individual runs took. The shortest times are the ones we are really interested in for benchmarking as they are the ones slowed down the least by the system doing other stuff at the same time, and they are the ones I shall use for filling in the scoreboard.
statsfunctions.py
import statistics def python_functions(data): """ Calculate a few statistics from the numeric list using Python functions. Inefficient due to repeated iteration. """ stats = {} stats["len"] = len(data) stats["min"] = min(data) stats["max"] = max(data) stats["sum"] = sum(data) stats["mean"] = statistics.mean(data) stats["pstdev"] = statistics.pstdev(data) return stats def single_iteration(data): """ Calculate a few statistics from the numeric list using single iteration for increased efficiency. """ sum_of_squares = 0 stats = {} stats["len"] = len(data) stats["min"] = data[0] stats["max"] = data[0] stats["sum"] = 0 for n in data: stats["min"] = min(stats["min"], n) stats["max"] = max(stats["max"], n) stats["sum"] += n sum_of_squares += n**2 stats["mean"] = stats["sum"] / stats["len"] stats["pstdev"] = ((sum_of_squares / stats["len"]) - (stats["mean"]**2))**0.5 return stats
The two functions in this file take a list of numeric values and create dictionaries containing a few statistics on this data. The first, python_functions, use Python functions of functions in the statistics module. As I mentioned above each of these (except len) need to iterate the entire data set.
The second function, single_iteration, iterates the data just once as its name suggests. Within the loop we pick up the minimum and maximum values as well as keeping running totals of the sum and sum of squares. The running totals are then used after the loop to calculate the mean and standard deviation.
statsfunctionscython.pyx
import statistics def python_functions(list data): """ Calculate a few statistics from the numeric list using Python functions. Inefficient due to repeated iteration. This Cython implementation is no more efficient than the Python version and is only included for demonstration. """ cdef int length = len(data) cdef double minval = min(data) cdef double maxval = max(data) cdef double total = sum(data) cdef double mean = statistics.mean(data) cdef double std_dev = statistics.pstdev(data) stats = {} stats["len"] = length stats["min"] = minval stats["max"] = maxval stats["sum"] = total stats["mean"] = mean stats["pstdev"] = std_dev return stats def single_iteration(list data): """ Calculate a few statistics from the numeric list using single iteration for increased efficiency. Very significant improvement on Python version due to typing. """ cdef int length = len(data) cdef double minval = data[0] cdef double maxval = data[0] cdef double total = 0 cdef double sum_of_squares = 0 cdef double mean = 0 cdef double std_dev = 0 cdef double n for n in data: minval = min(minval, n) maxval = max(maxval, n) total += n sum_of_squares += n**2 mean = total / length std_dev = ((sum_of_squares / length) - (mean**2))**0.5 stats = {} stats["len"] = length stats["min"] = minval stats["max"] = maxval stats["sum"] = total stats["mean"] = mean stats["pstdev"] = std_dev return stats
The Cython versions of the functions work in the same way as the Python versions but have additional typing information, removing the need for dynamic type checking.
I must stress that the first Cython function, python_functions, still uses the same Python functions as the non-Cython version so although I have added as much typing as possible it doesn't really help. As I stated in the docstring I have only included this function for demonstration and completeness.
The second Cython function, single_iteration, is however a huge improvement on the Python version. It doesn't rely on any Python functions so is free to use typing information to speed things up considerably.
setup.py
from distutils.core import setup from Cython.Build import cythonize setup(ext_modules = cythonize('statsfunctionscython.pyx'))
Cython requires the intermediate steps of converting .pyx files to C, and then compiling the C to an extension module. This file contains the code to do this, and can be used with any Cython project just by editing the argument to cythonize. In fact you could expand this file to take the filename as a command line argument to use it with any Cython source file.
Now the coding is finished we need to run it in three ways, starting with the functions in statsfunctions.py with CPython.
Run - CPython
python3.7 speedexperiments.py
After a rather boring wait we'll get something like this:
Program Output - CPython
-------------------- | codedrome.com | | SpeedExperiments | -------------------- Using python3.7 Generating random data... Running python_functions... min: 5251ms max: 5278ms mean: 5260ms Running single_iteration... min: 1481ms max: 1505ms mean: 1492ms
Now lets run the statsfunctions.py code again, this time with PyPy.
Run - PyPy
pypy3 speedexperiments.py
A slightly shorter wait gives us:
Program Output - PyPy
--------------------- | codedrome.com | | Speed Experiments | --------------------- Using pypy3 Generating random data... Running python_functions... min: 1631ms max: 1654ms mean: 1638ms Running single_iteration... min: 70ms max: 75ms mean: 71ms
Finally we'll run the Cython code. Comment out the first two function calls in main, uncomment the second two, and run these two commands.
Run - Cython
python3.7 setup.py build_ext --inplace python3.7 speedexperiments.py
Another long wait for the first bit to run, and then a very short wait for the second bit:
Program Output - Cython
--------------------- | codedrome.com | | Speed Experiments | --------------------- Using python3.7 Generating random data... Running python_functions... min: 5351ms max: 5429ms mean: 5393ms Running single_iteration... min: 23ms max: 27ms mean: 24ms
If you look in the folder where you have your source code you'll find two new files, statsfunctionscython.c and another with a name like statsfunctionscython.cpython-37m-x86_64-linux-gnu.so, depending on your version of Python, your architecture and your OS. (The C source code is theoretically human-readable...!)
Plugging the minimum times into the scoreboard gives us this.
Implementation | Function | Time(ms) | Index |
---|---|---|---|
CPython | python_functions | 5251 | 1 |
CPython | single_iteration | 1481 | 3.55 |
PyPy | python_functions | 1631 | 3.22 |
PyPy | single_iteration | 70 | 75.01 |
Cython | python_functions | 5351 | 0.98 |
Cython | single_iteration | 23 | 228.30 |
The first row, python_functions on CPython, is used as a baseline with an index of 1. The other indexes show how many times faster (or slower in one case) the other runs are.
You can also carry out further comparisons if you wish, for example 1481 / 70 = 21, so just by using PyPy we speed up the exact same functions by a factor of 21 with this particular combination of code and data size.
The Magic Line of Code
The single_iteration function in statsfunctionscython.pyx includes this humble line of code:
cdef double n
It doesn't look much but it frees up the code from dynamically checking the type of the variable n on each iteration of the loop. With the test data I have used that's a million checks it doesn't have to do.
In the test runs I got a minimum time of 23ms for this function, but commenting out this line increases the minimum to 760ms. That's over 32 times longer! That's why I have dubbed it "The Magic Line of Code".