Everyone understands averages, both their meaning and how to calculate them. However, there are situations, particularly when dealing with real-time data, when a conventional average is of little use because it includes old values which are no longer relevant and merely give a misleading impression of the current situation.
The solution to this problem is to use moving averages, ie. the average of the most recent values rather than all values, which is the subject of this post.
To illustrate the problem I will show part of the output of the program I'll write in this post. It shows the last rows of a set of server response times
Program Output - tail end of 1000 rows
------------------ | codedrome.com | | MovingAverages | ------------------ ------------------------------------------------- | value|overall average| moving average| ------------------------------------------------- . . . | 46.00| 29.77| 31.00| | 20.00| 29.76| 32.75| | 38.00| 29.76| 37.25| | 44.00| 29.78| 37.00| | 49.00| 29.80| 37.75| | 36.00| 29.80| 41.75| | 11.00| 29.79| 35.00| | 20.00| 29.78| 29.00| | 40.00| 29.79| 26.75| | 10.00| 29.77| 20.25| | 47.00| 29.78| 29.25| | 13.00| 29.77| 27.50| | 24.00| 29.76| 23.50| | 38.00| 29.77| 30.50| | 31.00| 29.77| 26.50| | 50.00| 29.79| 35.75| | 32.00| 29.79| 37.75| | 21.00| 29.78| 33.50| | 42.00| 29.80| 36.25| | 165.00| 29.93| 65.00| | 256.00| 30.16| 121.00| | 419.00| 30.55| 220.50| | 329.00| 30.85| 292.25| | 128.00| 30.94| 283.00| -------------------------------------------------
Most times in the left hand column are between 10ms and 50ms and can be considered normal but the last few shoot up considerably. The second column shows overall averages which we might use to monitor the server for any problems. However, the large number of normal times included in these averages mean that although the server has slowed down considerably for the last few requests the averages have hardly risen at all and we wouldn't realise anything was wrong. The last column shows 4-point moving averages, or the averages of only the last four values. These of course do increase considerably and so alarm bells should start to ring.
Having explained both the problem and its solution, let's write some code. This project consists of the following files which can be downloaded as a zip or you can clone/download the Github repository if you prefer.
- movingaverageslist.py
- movingaverages_test.py
Source Code Links
The movingaverageslist.py file implements a class which maintains a list of numerical values, and each time a new value is added the overall average and moving average up to that point are also calculated.
movingaverageslist.py
class MovingAveragesList(object): """ This class implements a list to which numeric values can be appended. Doing so actually appends a dictionary containing three values: "value" - the value added "average" - the arithmetic mean of all values up to and including the current one "movingaverage" - the arithmetic mean of the specified number of previous values The underlying list can be accessed using objectname.data. """ def __init__(self, points): """ The points argument specifies how many previous values should be used to calculate each moving average. """ self.data = [] self.points = points def append(self, n): """ Adds a dictionary of value, overall average and moving average to the list. """ average = self.__calculate_overall_average(n) moving_average = self.__calculate_moving_average(n) self.data.append({"value": n, "average": average, "movingaverage": moving_average}) def __calculate_overall_average(self, n): length = len(self.data) if length == 0: average = n else: average = (((self.data[length - 1]["average"]) * length) + n) / (length + 1) return average def __calculate_moving_average(self, n): length = len(self.data) if length == 0: moving_average = n elif length <= self.points - 1: moving_average = (((self.data[length - 1]["average"]) * length) + n) / (length + 1) else: moving_average = ((self.data[length - 1]["movingaverage"] * self.points) - (self.data[length - self.points]["value"]) + n) / self.points return moving_average def __str__(self): """ Create a grid from the data in the list. """ items = [] items.append("-" * 49 + "\n") items.append("| value|overall average| moving average|\n") items.append("-" * 49 + "\n") for item in self.data: items.append("|{:15.2f}|{:15.2f}|{:15.2f}|\n" .format(item["value"], item["average"], item["movingaverage"])) items.append("-" * 49) return "".join(items)
In __init__ we simply create an empty list, and set the points attribute, ie. the number of values used to calculate the average.
In the append method, the overall and moving averages are calculated using separate functions which I'll come to in a minute. Then a dictionary containing the new value and the two averages is appended to the list.
In __calculate_overall_average we don't need to add up all the values each time, we can just multiply the previous average by the count and then add the new value. This is then divided by the length + 1, ie. the length the list will be when the new value is added.
The __calculate_moving_average function uses a similar technique but is more complex as it has to allow for the list not yet having reached the length of the number of points. In this situation it just calculates the mean of whatever data the list has.
Lastly we implement __str__ which returns the data in a table format suitable for outputting to the console.
The MovingAveragesList class is now complete so let's put together a simple demo.
movingaverages_test.py
import random import movingaverageslist def main(): print("------------------") print("| codedrome.com |") print("| MovingAverages |") print("------------------\n") response_times_ms = populate_response_times() print(response_times_ms) # Quick demo of accessing the list directly. print(response_times_ms.data[-1]) def populate_response_times(): """ Create a MovingAveragesList object and populate it with random response times. """ response_times_ms = movingaverageslist.MovingAveragesList(4) # Add a large number of normal times for t in range(1, 996): response_times_ms.append(random.randint(10, 50)) # Add a few excessively long times for t in range(1, 6): response_times_ms.append(random.randint(100, 500)) return response_times_ms main()
In main we call populate_response_times to get a MovingAveragesList object with 1000 items, and then print the object. As we implemented __str__ in the class this will be called and therefore we'll see the table described above.
I have also added a line which prints the last item in the list just to show how to access the most recent value and averages. A possible enhancement would be to wrap this in a method to avoid rummaging around in the inner workings of the class.
The populate_response_times function creates a MovingAveragesList object with a points value of 4. This is probably too low for practical purposes but it does make manual testing easier!
It then adds a large number of "normal" values to it; remember that each time a value is added new overall and moving averages are also added. Then a few large numbers are added to simulate a server problem before we return the object.
Now we can run the program like this...
Running the program
python3.7 movingaverages_test.py
I won't repeat the output but you'll see 1000 rows of data whizzing up your console.
Possible Improvements
The MovingAveragesList class has been tailored to demonstrating the problem it solves and how it does it. In a production environment this are unnecessary and there are a few improvements which could make the class more efficient and useful.
- We could drop the overall averages
- Only the latest moving average could be kept
- We could delete the oldest value each time a new one is added, just keeping a restricted number of the latest values
- We could forget the list concept entirely and just keep a single moving average, updated from any new values added
- We could include a threshold and function to be called if the threshold is exceeded, for example sending out emails if the server response time slows to an unacceptable level