Zipf’s Law in Python

Zipf's Law describes a probability distribution where each frequency is the reciprocal of its rank multiplied by the highest frequency. Therefore the second highest frequency is the highest multiplied by 1/2, the third highest is the highest multiplied by 1/3 and so on.

This is best illustrated with a graph.

In this post I will write a project in Python to apply Zipf's Law to what is probably it's best known use, that of analysing word frequencies in a piece of text.

The Zipfian Distribution can be applied to many different areas including populations, incomes, company revenues and so on, as well as words in a piece of text as I mentioned above. The word frequencies of a single piece of text are unlikely to be a good fit - for that you really need a large and varied range of texts. However, applying it to a single piece of text is a simple way to demonstrate the distribution, and if we implement it in Python we can also see a few useful language techniques in action along the way.

The Problem

For this project I want to implement the following features:

  • Take a text file and count the frequencies of individual words

  • Create a data structure containing the top n words in descending order, together with the following pieces of information for each word:

    • The word frequency

    • The Zipfian fraction (1/1, 1/2, 1/3 etc,)

    • The Zipfian frequency

    • The difference between the actual and Zipfian frequency

    • The percentage difference between the two frequencies

In this project I'll use a handful of slightly lesser know Python features which I'll list here just to give a sneak preview, and then I'll describe them in more detail later on.

  • The string split method to split text into words

  • maketrans and translate to remove punctuation and digits

  • collections.Counter to count word frequencies

The project consists of the following two Python files, as well as dracula.txt which contains the full text of Bram Stoker's novel which we'll use as the input. The files can be downloaded as a zip from the Downloads page, or you can clone/download the Github repository.

  • zipfslaw.py

  • zipfslawtest.py

Source Code Links

ZIP File
GitHub

This is the full code of zipfslaw.py.

zipfslaw.py

import collections


def generate_zipf_table(text, top):

    """
    Create a list of dictionaries containing the top
    most frequent words, their frequencies and
    other Zipfian data.
    """

    text = _remove_punctuation(text)

    text = text.lower()

    top_word_frequencies = _top_word_frequencies(text, top)

    zipf_table = _create_zipf_table(top_word_frequencies)

    return zipf_table


def _remove_punctuation(text):

    """
    Removes the characters:
    !\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0123456789
    from the text.
    """

    chars_to_remove = "!\"#$%&'()*+,-./:;<=>?@[\]^_`{|}~0123456789"

    tr = str.maketrans("", "", chars_to_remove)

    return text.translate(tr)


def _top_word_frequencies(text, top):

    """
    Create a list of tuples containing the most
    frequent words and their frequencies
    in descending order.
    """

    # With no argument, split() separates the string
    # by 1 or more consecutive instances of whitespace.
    words = text.split()

    # Create a collections.Counter instance from an
    # iterable, in this case our list of words.
    word_frequencies = collections.Counter(words)

    # most_common() gives us a list of tuples
    # containing words and their frequencies,
    # in descending order of frequency.
    top_word_frequencies = word_frequencies.most_common(top)

    return top_word_frequencies


def _create_zipf_table(frequencies):

    """
    Takes the list created by _top_word_frequencies
    and inserts it into a list of dictionaries,
    along with the Zipfian data.
    """

    zipf_table = []

    top_frequency = frequencies[0][1]

    for index, item in enumerate(frequencies, start=1):

        relative_frequency = "1/{}".format(index)
        zipf_frequency = top_frequency * (1 / index)
        difference_actual = item[1] - zipf_frequency
        difference_percent = (item[1] / zipf_frequency) * 100

        zipf_table.append({"word": item[0],
                           "actual_frequency": item[1],
                           "relative_frequency": relative_frequency,
                           "zipf_frequency": zipf_frequency,
                           "difference_actual": difference_actual,
                           "difference_percent": difference_percent})

    return zipf_table


def print_zipf_table(zipf_table):

    """
    Prints the list created by generate_zipf_table
    in table format with column headings.
    """

    width = 80

    print("-" * width)
    print("|Rank|    Word    |Actual Freq | Zipf Frac  | Zipf Freq  |Actual Diff |Pct Diff|")
    print("-" * width)

    format_string = "|{:4}|{:12}|{:12.0f}|{:>12}|{:12.2f}|{:12.2f}|{:7.2f}%|"

    for index, item in enumerate(zipf_table, start=1):

        print(format_string.format(index,
                                   item["word"],
                                   item["actual_frequency"],
                                   item["relative_frequency"],
                                   item["zipf_frequency"],
                                   item["difference_actual"],
                                   item["difference_percent"]))

    print("-" * width)

generate_zipf_table

This is the core function and which takes a string and a top argument which specifies the maximum number of items to return the frequencies of. (Data sets following the Zipfian distribution will often have a long tail of very low frequencies which aren't worth considering or trying to fit to the reciprocal of the rank.) It then generates and returns the data structure described in the bullet points above.

_remove_punctuation

This is a short and simple but quite interesting function. It first creates a string containing all the characters we want to remove, basically punctuation plus numbers 0 to 9. It's not very sophisticated and I have to admit that this whole project is really only suitable for use with ASCII-only text, but it works with Dracula.

Next we use the three-argument version of str.maketrans. This is a static string method which creates a table of character replacement mappings, in this case meaning no characters will be replaced (the first two arguments therefore being empty strings) but the characters in the third argument are to be removed.

Finally we return the result of calling translate on the text with the translation table created in the previous line.

_top_word_frequencies

I have gone a bit over the top with comments in this function just to make it clear how the rather less well known bits of code actually work. Firstly we split the text into a list of words and then use that list to construct a Counter. We then call the Counter object's most_common method to get a sorted list of the top words.

_create_zipf_table

We then take the list generated by the previous function and use it as the basis of a new list containing all the additional Zipfian Distribution stuff.

Within a loop through the list we calculate all the various extra values before adding them as a dictionary to a new list.

print_zipf_table

This final function simply takes the data structure from _create_zipf_table and prints it in a neat table. The formatting string is long and unweildy so I have assigned it to a separate variable, not something I have to do often!

Now we can move on to trying out the module in.

zipfslawtest.py

import zipfslaw


def main():

    print("-----------------")
    print("| codedrome.com |")
    print("| Zipf's Law    |")
    print("-----------------\n")

    try:

        f = open("dracula.txt", "r")
        text = f.read()
        f.close()

        zipf_table = zipfslaw.generate_zipf_table(text, 135)

        zipfslaw.print_zipf_table(zipf_table)

    except IOError as e:

        print(e)


main()

After some very mundate code to read a text file we pass it to generate_zipf_table and then pass the result to print_zipf_table.

Now we can run the code with this command:

Run

python3.7 zipfslawtest.py

The output is

Program Output (partial)

-----------------
| codedrome.com |
| Zipf's Law    |
-----------------

--------------------------------------------------------------------------------
|Rank|    Word    |Actual Freq | Zipf Frac  | Zipf Freq  |Actual Diff |Pct Diff|
--------------------------------------------------------------------------------
|   1|the         |        7965|         1/1|     7965.00|        0.00| 100.00%|
|   2|and         |        5849|         1/2|     3982.50|     1866.50| 146.87%|
|   3|i           |        4530|         1/3|     2655.00|     1875.00| 170.62%|
|   4|to          |        4523|         1/4|     1991.25|     2531.75| 227.14%|
|   5|of          |        3726|         1/5|     1593.00|     2133.00| 233.90%|
|   6|a           |        2933|         1/6|     1327.50|     1605.50| 220.94%|
|   7|in          |        2539|         1/7|     1137.86|     1401.14| 223.14%|
|   8|he          |        2531|         1/8|      995.62|     1535.38| 254.21%|
|   9|that        |        2424|         1/9|      885.00|     1539.00| 273.90%|
|  10|it          |        2064|        1/10|      796.50|     1267.50| 259.13%|
|  11|was         |        1870|        1/11|      724.09|     1145.91| 258.25%|
|  12|as          |        1580|        1/12|      663.75|      916.25| 238.04%|
|  13|for         |        1517|        1/13|      612.69|      904.31| 247.60%|
|  14|is          |        1513|        1/14|      568.93|      944.07| 265.94%|
|  15|we          |        1498|        1/15|      531.00|      967.00| 282.11%|
|  16|his         |        1461|        1/16|      497.81|      963.19| 293.48%|
|  17|me          |        1394|        1/17|      468.53|      925.47| 297.53%|
|  18|not         |        1378|        1/18|      442.50|      935.50| 311.41%|
|  19|you         |        1356|        1/19|      419.21|      936.79| 323.47%|
|  20|with        |        1320|        1/20|      398.25|      921.75| 331.45%|
|  21|my          |        1215|        1/21|      379.29|      835.71| 320.34%|
|  22|all         |        1149|        1/22|      362.05|      786.95| 317.36%|
|  23|be          |        1121|        1/23|      346.30|      774.70| 323.70%|
|  24|at          |        1087|        1/24|      331.88|      755.12| 327.53%|
|  25|on          |        1072|        1/25|      318.60|      753.40| 336.47%|
|  26|have        |        1065|        1/26|      306.35|      758.65| 347.65%|
|  27|so          |        1065|        1/27|      295.00|      770.00| 361.02%|
|  28|her         |        1044|        1/28|      284.46|      759.54| 367.01%|
|  29|had         |        1033|        1/29|      274.66|      758.34| 376.11%|
|  30|but         |        1028|        1/30|      265.50|      762.50| 387.19%|
|  31|him         |         926|        1/31|      256.94|      669.06| 360.40%|
|  32|she         |         800|        1/32|      248.91|      551.09| 321.41%|

As you can see Dracula isn't a good fit for the Zipfian Distribution - few individual texts are. The code in this project uses a rather naive series of reciprocals of ranks, but more sophisticated methods of calculating the Zipfian probabilities might provide a better fit. However, when counting words these formulae require a value for the number of words in the language of the text. This is of course such a vague concept that it is pretty much impossible to arrive at a suitable value.

This project has been tailored to calculating the Zipfian distribution for words in a piece of text. However, as I stated above, the distribution can be applied to many types of data and often with a better fit. The code in this post could be enhanced to be more general-purpose, creating a probability distribution from a list of any data type.