Frequency Analysis in Python

In a previous post I implemented a very simple and very insecure substitution cypher. It is insecure because each letter in the original text is always encrypted the same way, for example the most common letter "e" might always be encrypted as "h", so if we find that "h" is the most common letter in the encrypted text then we can assume it represents "e". This can be carried out for all letters, a process called frequency analysis which is the subject of this post.

The Caesar Shift Cypher implemented in the previous post uses a one-to-one mapping between plaintext characters and encrypted characters by shifting along the alphabet by a fixed amount (rolling round to "a" if you get to "z"). If you know such a cypher has been used you only need to decrypt one letter and you can work out the shift and therefore crack the entire code. Frequency analysis works with a shift cypher but also any substitution cypher where each letter always encrypts to just one other letter, so can be used on more random cyphers.

The frequency analysis process consists of the following steps:

  • Create a list of plaintext frequencies, sorted by frequency

  • Create a list of encrypted text frequencies, sorted by frequency

  • Merge these lists to create an estimated mapping between encrypted letters and plaintext letters

  • Use this mapping to create an approximate decryption of the encrypted text

I use the words "estimated" and "approximate" because there is no guarantee the relative letter frequencies in the plaintext original of the encrypted text will match those of the sample text used to create a decryption mapping. However, given texts of reasonable length it is probable that the majority of frequencies will be in the same order, and those that are not will often be just one or two places out enabling us to manually edit our mappings to get a better result.

The Code

This project consists of two Python files, and also a couple of text files:

  • frequencyanalysis.py

  • frequencyanalysis_test.py

  • the_sign_of_four_plaintext.txt

  • encrypted.txt

The first text file is the whole of the Sherlock Holmes novel The Sign of Four and will be used to create a list of letter frequencies which will then be used to attempt a decryption of encrypted.txt.

The files can be downloaded in a zip, or you can clone/download from Github.

Source Code Links

ZIP File
GitHub

This is the source code of frequencyanalysis.py.

frequencyanalysis.py

from operator import itemgetter
import json


def create_decryption_dictionary(plaintext_filepath, encrypted_filepath, dictionary_filepath):

    """
    Create an estimated mapping between encrypted letters and
    plaintext letters by comparing the frequencies in the
    plaintext and encrypted text.
    The dictionary is then saved as a JSON file.
    """


    sample_plaintext = _readfile(plaintext_filepath)
    encrypted_text = _readfile(encrypted_filepath)

    sample_plaintext_frequencies = _count_letter_frequencies(sample_plaintext)
    encrypted_text_frequencies = _count_letter_frequencies(encrypted_text)

    decryption_dict = {}
    for i in range(0, 26):
        decryption_dict[encrypted_text_frequencies[i][0]] = sample_plaintext_frequencies[i][0].lower()

    f = open(dictionary_filepath, "w")
    json.dump(decryption_dict, f)
    f.close


def decrypt_file(encrypted_filepath, decrypted_filepath, dictionary_filepath):

    """
    Use the dictionary to decrypt the encrypted file
    and save the result.
    """


    encrypted_text = _readfile(encrypted_filepath)

    f = open(dictionary_filepath, "r")
    decryption_dict = json.load(f)
    f.close()

    decrypted_list = []

    for letter in encrypted_text:
        asciicode = ord(letter.upper())
        if asciicode >= 65 and asciicode <= 90:
            decrypted_list.append(decryption_dict[letter])

    decrypted_text = "".join(decrypted_list)

    f = open(decrypted_filepath, "w")
    f.write(decrypted_text)
    f.close()


def _count_letter_frequencies(text):

    """
    Create a dictionary of letters A-Z and count the frequency
    of each in the supplied text.
    Lower case letters are converted to upper case.
    All other characters are ignored.
    The returned data structure is a list as we need to sort it by frequency.
    """


    frequencies = {}

    for asciicode in range(65, 91):
        frequencies[chr(asciicode)] = 0

    for letter in text:
        asciicode = ord(letter.upper())
        if asciicode >= 65 and asciicode <= 90:
            frequencies[chr(asciicode)] += 1

    sorted_by_frequency = sorted(frequencies.items(), key = itemgetter(1), reverse=True)

    return sorted_by_frequency


def _readfile(path):

    f = open(path, "r")
    text = f.read()
    f.close()
    return text

create_decryption_dictionary

Firstly we read the contents of the two files using the _readfile function, and then call _count_letter_frequencies with each in turn to get dictionaries of the letter frequencies. The dictionaries come to us ready-sorted by frequency which is crucial.

We then create a new dictionary and, within a loop, add key/value pairs with the encrypted letters as keys and plaintext letters as values. We don't add the actual frequencies as they are irrelevant; we are just interested in the frequency orders.

Finally the mappings dictionary is saved to the specified JSON file.

decrypt_file

This function uses the mappings dictionary in the specified dictionary file to decrypt the encrypted file and save it to the specified path.

Firstly we read the encrypted text into memory, and then the mapping dictionary. Then we create an empty list to hold the individual letters of the decrypted text.

Next we iterate the letters of the encrypted text, getting the ASCII code of the uppercase version of each letter. If it is an actual letter we use it as a dictionary key to get the corresponding decrypted letter which is added to decrypted_list.

Finally we use join to stick all the letters in decrypted_list together into a single string. This is far more efficient than appending letters one by one to a string. Finally we just need to save the decrypted text to a file.

_count_letter_frequencies

This function takes a string and creates a dictionary using the 26 letters of the alphabet as keys and their frequencies as values, sorted by frequency.

After creating an empty dictionary we add the keys to it with frequencies of 0. Next we iterate the text, incrementing the frequencies by 1 for each letter. Finally the dictionary is sorted in descending order and returned.

_readfile

Text files are read three times in the code above so I included a short utility function to simplify things very slightly - it just opens, reads and closes the file before returning the contents. There is no exception handling in any of the above code as this is done by the calling code.

That's the module finished so we just need a few lines of code to test it.

frequencyanalysis_test.py

import frequencyanalysis


def main():

    """
    Demonstration of the frequencyanalysis module.
    """


    print("----------------------")
    print("| codedrome.com      |")
    print("| Frequency Analysis |")
    print("----------------------\n")

    try:
        frequencyanalysis.create_decryption_dictionary("the_sign_of_four_plaintext.txt",
                                                       "encrypted.txt",
                                                       "decryption_dict.json")
    except Exception as e:
        print(e)

    try:
        frequencyanalysis.decrypt_file("encrypted.txt",
                                       "decrypted.txt",
                                       "decryption_dict.json")
    except Exception as e:
        print(e)


main()

Ignoring the heading and exception handling there are only two lines of code in main, calling the two "public" functions in frequencyanalysis.py. Firstly we call create_decryption_dictionary to create the dictionary and save it as JSON. Then we call decrypt_file to generate decrypted.txt.

The code is finished so run it with this command:

Run

python3.7 frequencyanalysis_test.py

If you open the folder where your files are you'll find a new file called decrypted.txt which starts like this.

decrypted.txt - 1st attempt

msapterurhserdomksoduehurhserdomksoduehysoyahwhwaddcvercdateintseuorninghhavewpontsohenotinfreqwentommahionhysenseyahwpaddnigstyahheatelattsebreakfahttabdeihtoolwpontseseartsrwganlpimkelwptsehtimkysimsowrvihitorsaldeftbesinlsiutsenigstbeforeityahafinetsimkpiemeofyoolbwdbowhsealeloftsehortysimsihknoynahapenangdaycerxwhtwnlertsesealyahabroalhidverbanlneardcaninmsamrohhtoxauehuortiuerurmhfrousihfrienlhoftsemmsyahengravelwponityitstselateityahxwhthwmsahtimkahtseodlfahsionelfauidcpramtitionerwheltomarrclignifielhodilanlreahhwringyeddyathonysatlocowuakeofitsoduehyahhittingyitssihbamktoueanlisalgivensiunohignofucommwpationsoylilcowknoyysatiyahloingibedievecowsaveecehintsebamkofcowrsealisaveatdeahtayeddpodihselhidverpdatelmoffeepotinfrontofuehailsebwtteddueyathonysatlocowuakeofowrvihitorhhtimkhinmeyesavebeenhownfortwnateahtouihhsiuanlsavenonotionofsiherranltsihammilentadhowvenirbemouehofiuportanmedetuesearcowremonhtrwmttseuanbcanejauinationofititsinkhailifoddoyingahfarahimowdltseuetsolhofucmoupaniontsatlruortiuerihahwmmehhfwdedlerdcuelimaduanyeddehteeuelhinmetsoheysoknoysiugivesiutsihuarkoftseirappremiationgoolhailsoduehejmeddentitsinkadhotsattseprobabiditcihinfavowrofsihbeingamowntrcpramtitionerysoloehagreatleadofsihvihitingonfootyschobemawhetsihhtimktsowgsoriginaddcavercsanlhoueonesahbeenhoknomkelabowttsatimansarldciuagineatoynpramtitionerm

It doesn't look very decrypted but if we think the encrypted text is a book then it is probable that it starts with the word "chapter". It actually starts with "msapter" which is promising as only the first two letters are different.

Open decryption_dict.json and swap the following values. (Make sure you swap the lower case values, not the upper case keys.)

  • swap m and c
  • swap s and h

Comment out the call to frequencyanalysis.create_decryption_dictionary so we don't overwrite the edited dictionary but use the one we just edited, and then run the program again. This is the new decrypted.txt file.

decrypted.txt - 2nd attempt

chapterursherdockhoduesursherdockhoduesyhoyaswswaddmvermdateintheuorningssavewponthosenotinfreqwentoccasionsyhenheyaswpaddnightyasseatelatthebreakfasttabdeistoolwponthehearthrwganlpickelwpthestickyhichowrvisitorhaldeftbehinlhiuthenightbeforeityasafinethickpieceofyoolbwdbowshealelofthesortyhichisknoynasapenangdaymerxwstwnlerthehealyasabroalsidverbanlneardmaninchacrosstoxauesuortiuerurcsfrouhisfrienlsofthecchyasengravelwponityiththelateityasxwstswchastickastheodlfashionelfauidmpractitionerwseltocarrmlignifielsodilanlreasswringyeddyatsonyhatlomowuakeofithoduesyassittingyithhisbacktoueanlihalgivenhiunosignofumoccwpationhoylilmowknoyyhatiyasloingibedievemowhaveemesinthebackofmowrhealihaveatdeastayeddpodishelsidverpdatelcoffeepotinfrontofuesailhebwtteddueyatsonyhatlomowuakeofowrvisitorssticksinceyehavebeensownfortwnateastouisshiuanlhavenonotionofhiserranlthisaccilentadsowvenirbecouesofiuportancedetuehearmowreconstrwcttheuanbmanejauinationofitithinksailifoddoyingasfarasicowdltheuetholsofumcoupanionthatlruortiuerisaswccessfwdedlerdmuelicaduanyeddesteeuelsincethoseyhoknoyhiugivehiuthisuarkoftheirappreciationgoolsailhoduesejceddentithinkadsothattheprobabiditmisinfavowrofhisbeingacowntrmpractitioneryholoesagreatleadofhisvisitingonfootyhmsobecawsethisstickthowghoriginaddmavermhanlsoueonehasbeensoknockelabowtthaticanharldmiuagineatoynpractitionerc

The text now starts with the word "chapter", followed by "ursherdockhoduesursherdockhodues". (The character after the word "chapter" would usually be a number and therefore not encrypted by our simple cypher unless spelled out in words.) If we think the text might be about a character called "Mr Sherlock Holmes" we can easily identify which letters need to be swapped in the decryption dictionary:

  • u and m
  • d and l

Make these changes in the JSON file and run the program. This is the third version of the decrypted text.

decrypted.txt - 3rd attempt

chaptermrsherlockholmesmrsherlockholmesyhoyaswswalluverulateinthemorningssavewponthosenotinfreqwentoccasionsyhenheyaswpallnightyasseatedatthebreakfasttableistoodwponthehearthrwgandpickedwpthestickyhichowrvisitorhadleftbehindhimthenightbeforeityasafinethickpieceofyoodbwlbowsheadedofthesortyhichisknoynasapenanglayuerxwstwndertheheadyasabroadsilverbandnearluaninchacrosstoxamesmortimermrcsfromhisfriendsofthecchyasengravedwponityiththedateityasxwstswchastickastheoldfashionedfamilupractitionerwsedtocarrudignifiedsolidandreasswringyellyatsonyhatdouowmakeofitholmesyassittingyithhisbacktomeandihadgivenhimnosignofmuoccwpationhoydiduowknoyyhatiyasdoingibelieveuowhaveeuesinthebackofuowrheadihaveatleastayellpolishedsilverplatedcoffeepotinfrontofmesaidhebwttellmeyatsonyhatdouowmakeofowrvisitorssticksinceyehavebeensownfortwnateastomisshimandhavenonotionofhiserrandthisaccidentalsowvenirbecomesofimportanceletmehearuowreconstrwctthemanbuanejaminationofitithinksaidifolloyingasfarasicowldthemethodsofmucompanionthatdrmortimerisaswccessfwlelderlumedicalmanyellesteemedsincethoseyhoknoyhimgivehimthismarkoftheirappreciationgoodsaidholmesejcellentithinkalsothattheprobabilituisinfavowrofhisbeingacowntrupractitioneryhodoesagreatdealofhisvisitingonfootyhusobecawsethisstickthowghoriginalluaveruhandsomeonehasbeensoknockedabowtthaticanhardluimagineatoynpractitionerc

We now have what can be interpreted as a chapter number, a chapter title and the first three words of the chapter itself. After that comes "yhoyaswswalluverulate" which with a bit of thought could be "whowasusuallyverylate" so make the following swap in the JSON file and run the program.

  • y and w

This is the next version of decrypted.txt.

decrypted.txt - 4th attempt

chaptermrsherlockholmesmrsherlockholmeswhowasysyalluverulateinthemorningssaveyponthosenotinfreqyentoccasionswhenhewasypallnightwasseatedatthebreakfasttableistoodyponthehearthrygandpickedypthestickwhichoyrvisitorhadleftbehindhimthenightbeforeitwasafinethickpieceofwoodbylboysheadedofthesortwhichisknownasapenanglawuerxystyndertheheadwasabroadsilverbandnearluaninchacrosstoxamesmortimermrcsfromhisfriendsofthecchwasengravedyponitwiththedateitwasxystsychastickastheoldfashionedfamilupractitionerysedtocarrudignifiedsolidandreassyringwellwatsonwhatdouoymakeofitholmeswassittingwithhisbacktomeandihadgivenhimnosignofmuoccypationhowdiduoyknowwhatiwasdoingibelieveuoyhaveeuesinthebackofuoyrheadihaveatleastawellpolishedsilverplatedcoffeepotinfrontofmesaidhebyttellmewatsonwhatdouoymakeofoyrvisitorssticksincewehavebeensoynfortynateastomisshimandhavenonotionofhiserrandthisaccidentalsoyvenirbecomesofimportanceletmehearuoyreconstryctthemanbuanejaminationofitithinksaidifollowingasfarasicoyldthemethodsofmucompanionthatdrmortimerisasyccessfylelderlumedicalmanwellesteemedsincethosewhoknowhimgivehimthismarkoftheirappreciationgoodsaidholmesejcellentithinkalsothattheprobabilituisinfavoyrofhisbeingacoyntrupractitionerwhodoesagreatdealofhisvisitingonfootwhusobecaysethisstickthoyghoriginalluaveruhandsomeonehasbeensoknockedaboytthaticanhardluimagineatownpractitionerc

We nearly have "whowasusuallyverylate" so make this final edit to the dictionary...

  • y and u

...and run the program again. This is the fifth version of decrypted.txt.

decrypted.txt - 5th attempt

chaptermrsherlockholmesmrsherlockholmeswhowasusuallyverylateinthemorningssaveuponthosenotinfrequentoccasionswhenhewasupallnightwasseatedatthebreakfasttableistooduponthehearthrugandpickedupthestickwhichourvisitorhadleftbehindhimthenightbeforeitwasafinethickpieceofwoodbulbousheadedofthesortwhichisknownasapenanglawyerxustundertheheadwasabroadsilverbandnearlyaninchacrosstoxamesmortimermrcsfromhisfriendsofthecchwasengraveduponitwiththedateitwasxustsuchastickastheoldfashionedfamilypractitionerusedtocarrydignifiedsolidandreassuringwellwatsonwhatdoyoumakeofitholmeswassittingwithhisbacktomeandihadgivenhimnosignofmyoccupationhowdidyouknowwhatiwasdoingibelieveyouhaveeyesinthebackofyourheadihaveatleastawellpolishedsilverplatedcoffeepotinfrontofmesaidhebuttellmewatsonwhatdoyoumakeofourvisitorssticksincewehavebeensounfortunateastomisshimandhavenonotionofhiserrandthisaccidentalsouvenirbecomesofimportanceletmehearyoureconstructthemanbyanejaminationofitithinksaidifollowingasfarasicouldthemethodsofmycompanionthatdrmortimerisasuccessfulelderlymedicalmanwellesteemedsincethosewhoknowhimgivehimthismarkoftheirappreciationgoodsaidholmesejcellentithinkalsothattheprobabilityisinfavourofhisbeingacountrypractitionerwhodoesagreatdealofhisvisitingonfootwhysobecausethisstickthoughoriginallyaveryhandsomeonehasbeensoknockedaboutthaticanhardlyimagineatownpractitionerc

It's not quite perfect - you might want to identify and make a couple more swaps - but it is entirely readable and recognisable as the first chapter of The Hound of the Baskervilles.