Exploring Unicode in Python

You might be surprised to hear that Unicode can represent up to 1,114,112 characters, 137,994 of which have currently been allocated. They include letters and other symbols from a huge variety of alphabets, punctuation, numbers and general-purpose symbols.

Each character has a name and a category to help you track down the ones you need, and in this project I will write a Python module which returns details of those satisfying specified search criteria.

The Project

This project consists of the following two files which can be downloaded as a zip, or you can clone/download the Github repository if you prefer.

  • unicodefilter.py
  • unicodefilter_test.py

Source Code Links

ZIP File
GitHub

Let's look first at unicodefilter.py.

unicodefilter.py

import unicodedata


def get_characters(character_name_like="", category_name_like=""):

    """
    Returns a list of dictionaries holding
    details of the Unicode characters
    which satisfy the search criteria
    given in the arguments.
    """

    character_name_like = character_name_like.lower()
    category_name_like = category_name_like.lower()

    category_names = _create_category_names()

    ucl = []

    for n in range(0, 137994):

        try:

            character = chr(n)
            name = unicodedata.name(character)
            category = unicodedata.category(character)
            category_name = category_names[category]

            if character_name_like in name.lower() \
            and category_name_like in category_name.lower():

                cd = {"codepoint_dec": n,
                      "codepoint_hex": format(n, "X"),
                      "character": character,
                      "name": name,
                      "category": category,
                      "category_name": category_name}

                ucl.append(cd)

        except ValueError as e:

            pass

    return ucl


def _create_category_names():

    category_names = {}

    # Letter
    category_names["Lu"] = "Letter, uppercase"
    category_names["Ll"] = "Letter, lowercase"
    category_names["Lt"] = "Letter, titlecase"
    category_names["Lm"] = "Letter, modifie"
    category_names["Lo"] = "Letter, other"

    # Mark
    category_names["Mn"] = "Mark, nonspacing"
    category_names["Mc"] = "Mark, spacing combining"
    category_names["Me"] = "Mark, enclosing"

    # Number
    category_names["Nd"] = "Number, decimal digit"
    category_names["Nl"] = "Number, letter"
    category_names["No"] = "Number, other"

    # Punctuation
    category_names["Pc"] = "Punctuation, connector"
    category_names["Pd"] = "Punctuation, dash"
    category_names["Ps"] = "Punctuation, open"
    category_names["Pe"] = "Punctuation, close"
    category_names["Pi"] = "Punctuation, initial quote"
    category_names["Pf"] = "Punctuation, final quote"
    category_names["Po"] = "Punctuation, other"

    # Symbol
    category_names["Sm"] = "Symbol, math"
    category_names["Sc"] = "Symbol, currency"
    category_names["Sk"] = "Symbol, modifier"
    category_names["So"] = "Symbol, other"

    # Separator
    category_names["Zs"] = "Separator, space"
    category_names["Zl"] = "Separator, line"
    category_names["Zp"] = "Separator, paragraph"

    # Other
    category_names["Cc"] = "Other, control"
    category_names["Cf"] = "Other, format"
    category_names["Cs"] = "Other, surrogate"
    category_names["Co"] = "Other, private use"
    category_names["Cn"] = "Other, not assigned"

    return category_names

Firstly we import unicodedata from the Python standard library. This provides a number of methods but I will just be using it to get the full name and category of each character.

get_characters

The get_characters function takes two methods which are search strings for character names and category names respectively. Their defaults are empty strings which will return all characters. They are converted to lower case in the first two lines of the function to make the function case-insensitive.

Categories come to us from the unicodedata module as two-character strings so we need some way of obtaining the full category names. I have done this using a dictionary created by the _create_category_names function which I'll describe further down.

Next we create an empty list before iterating from 0 to 137994. Within a try/except we use the chr function to obtain the character equivalent of n, and then use unicodedata.name to attempt to get the character's name. Some characters are control characters so do not have a Unicode name; these will cause unicodedata.name to raise a ValueError exception which we can just ignore with pass.

If we make it to the next line we use unicodedata.category to get the character's category as a 2-digit code, and then get the full category name from the category_names dictionary.

Next we check to see whether the two filter criteria are in the character name and category name respectively. Note that lower() is called on these, again to make the function case-insensitive. If there is a match we create a dictionary with six pieces of information about the current character and then add it to the list.

_create_category_names

As I mentioned above categories have a two character code, the first character (in upper case) standing for the major category and the second (lower case) character standing for the minor category.

This function simply creates a dictionary with the category codes as keys and the category names as values, and which was used in get_characters.

Now let's write a bit of code to test the module.

unicodefilter_test.py

import unicodedata

import unicodefilter


def main():

    print("-----------------")
    print("| codedrome.com |")
    print("| Unicode       |")
    print("-----------------\n")

    print("Unicode version {}\n".format(unicodedata.unidata_version))

    ucl = unicodefilter.get_characters(character_name_like="coptic",
                                       category_name_like="number")

    for uc in ucl:

        print("| {:<6} | {:6} | {:4} | {:72} | {:2} | {:32} |"
              .format(uc["codepoint_dec"],
              uc["codepoint_hex"],
              uc["character"],
              uc["name"],
              uc["category"],
              uc["category_name"]))

    print("\n{} characters in filtered list\n".format(len(ucl)))


main()

The unicodedata module provides a unidata_version function which returns the version of Unicode it supports. I wrote this code using Python 3.7 which was released in June 2018 when the latest version of Unicode was 11, so that is what unidata_version returns, as you can see in the screenshots below. At the time of writing (May 2019) the current version of Unicode is 12.1 which I assume will be supported in Python 3.8 which is currently in Beta.

The call to unicodefilter.get_characters includes arguments to get Coptic numbers, and is followed by a loop to print them in a table. The widths for character names and category names might look a bit high but some of the names are very long.

Now we can run the program.

Run

python3.7 unicodefilter_test.py

This is our list of coptic numbers.

This is the output with a character_name_like argument of "chess".