You might be surprised to hear that Unicode can represent up to 1,114,112 characters, 137,994 of which have currently been allocated. They include letters and other symbols from a huge variety of alphabets, punctuation, numbers and general-purpose symbols.
Each character has a name and a category to help you track down the ones you need, and in this project I will write a Python module which returns details of those satisfying specified search criteria.
The Project
This project consists of the following two files which can be downloaded as a zip, or you can clone/download the Github repository if you prefer.
- unicodefilter.py
- unicodefilter_test.py
Source Code Links
Let's look first at unicodefilter.py.
unicodefilter.py
import unicodedata def get_characters(character_name_like="", category_name_like=""): """ Returns a list of dictionaries holding details of the Unicode characters which satisfy the search criteria given in the arguments. """ character_name_like = character_name_like.lower() category_name_like = category_name_like.lower() category_names = _create_category_names() ucl = [] for n in range(0, 137994): try: character = chr(n) name = unicodedata.name(character) category = unicodedata.category(character) category_name = category_names[category] if character_name_like in name.lower() \ and category_name_like in category_name.lower(): cd = {"codepoint_dec": n, "codepoint_hex": format(n, "X"), "character": character, "name": name, "category": category, "category_name": category_name} ucl.append(cd) except ValueError as e: pass return ucl def _create_category_names(): category_names = {} # Letter category_names["Lu"] = "Letter, uppercase" category_names["Ll"] = "Letter, lowercase" category_names["Lt"] = "Letter, titlecase" category_names["Lm"] = "Letter, modifie" category_names["Lo"] = "Letter, other" # Mark category_names["Mn"] = "Mark, nonspacing" category_names["Mc"] = "Mark, spacing combining" category_names["Me"] = "Mark, enclosing" # Number category_names["Nd"] = "Number, decimal digit" category_names["Nl"] = "Number, letter" category_names["No"] = "Number, other" # Punctuation category_names["Pc"] = "Punctuation, connector" category_names["Pd"] = "Punctuation, dash" category_names["Ps"] = "Punctuation, open" category_names["Pe"] = "Punctuation, close" category_names["Pi"] = "Punctuation, initial quote" category_names["Pf"] = "Punctuation, final quote" category_names["Po"] = "Punctuation, other" # Symbol category_names["Sm"] = "Symbol, math" category_names["Sc"] = "Symbol, currency" category_names["Sk"] = "Symbol, modifier" category_names["So"] = "Symbol, other" # Separator category_names["Zs"] = "Separator, space" category_names["Zl"] = "Separator, line" category_names["Zp"] = "Separator, paragraph" # Other category_names["Cc"] = "Other, control" category_names["Cf"] = "Other, format" category_names["Cs"] = "Other, surrogate" category_names["Co"] = "Other, private use" category_names["Cn"] = "Other, not assigned" return category_names
Firstly we import unicodedata from the Python standard library. This provides a number of methods but I will just be using it to get the full name and category of each character.
get_characters
The get_characters function takes two methods which are search strings for character names and category names respectively. Their defaults are empty strings which will return all characters. They are converted to lower case in the first two lines of the function to make the function case-insensitive.
Categories come to us from the unicodedata module as two-character strings so we need some way of obtaining the full category names. I have done this using a dictionary created by the _create_category_names function which I'll describe further down.
Next we create an empty list before iterating from 0 to 137994. Within a try/except we use the chr function to obtain the character equivalent of n, and then use unicodedata.name to attempt to get the character's name. Some characters are control characters so do not have a Unicode name; these will cause unicodedata.name to raise a ValueError exception which we can just ignore with pass.
If we make it to the next line we use unicodedata.category to get the character's category as a 2-digit code, and then get the full category name from the category_names dictionary.
Next we check to see whether the two filter criteria are in the character name and category name respectively. Note that lower() is called on these, again to make the function case-insensitive. If there is a match we create a dictionary with six pieces of information about the current character and then add it to the list.
_create_category_names
As I mentioned above categories have a two character code, the first character (in upper case) standing for the major category and the second (lower case) character standing for the minor category.
This function simply creates a dictionary with the category codes as keys and the category names as values, and which was used in get_characters.
Now let's write a bit of code to test the module.
unicodefilter_test.py
import unicodedata import unicodefilter def main(): print("-----------------") print("| codedrome.com |") print("| Unicode |") print("-----------------\n") print("Unicode version {}\n".format(unicodedata.unidata_version)) ucl = unicodefilter.get_characters(character_name_like="coptic", category_name_like="number") for uc in ucl: print("| {:<6} | {:6} | {:4} | {:72} | {:2} | {:32} |" .format(uc["codepoint_dec"], uc["codepoint_hex"], uc["character"], uc["name"], uc["category"], uc["category_name"])) print("\n{} characters in filtered list\n".format(len(ucl))) main()
The unicodedata module provides a unidata_version function which returns the version of Unicode it supports. I wrote this code using Python 3.7 which was released in June 2018 when the latest version of Unicode was 11, so that is what unidata_version returns, as you can see in the screenshots below. At the time of writing (May 2019) the current version of Unicode is 12.1 which I assume will be supported in Python 3.8 which is currently in Beta.
The call to unicodefilter.get_characters includes arguments to get Coptic numbers, and is followed by a loop to print them in a table. The widths for character names and category names might look a bit high but some of the names are very long.
Now we can run the program.
Run
python3.7 unicodefilter_test.py
This is our list of coptic numbers.
This is the output with a character_name_like argument of "chess".