The Unicode Algorithms

This article is the third in a series intended to demystify the Unicode standard and explore in some depth the aspects that programmers have issues with. It is recommended that you follow the articles in sequence, as the concepts in later articles build upon the ones explained in earlier ones.

  1. I � Unicode (Intro)
  2. The Character Set
  3. The Encodings
  4. The Algorithms (This article)
⌘ The Rundown
  1. Beyond defining a character set and encodings, the Unicode standard also elevates itself to address problems of a higher order when it comes to handling text
  2. Characters are intended to form words and sentences in languages, each of which has rules that can be subtly different when it comes to performing common computing tasks such as sorting
  3. Most Unicode algorithms are implemented in a library called ICU (International Components for Unicode) and can be easily tapped into programmatically
  4. ICU itself uses CLDR (Common Locale Data Repository) as a structured data source for applying the algorithm in a way that is compliant with a given region or language (a locale)

As seen in the other two parts of this series, Unicode concerns itself with defining a universal set of characters and representing those characters as sequences of bytes. The Unicode standard also concerns itself with text as a broader domain. In this section, we are going to dive a bit deeper in

All of these are defined by algorithms, which are specified as part of the standard

The collation algorithm

Collation is fancy term for sorting: your goal in this case is to go from an unordered collection of Unicode strings to a state where these are in the correct order. What is critical to understand here is that the “correct order” is neither universal nor invariant: it changes from person to person, and it changes over time.

Different scripts and different languages impose different rules for deciding the ordering of strings. All code points do have a U number but it does not imply any kind of ordering. The Unicode Collation Algorithm is specified as Technical Report 10 in the Unicode standard.

There is generally a notion of basic order across distinct characters in a script, but languages have an incredible variety when it comes to defining the ordering rules:

In other words, trying to sort text without knowing the expectations of the person who will read the result is pointless: you need to know the language and the context in which the sorted content will be used and presented.

The Unicode Collation Algorithm is therefore more of a framework for collation to be defined and implemented for a given context. The goal of the algorithm is to translate strings of Unicode code points into a sort key, which is an array of 16-bit integers. The sort keys can then be compared in binary to establish the order of the source strings they were computed for.

The following snippet of code demonstrates how to build a collator to properly sort a list of strings, in Java:

package co.eiggen.unicode;

import com.ibm.icu.text.CollationKey;
import com.ibm.icu.text.Collator;
import com.ibm.icu.util.ULocale;

import java.math.BigInteger;
import java.util.ArrayList;
import java.util.List;
import java.util.Collections;

public class Collation {
  public static void main(String... args) {
    ULocale germanPhoneBookLocale = ULocale.GERMAN.setKeywordValue("collation", "phonebook");
    Collator germanPhoneBookCollator = Collator.getInstance(germanPhoneBookLocale);
    
    List<String> words = new ArrayList<String>();
    words.add("Affekt");
    words.add("Äquivalent");

    for (String word : words) {
      byte[] collationKey = germanPhoneBookCollator.getCollationKey(word).toByteArray();
      BigInteger collationKeyAsInteger = new BigInteger(1, collationKey);
      System.out.printf("Collation key for %s is %s\n", word,  collationKeyAsInteger.toString(16));
    }

    System.out.printf("Initial order: %s\n", words.toString());
    Collections.sort(words, germanPhoneBookCollator);
    System.out.printf("Sorted order: %s\n", words.toString());
  }
}

The code yields the following output:

Collation key for Affekt is 2a3434323e50010a01dc0900
Collation key for Äquivalent is 2a324a523a542a403244500145480d01dcc60d00
Initial order: [Affekt, Äquivalent]
Sorted order: [Äquivalent, Affekt]

As can be seen in the above snippet, the most popular implementation of the Unicode Collation Algorithm is ICU, which stands for International Components for Unicode. It’s a library developed for C/++ and Java that provides bindings to CLDR, the Common Locale Data Repository.

CLDR, as the name implies, is a repository of data that pertains to regions and languages of the world, and the combination thereof. Specifically for the purpose of sorting, CLDR encodes all the differences mentioned above: diacritics, casing, expansion and contraction, etc… CLDR is also an authority for a range of tasks pertaining to software internationalization: formatting of dates, numbers, currencies, pluralization rules, calendars, segmentation, etc…

Even though CLDR is maintained by the Unicode consortium, its scope goes well beyond the handling of text. In addition to Java and C/C++, bindings of CLDR have been created for Ruby and JavaScript, even though the latter is being progressively replaced by the Intl object.

The normalization algorithm

Normalization is an issue that occurs when you want to compare strings rather than order them. As the first part of this series explained, Unicode defines over a million code points, with some of them being strikingly similar to others.

In other cases, the standard included code points that explicitly represent typographic ligatures: groupings of multiple characters that are rendered on screen by a single font glyph. One of the most common ligature in the Latin script is U+FB01 LATIN SMALL LIGATURE FI (fi). Ligatures do not have any standalone semantic, they are just a presentation artifact.

The normalization forms, defined at the standard level by Technical Report 15, intend to bring a solution to this issue. “Why is normalization important?”, you might ask.

The most common use case is indexing: if you want to build a search engine, or a database engine, a lot of the work will consist in making sense of the content indexed and making it queryable. On either side of it, the user data should be normalized so that a text containing the “fi” ligature would actually match an input such as “fi”.

The same logic applies to a number of other characters beyond ligatures:

The normalization algorithm defines 4 normal forms which have different characteristics and will yield distinct results:

Picking which form is the best suited for the task depends on the goal you have. The table below shows the result of testing for equality in the normalized forms of two input strings:

Left string Left code points Right String Right code points NFD NFC NFKD NFKC
foo U+66 LATIN SMALL LETTER F
U+6F LATIN SMALL LETTER O
U+6F LATIN SMALL LETTER O
bar U+62 LATIN SMALL LETTER B
U+61 LATIN SMALL LETTER A
U+72 LATIN SMALL LETTER R
false false false false
U+FB01 LATIN SMALL LIGATURE FI fi U+66 LATIN SMALL LETTER F
U+69 LATIN SMALL LETTER I
false false true true
U+212B ANGSTROM SIGN Å U+C5 LATIN CAPITAL LETTER A WITH RING ABOVE true true true true
U+FF76 HALFWIDTH KATAKANA LETTER KA U+30AB KATAKANA LETTER KA false false true true
U+2466 CIRCLED DIGIT SEVEN 7 U+37 DIGIT SEVEN false false true true
8 U+38 DIGIT EIGHT U+2078 SUPERSCRIPT EIGHT false false true true
U+2177 SMALL ROMAN NUMERAL EIGHT viii U+76 LATIN SMALL LETTER V
U+69 LATIN SMALL LETTER I
U+69 LATIN SMALL LETTER I
U+69 LATIN SMALL LETTER I
false false true true
ñ U+F1 LATIN SMALL LETTER N WITH TILDE U+6E LATIN SMALL LETTER N
U+303 COMBINING TILDE
true true true true
ế U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE ế U+65 LATIN SMALL LETTER E
U+302 COMBINING CIRCUMFLEX ACCENT
U+301 COMBINING ACUTE ACCENT
true true true true
ế U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE é̂ U+65 LATIN SMALL LETTER E
U+301 COMBINING ACUTE ACCENT
U+302 COMBINING CIRCUMFLEX ACCENT
false false false false
ế U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE ế U+EA LATIN SMALL LETTER E WITH CIRCUMFLEX
U+301 COMBINING ACUTE ACCENT
true true true true
ế U+65 LATIN SMALL LETTER E
U+302 COMBINING CIRCUMFLEX ACCENT
U+301 COMBINING ACUTE ACCENT
é̂ U+65 LATIN SMALL LETTER E
U+301 COMBINING ACUTE ACCENT
U+302 COMBINING CIRCUMFLEX ACCENT
false false false false

As can be seen from these results, the largest difference in normalization results stems from the distinction between canonical and compatibility forms: the canonical normalization is very strict, while the compatibility one matches code points based on semantics.

There are no examples where the decomposition criteria matters when it is used consistently. Which you pick is up to you but be consistent in your choice. If you were to compare the normalized outputs of, e.g. NFKD and NFKC for the same string input string, they would be non-equal for any string that contains composed characters:

Input Code points NFKD =? NFKC
foo U+66 LATIN SMALL LETTER F
U+6F LATIN SMALL LETTER O
U+6F LATIN SMALL LETTER O
true
U+FB01 LATIN SMALL LIGATURE FI true
ñ U+F1 LATIN SMALL LETTER N WITH TILDE false
U+212B ANGSTROM SIGN false
Å U+C5 LATIN CAPITAL LETTER A WITH RING ABOVE false
ế U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE false

The code used to generate these tables is available here.

Miscellaneous other algorithms

Segmentation

As we just saw in the previous section about the Normalization algorithm, characters may actually be represented by several code points — these are called combined characters and are semantically equivalent to their combined form. What this means pragmatically is that the end user doesn’t care how the underlying text is formed, they care about the end result that they can see and interact with. Whether “ñ” is written as U+F1 LATIN SMALL LETTER N WITH TILDE or U+6E LATIN SMALL LETTER N along with U+303 COMBINING TILDE makes no difference.

The problem becomes more complex with more advanced forms of composing characters, e.g. “👨🏿‍🦲” is actually a combination of four (!) code points—but when it is rendered in a UI, it should still be represented as a single character. Segmentation (specified in Unicode Technical Report 29) is intended to allow the programmatic chunking of a Unicode string into clusters of code points that match more closely what an end-user might expect.

Segmentation doesn’t just address the grouping of code points into higher-level symbols, it also defines how to break a text into words, sentences and paragraphs. There are several use cases for this:

ICU ships with a segmentation implementation of compliant with the Unicode standard.

The Bidirectional Algorithm

The bidirectional algorithm (Unicode Technical Report 9) is perhaps the most abstract of one, making it arduous to grasp and to implement correctly. You should almost never need to directly interact with it, and instead rely on the implementation at the OS or browser level to do the right things.

Writing systems of the worlds have a natural directionality, which is proper to each of them. Latin, Greek, and many other scripts are written left to right (LTR), but scripts such as Hebrew and Arabic are written and read from right to left (RTL). The bidirectional algorithm concerns itself with the rendering of text in mixed contexts, e.g. when Arabic and Latin characters belong in the same sentence or paragraph.

It introduces formatting characters that can be used to assign segments of text an explicit directionality, which may be different than the one they would have by default. These characters have no semantics of their own and are only hints to systems tasked with rendering text to the user:

The directional formatting characters are used only to influence the display ordering of text. In all other respects they should be ignored—they have no effect on the comparison of text or on word breaks, parsing, or numeric analysis.