The Unicode Algorithms
This article is the third in a series intended to demystify the Unicode standard and explore in some depth the aspects that programmers have issues with. It is recommended that you follow the articles in sequence, as the concepts in later articles build upon the ones explained in earlier ones.
- I � Unicode (Intro)
- The Character Set
- The Encodings
- The Algorithms (This article)
As seen in the other two parts of this series, Unicode concerns itself with defining a universal set of characters and representing those characters as sequences of bytes. The Unicode standard also concerns itself with text as a broader domain. In this section, we are going to dive a bit deeper in
- How is text sorted? How do you define an ordering across characters in a set as vast as the one defined by Unicode?
- How is text compared? We saw that Unicode sometimes defined distinct code points for the same entity, e.g. U+FF76
HALFWIDTH KATAKANA LETTER KA
カ and U+30ABKATAKANA LETTER KA
カ. What happens when those two are compared?
All of these are defined by algorithms, which are specified as part of the standard
The collation algorithm
Collation is fancy term for sorting: your goal in this case is to go from an unordered collection of Unicode strings to a state where these are in the correct order. What is critical to understand here is that the “correct order” is neither universal nor invariant: it changes from person to person, and it changes over time.
Different scripts and different languages impose different rules for deciding the ordering of strings. All code points do have a U number but it does not imply any kind of ordering. The Unicode Collation Algorithm is specified as Technical Report 10 in the Unicode standard.
There is generally a notion of basic order across distinct characters in a script, but languages have an incredible variety when it comes to defining the ordering rules:
- Characters that have diacritical marks (e.g. accents) generally follow specific ordering rules, or can occasionally be considered as separate characters than the one they are graphically based on.
- Groups of multiple characters may be contracted as a single token for the purpose of sorting the string, e.g. “ch” in Czech sorts after “h”, whereas it sorts after the “c” character in Spanish.
- Characters may be expanded into clusters of multiple distinct characters, e.g. “ä” in certain German contexts expends to “ae”, which means a word such as “Äquivalent” will be sorted before “Affekt”.
- When more than single words are compared, the rules for handling punctuation such as a dash may vary.
- Text may need to be normalized, which will be covered in details further down in this post.
- When it applies in the script, the casing of letters might be an element for determining the ordering rule.
In other words, trying to sort text without knowing the expectations of the person who will read the result is pointless: you need to know the language and the context in which the sorted content will be used and presented.
The Unicode Collation Algorithm is therefore more of a framework for collation to be defined and implemented for a given context. The goal of the algorithm is to translate strings of Unicode code points into a sort key, which is an array of 16-bit integers. The sort keys can then be compared in binary to establish the order of the source strings they were computed for.
The following snippet of code demonstrates how to build a collator to properly sort a list of strings, in Java:
package co.eiggen.unicode;
import com.ibm.icu.text.CollationKey;
import com.ibm.icu.text.Collator;
import com.ibm.icu.util.ULocale;
import java.math.BigInteger;
import java.util.ArrayList;
import java.util.List;
import java.util.Collections;
public class Collation {
public static void main(String... args) {
ULocale germanPhoneBookLocale = ULocale.GERMAN.setKeywordValue("collation", "phonebook");
Collator germanPhoneBookCollator = Collator.getInstance(germanPhoneBookLocale);
List<String> words = new ArrayList<String>();
words.add("Affekt");
words.add("Äquivalent");
for (String word : words) {
byte[] collationKey = germanPhoneBookCollator.getCollationKey(word).toByteArray();
BigInteger collationKeyAsInteger = new BigInteger(1, collationKey);
System.out.printf("Collation key for %s is %s\n", word, collationKeyAsInteger.toString(16));
}
System.out.printf("Initial order: %s\n", words.toString());
Collections.sort(words, germanPhoneBookCollator);
System.out.printf("Sorted order: %s\n", words.toString());
}
}
The code yields the following output:
Collation key for Affekt is 2a3434323e50010a01dc0900
Collation key for Äquivalent is 2a324a523a542a403244500145480d01dcc60d00
Initial order: [Affekt, Äquivalent]
Sorted order: [Äquivalent, Affekt]
As can be seen in the above snippet, the most popular implementation of the Unicode Collation Algorithm is ICU, which stands for International Components for Unicode. It’s a library developed for C/++ and Java that provides bindings to CLDR, the Common Locale Data Repository.
CLDR, as the name implies, is a repository of data that pertains to regions and languages of the world, and the combination thereof. Specifically for the purpose of sorting, CLDR encodes all the differences mentioned above: diacritics, casing, expansion and contraction, etc… CLDR is also an authority for a range of tasks pertaining to software internationalization: formatting of dates, numbers, currencies, pluralization rules, calendars, segmentation, etc…
Even though CLDR is maintained by the Unicode consortium, its scope goes well beyond the handling of text. In addition to Java and C/C++, bindings of CLDR have been created for Ruby and JavaScript, even though the latter is being progressively replaced by the Intl object.
The normalization algorithm
Normalization is an issue that occurs when you want to compare strings rather than order them. As the first part of this series explained, Unicode defines over a million code points, with some of them being strikingly similar to others.
In other cases, the standard included code points that explicitly represent typographic ligatures: groupings of multiple characters that are rendered on screen by a single font glyph. One of the most common ligature in the Latin script is U+FB01 LATIN SMALL LIGATURE FI
(fi). Ligatures do not have any standalone semantic, they are just a presentation artifact.
The normalization forms, defined at the standard level by Technical Report 15, intend to bring a solution to this issue. “Why is normalization important?”, you might ask.
The most common use case is indexing: if you want to build a search engine, or a database engine, a lot of the work will consist in making sense of the content indexed and making it queryable. On either side of it, the user data should be normalized so that a text containing the “fi” ligature would actually match an input such as “fi”.
The same logic applies to a number of other characters beyond ligatures:
- Unicode has dedicated code points for superscript numbers, e.g. U+2078
SUPERSCRIPT EIGHT
(⁸) - Full-width and half-width Katakana characters and Latin characters, which were covered in the first post in this series
- Unicode defines circled alphanumeric code points such as U+2466
CIRCLED DIGIT SEVEN
(⑦) - Old Roman numerals such as U+216F
ROMAN NUMERAL ONE THOUSAND
(Ⅿ) or even composed forms such as U+2177SMALL ROMAN NUMERAL EIGHT
(ⅷ)
The normalization algorithm defines 4 normal forms which have different characteristics and will yield distinct results:
- A normalization form may be concerned with canonical equivalence (NF) or compatibility (NFK) — canonical normalization mandates strict canonicity of the strings, while compatible normalization is more lenient and considers distinct code points or combination of code points to have the same meaning.
- A normalization form may be concerned with the composed (C) or decomposed form (D) of the underlying characters – some diacritical characters may have a standalone form, e.g. U+F4
LATIN SMALL LETTER WITH CIRCUMFLEX
(ô), but may also be written as a sequence of U+6FLATIN SMALL LETTER O
(o) and U+302COMBINING CIRCUMFLEX ACCENT
( ̂). In the latter case, the order in which the code points are combined affect the normalized form of the string.
Picking which form is the best suited for the task depends on the goal you have. The table below shows the result of testing for equality in the normalized forms of two input strings:
Left string | Left code points | Right String | Right code points | NFD | NFC | NFKD | NFKC |
---|---|---|---|---|---|---|---|
foo | U+66 LATIN SMALL LETTER F U+6F LATIN SMALL LETTER O U+6F LATIN SMALL LETTER O |
bar | U+62 LATIN SMALL LETTER B U+61 LATIN SMALL LETTER A U+72 LATIN SMALL LETTER R |
false | false | false | false |
fi | U+FB01 LATIN SMALL LIGATURE FI |
fi | U+66 LATIN SMALL LETTER F U+69 LATIN SMALL LETTER I |
false | false | true | true |
Å | U+212B ANGSTROM SIGN |
Å | U+C5 LATIN CAPITAL LETTER A WITH RING ABOVE |
true | true | true | true |
カ | U+FF76 HALFWIDTH KATAKANA LETTER KA |
カ | U+30AB KATAKANA LETTER KA |
false | false | true | true |
⑦ | U+2466 CIRCLED DIGIT SEVEN |
7 | U+37 DIGIT SEVEN |
false | false | true | true |
8 | U+38 DIGIT EIGHT |
⁸ | U+2078 SUPERSCRIPT EIGHT |
false | false | true | true |
ⅷ | U+2177 SMALL ROMAN NUMERAL EIGHT |
viii | U+76 LATIN SMALL LETTER V U+69 LATIN SMALL LETTER I U+69 LATIN SMALL LETTER I U+69 LATIN SMALL LETTER I |
false | false | true | true |
ñ | U+F1 LATIN SMALL LETTER N WITH TILDE |
ñ | U+6E LATIN SMALL LETTER N U+303 COMBINING TILDE |
true | true | true | true |
ế | U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE |
ế | U+65 LATIN SMALL LETTER E U+302 COMBINING CIRCUMFLEX ACCENT U+301 COMBINING ACUTE ACCENT |
true | true | true | true |
ế | U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE |
é̂ | U+65 LATIN SMALL LETTER E U+301 COMBINING ACUTE ACCENT U+302 COMBINING CIRCUMFLEX ACCENT |
false | false | false | false |
ế | U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE |
ế | U+EA LATIN SMALL LETTER E WITH CIRCUMFLEX U+301 COMBINING ACUTE ACCENT |
true | true | true | true |
ế | U+65 LATIN SMALL LETTER E U+302 COMBINING CIRCUMFLEX ACCENT U+301 COMBINING ACUTE ACCENT |
é̂ | U+65 LATIN SMALL LETTER E U+301 COMBINING ACUTE ACCENT U+302 COMBINING CIRCUMFLEX ACCENT |
false | false | false | false |
As can be seen from these results, the largest difference in normalization results stems from the distinction between canonical and compatibility forms: the canonical normalization is very strict, while the compatibility one matches code points based on semantics.
There are no examples where the decomposition criteria matters when it is used consistently. Which you pick is up to you but be consistent in your choice. If you were to compare the normalized outputs of, e.g. NFKD and NFKC for the same string input string, they would be non-equal for any string that contains composed characters:
Input | Code points | NFKD =? NFKC |
---|---|---|
foo | U+66 LATIN SMALL LETTER F U+6F LATIN SMALL LETTER O U+6F LATIN SMALL LETTER O |
true |
fi | U+FB01 LATIN SMALL LIGATURE FI |
true |
ñ | U+F1 LATIN SMALL LETTER N WITH TILDE |
false |
Å | U+212B ANGSTROM SIGN |
false |
Å | U+C5 LATIN CAPITAL LETTER A WITH RING ABOVE |
false |
ế | U+1EBF LATIN SMALL LETTER E WITH CIRCUMFLEX AND ACUTE |
false |
The code used to generate these tables is available here.
Miscellaneous other algorithms
Segmentation
As we just saw in the previous section about the Normalization algorithm, characters may actually be represented by several code points — these are called combined characters and are semantically equivalent to their combined form. What this means pragmatically is that the end user doesn’t care how the underlying text is formed, they care about the end result that they can see and interact with. Whether “ñ” is written as U+F1 LATIN SMALL LETTER N WITH TILDE
or U+6E LATIN SMALL LETTER N
along with U+303 COMBINING TILDE
makes no difference.
The problem becomes more complex with more advanced forms of composing characters, e.g. “👨🏿🦲” is actually a combination of four (!) code points—but when it is rendered in a UI, it should still be represented as a single character. Segmentation (specified in Unicode Technical Report 29) is intended to allow the programmatic chunking of a Unicode string into clusters of code points that match more closely what an end-user might expect.
Segmentation doesn’t just address the grouping of code points into higher-level symbols, it also defines how to break a text into words, sentences and paragraphs. There are several use cases for this:
- Computing the user-visible length of a string
- Sentence-level breaking is useful for search engines to index content
- Implementing text selection, e.g. double-clicking on a word extends the selection to the full word
ICU ships with a segmentation implementation of compliant with the Unicode standard.
The Bidirectional Algorithm
The bidirectional algorithm (Unicode Technical Report 9) is perhaps the most abstract of one, making it arduous to grasp and to implement correctly. You should almost never need to directly interact with it, and instead rely on the implementation at the OS or browser level to do the right things.
Writing systems of the worlds have a natural directionality, which is proper to each of them. Latin, Greek, and many other scripts are written left to right (LTR), but scripts such as Hebrew and Arabic are written and read from right to left (RTL). The bidirectional algorithm concerns itself with the rendering of text in mixed contexts, e.g. when Arabic and Latin characters belong in the same sentence or paragraph.
It introduces formatting characters that can be used to assign segments of text an explicit directionality, which may be different than the one they would have by default. These characters have no semantics of their own and are only hints to systems tasked with rendering text to the user:
The directional formatting characters are used only to influence the display ordering of text. In all other respects they should be ignored—they have no effect on the comparison of text or on word breaks, parsing, or numeric analysis.