The Unicode Encodings

This article is the second in a series intended to demystify the Unicode standard and explore in some depth the aspects that programmers have issues with. It is recommended that you follow the articles in sequence, as the concepts in later articles build upon the ones explained in earlier ones.

  1. I ๏ฟฝ Unicode (Intro)
  2. The Character Set
  3. The Encodings (This article)
  4. The Algorithms
โŒ˜ The Rundown
  1. Unlike most other character sets, there is no straightforward mapping between Unicode code point and bytes
  2. You must specify an encoding when converting bytes into Unicode code points, and code points into bytes
  3. All Unicode encodings support the full set of 1,114,112 possible Unicode code points
  4. The Unicode standard defines several encodings which are called transformation formats: UTF-32, UTF-16 and UTF-8 are all part of the Unicode standard
  5. UTF-16 leverages an indexing scheme called surrogate pairs, which was devised when the character set grew past the BMP
  6. UTF-8 is by far the most popular encoding in the world and the recommended one for several document formats, such as JSON and HTML

Perhaps Unicode’s greatest insight is the decoupling of characters from bytes. Before Unicode, most systems used for representing strings equated the bytes read from, e.g., a file, to the characters in memory. It made systems easy to understand but inherently limited.

For instance, the C programming language defines a very minimal character set which contains basic latin characters and the built-in char type as follows:

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

When it came down to representing those characters in memory and writing them to a file, all C implementations settled on ASCII. 'a' the ASCII character was mapped to 0x61 and that was also the byte value of the character 'a' when declared in C. That is not the case in Unicode: the lowercase latin letter ‘a’ is defined as U+0061. That’s a subtle difference but a critical one: the U+ notation is how you know the value is referring to a Unicode code point and not bytes.

Encodings sit between between bytes and code points. They consist of rules for translating code points into bytes and bytes into code points. What it implies is that you cannot just take bytes and turn them into a meaningful Unicode string of text without knowing what encoding was used to write the bytes.

Unicode defines several encodings and all of them are designed to support the full range of all possible code points across all planes. Encodings are designated by the “UTF” prefix, which stands for “Unicode Transformation Format”

UTF-32

UTF-32 is the simplest and most straightforward of all Unicode encodings. It’s a fixed-length encoding where each code point is encoded over 32 bits. The U+ value of a code point is directly converted into bytes with the same value:

Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
4 21 U+0000 U+10FFFF 00000000 (0x00) 000xxxxx xxxxxxxx xxxxxxxx

Because there are “only” over one million Unicode code points but that 32 bits can represent up to 4 billion values, 11 bits are always set to 0. This makes UTF-32 an extremely wasteful encoding.

One useful purpose for it is to trivially index a code point in a string of bytes: the code point at index n is simply the nth 32-bit chunk in your string. But there’s little practical use to this: Unicode code points may combine with adjacent ones to form the actual grapheme, which is the unit of text that is truly meaningful to end users.

Example

Here’s an example of select code points and how they are encoded into bytes using UTF-32:

Code point Name Glyph Byte 1 Byte 2 Byte 3 Byte 4
U+61 LATIN SMALL LETTER A a 0x00 0x00 0x00 0x61
U+27A4 BLACK RIGHTWARDS ARROWHEAD โžค 0x00 0x00 0x27 0xA4
U+130E6 EGYPTIAN HIEROGLYPH E017A ๐“ƒฆ 0x00 0x01 0x30 0xE6
U+1F60D SMILING FACE WITH HEART-SHAPED EYES ๐Ÿ˜ 0x00 0x01 0xF6 0x0D

UTF-16

UTF-16 is a more complex form of encoding, perhaps the most peculiar of all Unicode encodings. It’s a variable-length encoding with 16-bit chunks: a given code point may be encoded using one or two chunks. As with other encodings, it supports all Unicode code points and follows the following structure:

Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
2 0 U+0000 U+D7FF xxxxxxxx xxxxxxxx
N/A N/A U+D800 U+DFFF N/A N/A N/A N/A
2 0 U+E000 U+FFFF xxxxxxxx xxxxxxxx
4 20 U+10000 U+10FFFF 110110xx xxxxxxxx 110111xx xxxxxxxx

The reason UTF-16 is more complex than other encodings lies in its tight coupling with the definition of the character set itself. At the beginning of this article, I said that Unicode’s greatest insight was the decoupling between characters and bytes: this is true today but was not always the case.

The reason behind the 17 planes

In its very first version, Unicode defined a single plane of 65,536 elements and the UTF-16 encoding was a direct map from U numbers to bytes โ€” this was a very simple architectural choice for implementors of the standard to follow.

When the BMP started running short of space and that the need to grow beyond it became clear, Unicode was confronted with a choice:

The Unicode Consortium chose the latter option: out the already packed BMP, the standard would reserve a range of 2,048 code points that would be guaranteed to never be allocated: U+D800 โ†’ U+DFFF. This space would instead be used create an index for pointing at code points outside of the BMP.

UTF-16 needed to remain an encoding where chunks would take 16 bits each. Characters beyond the BMP would necessarily take at least 2 16-bit chunks to be represented: UTF-16 calls those chunk surrogate pairs: this required further dividing the 2,048 block in 2 blocks of 1,024 values each: U+D800 โ†’ U+DBFF and U+DC00 โ†’ U+DFFF: one for the high surrogate and one for the low surrogate.

1,024 values requires 10 significant bits to represent, which means that a single 1,024 block is able to map to 210 values. Since there are 2 blocks, their indexing powers are multiplied and bring the total of supplementary code points to 210 ร— 210 = 220.

In other words, sacrificing just over two thousand code points in the BMP opened up the space for 220 brand new code points. Since a plane contains 216 code points, the total number of planes outside of the BMP is 220 รท 216 = 24 = 16, bringing the total count of planes to 17.

Even though the scheme was first published in 1996 in Unicode 2.0, it took until 2004 for implementors such as Sun to surface this new mechanism to developers in Java 1.5.

That’s the secret of UTF-16, and the reason it is tightly coupled to the Unicode standard itself: The U+D800 โ†’ U+DFFF range of code points is guaranteed by the standard to never be assigned and its values are instead used for indexing characters outside the BMP, but exclusively for the purpose of allowing UTF-16 do do to (neither UTF-32 nor UTF-8 have that problem). When this is needed, the two resulting 16-bit chunks are called a surrogate pair.

The UTF-16 scheme

To encode a character in non-BMP planes requires some bit fiddling:

  1. Take the U-value of the code point and map it to bytes, e.g. U+1F976 โ†’ 0x1F976
  2. Subtract 0x10000 (65536) โ€” since the maximum code point value is U+10FFFF (1114111), the range of the remaining value is 0x0โ€“0xFFFFF (0โ€“1048575)
    1. Left-pad the value with 0, up to to 20 bits
    2. Take the high ten bits of that value (range 0x000โ€“0x3FF, 0โ€“1023) and add 0xD800 (55296) to form the high surrogate (range 0xD800-0xDBFF, 55296โ€“56319)
    3. Take the low ten bits of that value (also in range 0x000โ€“0x3FF) and add 0xDC00 (56320) to form the low surrogate (range 0xDC00โ€“0xDFFF, 56320โ€“57343)

Properties

UTF-16 has properties that make it a good general purpose encoding. Primarily, any code point in the BMP will be encoded over 16 bits: this is a use case that covers most language on Earth in a decently low number of bits. Second, despite the apparent complexity the encoding rules, decoding a UTF-16 stream is relatively simple to implement in code: if a chunk is in the high surrogate range, consume another chunk and combine the two.

Examples

Below is an example of the same code points being encoded to UTF-161:

Code point Name Glyph Byte 1 Byte 2 Byte 3 Byte 4
U+61 LATIN SMALL LETTER A a 0x00 0x61
U+27A4 BLACK RIGHTWARDS ARROWHEAD โžค 0x27 0xA4
U+130E6 EGYPTIAN HIEROGLYPH E017A ๐“ƒฆ 0xD8 0x0C 0xDC 0xE6
U+1F60D SMILING FACE WITH HEART-SHAPED EYES ๐Ÿ˜ 0xD8 0x3D 0xDE 0x0D

UTF-8

UTF-8 is by far the most popular of Unicode encodings and the one most developers are familiar with. Like UTF-16, it is a variable-length encoding where code points are encoded using chunks of 1 byte (8 bits). Just like other Unicode encodings, any code point may be encoded and decoded using UTF-8. The scheme for encoding a code point into UTF-8 is as follows. UTF-8 divides the set of code points in four contiguous ranges and maps them to be represented over 1, 2, 3 or 4 bytes:

Number of bytes Bits for code point First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
1 7 U+0000 U+007F 0xxxxxxx
2 11 U+0080 U+07FF 110xxxxx (0xC0) 10xxxxxx
3 16 U+0800 U+FFFF 1110xxxx (0xE0) 10xxxxxx 10xxxxxx
4 21 U+10000 U+10FFFF 11110xxx (0xF0) 10xxxxxx 10xxxxxx 10xxxxxx

Encoding a code point is a relatively simple task, especially compared to UTF-16. Let’s take U+801 SAMARITAN LETTER BIT (เ ) as an example.

  1. From the code point, determine the number of bytes needed to encode it using the ranges in the table above (3 bytes in this case)
  2. Take the U number of the code point and turn it into binary U+801 โ†’ 0x801 = 0b100000000001
  3. Pad left with 0s to reach the target number of bytes, which is 16 in our case: 0b100000000001 โ†’ 0b0000100000000001
  4. Fill in the blanks starting from the left:
    1. The first byte will take the form 0b11xxxxxx โ†’ 0b11100000 (0xE0)
    2. The second byte will take the form 0b10xxxxxx โ†’ 0b10100000 (0xA0)
    3. The third byte will take the form 0b10xxxxxx โ†’ 0b10000001 (0x81)

Decoding UTF-8 is also relatively simple, as the first bytes tells you how many bytes are expected to participate in the encoding of the code point.

As of 2020, UTF-8 is by far the most popular form of encoding for text. There are a few reasons for this, but the obvious one is the encoding’s backwards compatibility with ASCII. ASCII is a Latin-centric codec which was incredibly popular in the early days of computing and was still in use through the 2010’s To this day, it is still fairly common for programs written today to special case characters that are outside the set of characters defined in ASCII.

The genius of UTF-8 is that any valid ASCII content is naturally valid UTF-8 content. In other words, a file written as ASCII text when the standard was initially published in 1963 (almost 60 years ago!) can be read by a program interpreting it as UTF-8 in 2020.

On the Web, this encoding went from being used by ~5% of pages in 2006 to under just 65% in 2012 and 95% in 2020. UTF-8 benefited from being the recommended encoding for HTML5 documents, JSON, and a variety of other formats and standards. It is also the internal encoding for strings in the Go and Rust programming languages.

The following table shows the encoded values of the same code points used in the previous examples:

Code point Name Glyph Byte 1 Byte 2 Byte 3 Byte 4
U+61 LATIN SMALL LETTER A a 0x61
U+27A4 BLACK RIGHTWARDS ARROWHEAD โžค 0xE2 0x9E 0xA4
U+130E6 EGYPTIAN HIEROGLYPH E017A ๐“ƒฆ 0xF0 0x93 0x83 0xA6
U+1F60D SMILING FACE WITH HEART-SHAPED EYES ๐Ÿ˜ 0xF0 0x9F 0x98 0x8D

A note on byte-order marks

Whenever a data serialization format encodes data over multiple bytes, the question of endianness is one that must be addressed. A UTF-16 implementation that would naively rely on the native of the host machine might produce a different output than a machine using the converse endianness mode. This can be an issue for UTF-32 and UTF-16.

To that effect, the Unicode standard defines a code point that has a special status: U+FEFF ZERO WIDTH NO-BREAK SPACE. This code point, which is called a byte-order mark (BOM) is expected to be present as the first characters of byte strings that encode Unicode strings over multiple bytes:

This is a step that is generally transparently handled by implementations and that you shouldn’t have to pay attention to:


  1. The byte-order mark was omitted from these examples, please refer to the paragraph on byte-order marks later in this article to understand their purpose. ↩︎