The Unicode Encodings

This article is the second in a series intended to demystify the Unicode standard and explore in some depth the aspects that programmers have issues with. It is recommended that you follow the articles in sequence, as the concepts in later articles build upon the ones explained in earlier ones.

⌘ The Rundown

Unlike most other character sets, there is no straightforward mapping between Unicode code point and bytes
You must specify an encoding when converting bytes into Unicode code points, and code points into bytes
All Unicode encodings support the full set of 1,114,112 possible Unicode code points
The Unicode standard defines several encodings which are called transformation formats: UTF-32, UTF-16 and UTF-8 are all part of the Unicode standard
UTF-16 leverages an indexing scheme called surrogate pairs, which was devised when the character set grew past the BMP
UTF-8 is by far the most popular encoding in the world and the recommended one for several document formats, such as JSON and HTML

Perhaps Unicode’s greatest insight is the decoupling of characters from bytes. Before Unicode, most systems used for representing strings equated the bytes read from, e.g., a file, to the characters in memory. It made systems easy to understand but inherently limited.

For instance, the C programming language defines a very minimal character set which contains basic latin characters and the built-in char type as follows:

If a member of the basic execution character set is stored in a char object, its value is guaranteed to be nonnegative. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

When it came down to representing those characters in memory and writing them to a file, all C implementations settled on ASCII. 'a' the ASCII character was mapped to 0x61 and that was also the byte value of the character 'a' when declared in C. That is not the case in Unicode: the lowercase latin letter ‘a’ is defined as U+0061. That’s a subtle difference but a critical one: the U+ notation is how you know the value is referring to a Unicode code point and not bytes.

Encodings sit between between bytes and code points. They consist of rules for translating code points into bytes and bytes into code points. What it implies is that you cannot just take bytes and turn them into a meaningful Unicode string of text without knowing what encoding was used to write the bytes.

Unicode defines several encodings and all of them are designed to support the full range of all possible code points across all planes. Encodings are designated by the “UTF” prefix, which stands for “Unicode Transformation Format”

UTF-32

UTF-32 is the simplest and most straightforward of all Unicode encodings. It’s a fixed-length encoding where each code point is encoded over 32 bits. The U+ value of a code point is directly converted into bytes with the same value:

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
4	21	U+0000	U+10FFFF	00000000 (0x00)	000xxxxx	xxxxxxxx	xxxxxxxx

Because there are “only” over one million Unicode code points but that 32 bits can represent up to 4 billion values, 11 bits are always set to 0. This makes UTF-32 an extremely wasteful encoding.

One useful purpose for it is to trivially index a code point in a string of bytes: the code point at index n is simply the n^th 32-bit chunk in your string. But there’s little practical use to this: Unicode code points may combine with adjacent ones to form the actual grapheme, which is the unit of text that is truly meaningful to end users.

Example

Here’s an example of select code points and how they are encoded into bytes using UTF-32:

Code point	Name	Glyph	Byte 1	Byte 2	Byte 3	Byte 4
U+61	LATIN SMALL LETTER A	a	0x00	0x00	0x00	0x61
U+27A4	BLACK RIGHTWARDS ARROWHEAD	➤	0x00	0x00	0x27	0xA4
U+130E6	EGYPTIAN HIEROGLYPH E017A	𓃦	0x00	0x01	0x30	0xE6
U+1F60D	SMILING FACE WITH HEART-SHAPED EYES	😍	0x00	0x01	0xF6	0x0D

UTF-16

UTF-16 is a more complex form of encoding, perhaps the most peculiar of all Unicode encodings. It’s a variable-length encoding with 16-bit chunks: a given code point may be encoded using one or two chunks. As with other encodings, it supports all Unicode code points and follows the following structure:

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
2	0	U+0000	U+D7FF	xxxxxxxx	xxxxxxxx
N/A	N/A	U+D800	U+DFFF	N/A	N/A	N/A	N/A
2	0	U+E000	U+FFFF	xxxxxxxx	xxxxxxxx
4	20	U+10000	U+10FFFF	110110xx	xxxxxxxx	110111xx	xxxxxxxx

The reason UTF-16 is more complex than other encodings lies in its tight coupling with the definition of the character set itself. At the beginning of this article, I said that Unicode’s greatest insight was the decoupling between characters and bytes: this is true today but was not always the case.

The reason behind the 17 planes

In its very first version, Unicode defined a single plane of 65,536 elements and the UTF-16 encoding was a direct map from U numbers to bytes — this was a very simple architectural choice for implementors of the standard to follow.

When the BMP started running short of space and that the need to grow beyond it became clear, Unicode was confronted with a choice:

Deprecate UTF-16 and define a new encoding that would cover the new full breadth of the new space. This has the downside asking implementors to roll out and adopt a brand new encoding.
Define a scheme that would make the encoding more complex but allow backwards compatibility with existing implementations.

The Unicode Consortium chose the latter option: out the already packed BMP, the standard would reserve a range of 2,048 code points that would be guaranteed to never be allocated: U+D800 → U+DFFF. This space would instead be used create an index for pointing at code points outside of the BMP.

UTF-16 needed to remain an encoding where chunks would take 16 bits each. Characters beyond the BMP would necessarily take at least 2 16-bit chunks to be represented: UTF-16 calls those chunk surrogate pairs: this required further dividing the 2,048 block in 2 blocks of 1,024 values each: U+D800 → U+DBFF and U+DC00 → U+DFFF: one for the high surrogate and one for the low surrogate.

1,024 values requires 10 significant bits to represent, which means that a single 1,024 block is able to map to 2¹⁰ values. Since there are 2 blocks, their indexing powers are multiplied and bring the total of supplementary code points to 2¹⁰ × 2¹⁰ = 2²⁰.

In other words, sacrificing just over two thousand code points in the BMP opened up the space for 2²⁰ brand new code points. Since a plane contains 2¹⁶ code points, the total number of planes outside of the BMP is 2²⁰ ÷ 2¹⁶ = 2⁴ = 16, bringing the total count of planes to 17.

Even though the scheme was first published in 1996 in Unicode 2.0, it took until 2004 for implementors such as Sun to surface this new mechanism to developers in Java 1.5.

That’s the secret of UTF-16, and the reason it is tightly coupled to the Unicode standard itself: The U+D800 → U+DFFF range of code points is guaranteed by the standard to never be assigned and its values are instead used for indexing characters outside the BMP, but exclusively for the purpose of allowing UTF-16 do do to (neither UTF-32 nor UTF-8 have that problem). When this is needed, the two resulting 16-bit chunks are called a surrogate pair.

The UTF-16 scheme

To encode a character in non-BMP planes requires some bit fiddling:

Take the U-value of the code point and map it to bytes, e.g. U+1F976 → 0x1F976
Subtract 0x10000 (65536) — since the maximum code point value is U+10FFFF (1114111), the range of the remaining value is 0x0–0xFFFFF (0–1048575)
1. Left-pad the value with 0, up to to 20 bits
2. Take the high ten bits of that value (range 0x000–0x3FF, 0–1023) and add 0xD800 (55296) to form the high surrogate (range 0xD800-0xDBFF, 55296–56319)
3. Take the low ten bits of that value (also in range 0x000–0x3FF) and add 0xDC00 (56320) to form the low surrogate (range 0xDC00–0xDFFF, 56320–57343)

Properties

UTF-16 has properties that make it a good general purpose encoding. Primarily, any code point in the BMP will be encoded over 16 bits: this is a use case that covers most language on Earth in a decently low number of bits. Second, despite the apparent complexity the encoding rules, decoding a UTF-16 stream is relatively simple to implement in code: if a chunk is in the high surrogate range, consume another chunk and combine the two.

Examples

Below is an example of the same code points being encoded to UTF-16¹:

Code point	Name	Glyph	Byte 1	Byte 2	Byte 3	Byte 4
U+61	LATIN SMALL LETTER A	a	0x00	0x61
U+27A4	BLACK RIGHTWARDS ARROWHEAD	➤	0x27	0xA4
U+130E6	EGYPTIAN HIEROGLYPH E017A	𓃦	0xD8	0x0C	0xDC	0xE6
U+1F60D	SMILING FACE WITH HEART-SHAPED EYES	😍	0xD8	0x3D	0xDE	0x0D

UTF-8

UTF-8 is by far the most popular of Unicode encodings and the one most developers are familiar with. Like UTF-16, it is a variable-length encoding where code points are encoded using chunks of 1 byte (8 bits). Just like other Unicode encodings, any code point may be encoded and decoded using UTF-8. The scheme for encoding a code point into UTF-8 is as follows. UTF-8 divides the set of code points in four contiguous ranges and maps them to be represented over 1, 2, 3 or 4 bytes:

Number of bytes	Bits for code point	First code point	Last code point	Byte 1	Byte 2	Byte 3	Byte 4
1	7	U+0000	U+007F	0xxxxxxx
2	11	U+0080	U+07FF	110xxxxx (0xC0)	10xxxxxx
3	16	U+0800	U+FFFF	1110xxxx (0xE0)	10xxxxxx	10xxxxxx
4	21	U+10000	U+10FFFF	11110xxx (0xF0)	10xxxxxx	10xxxxxx	10xxxxxx

Encoding a code point is a relatively simple task, especially compared to UTF-16. Let’s take U+801 SAMARITAN LETTER BIT (ࠁ) as an example.

From the code point, determine the number of bytes needed to encode it using the ranges in the table above (3 bytes in this case)
Take the U number of the code point and turn it into binary U+801 → 0x801 = 0b100000000001
Pad left with 0s to reach the target number of bytes, which is 16 in our case: 0b100000000001 → 0b0000100000000001
Fill in the blanks starting from the left:
1. The first byte will take the form 0b11xxxxxx → 0b11100000 (0xE0)
2. The second byte will take the form 0b10xxxxxx → 0b10100000 (0xA0)
3. The third byte will take the form 0b10xxxxxx → 0b10000001 (0x81)

Decoding UTF-8 is also relatively simple, as the first bytes tells you how many bytes are expected to participate in the encoding of the code point.

As of 2020, UTF-8 is by far the most popular form of encoding for text. There are a few reasons for this, but the obvious one is the encoding’s backwards compatibility with ASCII. ASCII is a Latin-centric codec which was incredibly popular in the early days of computing and was still in use through the 2010’s To this day, it is still fairly common for programs written today to special case characters that are outside the set of characters defined in ASCII.

The genius of UTF-8 is that any valid ASCII content is naturally valid UTF-8 content. In other words, a file written as ASCII text when the standard was initially published in 1963 (almost 60 years ago!) can be read by a program interpreting it as UTF-8 in 2020.

On the Web, this encoding went from being used by ~5% of pages in 2006 to under just 65% in 2012 and 95% in 2020. UTF-8 benefited from being the recommended encoding for HTML5 documents, JSON, and a variety of other formats and standards. It is also the internal encoding for strings in the Go and Rust programming languages.

The following table shows the encoded values of the same code points used in the previous examples:

Code point	Name	Glyph	Byte 1	Byte 2	Byte 3	Byte 4
U+61	LATIN SMALL LETTER A	a	0x61
U+27A4	BLACK RIGHTWARDS ARROWHEAD	➤	0xE2	0x9E	0xA4
U+130E6	EGYPTIAN HIEROGLYPH E017A	𓃦	0xF0	0x93	0x83	0xA6
U+1F60D	SMILING FACE WITH HEART-SHAPED EYES	😍	0xF0	0x9F	0x98	0x8D

A note on byte-order marks

Whenever a data serialization format encodes data over multiple bytes, the question of endianness is one that must be addressed. A UTF-16 implementation that would naively rely on the native of the host machine might produce a different output than a machine using the converse endianness mode. This can be an issue for UTF-32 and UTF-16.

To that effect, the Unicode standard defines a code point that has a special status: U+FEFF ZERO WIDTH NO-BREAK SPACE. This code point, which is called a byte-order mark (BOM) is expected to be present as the first characters of byte strings that encode Unicode strings over multiple bytes:

Upon encoding a string, encoders would prefix it with the BOM and then run the normal encoding logic in their native endianness
When reading bytes that are expected to encode code point over multiple bytes, decoders would peek the first chunk of bytes:
- If the result is 0xFEFF, the rest of the bytes should be decoded as little-endian
- If the result is 0xFFFE, the rest of the bytes should be decoded as big-endian

This is a step that is generally transparently handled by implementations and that you shouldn’t have to pay attention to:

UTF-32 would normally require a BOM (encoded on 4 bytes, i.e. 0x0000FFFE or 0x0000FEFF) but since the encoding is practically never used, implementations are inconsistent and may omit it.
UTF-16 remains a relatively common encoding and most encoders would automatically prefix the BOM to the bytes they produce — this was the case for the Java code used to generate the encoding table above, I chose to omit the BOM from the output but it was actually present.
UTF-8 operates at the byte level, and therefore doesn’t need the BOM in either encoding or decoding. It occasionally happens that UTF-16 content that is then serialized into UTF-8 embed a BOM, but implementations are expected to ignore it.

The byte-order mark was omitted from these examples, please refer to the paragraph on byte-order marks later in this article to understand their purpose. ↩︎