The Unicode Encodings
This article is the second in a series intended to demystify the Unicode standard and explore in some depth the aspects that programmers have issues with. It is recommended that you follow the articles in sequence, as the concepts in later articles build upon the ones explained in earlier ones.
- I ๏ฟฝ Unicode (Intro)
- The Character Set
- The Encodings (This article)
- The Algorithms
Perhaps Unicode’s greatest insight is the decoupling of characters from bytes. Before Unicode, most systems used for representing strings equated the bytes read from, e.g., a file, to the characters in memory. It made systems easy to understand but inherently limited.
For instance, the C programming language defines a very minimal character set which contains basic latin characters and the built-in char
type as follows:
If a member of the basic execution character set is stored in a
char
object, its value is guaranteed to be nonnegative. If any other character is stored in achar
object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.
When it came down to representing those characters in memory and writing them to a file, all C implementations settled on ASCII. 'a'
the ASCII character was mapped to 0x61
and that was also the byte value of the character 'a'
when declared in C. That is not the case in Unicode: the lowercase latin letter ‘a’ is defined as U+0061. That’s a subtle difference but a critical one: the U+
notation is how you know the value is referring to a Unicode code point and not bytes.
Encodings sit between between bytes and code points. They consist of rules for translating code points into bytes and bytes into code points. What it implies is that you cannot just take bytes and turn them into a meaningful Unicode string of text without knowing what encoding was used to write the bytes.
Unicode defines several encodings and all of them are designed to support the full range of all possible code points across all planes. Encodings are designated by the “UTF” prefix, which stands for “Unicode Transformation Format”
UTF-32
UTF-32 is the simplest and most straightforward of all Unicode encodings. It’s a fixed-length encoding where each code point is encoded over 32 bits. The U+ value of a code point is directly converted into bytes with the same value:
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
4 | 21 | U+0000 | U+10FFFF | 00000000 (0x00) | 000xxxxx | xxxxxxxx | xxxxxxxx |
Because there are “only” over one million Unicode code points but that 32 bits can represent up to 4 billion values, 11 bits are always set to 0. This makes UTF-32 an extremely wasteful encoding.
One useful purpose for it is to trivially index a code point in a string of bytes: the code point at index n
is simply the nth 32-bit chunk in your string. But there’s little practical use to this: Unicode code points may combine with adjacent ones to form the actual grapheme, which is the unit of text that is truly meaningful to end users.
Example
Here’s an example of select code points and how they are encoded into bytes using UTF-32:
Code point | Name | Glyph | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|
U+61 | LATIN SMALL LETTER A | a | 0x00 | 0x00 | 0x00 | 0x61 |
U+27A4 | BLACK RIGHTWARDS ARROWHEAD | โค | 0x00 | 0x00 | 0x27 | 0xA4 |
U+130E6 | EGYPTIAN HIEROGLYPH E017A | ๐ฆ | 0x00 | 0x01 | 0x30 | 0xE6 |
U+1F60D | SMILING FACE WITH HEART-SHAPED EYES | ๐ | 0x00 | 0x01 | 0xF6 | 0x0D |
UTF-16
UTF-16 is a more complex form of encoding, perhaps the most peculiar of all Unicode encodings. It’s a variable-length encoding with 16-bit chunks: a given code point may be encoded using one or two chunks. As with other encodings, it supports all Unicode code points and follows the following structure:
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
2 | 0 | U+0000 | U+D7FF | xxxxxxxx | xxxxxxxx | ||
N/A | N/A | U+D800 | U+DFFF | N/A | N/A | N/A | N/A |
2 | 0 | U+E000 | U+FFFF | xxxxxxxx | xxxxxxxx | ||
4 | 20 | U+10000 | U+10FFFF | 110110xx | xxxxxxxx | 110111xx | xxxxxxxx |
The reason UTF-16 is more complex than other encodings lies in its tight coupling with the definition of the character set itself. At the beginning of this article, I said that Unicode’s greatest insight was the decoupling between characters and bytes: this is true today but was not always the case.
The reason behind the 17 planes
In its very first version, Unicode defined a single plane of 65,536 elements and the UTF-16 encoding was a direct map from U numbers to bytes โ this was a very simple architectural choice for implementors of the standard to follow.
When the BMP started running short of space and that the need to grow beyond it became clear, Unicode was confronted with a choice:
- Deprecate UTF-16 and define a new encoding that would cover the new full breadth of the new space. This has the downside asking implementors to roll out and adopt a brand new encoding.
- Define a scheme that would make the encoding more complex but allow backwards compatibility with existing implementations.
The Unicode Consortium chose the latter option: out the already packed BMP, the standard would reserve a range of 2,048 code points that would be guaranteed to never be allocated: U+D800 โ U+DFFF. This space would instead be used create an index for pointing at code points outside of the BMP.
UTF-16 needed to remain an encoding where chunks would take 16 bits each. Characters beyond the BMP would necessarily take at least 2 16-bit chunks to be represented: UTF-16 calls those chunk surrogate pairs: this required further dividing the 2,048 block in 2 blocks of 1,024 values each: U+D800 โ U+DBFF and U+DC00 โ U+DFFF: one for the high surrogate and one for the low surrogate.
1,024 values requires 10 significant bits to represent, which means that a single 1,024 block is able to map to 210 values. Since there are 2 blocks, their indexing powers are multiplied and bring the total of supplementary code points to 210 ร 210 = 220.
In other words, sacrificing just over two thousand code points in the BMP opened up the space for 220 brand new code points. Since a plane contains 216 code points, the total number of planes outside of the BMP is 220 รท 216 = 24 = 16, bringing the total count of planes to 17.
Even though the scheme was first published in 1996 in Unicode 2.0, it took until 2004 for implementors such as Sun to surface this new mechanism to developers in Java 1.5.
That’s the secret of UTF-16, and the reason it is tightly coupled to the Unicode standard itself: The U+D800 โ U+DFFF range of code points is guaranteed by the standard to never be assigned and its values are instead used for indexing characters outside the BMP, but exclusively for the purpose of allowing UTF-16 do do to (neither UTF-32 nor UTF-8 have that problem). When this is needed, the two resulting 16-bit chunks are called a surrogate pair.
The UTF-16 scheme
To encode a character in non-BMP planes requires some bit fiddling:
- Take the U-value of the code point and map it to bytes, e.g. U+1F976 โ 0x1F976
- Subtract 0x10000 (65536) โ since the maximum code point value is U+10FFFF (1114111), the range of the remaining value is 0x0โ0xFFFFF (0โ1048575)
- Left-pad the value with 0, up to to 20 bits
- Take the high ten bits of that value (range 0x000โ0x3FF, 0โ1023) and add 0xD800 (55296) to form the high surrogate (range 0xD800-0xDBFF, 55296โ56319)
- Take the low ten bits of that value (also in range 0x000โ0x3FF) and add 0xDC00 (56320) to form the low surrogate (range 0xDC00โ0xDFFF, 56320โ57343)
Properties
UTF-16 has properties that make it a good general purpose encoding. Primarily, any code point in the BMP will be encoded over 16 bits: this is a use case that covers most language on Earth in a decently low number of bits. Second, despite the apparent complexity the encoding rules, decoding a UTF-16 stream is relatively simple to implement in code: if a chunk is in the high surrogate range, consume another chunk and combine the two.
Examples
Below is an example of the same code points being encoded to UTF-161:
Code point | Name | Glyph | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|
U+61 | LATIN SMALL LETTER A | a | 0x00 | 0x61 | ||
U+27A4 | BLACK RIGHTWARDS ARROWHEAD | โค | 0x27 | 0xA4 | ||
U+130E6 | EGYPTIAN HIEROGLYPH E017A | ๐ฆ | 0xD8 | 0x0C | 0xDC | 0xE6 |
U+1F60D | SMILING FACE WITH HEART-SHAPED EYES | ๐ | 0xD8 | 0x3D | 0xDE | 0x0D |
UTF-8
UTF-8 is by far the most popular of Unicode encodings and the one most developers are familiar with. Like UTF-16, it is a variable-length encoding where code points are encoded using chunks of 1 byte (8 bits). Just like other Unicode encodings, any code point may be encoded and decoded using UTF-8. The scheme for encoding a code point into UTF-8 is as follows. UTF-8 divides the set of code points in four contiguous ranges and maps them to be represented over 1, 2, 3 or 4 bytes:
Number of bytes | Bits for code point | First code point | Last code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+007F | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx (0xC0) | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx (0xE0) | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx (0xF0) | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Encoding a code point is a relatively simple task, especially compared to UTF-16. Let’s take U+801 SAMARITAN LETTER BIT
(เ ) as an example.
- From the code point, determine the number of bytes needed to encode it using the ranges in the table above (3 bytes in this case)
- Take the U number of the code point and turn it into binary U+801 โ 0x801 = 0b100000000001
- Pad left with 0s to reach the target number of bytes, which is 16 in our case: 0b100000000001 โ 0b0000100000000001
- Fill in the blanks starting from the left:
- The first byte will take the form 0b11xxxxxx โ 0b11100000 (0xE0)
- The second byte will take the form 0b10xxxxxx โ 0b10100000 (0xA0)
- The third byte will take the form 0b10xxxxxx โ 0b10000001 (0x81)
Decoding UTF-8 is also relatively simple, as the first bytes tells you how many bytes are expected to participate in the encoding of the code point.
As of 2020, UTF-8 is by far the most popular form of encoding for text. There are a few reasons for this, but the obvious one is the encoding’s backwards compatibility with ASCII. ASCII is a Latin-centric codec which was incredibly popular in the early days of computing and was still in use through the 2010’s To this day, it is still fairly common for programs written today to special case characters that are outside the set of characters defined in ASCII.
The genius of UTF-8 is that any valid ASCII content is naturally valid UTF-8 content. In other words, a file written as ASCII text when the standard was initially published in 1963 (almost 60 years ago!) can be read by a program interpreting it as UTF-8 in 2020.
On the Web, this encoding went from being used by ~5% of pages in 2006 to under just 65% in 2012 and 95% in 2020. UTF-8 benefited from being the recommended encoding for HTML5 documents, JSON, and a variety of other formats and standards. It is also the internal encoding for strings in the Go and Rust programming languages.
The following table shows the encoded values of the same code points used in the previous examples:
Code point | Name | Glyph | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|
U+61 | LATIN SMALL LETTER A | a | 0x61 | |||
U+27A4 | BLACK RIGHTWARDS ARROWHEAD | โค | 0xE2 | 0x9E | 0xA4 | |
U+130E6 | EGYPTIAN HIEROGLYPH E017A | ๐ฆ | 0xF0 | 0x93 | 0x83 | 0xA6 |
U+1F60D | SMILING FACE WITH HEART-SHAPED EYES | ๐ | 0xF0 | 0x9F | 0x98 | 0x8D |
A note on byte-order marks
Whenever a data serialization format encodes data over multiple bytes, the question of endianness is one that must be addressed. A UTF-16 implementation that would naively rely on the native of the host machine might produce a different output than a machine using the converse endianness mode. This can be an issue for UTF-32 and UTF-16.
To that effect, the Unicode standard defines a code point that has a special status: U+FEFF ZERO WIDTH NO-BREAK SPACE
. This code point, which is called a byte-order mark (BOM) is expected to be present as the first characters of byte strings that encode Unicode strings over multiple bytes:
- Upon encoding a string, encoders would prefix it with the BOM and then run the normal encoding logic in their native endianness
- When reading bytes that are expected to encode code point over multiple bytes, decoders would peek the first chunk of bytes:
- If the result is
0xFEFF
, the rest of the bytes should be decoded as little-endian - If the result is
0xFFFE
, the rest of the bytes should be decoded as big-endian
- If the result is
This is a step that is generally transparently handled by implementations and that you shouldn’t have to pay attention to:
- UTF-32 would normally require a BOM (encoded on 4 bytes, i.e.
0x0000FFFE
or0x0000FEFF
) but since the encoding is practically never used, implementations are inconsistent and may omit it. - UTF-16 remains a relatively common encoding and most encoders would automatically prefix the BOM to the bytes they produce โ this was the case for the Java code used to generate the encoding table above, I chose to omit the BOM from the output but it was actually present.
- UTF-8 operates at the byte level, and therefore doesn’t need the BOM in either encoding or decoding. It occasionally happens that UTF-16 content that is then serialized into UTF-8 embed a BOM, but implementations are expected to ignore it.
-
The byte-order mark was omitted from these examples, please refer to the paragraph on byte-order marks later in this article to understand their purpose. ↩︎