I � Unicode

Introduction

If you work in computer science, the sentence “It’s probably just a Unicode character” is probably one that you have heard thrown around more than once in your career. It’s one after which the people in the room generally shrug it off as one of those inevitable quirks of working with computers and move on with their day because, well, what else is there to do today?

In another instance, you might have found yourself trying make sense of a blob of bytes content using various flavors and combinations of ASCII and then UTF-8 and then converting back to binary without always understanding what went on and how you got it working. But that email parsing functionality you were tasked to deliver is working now, at least until you start shipping your software in Bulgaria or Uzbekistan, at which point you’ll hopefully be two jobs further along in your career.

And there is this set of JIRA tickets, all carefully labelled i18n by your predecessor, where your users report being unable to get the right search results when they search for the content they have hosted in the CMS you sell them. Some of them even provide reproducible examples of issues with your software’s indexing. But there’s few enough of them that you brush it off as an edge case: “It’s probably just a Unicode character or something like that”. You decide to further label all tickets with intern and starter-task.

A primer on Unicode

Joking aside, this series is intended as a moderately advanced explainer for people who want to educate themselves about what Unicode is. The reason for it to be is that there are very few places on the Internet that strike the right balance between Joel Spolsky’s Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) (which is an absolute must-read) and the technical portal on the official Unicode website (that homepage got a welcome revamp in 2020, let’s hope it will expand to the documentation sections too)

Unicode is a lot of different things to different people. It’s easier to work backwards from what the goal of Unicode is: enable everyone to use their own language on any digital platform. That’s a lofty goal, especially at the time it was formulated all the way back in the early 1990’s.

However, and unlike most standards and specifications that came in the wake of dozen others, Unicode has actually risen to be the prevailing set of modern standards for managing text on billions of computers and phones. Since there’s a XKCD for everything, there’s also an anti-XKCD for everything:

Standards

I will repeat myself: Unicode is a lot of different things to different people. That’s because the Unicode standard is a vast and complex one: it attempts to formalize the complexities and intricacies of text, which is a thorny and culturally touchy domain. It seeks to bridge the gap between computers and programmers, who expect rules and structures, and languages and humans, who are messy and history-rich.

For this series, I will focus on three aspects of the Unicode standard:

The character set: what is the Unicode character set and how does it differ from ones that came before?
The encodings: what exactly is an encoding and why it is a requirement for Unicode I/O?
The algorithms: beyond characters and codepoints, what higher-level problems does the Unicode standard consider and address?

Code and examples

The code used to generate the tables in this series (and more) can be found on GitHub at jsilland/unicode. The examples are in Java but are intended to demonstrate concepts that are applicable and accessible in most languages that have access to a port of CLDR. Contributions and examples in other programming languages are welcome.