I find your lack of π disturbing
This story dates back from when I worked at Strava. At the time, Strava ran its main web and API service on Ruby on Rails v3 (which has since been upgraded) and stored most of the data in MySQL.
Despite its status as the go-to social network for millions of athletes, Strava was missing support for emojis until the second half of 2017. This wasn’t even something that came on our radar as a feature request from users, but the content on Strava looked flat compared to other places where users are free to type in text. Facebook, Twitter, Slack and even GitHub are places where emojis have taken on like wildfire: it’s just part of the world’s vocabulary and Strava users were left behind.
But even before the engineering team looked seriously into supporting emojis, one of my coworkers asked why the caption of a picture taken on another employee’s activity showed signs that we actually did: specifically, the caption simply read as two red hearts: “β€οΈβ€οΈ”
The persistence layer
Most entities at Strava are persisted in MySQL databases. The main instance encoded most string using utf8
, which is an incredibly misleading name in the context of MySQL. As you can see in the linked documentation, this encoding cannot actually encode a large portion of the Unicode codepoints:
- “No support for supplementary characters (BMP characters only)”
- “A maximum of three bytes per multibyte character”
Strava’s core database was set up before MySQL introduced the utf8mb4
character set, which extends the support of UTF-8-encoded text to the actual full range of all codepoints defined Unicode.
However, pictures (and their captions) are managed by a separate backend, which has its own database. The captions persisted in that database can contain emojis because the column was explicitly set to support it. As a matter of fact, one could easily INSERT
values that contained any codepoint and retrieve them later, as long as this was done from the MySQL CLI.
The serving layer
Now that we know that the data in the database may be an emoji, let’s try to see how it is returned to the client. Using httpie and jq, we can specifically look at that field in an API response and examine the bytes1:
$ http "https://www.strava.com/api/v3/activities/<redacted>/photos?photo_sources=true" \
'Authorization: Bearer <redacted>' | jq '.[8].caption' | xxd
00000000: 22e2 9da4 efb8 8fe2 9da4 efb8 8f22 0a "............".
We see the same sequence of bytes, repeated twice: e29da4efb88f
. This is the raw hexadecimal representation of content that is UTF-8 encoded. Decoding that sequence to the Unicode codepoints yields two of them:
U+2764 HEAVY BLACK HEART
: this character is actually not in the Emoji block (range: U+1F600..U+1F64F) but in the Dingbats block (range: U+2700..U+27BF)U+FE0F VARIATION SELECTOR-16
: this is a variant selector - it specifies a variant of the previous character’s glyph, and, in this case, I assume it changes the color to red.
What’s interesting is that the heart codepoint actually appears in the exhaustive list of Emoji codepoints on the Unicode website. The character, which was originally defined in 1995, was grandfathered into the emoji range despite having a codepoint in a very different range than other emojis.
We’ve asserted that the content is actually not an emoji, and can therefore be rendered properly by Rails v3. Problem solved. However, the premise of my coworker’s question was that emojis simply do not work and it’s worth exploring why that is the case.
β€οΈ but no π
At the time I worked there, Strava’s main frontend was a relatively vanilla Rails setup running v3.2. Let’s look into how that version of Rails encodes strings in JSON - the code below is a simplified version of activesupport/lib/active_support/json/encoding.rb
2:
string = "\u2764\uFE0F"
string = string.encode(::Encoding::UTF_8, :undef => :replace).force_encoding(::Encoding::BINARY)
json = string.gsub(escape_regex) { |s| ESCAPED_CHARS[s] }
json = json.gsub(/([\xC0-\xDF][\x80-\xBF]|
[\xE0-\xEF][\x80-\xBF]{2}|
[\xF0-\xF7][\x80-\xBF]{3})+/nx) { |s|
s.unpack("U*").pack("n*").unpack("H*")[0].gsub(/.{4}/n, '\\\\u\&')
}
json = %("#{json}")
json.force_encoding(::Encoding::UTF_8)
Making sense of this logic is not easy, to say the least:
- The string is first encoded to UTF-8 (undefined characters are replaced with the replacement character) and then force-encoded to binary. This will enable the rest of the code to dig into the binary representation of the text encoded in UTF-8
- A first round of escaping takes place:
gsub
is a method on Ruby strings that takes in a regular expression and yields each match in the receiver to a block which is tasked with determining the string to replace the match with.escape_regex
is essentially[\x00-\x1F"\\]
, i.e. bytes0x00
to0x1F
, the double quote (reserved character in JSON) and the backslash.ESCAPED_CHARS
is a mapping that contains the escaped version of these, e.g.\x17
β\u0017
.
- A second round of replacement takes place, again using
gsub
- The code first attempts to chunk the string in groups of two, three or four bytes depending on the range of the leading byte:
0xC0
-0xDF
β 2 bytes0xE0
-0xEF
β 3 bytes0xF0
-0xF7
β 4 bytes
- Each chunk of bytes then goes through another conversion pipeline that uses Ruby’s
pack
andunpack
methods- decoded as a UTF-8 character (
unpack("U*")
) - converted into big-endian 16-bit values (
pack("n*")
) - finally converted into a hexadecimal string (
unpack("H*")
)
- decoded as a UTF-8 character (
The chunking in 2/3/4 bytes is done based on the correct bytes prefixes for UTF-8, where codepoints are encoded on 1-4 bytes
Number of bytes | Bits for codepoint | First codepoint | Last codepoint | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|---|---|
1 | 7 | U+0000 | U+0070 | 0xxxxxxx | |||
2 | 11 | U+0080 | U+07FF | 110xxxxx (0xC0) | 10xxxxxx | ||
3 | 16 | U+0800 | U+FFFF | 1110xxxx (0xE0) | 10xxxxxx | 10xxxxxx | |
4 | 21 | U+10000 | U+10FFFF | 11110xxx (0xF0) | 10xxxxxx | 10xxxxxx | 10xxxxxx |
As one might suspect from the complicated logic implemented here, the final conversion pipeline is why Rails doesn’t support emojis out of the box. All responses to XHRs and the entirety of the API downstream traffic go through this JSON encoder. This block of code was once removed and then re-added because the fix was deemed to be changing the behavior of a stable version (later versions of Rails do not have that issue).
Interestingly, rendering an emoji character in an HTML context does work even in v3.2 because the encoding path is completely independent Β―\_(γ)_/Β―. But what actually happens? Knowing the escaping logic is faulty, we’d expect the dingbat character to go through that phase unscathed, while a “real” emoji will come out mangled:
- For “β€οΈ” as an input (
string = "\u2764\uFE0F"
), the return value is"\"\\u2764\\ufe0f\""
. - For “π” as an input (
string = "\u{1F389}"
), which is a character in the Emoji block, this returns"\"\\uf389\""
.
The outputs here seems does not fully allow us to blame the JSON encoder: the two outputs look roughly similar and it’s unclear if either or both of them are good or faulty.
API clients
To understand why the output of the second call is actually an issue, we need to look into what happens on the receiving end of it. JSON stands for “JavaScript Object Notation”: a data representation which is proper to JavaScript and we should therefore continue to investigate in that context. Using the Chrome inspector, it’s trivial to convert our content from JSON into JavaScript strings:
JSON.parse("\"\\u2764\\ufe0f\"")
"β€οΈ"
JSON.parse("\"\\uf389\"")
"ο"
This is the evidence we were looking for all along. It took going all the way to the client but we finally know the issue is real: only one of the two characters is failing to be unescaped correctly by a real-world JSON parser.
To the JSON RFCβ¦
Now that we now that the encoder is faulty, we should look into the root cause of the issueβand specifically understanding why there is such a triple conversion at the last step. That will require a dive in the JSON RFC:
Any character may be escaped. If the character is in the Basic Multilingual Plane (U+0000 through U+FFFF), then it may be represented as a six-character sequence: a reverse solidus, followed by the lowercase letter u, followed by four hexadecimal digits that encode the character’s code point. The hexadecimal letters A though F can be upper or lowercase. So, for example, a string containing only a single reverse solidus character may be represented as “\u005C”.
The heart character is in the BMP, the Basic Multilingual Plane (range: U+0000..U+FFFF), so that section applies and the output of the escaper makes sense.
β¦And back again
What does the spec have to say about characters that are not in the BMP, however?
To escape an extended character that is not in the Basic Multilingual Plane, the character is represented as a twelve-character sequence, encoding the UTF-16 surrogate pair.
That should be the case for the the party popper character (U+1F389) but the JSON encoder shipping with with Rails 3 doesn’t do that. It did escape the character but actually stripped the leading 1
, which is kind of important π. As per the spec, characters may be encoded in UTF-16 and represented using the two resulting bytes. Converting π to UTF-16 yields “D83C DF89”, so let’s try that:
JSON.parse("\"\uD83C\uDF89\"")
"π"
We have our first step towards a fix: if we can get the encoder to to spit out the correct UTF-16 bytes representing the “π” character, we could mitigate the issue. But it seems like a fairly painful mitigation: do we really need to jump through all those hoops to just represent emojis in JSON? The sad truth is that all that escaping is not even needed to begin with. The spec also contains the following language:
JSON text SHALL be encoded in Unicode. The default encoding is UTF-8
UTF-8 is just an encoding, which supports all Unicode codepoints, including emojis. And the spec didn’t say that such characters must be escaped, it said that they may. So let’s try to be literal for one second:
JSON.parse("\"π\"")
"π"
This is exactly the path taken by the JSON escaper in Rails 4+: it mostly became a passthrough for text content which is already encoded in UTF-8.
Step by step
The full solution to supporting emojis on Rails 3 and MySQL 5 is as follows:
- Convert your MySQL
VARCHAR
andTEXT
columns to useutf8mb4
as an encoding. Depending on the volume of data you have, this may take a while. - Set
utf8mb4
as an encoding in your database connection settings. - Monkey-patch the default Rails JSON encoder β you can do that in an initializer, e.g.
config/emoji_support.rb
. This commit which was proposed in the 3.2 branch to fix this very issue is a good place to start. - Test, test, test. Depending on the age of your Rails application, there can be subtle bugs that have never manifested with the lower range of Unicode codepoints. One that wasn’t caught in time for Strava is this one.
-
You may object that this command is not useful because it doesn’t actually expose the raw bytes as returned by the server: both
httpie
andjq
have built-in JSON parsers, which unescape the real data sent from the server. The actual content can be observed by piping the output ofhttp
directly intoxxd
. Unsurprisingly,"\u2764\ufe0f"
appears in it:000012e0: 6361 7074 696f 6e22 3a22 5c75 3237 3634 caption":"\u2764 000012f0: 5c75 6665 3066 5c75 3237 3634 5c75 6665 \ufe0f\u2764\ufe 00001300: 3066 222c 2274 7970 6522 3a6e 756c 6c2c 0f","type":null,
This is a valid point β with that said, and independently of the findings presented here, the mechanisms defined by the JSON RFC are meant to be transparent. What really matters to the user isn’t what’s on the wire, it’s the decoded, unescaped data, which the initial command does actually surface. ↩︎
-
With
ESCAPED_CHARS
declared as per the constant in the source ↩︎