Optimizing Unicode data for fun and profit

Just the default ICU Unicode data library is some 20MB. We don't need most of those data and since we are anal about executable size and also about dependencies, I implemented all of this shit by hand.

I could have packed the data with some state-of-art compressor, but that is a dependency, and it blows up executable size. Instead I went for some fun and tried to bring the data sizes down with very simple algorithms.

Grapheme clusters boundary data

These data allow Rift to place it's caret correctly and therefore not break Unicode text even if we cannot render it.

Algorithm requires 17811 codepoints in 1,345 ranges (a space of 921,600 codepoints) to be assigned one of 14 categories. You then use these 14 categories to detect whether there is a valid caret offset between a pair of codepoints using algorithm defined in Unicode.

Naive storage with two 32-bit codepoints and an 8-bit category would be some 12105 bytes.

The optimization I made is this:

  • Expanded the ranges to an array of 921,600 categories, each sitting at index of corresponding codepoint.
  • RLE encoded them down.

Result is 3,674 bytes in total.

Detection algorithm itself requires a state machine which is encoded into a table of 98 bytes. So it totals at 3,772 bytes. That allows the actual algorithm code to be this tiny:

static inline int
corune_get_cluster_break_property(CoRune rune) {

static inline int
corune_is_cluster_break(int state, int category) {
    return CORUNE_CLUSTER_BREAK_STATES[((state + 1) << 4) + category];

static inline int
corune_get_cluster_break_state(int category) {
    return CORUNE_CLUSTER_BREAK_STATES[category];

I've documented how we went about optimizing the algorithm in a separate article written in 2016.

General category data

These data allow Rift to tell whether a codepoint is a lowercase, uppercase, number or any of the 30 categories defined by Unicode. It is a replacement for <ctype.h>'s isalpha, isdigit etc.

Data is one of 30 categories assigned to one of 3,823 ranges. Again with simple storage of two 32-bit codepoints and 8-bit category we'd end up with 34,407 bytes.

To optimize, similarly to previous data set:

  • I've expanded the ranges into a flat array of categories each at index of corresponding codepoint.
  • RLE encoding with pair {category, count}
  • Encoded both pair values with UTF-8.

Result is 7,748 bytes total.

Case mapping and folding data

These data are used for transforming lowercase characters to uppercase and vice versa. They also contain rules for case folding which is used to transform text in case-insensitive search.

I'm using simple version of the data, with one-to-one mapping, so the data is really a pair of two from-to codepoints.

The data contains:

  • 1,393 codepoint pairs for to-lowercase (*some 11,144 bytes)
  • 1,410 codepoint pairs for to-uppercase (*some 11,280 bytes)
  • 1,414 codepoint pairs for case-folding (*some 11,312 bytes)

Some 33,736 bytes total, (*) when stored as simple pair of 32-bit values.

Naturally you'd expect the uppercase and lowercase tables to be symmetric, but they are not. There's 37 differences. Case folding table is quite similar to data in lowercase too, it differs in 195 entries.

You probably know where I'm heading with this. Instead of storing each table fully, I only store to-lowercase table in full (because it's the smallest) and for the remaining two I only store the differences.

Distance between a and A is 32, and it is so also for most of other pairs above ASCII range. Similarly distance between source codepoints is quite small. So another optimization I do is that instead of two codepoints I store two deltas {last - source, destination - source}. These two values are then encoded with UTF-8.

In the end we get:

  • 3,234 bytes for base to-lowercase table.
  • 106 bytes for difference to-uppercase table.
  • 588 bytes for difference case-folding table.

That's 3928 bytes in total.


Most of optimizations do use UTF-8 as a "compression algorithm". It works because:

  • The numbers involved are quite small, so instead of wasting most of 32 bits for most of numbers, we can get down to 8 or 16 bits for majority of the data.
  • We are also encoding codepoints, codepoint deltas or 8-bit data. So we always fit into 4-byte UTF-8 sequences.
  • We already need UTF-8 implementation in a text editor, so there's no additional cost for single-purpose compression algorithm.

For instance general category data stored without UTF-8 "compression" would occupy 15kB instead of current 8kB.

This was mostly fun, but it got us some good base for further improvements. For one, I already have a fairly tiny LZW implemented (though not linked).

However even with this simple dance, we got the data from 80,248 to 15,448 bytes. Additional use of my tiny LZW pushes the data down to 6,081 bytes. The data is unpacked to memory for fast lookups and occupy 2,160,976 bytes.

The code for all this will be available as a separate tiny MIT library once it's stabilized and battle-tested in Rift.