Compression: Run-Length Encoding

Perhaps the simplest method of compression is run length encoding. When faced with the string "AAAAAAAAAAAAAAAAA", if asked to describe it to another person, most people's first instinct would be to count the number of As, and tell the other person "17 'A's", at which point the other person perversely writes down the string "SEVENTEEN AYES", showing up one of the problems with this approach: distinguishing talking about the data from the data itself.

First, let's calculate the length in bits of some strings, so we can better compare performance. If we assume that our alphabet consists solely of upper case letters, then:
* "AAAAAAAAAAAAAAAAA" is 79.9 bits long
* "AAAABBBBAAAABBBB" is 75.2 bits long
* "ABCDEFGHIJKLMNOPQRSTUVWXYZ" is 122.2 bits long

We could add 10 new symbols corresponding to the digits 0-9, and send "AAAAAAAAAAAAAAAAA" as "17A" (15.5 bits, equivalent to 3.29 original characters), "AAAABBBBAAAABBBB" as "4A4B4A4B" (41.3 bits, equivalent to 8.8 original characters), and "ABCDEFGHIJKLMOPQRSTUVWXYZ" as "ABCDEFGHIJKLMOPQRSTUVWXYZ". Unfortunately, "ABCDEFGHIJKLMOPQRSTUVWXYZ" how has a length of 134.4 bits, longer than the original by about 11.75 bits (about 2.5 letters), because each symbol now represents a choice between 36 different options compared to the 26 before.

Rather than add 10 new symbols, let's try another approach. Let's add just one new symbol, say '!'. Whenever we see it, we interpret the next symbol as a value between 2 and 28 (A=2, B=3, ..., Z=27, !=28), and repeat the character after that that many times. If the next symbol is also a !, we multiply the first number by 27 and interpret the next symbol as a number between 0 and 26 to add to it. This can repeat as many times as necessary before we finally see a symbol that's not !, which then gets repeated as many times as the number we've been building up. This would make our previous examples "!QA" (14.26 bits; 3.03 original characters), "!CA!CB!CA!CB" (57 bits; 12.1 original characters) and "ABCDEFGHIJKLMOPQRSTUVWXYZ" (123.6 bits; 26.3 characters). The last example is still a bit longer, but now it's just 1.41 bits longer (0.3 letters).

Finally, we can repurpose one of the letters, say Z, as a special symbol. Whenever we see a Z, we behave as we did when we saw the ! in the previous example, but with one difference: instead of values going up to 27 or 26, it only goes up to 25 or 24 - there is no extra character, and if we see a Z, we output a Z. We how have "ZQA" (14.1 bits; 3 original characters), "ZCAZCBZCAZCB" (56.4 bits; 12 original characters) and "ABCDEFGHIJKLMNOPQRSTUVWXYZZ" (126.9 bits; 27 original characters).

If all letters are equally likely to appear, then on average, a message with no repetitions would be expected to increase in size by 10% by using numbers, 1.1% for a dedicated escape character, and 3.8% if we repurpose an existing character for escape. English text doesn't have a uniform letter distribution, though, so if we're encoding standard English text, we'd expect an actual increase of 0.074%.

If we don't have a good idea as to how frequently any given character might appear, another approach is to have multiple modes. To make things easier, I'm going to switch to working with bytes, rather than alphabet letters - the larger numbers make things, and working with numbers rather than letters makes treating them as numbers easier - no A=1, B=2, etc.

The default state of the modal run-length decoder is the command state. In this state, it reads in an (unsigned) byte n, and then determines what to do next: if the byte has a value less than 128 then the next n+1 bytes are to inserted directly into the output. If it's greater than or equal to 128, then we subtract 126 for this value, and repeat the next byte that many times. This method has an overhead of 1 extra byte every 128 or so, but this is also a worst case (on a sufficiently long input) - the only reason to have a swap modes more frequently than that is if there was a run of a length 3 or greater. (A run of length two is best if following a literal block of length less than 127, as it costs one byte to get into Run mode, one byte to indicate the byte to repeat). With a run of length three, the cost to express the run and return to literal mode is the same as the cost of just including the run in the last literal block (if any).