Compression: Introduction

    What is compression?


A compression algorithm that maps a set of strings A to another set of strings B, with the goal of mapping strings in A that are believed to be more likely to occur to shorter strings in B. There are two important details here: "shorter", and "more common".

Because A might have greater or fewer symbols than B, we need to refine our concept of length. If we wrote down a binary string, then drew boxes around every four bits, and considered each box to be a symbol, we wouldn't consider this new string to be more concise than the original, even though it only requires one quarter of as many new symbols to write. Similarly, converting a binary string to hexadecimal doesn't make it shorter in the way that we're looking for, even though it might take up less space on paper and fewer pen strokes to write. Each hexadecimal digit is equivalent to one of the boxes. For this reason, we consider string lengths in terms of bits, even if the actual output is written in entire bytes at a time.

The detail that we're only trying to shorten frequently appearing strings is also an important one, and leads to one of the fundamental principles of compression. There are always more strings of a given length than there are shorter ones. If there are k symbols in the alphabet, then the number of strings of length n is k^n, but the number of shorter strings is (k^n - 1)/(k-1). If k = 2, then this means you're always short by one. For larger values of k, the situation gets even worse.

If we ignore the inner workings of the compression algorithm, and just look at the input and output, there is one important property that we want: for every given output, there is only one input that yields it. If more than one input yields the same output, then there is no way to reverse the process - any metadata that might give the decompressor a clue as to which input file to recreate must be counted as part of the output, and as part of its length. No-one would consider an algorithm that merely trimmed the final byte from a file and appended it to the file's name to be a proper compression algorithm, even though it did just shorten the file on disc by one byte - in order to reconstruct the original file somewhere else, one must now inform the destination of what the compressed file's name is.

This brings us to the counting argument, which simply states that you can't perform an injective map from a larger set to a smaller set. Since the set of strings of any given length is larger than the set of strings smaller than them, there's no algorithm that will compress all of them - at least one must remain the same size. For a simple example, considering writing the numbers from 100 to 999 on red slips of paper, and devising some kind of compression scheme that will compress as many of them as possible. Now right the numbers 0-99 on blue slips of paper. For every slip in the red pile, pick out of the blue pile the number that your compression scheme would map it to, and staple them together. Once you've finished, you're going to either have a lot of red slips with no blue slips stapled to them, or at least one blue slip with a lot of red slips stapled to it.

In other words, there is no perfect compression algorithm that will work on all inputs. You can cut your losses, by marking each output with a leading bit that says whether the rest of the output has been compressed or not, but now every output has grown in size by one bit. A lot of compression programs do something similar to handle files that they cannot compress. The rest of the savings comes from considering files which files they're likely to see, and reducing redundancy. Mathematically, both approaches are equivalent, but in terms of implementation, the two tend to be separate, or apply methods from both camps in sequence.