Snell-Pym

Splay trees, compression, encryption, and embedding (by alaric)

But the fun doesn't end there. Splay trees can be used for data compression, too.

There are two main kinds of data compression; entropy coding and dictionary coding. Either type of compressor treats the input as a string of symbols - usually 8-bit bytes - and the output as a string of bits; the decompressor, obviously, has the reverse.

Entropy coding considers each symbol in turn, and tries to predict what the next symbol will be based on past statistics; it assigns short sequences of output bits to symbols that it thinks are likely to appear next, and long sequences to those it thinks unlikely. An entropy compressor reading English text will consider space (which occurs once per word!) and common letters like e, t, a, i, o as very likely, so will give them short codes (generally 4 bits or less, thus a saving over the 8 bits used to encode them in uncompressed text). Smarter compressors base their prediction on the past few characters, so that the letter 'u', normally uncommon so with a long code, is encoded with a very short code if the last character was 'q'.

Dictionary coding, on the other hand, is more concerned with sequences of symbols than what the symbols actually are. A dictionary coder looks for repeated sequences, and tries to encode them as multiple references to one copy of the sequence. The output of a dictionary code is a mixture of literal symbols, that didn't appear to be part of repeated sequences or are the first appearance of such a sequence, and "backreferences" to past sequences, encoded as a number of symbols to skip back and a number of symbols to copy from that point. The easiest example to explain is LZ77, in which the encoder keeps a record of the past 2KB or so of input symbols, and looks ahead at the next 32 or so input symbols. It searches the past history for the longest string that matches the upcoming symbols, and if it finds a match of more than 3 or 4 symbols, it encodes them as a backreference; otherwise, it outputs the first of the upcoming symbols as a literal symbol, takes it off of the upcoming list, and shoves it onto the history list, reads the next symbol of the input and puts it on the end of the upcoming list, and repeats the process.

Given a text like:

 "monkeys like monkeys like monkeys"

It would ouput:

 "monkeys like [13/7][13/13]"

Where the [13/7] means 'Copy 7 symbols from 13 what appeared symbols ago'

Often, the two methods are combined; the DEFLATE algorithm used in ZIP files works by running the input through a dictionary compressor, then using entropy compressors on the literals and the two parts of each backreference. This provides a double win - the literal text the dictionary system cannot compress gets entropy compressed, but also the backreferences are compressed; certain lengths of sequences are more common than others, so will be encoded with shorter codes.

It turns out that splay trees provide a reasonable form of entropy coding. It's not quite as good as Huffman coding or Arithmetic coding, but it's a lot simpler to implement, so is great for limited environments.

Basically, you store a tree containing all the different symbols. There's some technical differences to how the tree works, but basically, to encode each symbol, you look the symbol up in the tree and output a bit string describing the route you took down from the root to that symbol. You then splay that symbol up the tree, so that if it occurs again soon, it will be nearer the root, so will have a shorter sequence. The most common symbols hover near the top of the tree, so get the shorter sequences. Voila!

The decoder must start off with the same initial tree layout as the encoder; it then reads in a sequence of bits that express a path down the tree until it encounters a symbol, which it then outputs. It must then mimic what the encoder did to its tree, and splay that node, then start again from the root. As it extracts each symbol, its tree keeps pace with the encoder's tree, ensuring that both agree on what symbol appears where.

However, the fact that the tree is constantly changing with each symbol encoded, depending on which symbol was encoded, corresponds to a desirable property of encryption systems; that a single bit change in the input data stream causes the output stream to become very different after that point. This means that the splay compression encoding has the makings of a passable encryption algorithm. If you start the encoding by first of all searching for a string of symbols in the tree, without bothering to output the resulting paths, then the initial state of the tree depends on that string of symbols; the same string of symbols must be made available to the decode for it to know where to begin - so it's basically a password. There are weaknesses, so it should not be used alone except in low-security environments, but I suspect that it could be very useful as part of a larger system.

A common "encoding pipeline" for data is dictionary encoding, entropy encoding, then encryption. For example, PGP will DEFLATE the input text (which, as mentioned above, consists of dictionary encoding followed by entropy encoding), then encrypt it.

I suspect there'd be some value in making a DEFLATE-like composition of dictionary encoding, splay tree based entropy encoding with encryption thrown in, then conventional encryption with a separate key. To prevent the chosen-plaintext attacks on splay encryption documented above, random symbols can be looked up in the tree without bothering to waste space encoding the path, as long as the random sequence of symbols and the points in time at which to look them up depend on the key so the decoder can perform them too.

That way, the splay tree is mainly there for compression, but also gives you a nearly-free added layer of security against chosen-plaintext attacks on the conventional encryption algorithm.

Sarah and Alaric Snell-Pym living in interesting times

Splay trees, compression, encryption, and embedding (by alaric)

No Comments

Leave a comment

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta

Snell-Pym

Sarah and Alaric Snell-Pym living in interesting times

Splay trees, compression, encryption, and embedding (by alaric)

No Comments

Leave a comment

Subscribe

Search

Categories

About Us

Ada Lovelace Day

Business

Family

Fictional Friends

Friends

Mind candy

Projects

The Salaric Blogs

Archives

Meta