[Standards] Fun and games with Huffman
dave at cridland.net
Tue Aug 4 09:00:32 UTC 2009
So, I've been playing with compression algorithms, by way of fun and
In particular, I've built a 2nd-order adaptive Huffman encoder in
Python. It's not designed to be particularly efficient, but it's
possible to build one of these in about 64k or so, I think - that's
considerably less than the 256k that a DEFLATE compressor uses. (A
zlib-based codec costs 300k total at maximum compression, so it's not
equal, whereas this would be.)
The simple explanation of how it works is that it picks an encoding
based on the probability so far of the symbol occuring given the last
one. More probable symbols gain shorter codes, and the most probable
usually ends up with a 1- or 2-bit code. Unknown symbols generate
additional codes, and if they've never occured before at all, get
Thus "Test" compresses from 4 to 6 - actually encoded as SYM_NYT "T"
SYM_NYT "e" ... SYM_FLUSH, so there's 9 symbols in there.
"TestTestTestTest" on the other hand compresses from 12 to 9 -
actually those last three "Test" strings compress to 1-bit per symbol
each, because the compressor has learnt the notion.
Now, there is a point to all this, shocking as it may seem. What I
wanted to do was examine what happens if we change the symbol-set we
input into the compression engine, so (and you'll see this at the end
of the attached Python code) I wrote out a simple bit of XMPP -
actually a misleadingly simple bit.
A DEFLATE compressor should do well on this, because there's lots of
My Huffman encoder, working on the string itself, brings it from 314
to 165 - that's 52.55% - and DEFLATE manages to get 36.94%. So, the
moral of the story here is that, on data which in principally textual
in nature, like XML, you don't want to be using a low-order adaptive
Huffman encoder. But we all knew that anyway - dictionary compressors
are pretty spiffy on long repeated substrings, and DEFLATE uses a
secondary (1st order, non-adaptive) Huffman.
But what happens if - instead of compressing the data as a string,
the Huffman alphabet includes the events generated by parsing the XML?
When we compress "<presence/>" traditionally, we compress it using 11
symbols - one for each letter. This is handy, as it means it generic
for any text, but as it happens, we don't have text - we have XML,
and we can use that to our advantage.
So, instead, my compressor can compress it using symbols which
describe the XML rather than the textual representation of it - in
this case, amusingly, it'll make no difference, as it comes out as
SYM_STARTTAG, "presence", SYM_ENDTAG, SYM_CLOSETAG. It makes slightly
more difference if we compress "<presence></presence>", as the
string-based compressor has nearly double the symbols, whereas a
XML-based one has exactly the same.
The moment I do this, then I'm marginally beating DEFLATE - no mean
feat in half the memory, albeit it might just be a product of the
I can also add in one more trick, which'll certainly increase the
memory consumption, but ought to gain some serious savings - if I
allow the compressor to identify certain strings - like element and
attribute names - that are likely to be repeated, and assign them
symbols, then in principle similar structures in XML should compress
Testing shows that at least on the one fragment I've tried, there's a
slim improvement - about 1.5% overall, or 3.7% relative to the
The memory would increase, but on an entity with several sessions,
these strings are likely to be shared between sessions, so I'm not
entirely convinced I've broken the bank quite yet on memory usage,
but it's certainly a more complex proposition to measure.
I'll adapt my script - and I make no apologies for how ugly the code
is or the quantity of obtuse debugging data it outputs - to read in
actual XMPP transcripts and see how it performs. Meanwhile, have fun.
(Oh, and no, I've not written the XML-based decompressor, yet.)
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at dave.cridland.net
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 16249 bytes
Desc: not available
More information about the Standards