[Standards] Fun and games with Huffman

Dave Cridland dave at cridland.net
Tue Aug 4 09:00:32 UTC 2009


So, I've been playing with compression algorithms, by way of fun and  
games.

In particular, I've built a 2nd-order adaptive Huffman encoder in  
Python. It's not designed to be particularly efficient, but it's  
possible to build one of these in about 64k or so, I think - that's  
considerably less than the 256k that a DEFLATE compressor uses. (A  
zlib-based codec costs 300k total at maximum compression, so it's not  
equal, whereas this would be.)

The simple explanation of how it works is that it picks an encoding  
based on the probability so far of the symbol occuring given the last  
one. More probable symbols gain shorter codes, and the most probable  
usually ends up with a 1- or 2-bit code. Unknown symbols generate  
additional codes, and if they've never occured before at all, get  
dumped literally.

Thus "Test" compresses from 4 to 6 - actually encoded as SYM_NYT "T"  
SYM_NYT "e" ... SYM_FLUSH, so there's 9 symbols in there.

"TestTestTestTest" on the other hand compresses from 12 to 9 -  
actually those last three "Test" strings compress to 1-bit per symbol  
each, because the compressor has learnt the notion.

Now, there is a point to all this, shocking as it may seem. What I  
wanted to do was examine what happens if we change the symbol-set we  
input into the compression engine, so (and you'll see this at the end  
of the attached Python code) I wrote out a simple bit of XMPP -  
actually a misleadingly simple bit.

A DEFLATE compressor should do well on this, because there's lots of  
repeated substrings.

My Huffman encoder, working on the string itself, brings it from 314  
to 165 - that's 52.55% - and DEFLATE manages to get 36.94%. So, the  
moral of the story here is that, on data which in principally textual  
in nature, like XML, you don't want to be using a low-order adaptive  
Huffman encoder. But we all knew that anyway - dictionary compressors  
are pretty spiffy on long repeated substrings, and DEFLATE uses a  
secondary (1st order, non-adaptive) Huffman.

But what happens if - instead of compressing the data as a string,  
the Huffman alphabet includes the events generated by parsing the XML?

When we compress "<presence/>" traditionally, we compress it using 11  
symbols - one for each letter. This is handy, as it means it generic  
for any text, but as it happens, we don't have text - we have XML,  
and we can use that to our advantage.

So, instead, my compressor can compress it using symbols which  
describe the XML rather than the textual representation of it - in  
this case, amusingly, it'll make no difference, as it comes out as  
SYM_STARTTAG, "presence", SYM_ENDTAG, SYM_CLOSETAG. It makes slightly  
more difference if we compress "<presence></presence>", as the  
string-based compressor has nearly double the symbols, whereas a  
XML-based one has exactly the same.

The moment I do this, then I'm marginally beating DEFLATE - no mean  
feat in half the memory, albeit it might just be a product of the  
data.

I can also add in one more trick, which'll certainly increase the  
memory consumption, but ought to gain some serious savings - if I  
allow the compressor to identify certain strings - like element and  
attribute names - that are likely to be repeated, and assign them  
symbols, then in principle similar structures in XML should compress  
exceedingly well.

Testing shows that at least on the one fragment I've tried, there's a  
slim improvement - about 1.5% overall, or 3.7% relative to the  
XML-based compression.

The memory would increase, but on an entity with several sessions,  
these strings are likely to be shared between sessions, so I'm not  
entirely convinced I've broken the bank quite yet on memory usage,  
but it's certainly a more complex proposition to measure.

I'll adapt my script - and I make no apologies for how ugly the code  
is or the quantity of obtuse debugging data it outputs - to read in  
actual XMPP transcripts and see how it performs. Meanwhile, have fun.

(Oh, and no, I've not written the XML-based decompressor, yet.)

Dave.
-- 
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at dave.cridland.net
  - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
  - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
-------------- next part --------------
A non-text attachment was scrubbed...
Name: comp.py
Type: text/x-python
Size: 16249 bytes
Desc: not available
URL: <http://mail.jabber.org/pipermail/standards/attachments/20090804/6c5396f1/attachment.py>


More information about the Standards mailing list