[Standards] Binary data over XMPP

Dave Cridland dave at cridland.net
Wed Nov 7 16:11:21 UTC 2007

On Wed Nov  7 15:02:57 2007, Michal 'vorner' Vaner wrote:
> Can't compression solve this? Does anyone know, how the base64  
> encoded
> data grow/shrink, if they are put trough zlib? Would be nice to  
> know,
> how far it is worth going with the blob transfers & modifications to
> protocol.

I've been accused - on this list - of treating compression as a  
panacea. But it's not a substitute for efficiency. Base64 encoding is  
recovered to a degree by a good minimal redundancy algorithm, but it  
tends to shield patterns from a dictionary algorithm. DEFLATE uses a  
Lempel-Ziv dictionary algorithm first, then Huffman, a minimal  
redundancy algorithm.

Lucky, practise is easier than theory. Grab some suitable data,  
compress it, base64+compress it, and compare all the sizes. Gzip is a  
useful tool to do this - the results aren't 100% accurate due to gzip  
overhead, but are close to the zlib compression we use in the  
application layer of XMPP, and are pretty close to DEFLATE (as we  
should be using, and as TLS uses).

I took a C source file, and found this:

-rwxr-xr-x 1 dwd dwd  36K 2007-11-07 15:43 connection.c
The original file. (100%)
-rw-r--r-- 1 dwd dwd  49K 2007-11-07 15:44 connection.c.b64
Base64 encoded, traditionally, with newlines. (135%)
-rw-r--r-- 1 dwd dwd  15K 2007-11-07 15:44 connection.c.b64.gz
Base64, then gzipped. (40%)
-rw-r--r-- 1 dwd dwd 8.1K 2007-11-07 15:44 connection.c.gz
Just gzipped. Note it's nearly half the size. We'll use this as an  
uncompressible object. (22% / 100%)
-rw-r--r-- 1 dwd dwd  11K 2007-11-07 15:45 connection.c.gz.b64
Gzipped, then base64. (30% / 135%)
-rw-r--r-- 1 dwd dwd 8.4K 2007-11-07 15:45 connection.c.gz.b64.gz
Now gzip it again. In principle, this should have recovered the  
base64 encoding, but note that it hasn't. (23% / 103%)

This suggests to me that not only does gzip not recover the base64  
encoding fully - although close - but base64 encoding prior to  
compression really hurts the compressor.

Note that compressing first, then base64 encoding, then compressing  
*again* actually gave better results than base64 *then* compressing,  
meaning that almost every file transfer we do under base64 should be  
compressed first.

Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
  - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
  - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade

More information about the Standards mailing list