[Standards] [LONG] Jeez, Sorry (was: All your problems, solved ; ))

Dave Cridland dave at cridland.net
Tue Aug 28 10:51:29 UTC 2007


On Mon Aug 27 22:56:39 2007, Jonathan Chayce Dickinson wrote:
> 2. Furthermore compression methods used by servers (which tend to 
> be fast instead of -small-effective) will, at best, return the data 
> to near it's original length.

No, that's not true, and I admit I wasn't clear in my explanation 
last night, which might (when I read it this morning) lead you to 
that beleif.

The entropy of any given data, and thus its theoretical 
compressibility, is largely unaffected by Base64. Yes, it's true that 
Base64 causes a reduction in compression by the majority of 
algorithms, but you only lose a few percent efficiency, rather than 
only ever recovering the Base64 encoding. So if you've a file which 
is 50% compressible normally, and you Base64 encode it, it'll 
probably end up only a little bit bigger afterward.

By way of example, I created a snmpwalk dump (highly compressible), 
and looked at the file sizes:

Unencoded: 203255 (100%)
Base64: 274574 (135%)
Gzip: 19792 (9.7%)
Base64+Gzip: 39423 (19.4%)

You can test this yourself quite easily, assuming that you're willing 
to pretend that the only compression algorithm in existence is 
DEFLATE. Simply use a zlib implementation to compress various files 
with and without base64 encoding, and see what sizes you get 
afterward. I do recommend the exercise, as it's quite easy to do - 
the results are valid if you're merely using gzip on the command line 
as I did here.

FWIW, you can't quite get the same test to work for XMPP streams, 
since you have to take blocking into account, but if you're 
reasonably comfortable with C programming, it shouldn't take long to 
rig up a test program for that, should it interest you. You may find 
looking at the IMAP COMPRESS extension, for which quite a bit of 
experimental implementation was done, informative.

(As a brief aside, email does require in-band binary data transfer, 
because the peers might not be online at the same time, so the 
circumstances are radically different there).

You'll note that base64 certainly harms the compression - in this 
case most likely because the Lemel Ziv backreferencing cannot locate 
matches unless they happen on a 3-octet boundary - but you certainly 
still see compression way beyond mere recovery of the Base64. This is 
why I suggested that Base64 was reasonable for small amounts of data, 
as the additional overhead doesn't amount to much.

Counter-intuitively, for data generally considered difficult to 
compress, you'll get better recovery of the Base64 overhead - I 
suspect this is due to Huffman encoding being largely unaffected by 
Base64. With a JFIF file:

Unencoded: 172892 (100%)
Base64: 233558 (135%)
Gzip: 168377 (97.4%)
Base64+Gzip: 172667 (99.7%)

Here, you're looking at a mere 2.3% increase in efficiency, not 
really enough to worry about. And note that incompressible files are 
quite common to transfer.

>> 3. Which they then abandoned in favor of "SOAP Message Transmission
>> Optimization Mechanism" (MTOM).
> 
> Why do we have MTOM now? Probably DIME. Note: Optimization.
> 
> 
Just to pick up on this, the name of a protocol is usually a 
political or marketing decision. XMPP is itself a case in point, but 
a more interesting example is IMAP, which used to be "Interactive 
Mail Access Protocol", but was renamed to "Internet Message Access 
Protocol" some time ago, as the marketing needs changed. I wouldn't 
read too much into what a protocol is called, it seems unlikely 
they'd call it "Workaround For Poor Framing Of Binary Data In XML 
Mechanism", after all.

Dave.
-- 
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
  - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
  - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade



More information about the Standards mailing list