[Standards] [LONG] Jeez, Sorry (was: All your problems, solved ; ))
dave at cridland.net
Tue Aug 28 10:51:29 UTC 2007
On Mon Aug 27 22:56:39 2007, Jonathan Chayce Dickinson wrote:
> 2. Furthermore compression methods used by servers (which tend to
> be fast instead of -small-effective) will, at best, return the data
> to near it's original length.
No, that's not true, and I admit I wasn't clear in my explanation
last night, which might (when I read it this morning) lead you to
The entropy of any given data, and thus its theoretical
compressibility, is largely unaffected by Base64. Yes, it's true that
Base64 causes a reduction in compression by the majority of
algorithms, but you only lose a few percent efficiency, rather than
only ever recovering the Base64 encoding. So if you've a file which
is 50% compressible normally, and you Base64 encode it, it'll
probably end up only a little bit bigger afterward.
By way of example, I created a snmpwalk dump (highly compressible),
and looked at the file sizes:
Unencoded: 203255 (100%)
Base64: 274574 (135%)
Gzip: 19792 (9.7%)
Base64+Gzip: 39423 (19.4%)
You can test this yourself quite easily, assuming that you're willing
to pretend that the only compression algorithm in existence is
DEFLATE. Simply use a zlib implementation to compress various files
with and without base64 encoding, and see what sizes you get
afterward. I do recommend the exercise, as it's quite easy to do -
the results are valid if you're merely using gzip on the command line
as I did here.
FWIW, you can't quite get the same test to work for XMPP streams,
since you have to take blocking into account, but if you're
reasonably comfortable with C programming, it shouldn't take long to
rig up a test program for that, should it interest you. You may find
looking at the IMAP COMPRESS extension, for which quite a bit of
experimental implementation was done, informative.
(As a brief aside, email does require in-band binary data transfer,
because the peers might not be online at the same time, so the
circumstances are radically different there).
You'll note that base64 certainly harms the compression - in this
case most likely because the Lemel Ziv backreferencing cannot locate
matches unless they happen on a 3-octet boundary - but you certainly
still see compression way beyond mere recovery of the Base64. This is
why I suggested that Base64 was reasonable for small amounts of data,
as the additional overhead doesn't amount to much.
Counter-intuitively, for data generally considered difficult to
compress, you'll get better recovery of the Base64 overhead - I
suspect this is due to Huffman encoding being largely unaffected by
Base64. With a JFIF file:
Unencoded: 172892 (100%)
Base64: 233558 (135%)
Gzip: 168377 (97.4%)
Base64+Gzip: 172667 (99.7%)
Here, you're looking at a mere 2.3% increase in efficiency, not
really enough to worry about. And note that incompressible files are
quite common to transfer.
>> 3. Which they then abandoned in favor of "SOAP Message Transmission
>> Optimization Mechanism" (MTOM).
> Why do we have MTOM now? Probably DIME. Note: Optimization.
Just to pick up on this, the name of a protocol is usually a
political or marketing decision. XMPP is itself a case in point, but
a more interesting example is IMAP, which used to be "Interactive
Mail Access Protocol", but was renamed to "Internet Message Access
Protocol" some time ago, as the marketing needs changed. I wouldn't
read too much into what a protocol is called, it seems unlikely
they'd call it "Workaround For Poor Framing Of Binary Data In XML
Mechanism", after all.
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
More information about the Standards