[Standards] [LONG] Jeez, Sorry (was: All your problems, solved ; ))
Dave Cridland
dave at cridland.net
Tue Aug 28 05:51:29 CDT 2007
On Mon Aug 27 22:56:39 2007, Jonathan Chayce Dickinson wrote:
> 2. Furthermore compression methods used by servers (which tend to
> be fast instead of -small-effective) will, at best, return the data
> to near it's original length.
No, that's not true, and I admit I wasn't clear in my explanation
last night, which might (when I read it this morning) lead you to
that beleif.
The entropy of any given data, and thus its theoretical
compressibility, is largely unaffected by Base64. Yes, it's true that
Base64 causes a reduction in compression by the majority of
algorithms, but you only lose a few percent efficiency, rather than
only ever recovering the Base64 encoding. So if you've a file which
is 50% compressible normally, and you Base64 encode it, it'll
probably end up only a little bit bigger afterward.
By way of example, I created a snmpwalk dump (highly compressible),
and looked at the file sizes:
Unencoded: 203255 (100%)
Base64: 274574 (135%)
Gzip: 19792 (9.7%)
Base64+Gzip: 39423 (19.4%)
You can test this yourself quite easily, assuming that you're willing
to pretend that the only compression algorithm in existence is
DEFLATE. Simply use a zlib implementation to compress various files
with and without base64 encoding, and see what sizes you get
afterward. I do recommend the exercise, as it's quite easy to do -
the results are valid if you're merely using gzip on the command line
as I did here.
FWIW, you can't quite get the same test to work for XMPP streams,
since you have to take blocking into account, but if you're
reasonably comfortable with C programming, it shouldn't take long to
rig up a test program for that, should it interest you. You may find
looking at the IMAP COMPRESS extension, for which quite a bit of
experimental implementation was done, informative.
(As a brief aside, email does require in-band binary data transfer,
because the peers might not be online at the same time, so the
circumstances are radically different there).
You'll note that base64 certainly harms the compression - in this
case most likely because the Lemel Ziv backreferencing cannot locate
matches unless they happen on a 3-octet boundary - but you certainly
still see compression way beyond mere recovery of the Base64. This is
why I suggested that Base64 was reasonable for small amounts of data,
as the additional overhead doesn't amount to much.
Counter-intuitively, for data generally considered difficult to
compress, you'll get better recovery of the Base64 overhead - I
suspect this is due to Huffman encoding being largely unaffected by
Base64. With a JFIF file:
Unencoded: 172892 (100%)
Base64: 233558 (135%)
Gzip: 168377 (97.4%)
Base64+Gzip: 172667 (99.7%)
Here, you're looking at a mere 2.3% increase in efficiency, not
really enough to worry about. And note that incompressible files are
quite common to transfer.
>> 3. Which they then abandoned in favor of "SOAP Message Transmission
>> Optimization Mechanism" (MTOM).
>
> Why do we have MTOM now? Probably DIME. Note: Optimization.
>
>
Just to pick up on this, the name of a protocol is usually a
political or marketing decision. XMPP is itself a case in point, but
a more interesting example is IMAP, which used to be "Interactive
Mail Access Protocol", but was renamed to "Internet Message Access
Protocol" some time ago, as the marketing needs changed. I wouldn't
read too much into what a protocol is called, it seems unlikely
they'd call it "Workaround For Poor Framing Of Binary Data In XML
Mechanism", after all.
Dave.
--
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
- acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
- http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
More information about the Standards
mailing list