[Standards] Binary data over XMPP
dave at cridland.net
Tue Nov 6 16:21:15 UTC 2007
On Tue Nov 6 15:25:44 2007, Tomasz Sterna wrote:
> Dnia 06-11-2007, Wt o godzinie 14:56 +0000, Dave Cridland pisze:
> > I'm not following something. So encode the octets #x00 #x01 #x02
> > #x5D #x3E, and tell me what you get.
> Like this:
> Binary <-> Encoded
> 0x00 <-> 0xC4, 0x80
> 0x01 <-> 0xC4, 0x81
Ah, okay - so you're adding 0x100 to these. I thought this would
yield 3-octet characters, hence my confusion.
> 0x20 <-> 0x20
> 0x21 <-> 0x21
> 0x7F <-> 0x7F
> 0x80 <-> 0xC2, 0x80
> 0xFF <-> 0xC3, 0xBF
> > I get three bytes that are not legal in a CDATA section, followed
> by > a sequence of bytes which decode (via UTF-8) to "]]>", which
> in turn > would end the CDATA section.
> Good point.
> We either transfer this chunk in &...; escaping, or just transcode
> or 0x5D bytes to 2byte UTF-8 character. (Maybe '>' to '»' :)
Or add 0x100 again. (I checked this time, 0x5D encodes to 0xC5 0x9D).
However, using this technique, truly random data will expand by -
roughly - 60.5%. Base64 beats this, at only 33%. There's only 101
octets that are legal single-byte UTF-8 octets that we can allow
safely in CDATA sections, by my count, so that leaves 155 that are
Base64 operates by encoding 6 bits into an alphabet of 64 symbols;
encoding 7 bits needs an alphabet of 2^7, or 128 symbols, and would
give us growth of 14.2% - we don't have 128 symbols to play with,
though. We could choose an additional 17 double-octet symbols, in
which case we'd see growth of 20.5% overall. Slightly better than
So we'd encode each 7 bits using an alphabet of #x9 | #xA | #xD |
[#x20-#x3D] | [#x3F-#x5C] | [#x5E-#x111], which would then be UTF-8
encoded, and be roughly 90% of the size of base64.
However, I think you need to factor in the overhead that no
encoder/decoder library exists for this, and each individual
implementation would have to code one, (or wait for someone else to
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade
More information about the Standards