[Standards] Binary data over XMPP

Dave Cridland dave at cridland.net
Tue Nov 6 16:21:15 UTC 2007


On Tue Nov  6 15:25:44 2007, Tomasz Sterna wrote:
> Dnia 06-11-2007, Wt o godzinie 14:56 +0000, Dave Cridland pisze:
> > I'm not following something. So encode the octets #x00 #x01 #x02
> > #x5D #x3E, and tell me what you get.
> 
> Like this:
> 
> Binary <-> Encoded
> 0x00 <-> 0xC4, 0x80
> 0x01 <-> 0xC4, 0x81
> ...

Ah, okay - so you're adding 0x100 to these. I thought this would  
yield 3-octet characters, hence my confusion.


> 0x20 <-> 0x20
> 0x21 <-> 0x21
> ..
> 0x7F <-> 0x7F
> 0x80 <-> 0xC2, 0x80
> ..
> 0xFF <-> 0xC3, 0xBF
> 
> 
Right.



> > I get three bytes that are not legal in a CDATA section, followed  
> by  > a sequence of bytes which decode (via UTF-8) to "]]>", which  
> in turn  > would end the CDATA section.
> 
> Good point.
> We either transfer this chunk in &...; escaping, or just transcode  
> 0x3E
> or 0x5D bytes to 2byte UTF-8 character. (Maybe '>' to '»' :)
> 
> 
Or add 0x100 again. (I checked this time, 0x5D encodes to 0xC5 0x9D).

However, using this technique, truly random data will expand by -  
roughly - 60.5%. Base64 beats this, at only 33%. There's only 101  
octets that are legal single-byte UTF-8 octets that we can allow  
safely in CDATA sections, by my count, so that leaves 155 that are  
double-byte.

Base64 operates by encoding 6 bits into an alphabet of 64 symbols;  
encoding 7 bits needs an alphabet of 2^7, or 128 symbols, and would  
give us growth of 14.2% - we don't have 128 symbols to play with,  
though. We could choose an additional 17 double-octet symbols, in  
which case we'd see growth of 20.5% overall. Slightly better than  
base64.

So we'd encode each 7 bits using an alphabet of #x9 | #xA | #xD |  
[#x20-#x3D] | [#x3F-#x5C] | [#x5E-#x111], which would then be UTF-8  
encoded, and be roughly 90% of the size of base64.

However, I think you need to factor in the overhead that no  
encoder/decoder library exists for this, and each individual  
implementation would have to code one, (or wait for someone else to  
do so).

Dave.
-- 
Dave Cridland - mailto:dave at cridland.net - xmpp:dwd at jabber.org
  - acap://acap.dave.cridland.net/byowner/user/dwd/bookmarks/
  - http://dave.cridland.net/
Infotrope Polymer - ACAP, IMAP, ESMTP, and Lemonade



More information about the Standards mailing list