[Standards-JIG] Still not sure ...

Jacek Konieczny jajcus at bnet.pl
Fri Sep 10 07:08:09 UTC 2004

On Thu, Sep 09, 2004 at 01:45:51PM -0700, Chris Mullins wrote:
> > The difference is that what has been standardized is very much 
> > dependant on some implementations while counting unicode 
> > codepoints ("characters") would have been more universal.
> What parts are implementation specific? That seems to me to be totally
> implementation independant. Bytes are bytes. 
> Counting characters is actually implementation specific. And worse
> than that, it's locale and language specific as well. Turning
> a normalized UTF-8 string into graphemes (combining all the combining
> characters, translating the surrogate pairs into 32-bit codepoints,
> and so forth) or into "eye characters" is alot more work. It's also
> going to have different results depending on the end-users system
> (assuming this is done client-side). . 

Noone asks you to count characters. IMHO what should be counted are
Unicode code-points. Unicode code-point is the smalest element/entity (I
know both has their own meaning in the XML world) of XML document.
JIDs are to be checked before sending them to the wire. In most cases
they would be represented as (set of) Unicode string(s) internally.
Converting them to UTF-8 is io-stream responsibility, not application's.
Requiring to count bytes requires the application to counvert JID to
UTF-8 string just to count bytes. This doesn't seem good to me.

And the server which tread XMPP stream as stream of bytes and not stream
of Unicode codepoints are broken. E.g. jabberd 1.4.x is broken -- it doesn't
check it input to be valid UTF-8 (or fails to do it properly) and
sometimes passes invalid characters through -- making the recipient
server or client disconnect instead of the (broken) senders client.

Unfortunately it is probably to late to fix XMPP specification now...  :-(
But the specification seems even more incorrect, because it says nothing
about encoding where it says about byte limit on each part of the JID.
We may guess it should be UTF-8, but in specification it should be
stated clearly. It's a pity we didn't noticed and discussed that

But that is not a big protocol flaw and if it cannot be fixed now then
let it stay.


More information about the Standards mailing list