[Standards-JIG] Still not sure ...

David Waite dwaite at gmail.com
Fri Sep 10 13:20:31 UTC 2004


On Fri, 10 Sep 2004 09:08:09 +0200, Jacek Konieczny <jajcus at bnet.pl> wrote:
> Noone asks you to count characters. IMHO what should be counted are
> Unicode code-points. Unicode code-point is the smalest element/entity (I
> know both has their own meaning in the XML world) of XML document.
> JIDs are to be checked before sending them to the wire. In most cases
> they would be represented as (set of) Unicode string(s) internally.
> Converting them to UTF-8 is io-stream responsibility, not application's.
> Requiring to count bytes requires the application to counvert JID to
> UTF-8 string just to count bytes. This doesn't seem good to me.

The requirement for the internal representation of unicode is not set
by the spec, while there are requirements for external representation
(wire protocol). It makes sense that these limitations would be
represented in terms of that wire protocol and not of an arbitrarily
chosen user representation.

Also, if you look at smtp, dns, http, and so on, all required and
recommended limits are set in terms of bytes. Also, internationalized
domain names are limited in terms of bytes, and such domain names are
already part of a JID. So even if a JID could somehow be retrofitted
to consider username and resource in terms of codepoints, the IDN
would still be in terms of bytes. Doesn't sound like that would make
things easier.

> And the server which tread XMPP stream as stream of bytes and not stream
> of Unicode codepoints are broken. E.g. jabberd 1.4.x is broken -- it doesn't
> check it input to be valid UTF-8 (or fails to do it properly) and
> sometimes passes invalid characters through -- making the recipient
> server or client disconnect instead of the (broken) senders client.

That a server implementation does not properly support UTF-8 does not
mean that all servers which handle UTF-8 rather than UTF-16/UCS-4
internally are broken. And my understanding was that the acceptance of
invalid UTF-8 was fixed a while ago with a switch to a newer version
of expat.

There is nothing about jabber or xmpp protocol which requires
character-level comprehension of data in order to route packets,
_other than_ the requirement that JIDs be normalized. and one of the
more popular normalization libraries (libidn) works on UTF-8 byte
arrays.

> Unfortunately it is probably to late to fix XMPP specification now...  :-(
> But the specification seems even more incorrect, because it says nothing
> about encoding where it says about byte limit on each part of the JID.
> We may guess it should be UTF-8, but in specification it should be
> stated clearly. It's a pity we didn't noticed and discussed that
> earlier.

The wire representation MUST be UTF-8 (core section 11.5) , and a JID
limitation is only specified to aid in interoperable implementations.
IMHO it is pretty clearly a limitation in terms of bytes of
UTF-8-encoded codepoints.

-David Waite



More information about the Standards mailing list