[Standards-JIG] Re: Still not sure ...
stpeter at jabber.org
Mon Sep 13 16:46:56 UTC 2004
In article <20040910070809.GB18591 at serwis2.beta>,
Jacek Konieczny <jajcus at bnet.pl> wrote:
> On Thu, Sep 09, 2004 at 01:45:51PM -0700, Chris Mullins wrote:
> > > The difference is that what has been standardized is very much
> > > dependant on some implementations while counting unicode
> > > codepoints ("characters") would have been more universal.
> > What parts are implementation specific? That seems to me to be totally
> > implementation independant. Bytes are bytes.
> > Counting characters is actually implementation specific. And worse
> > than that, it's locale and language specific as well. Turning
> > a normalized UTF-8 string into graphemes (combining all the combining
> > characters, translating the surrogate pairs into 32-bit codepoints,
> > and so forth) or into "eye characters" is alot more work. It's also
> > going to have different results depending on the end-users system
> > (assuming this is done client-side). .
> Noone asks you to count characters. IMHO what should be counted are
> Unicode code-points. Unicode code-point is the smalest element/entity (I
> know both has their own meaning in the XML world) of XML document.
> JIDs are to be checked before sending them to the wire. In most cases
> they would be represented as (set of) Unicode string(s) internally.
> Converting them to UTF-8 is io-stream responsibility, not application's.
> Requiring to count bytes requires the application to counvert JID to
> UTF-8 string just to count bytes. This doesn't seem good to me.
> And the server which tread XMPP stream as stream of bytes and not stream
> of Unicode codepoints are broken. E.g. jabberd 1.4.x is broken -- it doesn't
> check it input to be valid UTF-8 (or fails to do it properly) and
> sometimes passes invalid characters through -- making the recipient
> server or client disconnect instead of the (broken) senders client.
> Unfortunately it is probably to late to fix XMPP specification now... :-(
> But the specification seems even more incorrect, because it says nothing
> about encoding where it says about byte limit on each part of the JID.
> We may guess it should be UTF-8, but in specification it should be
> stated clearly. It's a pity we didn't noticed and discussed that
> But that is not a big protocol flaw and if it cannot be fixed now then
> let it stay.
The XMPP specs are Proposed Standards. As soon as the RFC numbers are
issued (should be any day now), we'll start gathering implementation
experience in the XMPP WG and figuring out if certain things need to be
documented more clearly in the specs before they advance to Draft
Standards in the IETF's process. So that would be an appropriate time to
bring up concerns such as these. The specs are *not* fixed in stone yet
and people *will* have another chance to suggest improvements, but (1)
the venue for discussing them is the xmppwg at jabber.org list and (2) the
time is not now (no, please don't post more about this stuff while we
are so close to getting RFC numbers, it just makes my life harder),
although you are free to modify the xmppimp wiki page if you would like
to note the topic:
More information about the Standards