[Standards-JIG] Still not sure ...
thoutbeckers at splendo.com
Fri Sep 10 13:04:37 UTC 2004
On Fri, 10 Sep 2004 09:08:09 +0200, Jacek Konieczny <jajcus at bnet.pl> wrote:
> On Thu, Sep 09, 2004 at 01:45:51PM -0700, Chris Mullins wrote:
>> > The difference is that what has been standardized is very much
>> > dependant on some implementations while counting unicode
>> > codepoints ("characters") would have been more universal.
>> What parts are implementation specific? That seems to me to be totally
>> implementation independant. Bytes are bytes.
>> Counting characters is actually implementation specific. And worse
>> than that, it's locale and language specific as well. Turning
>> a normalized UTF-8 string into graphemes (combining all the combining
>> characters, translating the surrogate pairs into 32-bit codepoints,
>> and so forth) or into "eye characters" is alot more work. It's also
>> going to have different results depending on the end-users system
>> (assuming this is done client-side). .
> Noone asks you to count characters. IMHO what should be counted are
> Unicode code-points.
A Unicode code-point is basically a unicode character. However, that
includes characters > 0xFFFF too. So UTF-16 (used by many enviroments to
store unicode internally) does not map all unicode characters to a singe
UTF-16 code-unit. The only encoding that does that is UTF-32.
> Unicode code-point is the smalest element/entity (I
> know both has their own meaning in the XML world) of XML document.
So I'm not sure what you mean here. I'd say the smallest element in an XML
document depends on the encoding. And even then what do you count as an
entity? In a UTF-8 document you could say the smallest entity is a UTF-8
code-unit; 1 byte. In a UTF-16 document that would be 2 bytes. On the
other hand you could say "a unicode character should be the smallest
entity". Then your entities have variable lengths in UTF-8 and UTF-16
documents, but fixes length in UTF-32 documents.
> JIDs are to be checked before sending them to the wire. In most cases
> they would be represented as (set of) Unicode string(s) internally.
> Converting them to UTF-8 is io-stream responsibility, not application's.
> Requiring to count bytes requires the application to counvert JID to
> UTF-8 string just to count bytes. This doesn't seem good to me.
"Characters" are platform dependant, on some platforms they are UTF-16,
some UTF-32, but also on some BIG5 etc. Thus if you use the "character
counting" mechanism of your platform, the result on another platform could
be different. But converting it to UTF-8 and counting the bytes will
always get you the same result. The alternative is choosing to convert it
to something different than UTF-8, something that more closely represents
the number of characters the user sees on his screen. UTF-16 comes a bit
closer to that, but it's really UTF-32 that covers it all. But as you can
see we still need to convert them to something, whether it's to UTF-8,
UTF-16 or UTF-32. It all depends what we choose to settle on and what your
platform/enviroment default is. And if your platform/enviroment does not
support that conversion it *will* be your applications responsibility (Or
you should use a better platform). In short, you need more than io-streams
to work properly with Unicode!
> And the server which tread XMPP stream as stream of bytes and not stream
> of Unicode codepoints are broken. E.g. jabberd 1.4.x is broken -- it
> check it input to be valid UTF-8 (or fails to do it properly) and
> sometimes passes invalid characters through -- making the recipient
> server or client disconnect instead of the (broken) senders client.
> Unfortunately it is probably to late to fix XMPP specification now...
> But the specification seems even more incorrect, because it says nothing
> about encoding where it says about byte limit on each part of the JID.
> We may guess it should be UTF-8, but in specification it should be
> stated clearly. It's a pity we didn't noticed and discussed that
For some reason XMPP mandates the *use* of UTF-8, so no need to guess
there. This compared to the XML specification that mandates *support* for
UTF-8 and UTF-16 and allows for other encodings. There will probably be a
few million or billion chinese, indians, japanese, koreans, etc. who won't
be too happy to find that aside from having to send a lot more data than
they should, they also have to pick usernames with less characters. (I'm
not taking a stand in the politics of this, I'm just predicting they won't
More information about the Standards