[Standards-JIG] Still not sure ...

Tijl Houtbeckers thoutbeckers at splendo.com
Fri Sep 10 13:04:37 UTC 2004

On Fri, 10 Sep 2004 09:08:09 +0200, Jacek Konieczny <jajcus at bnet.pl> wrote:

> On Thu, Sep 09, 2004 at 01:45:51PM -0700, Chris Mullins wrote:
>> > The difference is that what has been standardized is very much
>> > dependant on some implementations while counting unicode
>> > codepoints ("characters") would have been more universal.
>> What parts are implementation specific? That seems to me to be totally
>> implementation independant. Bytes are bytes.
>> Counting characters is actually implementation specific. And worse
>> than that, it's locale and language specific as well. Turning
>> a normalized UTF-8 string into graphemes (combining all the combining
>> characters, translating the surrogate pairs into 32-bit codepoints,
>> and so forth) or into "eye characters" is alot more work. It's also
>> going to have different results depending on the end-users system
>> (assuming this is done client-side). .
> Noone asks you to count characters. IMHO what should be counted are
> Unicode code-points.

A Unicode code-point is basically a unicode character. However, that  
includes characters > 0xFFFF too. So UTF-16 (used by many enviroments to  
store unicode internally) does not map all unicode characters to a singe  
UTF-16 code-unit. The only encoding that does that is UTF-32.

> Unicode code-point is the smalest element/entity (I
> know both has their own meaning in the XML world) of XML document.

So I'm not sure what you mean here. I'd say the smallest element in an XML  
document depends on the encoding. And even then what do you count as an  
entity? In a UTF-8 document you could say the smallest entity is a UTF-8  
code-unit; 1 byte. In a UTF-16 document that would be 2 bytes. On the  
other hand you could say "a unicode character should be the smallest  
entity". Then your entities have variable lengths in UTF-8 and UTF-16  
documents, but fixes length in UTF-32 documents.

> JIDs are to be checked before sending them to the wire. In most cases
> they would be represented as (set of) Unicode string(s) internally.
> Converting them to UTF-8 is io-stream responsibility, not application's.
> Requiring to count bytes requires the application to counvert JID to
> UTF-8 string just to count bytes. This doesn't seem good to me.

"Characters" are platform dependant, on some platforms they are UTF-16,  
some UTF-32, but also on some BIG5 etc. Thus if you use the "character  
counting" mechanism of your platform, the result on another platform could  
be different. But converting it to UTF-8 and counting the bytes will  
always get you the same result. The alternative is choosing to convert it  
to something different than UTF-8, something that more closely represents  
the number of characters the user sees on his screen. UTF-16 comes a bit  
closer to that, but it's really UTF-32 that covers it all. But as you can  
see we still need to convert them to something, whether it's to UTF-8,  
UTF-16 or UTF-32. It all depends what we choose to settle on and what your  
platform/enviroment default is. And if your platform/enviroment does not  
support that conversion it *will* be your applications responsibility (Or  
you should use a better platform). In short, you need more than io-streams  
to work properly with Unicode!

> And the server which tread XMPP stream as stream of bytes and not stream
> of Unicode codepoints are broken. E.g. jabberd 1.4.x is broken -- it  
> doesn't
> check it input to be valid UTF-8 (or fails to do it properly) and
> sometimes passes invalid characters through -- making the recipient
> server or client disconnect instead of the (broken) senders client.
> Unfortunately it is probably to late to fix XMPP specification now...   
> :-(
> But the specification seems even more incorrect, because it says nothing
> about encoding where it says about byte limit on each part of the JID.
> We may guess it should be UTF-8, but in specification it should be
> stated clearly. It's a pity we didn't noticed and discussed that
> earlier.

For some reason XMPP mandates the *use* of UTF-8, so no need to guess  
there. This compared to the XML specification that mandates *support* for  
UTF-8 and UTF-16 and allows for other encodings. There will probably be a  
few million or billion chinese, indians, japanese, koreans, etc. who won't  
be too happy to find that aside from having to send a lot more data than  
they should, they also have to pick usernames with less characters. (I'm  
not taking a stand in the politics of this, I'm just predicting they won't  
be happy)

More information about the Standards mailing list