[Standards-JIG] Still not sure ...
cmullins at winfessor.com
Thu Sep 9 18:27:42 UTC 2004
Initially 1023 bytes struck me as strange as well.
The problem, though, is what other metrics are there to use? As I came up to speed on Unicode, I found this topic to be very confusing. There are combining characters, surrogate pairs, graphemes, and many, many subtlies in what counts as a "character". These also change depending on the regional dialect the users dispaly system is set to (different dialects of, say, Mandrin will represent certain sequences as one grapheme, or two graphemes). The problem is frightfully complex, and once stringprep normalization rules are included it only gets more complex.
After playing with this for some time, bytes, actually, is really the only way to count it.
In .NET land, I can say: System.Text.Encoding.UTF8.GetBytes(utf8string) and it'll return me a byte array. Java almost certainly has something similar. (Although Java, pre 1.5, fails to deal with surrogate pairs, so I'm not sure what the answer there is).
There are pages and pages of documentation about this on the unicode site
From: Matthias Wimmer [mailto:m at tthias.net]
Sent: Thu 9/9/2004 1:08 AM
To: standards-jig at jabber.org
Subject: [Standards-JIG] Still not sure ...
I am still not sure if it has been a good idea, that xmpp core 3.1
limits the length of the portions in a JID to 1023 B in UTF-8 encoding.
This might seem to be a good choice for programs using 8 bit character
types ... but it makes it hard to check if a JID is valid if you use
wide character types like wchar_t in modern C/C++ or the standard character
type of Java.
For most modern languages it seems to be easier to check the number of
characters in a string than the number of bytes in a corresponding UTF-8
Fon: +49-(0)70 0770 07770 http://web.amessage.info
HAM: DB1MW xmpp:mawis at amessage.info
More information about the Standards