[Standards-JIG] Still not sure ...

Chris Mullins cmullins at winfessor.com
Thu Sep 9 18:27:42 UTC 2004

Initially 1023 bytes struck me as strange as well. 
The problem, though, is what other metrics are there to use? As I came up to speed on Unicode, I found this topic to be very confusing. There are combining characters, surrogate pairs, graphemes, and many, many subtlies in what counts as a "character". These also change depending on the regional dialect the users dispaly system is set to (different dialects of, say, Mandrin will represent certain sequences as one grapheme, or two graphemes). The problem is frightfully complex, and once stringprep normalization rules are included it only gets more complex. 
After playing with this for some time, bytes, actually, is really the only way to count it. 
In .NET land, I can say: System.Text.Encoding.UTF8.GetBytes(utf8string) and it'll return me a byte array. Java almost certainly has something similar. (Although Java, pre 1.5, fails to deal with surrogate pairs, so I'm not sure what the answer there is). 
There are pages and pages of documentation about this on the unicode site 
Chris Mullins

	-----Original Message----- 
	From: Matthias Wimmer [mailto:m at tthias.net] 
	Sent: Thu 9/9/2004 1:08 AM 
	To: standards-jig at jabber.org 
	Subject: [Standards-JIG] Still not sure ...

	Hi list!
	I am still not sure if it has been a good idea, that xmpp core 3.1
	limits the length of the portions in a JID to 1023 B in UTF-8 encoding.
	This might seem to be a good choice for programs using 8 bit character
	types ... but it makes it hard to check if a JID is valid if you use
	wide character types like wchar_t in modern C/C++ or the standard character
	type of Java.
	For most modern languages it seems to be easier to check the number of
	characters in a string than the number of bytes in a corresponding UTF-8
	byte sequence.
	Tot kijk
	Fon: +49-(0)70 0770 07770       http://web.amessage.info
	HAM: DB1MW                      xmpp:mawis at amessage.info

More information about the Standards mailing list