[Standards-JIG] Still not sure ...
dwaite at gmail.com
Thu Sep 9 21:06:39 UTC 2004
Whether or not a different code-point representation could be
standardized on as being more universal in potential implementation
environments - the wire protocol requires UTF-8 support. It makes
sense that, if a size requirement is to be instituted, it would be
expressed in terms of the wire protocol and wire encoding format,
rather than specifying additional requirements to support other
character encodings and/or normalization requirements.
On Thu, 9 Sep 2004 20:51:18 +0200, Matthias Wimmer <m at tthias.net> wrote:
> Hi Chris!
> Chris Mullins schrieb am 2004-09-09 11:27:42:
> > The problem, though, is what other metrics are there to use? As I came up to speed on Unicode, I found this topic to be very confusing. There are combining characters, surrogate pairs, graphemes, and many, many subtlies in what counts as a "character". These also change depending on the regional dialect the users dispaly system is set to (different dialects of, say, Mandrin will represent certain sequences as one grapheme, or two graphemes). The problem is frightfully complex, and once stringprep normalization rules are included it only gets more complex.
> > After playing with this for some time, bytes, actually, is really the only way to count it.
> I don't agree. All the problems you describe are there if you count
> bytes as well. There is a one-to-one mapping from a UTF-8 byte sequence
> to a sequence of unicode code points ("characters"). Therefore all
> problems you have in the one representation you will have in the other
> one as well.
> The difference is that what has been standardized is very much dependant
> on some implementations while counting unicode codepoints ("characters")
> would have been more universal.
> Tot kijk
> Fon: +49-(0)70 0770 07770 http://web.amessage.info
> HAM: DB1MW xmpp:mawis at amessage.info
> Standards-JIG mailing list
> Standards-JIG at jabber.org
More information about the Standards