[Standards-JIG] Still not sure ...
m at tthias.net
Thu Sep 9 18:51:18 UTC 2004
Chris Mullins schrieb am 2004-09-09 11:27:42:
> The problem, though, is what other metrics are there to use? As I came up to speed on Unicode, I found this topic to be very confusing. There are combining characters, surrogate pairs, graphemes, and many, many subtlies in what counts as a "character". These also change depending on the regional dialect the users dispaly system is set to (different dialects of, say, Mandrin will represent certain sequences as one grapheme, or two graphemes). The problem is frightfully complex, and once stringprep normalization rules are included it only gets more complex.
> After playing with this for some time, bytes, actually, is really the only way to count it.
I don't agree. All the problems you describe are there if you count
bytes as well. There is a one-to-one mapping from a UTF-8 byte sequence
to a sequence of unicode code points ("characters"). Therefore all
problems you have in the one representation you will have in the other
one as well.
The difference is that what has been standardized is very much dependant
on some implementations while counting unicode codepoints ("characters")
would have been more universal.
Fon: +49-(0)70 0770 07770 http://web.amessage.info
HAM: DB1MW xmpp:mawis at amessage.info
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 189 bytes
Desc: Digital signature
More information about the Standards