[Standards-JIG] Still not sure ...

Matthias Wimmer m at tthias.net
Thu Sep 9 18:51:18 UTC 2004

Hi Chris!

Chris Mullins schrieb am 2004-09-09 11:27:42:
> The problem, though, is what other metrics are there to use? As I came up to speed on Unicode, I found this topic to be very confusing. There are combining characters, surrogate pairs, graphemes, and many, many subtlies in what counts as a "character". These also change depending on the regional dialect the users dispaly system is set to (different dialects of, say, Mandrin will represent certain sequences as one grapheme, or two graphemes). The problem is frightfully complex, and once stringprep normalization rules are included it only gets more complex. 
> After playing with this for some time, bytes, actually, is really the only way to count it. 

I don't agree. All the problems you describe are there if you count
bytes as well. There is a one-to-one mapping from a UTF-8 byte sequence
to a sequence of unicode code points ("characters"). Therefore all
problems you have in the one representation you will have in the other
one as well.

The difference is that what has been standardized is very much dependant
on some implementations while counting unicode codepoints ("characters")
would have been more universal.

Tot kijk

Fon: +49-(0)70 0770 07770       http://web.amessage.info
HAM: DB1MW                      xmpp:mawis at amessage.info
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20040909/cd3a2bb2/attachment.sig>

More information about the Standards mailing list