[Standards-JIG] Still not sure ...

Matthias Wimmer m at tthias.net
Thu Sep 9 21:21:26 UTC 2004


Hi Chris!

Chris Mullins schrieb am 2004-09-09 13:45:51:
> Perhaps I'm misunderstanding your argument. The word "characters" is throwing me off. 

That's why I tryed to be more exact and wrote about codepoints in the
last post. You can 1:1 map a UTF-8 sequence to a sequence of unicode
codepoints.

Let me make an example:
"ä" can be represented as either U+00e4 or as the sequence U+0061
U+0308.

You told me that this makes a problem if we count codepoints instead of
bytes. But if that is a problem for you, you have the same problem with
the corresponding UTF-8 byte sequence.

For U+00e4 the sequence is "c3 a4".
For U+0061 U+0308 the sequence is "61 cc 88"

This is what I mean with they have a one-to-one mapping. Where you find
two ways to express the same thing with codepoints you find the same
number of different utf-8 encoding.

If you care that there are different ways to encode a string which have
different lengths, you have the same problem that the corresponding
utf-8 sequences have different lengths.

> > The difference is that what has been standardized is very much 
> > dependant on some implementations while counting unicode 
> > codepoints ("characters") would have been more universal.
>  
> What parts are implementation specific? That seems to me to be totally implementation independant. Bytes are bytes. 

Yes and no ... the implementation specific thing is that not every
application is using utf-8 internally. By counting codepoints used to
encode a string you are on a higher abstraction level, that does not
that much depend on how a implementation is representing the string
internally.

> Counting characters is actually implementation specific. And worse than that, it's locale and language specific as well.

Why are codepoints of unicode locale dependant? I don't understand what
you are calling a "character". And your argument about stringprep I do
understand neither ... you say that it can modify the number of
codepoints needed, that's true ... but it also changes the number of
bytes needed for the corresponding utf-8 sequence.


Tot kijk
    Matthias

-- 
Fon: +49-(0)70 0770 07770       http://web.amessage.info
HAM: DB1MW                      xmpp:mawis at amessage.info
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20040909/35f6cb74/attachment.sig>


More information about the Standards mailing list