[Standards-JIG] Still not sure ...

Chris Mullins cmullins at winfessor.com
Thu Sep 9 20:45:51 UTC 2004


Matthias Wimmer:

> There is a one-to-one mapping from a UTF-8 byte sequence
> to a sequence of unicode code points ("characters").
 
Perhaps I'm misunderstanding your argument. The word "characters" is throwing me off. 
 
There is not a 1:1 mapping between codepoints and what the user sees (call then "eye characters"). This mapping changes depending on a wide variety of circumstances. 
 
There is not a 1:1 mapping between code points and graphemes. Often a grapheme is composed of multiple codeponts (combining characters). There is not a 1:1 mapping between graphemes and "eye characters". 
 
All these mappings also change depending if you are before or after the KC Normalization step required by StringPrep. 
 
A set of combining characters (two distinct unicode code points) combine to form a single grapheme, that also can (usually) be represented by a full 32-bit code point. Which one is the character? Are there 2 characters ('cause it was originally 2 combining character code points) or is there 1 character (the resulting grapheme)? The grapheme's may also undergo some changes when they are rendered for view. What about for surrogate pairs? These appear to be two distinct code points and each may be represented in UTF8 by as many as (I think) 5 bytes. 
 
In Microsoft land the Unicode rendering engine is called UniScribe, and it's... weird.  Uniscribe does "language-specific orthographic analysis" - that's not something that I think we're prepared to do yet. With the xml:lang tags in most stanza's we probably have enough information to do it, but... Ick. 

> The difference is that what has been standardized is very much 
> dependant on some implementations while counting unicode 
> codepoints ("characters") would have been more universal.
 
What parts are implementation specific? That seems to me to be totally implementation independant. Bytes are bytes. 
 
Counting characters is actually implementation specific. And worse than that, it's locale and language specific as well. Turning a normalized UTF-8 string into graphemes (combining all the combining characters, translating the surrogate pairs into 32-bit codepoints, and so forth) or into "eye characters" is alot more work. It's also going to have different results depending on the end-users system (assuming this is done client-side). . 

-- 
Chris Mullins


More information about the Standards mailing list