[Standards-JIG] Still not sure ...

Chris Mullins cmullins at winfessor.com
Thu Sep 9 21:52:19 UTC 2004

Matthias Wimmer Wrote: 
> For U+00e4 the sequence is "c3 a4".
> For U+0061 U+0308 the sequence is "61 cc 88"

> This is what I mean with they have a one-to-one mapping. Where you find
> two ways to express the same thing with codepoints you find the same
> number of different utf-8 encoding.

I understand now what you mean. 

> > What parts are implementation specific? 
> Yes and no ... the implementation specific thing is that 
> not every application is using utf-8 internally. 

I can't think of any applications that use UTF8 internally (perhaps libidn?). These days I write all my code in .NET - everything I do, string wise, is represented internally using UTF-16. Any time I need to put something on the wire, in order to be XMPP compliant, I need to translate it to UTF-8 using an appropiate encoder. Likewise, evertyhing I pull off the wire needs to be encoded from UTF-8 into UTF-16. 

Most Windows programming is like this as well (using all the wchar, tchar, and other wide character representations) and certainly all the Windows API calls are UTF-16. 

I believe (although I'm not sure) Java also stores data internally as UTF-16. 

> By counting codepoints used to encode a string you are on a 
> higher abstraction level, that does not that much depend on 
> how a implementation is representing the string internally.

I actually have a much harding time doing this that getting the byte count. Because my strings are stored internally at UTF-16 strings, I need to resolve all the surrogate pairs (which is a UTF-16 only construct) into UTF-32 codepoitns. If it weren't for this, things would be easier. 

... but because of this, the algorithm that goes "For Each Char in MyString: Count++: Next", is broken. I would end up with all sorts of errors if the original (UTF-16) string containted surrogate pairs. 

The only way I can see this general algorithm working is if the string was encoded as UTF-32. Then we could count codepoints and get a meaningfull results. 

> > Counting characters is actually implementation specific. |
> > And worse than that, it's locale and language specific as well.

> Why are codepoints of unicode locale dependant? 

That's more of a rendering issues than anything else. If we call "character" as UTF-32 codepoint then this largley is a moot point. If we call "character" as "something a user sees" then we would have all sorts of issues. I guess it boils down to "don't use the word character."

I think the "ideal" case would be "number of UTF-32 codepoints",  but that's not something likley to happen any time soon. So in lieu of that, Bytes. 

Unicode. Ick.

Chris Mullins

More information about the Standards mailing list