[Standards-JIG] Still not sure ...
cmullins at winfessor.com
Thu Sep 9 21:52:19 UTC 2004
Matthias Wimmer Wrote:
> For U+00e4 the sequence is "c3 a4".
> For U+0061 U+0308 the sequence is "61 cc 88"
> This is what I mean with they have a one-to-one mapping. Where you find
> two ways to express the same thing with codepoints you find the same
> number of different utf-8 encoding.
I understand now what you mean.
> > What parts are implementation specific?
> Yes and no ... the implementation specific thing is that
> not every application is using utf-8 internally.
I can't think of any applications that use UTF8 internally (perhaps libidn?). These days I write all my code in .NET - everything I do, string wise, is represented internally using UTF-16. Any time I need to put something on the wire, in order to be XMPP compliant, I need to translate it to UTF-8 using an appropiate encoder. Likewise, evertyhing I pull off the wire needs to be encoded from UTF-8 into UTF-16.
Most Windows programming is like this as well (using all the wchar, tchar, and other wide character representations) and certainly all the Windows API calls are UTF-16.
I believe (although I'm not sure) Java also stores data internally as UTF-16.
> By counting codepoints used to encode a string you are on a
> higher abstraction level, that does not that much depend on
> how a implementation is representing the string internally.
I actually have a much harding time doing this that getting the byte count. Because my strings are stored internally at UTF-16 strings, I need to resolve all the surrogate pairs (which is a UTF-16 only construct) into UTF-32 codepoitns. If it weren't for this, things would be easier.
... but because of this, the algorithm that goes "For Each Char in MyString: Count++: Next", is broken. I would end up with all sorts of errors if the original (UTF-16) string containted surrogate pairs.
The only way I can see this general algorithm working is if the string was encoded as UTF-32. Then we could count codepoints and get a meaningfull results.
> > Counting characters is actually implementation specific. |
> > And worse than that, it's locale and language specific as well.
> Why are codepoints of unicode locale dependant?
That's more of a rendering issues than anything else. If we call "character" as UTF-32 codepoint then this largley is a moot point. If we call "character" as "something a user sees" then we would have all sorts of issues. I guess it boils down to "don't use the word character."
I think the "ideal" case would be "number of UTF-32 codepoints", but that's not something likley to happen any time soon. So in lieu of that, Bytes.
More information about the Standards