[Standards] RTT, take 2

Simon McVittie simon.mcvittie at collabora.co.uk
Wed Jun 22 18:56:06 UTC 2011


On Wed, 22 Jun 2011 at 13:05:48 -0400, Mark Rejhon wrote:
> UTF16 and UTF16LE, and even UCS2 has same behaviour in my RTT spec, so I
> just say "16-bit Unicode".  Java, C#, ObjectiveC stores strings in 16-bit,
> and the various flavours of Unicode C++ STL and stdlib++ also store strings
> in 16-bit as well. Extensive research and testing shows they all process in
> flat mode like an array of 16-bit integers

IMO you should either count Unicode codepoints (the underlying data model),
or bytes of UTF-8 (the XMPP wire protocol). Counting in units of what a
particular implementation uses internally, if it isn't one of those two,
seems attractive if you use that particular implementation, but complicates
things further for everyone who isn't.

Yes, Windows, Java etc. store UTF-16 and have their APIs in terms of 16-bit
units instead of in terms of codepoints, but that's an accident of history -
early Unicode developers thought 16 bits would be enough for everybody (UCS-2),
and when they realised that wasn't true, UTF-16 was invented as a way to not
change their 16-bit wchar_t.

wchar_t is 32 bits (enough for any Unicode codepoint) on some platforms,
notably Linux, and probably most other Unix systems.

As an implementation detail, within a UTF-16 string viewed as an array of
16-bit units, I believe it's possible to tell whether a unit is part of a
surrogate pair or not by just looking at its value without further context;
so it's relatively easy to advance a "cursor" by one Unicode codepoint
(which either means one or two 16-bit units).

    S



More information about the Standards mailing list