[Standards] RTT, take 2

Gunnar Hellström gunnar.hellstrom at omnitor.se
Thu Jun 23 09:43:23 UTC 2011


Mark said in the UTF-8 / UTF-16 discussion:
> However, I am thinking of following Simon's excellent suggestion.
>
> What do you think of his suggestion of using "code point" counting for 
> length and position attributes?
> That'd pretty much essentially turn XMPP RTT equivalently into a 
> standard for editing an array of 32-bit integers instead (allow use of 
> native UCS4 string functions in programming languages that stores 
> strings in UCS4 format).  It makes my 16-bit programming slightly more 
> complicated, but much easier than counting in UTF8.  It might be a 
> better long term goal.
>
> Opinion?
>
Yes, counting in code points is the right decision. You do not need to 
comment what that means for the programmer.
Some may want to work in native UTF-8. Then a Unicode codepoint is well 
defined as a 1-4 bytes long UTF-8 transform, easily isolated.

Some may want to work in UTF-16. They then need to watch out for 16-bit 
values in the range  U+D800 to U+DFFF and count pairs of such codes as 1 
codepoint while all other 16-bit codes are 1 codepoint.

And some may want to work in the 32 bit expanded Unicode.

Just specify that in the protocol, p and n are counted in Unicode code 
points.

/Gunnar



More information about the Standards mailing list