[Standards] RTT, take 2

Gunnar Hellström gunnar.hellstrom at omnitor.se
Fri Jun 24 08:57:13 UTC 2011


Remko Tronçon wrote:
> [ I don't like writing me-too e-mails, but you beat me by a minute to
> sending the exact same mail, so I'm doing it anyway ;-) ]
>
>> So I'd say that we should refer to characters in a string, and deal with
>> Unicode code-points in the abstract. I'd expect that implementations would
>> convert this internally into whatever made sense for them.
> I think it would be the first protocol to depend on knowing how to
> count code points (I haven't needed it before), but I also think it's
> the only sensible thing to do, because you could end up with incorrect
> encodings using the protocol otherwise.
>
> Anyway, for applications that don't use Unicode libraries, rolling
> your own codepoint count isn't very hard, at least for utf-8.
>
We just need a concise way to tell lengths and positions within the 
Unicode string. With Unicode, some characters can be composed of 
characters. Just the word "characters" has therefore the risk of being 
ambigous and need a clarification.

RFC 5198 Network Unicode says:
"Unicode identifies each character by an integer, called its "code
    point", in the range 0-0x10ffff.  These integers can be encoded into
    byte sequences for transmission in at least three standard and
    generally-recognized encoding forms, all of which are completely
    defined in The Unicode Standard and the documents cited below:"

It is this "Unicode code point" that is meant in the length and position 
parameters in this specification, as any representation of the Unicode 
character number.

With RFC 5198 using both the "character" and the "code point", and 
character being slightly ambigous, I suggest to use the term "Unicode 
code point".

cheers,
Gunnar




More information about the Standards mailing list