[Standards] RTT, take 2
gunnar.hellstrom at omnitor.se
Fri Jun 24 08:57:13 UTC 2011
Remko Tronçon wrote:
> [ I don't like writing me-too e-mails, but you beat me by a minute to
> sending the exact same mail, so I'm doing it anyway ;-) ]
>> So I'd say that we should refer to characters in a string, and deal with
>> Unicode code-points in the abstract. I'd expect that implementations would
>> convert this internally into whatever made sense for them.
> I think it would be the first protocol to depend on knowing how to
> count code points (I haven't needed it before), but I also think it's
> the only sensible thing to do, because you could end up with incorrect
> encodings using the protocol otherwise.
> Anyway, for applications that don't use Unicode libraries, rolling
> your own codepoint count isn't very hard, at least for utf-8.
We just need a concise way to tell lengths and positions within the
Unicode string. With Unicode, some characters can be composed of
characters. Just the word "characters" has therefore the risk of being
ambigous and need a clarification.
RFC 5198 Network Unicode says:
"Unicode identifies each character by an integer, called its "code
point", in the range 0-0x10ffff. These integers can be encoded into
byte sequences for transmission in at least three standard and
generally-recognized encoding forms, all of which are completely
defined in The Unicode Standard and the documents cited below:"
It is this "Unicode code point" that is meant in the length and position
parameters in this specification, as any representation of the Unicode
With RFC 5198 using both the "character" and the "code point", and
character being slightly ambigous, I suggest to use the term "Unicode
More information about the Standards