[Standards] XEP-0301 0.5 comments -Unicode characters
gunnar.hellstrom at omnitor.se
Thu Jul 26 21:33:57 UTC 2012
I think we have not solved this issue yet.
On 2012-07-25 11:06, Kevin Smith wrote:
>> >22.214.171.124 - "A single UTF-8 encoded character equals one code point" -
>> >this isn't true, is it?
>> >If we instead say
>> >"A single UTF-8 encoded Unicode Character equals one code point."
>> >Is true, and then we need to define Unicode Character as the Character
>> >concept used in the Unicode standard.
>> >And maybe a note saying that "Note that some visible characters are composed
>> >of more than one Unicode Character."
> My concern here is the lack of precision about normalisation is
> worrying me. I'm not yet convinced that nothing's going to change
> composition anywhere important - and one code point (unicode
> character) in one place could be more than one code point (unicode
> character) elsewhere. I'm feeling quite uncomfortable about the effect
> this will potentially have on interoperability - and I think it could
> easily be solved by saying "before calculating the rtt transforms to
> send the sender must apply normalisation to the string and before
> applying the transformations to the rtt buffer the recipient must
> apply normalisation to them, where we pick one of the normalisation
> types and stick with it. The other option suggested to me when I was
> asking people about the effect this would have on interop was to
> require RTT to include what normalisation is used, so the sender would
> send an update with normalisation=NFKC or whatever.
I think that normalization in the endpoints are manageable. They should
just be done outside the path where p and n calculations are done.
But Kevin indicated that network equipment might also do Unicode
normalization. Then we must introduce some suitable rule against that.
E.g. "If network equipment makes Unicode normalization of <rtt/>
elements, then they must recalculate n and p after that action."
More information about the Standards