[Standards] XEP-0301 0.5 comments -Unicode characters

Gunnar Hellström gunnar.hellstrom at omnitor.se
Fri Jul 27 06:59:58 UTC 2012


On 2012-07-27 00:04, Mark Rejhon wrote:
>
>
> On 2012-07-26 5:34 PM, "Gunnar Hellström" <gunnar.hellstrom at omnitor.se 
> <mailto:gunnar.hellstrom at omnitor.se>> wrote:
> >
> > I think we have not solved this issue yet.
> >
> > On 2012-07-25 11:06, Kevin Smith wrote:
> >>>
> >>> >4.5.4.3 - "A single UTF-8 encoded character equals one code point" -
> >>> >this isn't true, is it?
> >>> >
> >>> >If we instead say
> >>> >"A single UTF-8 encoded Unicode Character equals one code point."
> >>> >Is true, and then we need to define Unicode Character as the 
> Character
> >>> >concept used in the Unicode standard.
> >>> >And maybe a note saying that "Note that some visible characters 
> are composed
> >>> >of more than one Unicode Character."
>
<GH>Is this proposal captured and agreed and made into changes? It is 
important for uniform calculation of n and p  ?


> >>
> >> My concern here is the lack of precision about normalisation is
> >> worrying me. I'm not yet convinced that nothing's going to change
> >> composition anywhere important - and one code point (unicode
> >> character) in one place could be more than one code point (unicode
> >> character) elsewhere. I'm feeling quite uncomfortable about the effect
> >> this will potentially have on interoperability - and I think it could
> >> easily be solved by saying "before calculating the rtt transforms to
> >> send the sender must apply normalisation to the string and before
> >> applying the transformations to the rtt buffer the recipient must
> >> apply normalisation to them, where we pick one of the normalisation
> >> types and stick with it. The other option suggested to me when I was
> >> asking people about the effect this would have on interop was to
> >> require RTT to include what normalisation is used, so the sender would
> >> send an update with normalisation=NFKC or whatever.
> >
> > I think that normalization in the endpoints are manageable. They 
> should just be done outside the path where p and n calculations are done.
> > But Kevin indicated that network equipment might also do Unicode 
> normalization. Then we must introduce some suitable rule against that.
> >
> > E.g. "If network equipment makes Unicode normalization of <rtt/> 
> elements, then they must recalculate n and p after that action."
>
> Generally, in most reasonable situations in XMPP, normalizing an 
> already-normalized Unicode string, results in no changes. Kevin says 
> to specify a normalization format, but how do we know what 
> normalization network equipment uses?   So we have to carefully choose 
> the normalization standard that is least likely to be affected by 
> further unexpected passes of normalization.
>
> Anyway, as long as you normalize first at the sender end, any further 
> normalization is usually harmless.  There are different standards of 
> normalization, so research in choosing specific normalization in 
> advance, has merit, but factoring into:
>
> - It only affects mid message editing for the most part; where 99 
> percent plus of typing is at the end.
>
> - If servers and network equipment violates standards and rudely 
> modifies code points, Message inconsistencies are generally erased 
> during the once-every-10-seconds Message Reset (or final message 
> delivery in <body/>)
>
> - Do a full, complete normalization so that from thereafter, most/all 
> normalization subsets likely has no damaging effects to real-time text 
> in these rare situations.
>
> - Experience has shown I have not run into any situation where it is 
> an issue.
>
> - Are there special situations?  Does country-wide Great Firewalls 
> modify code points n text based packets, for example? Presently, I 
> feel this is beyond scope of XEP-0301 and the rest of the real-time 
> message is probably a lost cause, until the next line.
>
> - Again, rare normalization damage (which I have never seen, not even 
> with realjabber.org <http://realjabber.org>, talk.l.google.com 
> <http://talk.l.google.com>, or Openfire) is self repairing anyway via 
> Message Reset.
>
> - I did many tests; I copy and pasted tortrue test strings including 
> funny bidirectional text with lots of superimposed characters and 
> strange Unicode emoticons, and they transmit/edit in sync on both 
> ends.  I will keep testing....
>
> Personally, I think the Unicode Code Point handling is fine but I 
> agree several minor edits may be needed, such as the need to specify a 
> strict/fuller sender normalization standard (before the rtt encode) so 
> that further normalization is unlikely to affect code points.
>
> Thanks
> Mark Rejhon
>
<GH>RFC 5198 Network Unicode  requires Unicode Normalization form NFC 
with good motivations. I suggest that we introduce:
"Use of Unicode Normalization form NFC as required by RFC 5198 is 
strongly RECOMMENDED."
"Servers and other network elements SHOULD avoid modifying contents of 
<t/> elements"

I agree that it is in extreme rare conditions that this can have any bad 
effect:
1.it is only for cases when there are middle boxes fiddling with the 
contents of <t/> elements.
2.and mid-message editing contain odd abstract unicode combined 
characters that have different number of code-points in different 
normalizations.
3.and the middle-box changes from one normalization to another that 
change number of code points.

So, with the change proposed we are quite safe.

It is too bad that we cannot use RFC 5198 completely for our Unicode. It 
requires use of CRLF, while XML apparently requires LF.

Gunnar


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20120727/19231f83/attachment.html>


More information about the Standards mailing list