[Standards] XEP-0301 0.5 comments -Unicode characters

Mark Rejhon markybox at gmail.com
Fri Jul 27 00:55:47 UTC 2012


On Thu, Jul 26, 2012 at 6:04 PM, Mark Rejhon <markybox at gmail.com> wrote:
>
> On 2012-07-26 5:34 PM, "Gunnar Hellström" <gunnar.hellstrom at omnitor.se>
> wrote:
>>
>> I think we have not solved this issue yet.
>>
>> On 2012-07-25 11:06, Kevin Smith wrote:
>>>>
>>>> >4.5.4.3 - "A single UTF-8 encoded character equals one code point" -
>>>> >this isn't true, is it?
>>>> >
>>>> >If we instead say
>>>> >"A single UTF-8 encoded Unicode Character equals one code point."
>>>> >Is true, and then we need to define Unicode Character as the Character
>>>> >concept used in the Unicode standard.
>>>> >And maybe a note saying that "Note that some visible characters are
>>>> > composed
>>>> >of more than one Unicode Character."
>>>
>>> My concern here is the lack of precision about normalisation is
>>> worrying me. I'm not yet convinced that nothing's going to change
>>> composition anywhere important - and one code point (unicode
>>> character) in one place could be more than one code point (unicode
>>> character) elsewhere. I'm feeling quite uncomfortable about the effect
>>> this will potentially have on interoperability - and I think it could
>>> easily be solved by saying "before calculating the rtt transforms to
>>> send the sender must apply normalisation to the string and before
>>> applying the transformations to the rtt buffer the recipient must
>>> apply normalisation to them, where we pick one of the normalisation
>>> types and stick with it. The other option suggested to me when I was
>>> asking people about the effect this would have on interop was to
>>> require RTT to include what normalisation is used, so the sender would
>>> send an update with normalisation=NFKC or whatever.
>>
>> I think that normalization in the endpoints are manageable. They should
>> just be done outside the path where p and n calculations are done.
>> But Kevin indicated that network equipment might also do Unicode
>> normalization. Then we must introduce some suitable rule against that.
>>
>> E.g. "If network equipment makes Unicode normalization of <rtt/> elements,
>> then they must recalculate n and p after that action."
>
> Generally, in most reasonable situations in XMPP, normalizing an
> already-normalized Unicode string, results in no changes.  Kevin says to
> specify a normalization format, but how do we know what normalization
> network equipment uses?   So we have to carefully choose the normalization
> standard that is least likely to be affected by further unexpected passes of
> normalization.
>
> Anyway, as long as you normalize first at the sender end, any further
> normalization is usually harmless.  There are different standards of
> normalization, so research in choosing specific normalization in advance,
> has merit, but factoring into:
>
> - It only affects mid message editing for the most part; where 99 percent
> plus of typing is at the end.
>
> - If servers and network equipment violates standards and rudely modifies
> code points, Message inconsistencies are generally erased during the
> once-every-10-seconds Message Reset (or final message delivery in <body/>)
>
> - Do a full, complete normalization so that from thereafter, most/all
> normalization subsets likely has no damaging effects to real-time text in
> these rare situations.
>
> - Experience has shown I have not run into any situation where it is an
> issue.
>
> - Are there special situations?  Does country-wide Great Firewalls modify
> code points n text based packets, for example?  Presently, I feel this is
> beyond scope of XEP-0301 and the rest of the real-time message is probably a
> lost cause, until the next line.
>
> - Again, rare normalization damage (which I have never seen, not even with
> realjabber.org, talk.l.google.com, or Openfire) is self repairing anyway via
> Message Reset.
>
> - I did many tests; I copy and pasted tortrue test strings including funny
> bidirectional text with lots of superimposed characters and strange Unicode
> emoticons, and they transmit/edit in sync on both ends.  I will keep
> testing....
>
> Personally, I think the Unicode Code Point handling is fine but I agree
> several minor edits may be needed, such as the need to specify a
> strict/fuller sender normalization standard (before the rtt encode) so that
> further normalization is unlikely to affect code points.
>
> Thanks
> Mark Rejhon

Checking on the Unicode standards, I realize I was referring is
various NFC algorithms re-normalizing NFC.  (common normalization
being compacting all the combining characters to its most compact
formats)
Now that we have the appropriate terminology, NFC, NFD, NFKC, NFKD --
I didn't realize that's what you were referring.
http://unicode.org/reports/tr15/
I am now assuming that is what Kevin / Gunnar is referring to, the
four different "normal forms".

I will now speak in proper Unicode normalization terminology (NFC,
NFD, NFKC, NFKD)

Assuming the path is standards-compliant (standard XML processors),
I've found it doesn't matter if the two ends are using different
normal forms (e.g. execute NFC normalization before <rtt/> encode) and
the other end is using the other form (e.g. converted to NFD after
<rtt/> decode), as long as the real-time message is unaffected.

About XML parser / server / network driven normalization:
--- XML processors do normalize attribute values (for necessity of
comparing attributes), but they do not modify the normal format of the
Unicode strings within tags.  Real-time text is transmitted in the
inner text of a <t/> element, so whatever normal format (NFC, NFC,
NFKC, NFKD) is acceptable as long as subseqeuent action elements use
the same normal format (e.g. NFC for one <rtt/> element and NFD for
the next <rtt/> element would be a big "no-no") ...
--- Some XML parsers do provide the ability to turn on normalization
of Unicode text (as a flag), so either that has to be disabled, or you
simply normalize first (to the same normal format, e.g. NFC), so that
parser-driven normalization has no effect.   If we specify to
normalize in NFC format, that's not going to be a helpful mention in
XEP-0301 if if the XML parser is currently configured to normalize to
a different normal format (e.g. NFD).   So that's a plus and a minus
at the same time, if I am given a choice: I prefer not to mention a
specific normal form.
--- In actual practice, by default, most XML parser libraries do not
normalize automatically for you (at least without developer consent)
--- I've found XMPP servers don't normalize on the server side.
(jabber.org / talk.l.google.com / OpenFire)   If there are different
severs that execute a conversion to multiple different specific normal
form, then that is bad for XEP-0301 interop of mid-message edits
anyway.  Though if I had to mention anything, then NFC normalization
is probably the one I should mention -- though that would still be
affected by any servers that decide to convert to NFD / NFKC / NFKD
(If such servers exist, then -- bad, server, bad, bad!)
--- TCP/IP routing don't modify normalization.  That's tantamount to
packet content modification, which is largely a big no-no anyway, and
beyond scope of XEP-0301.
--- It's not "the end of the world" if there's a normalization
catastrophe, since two methods cause quick recovery of any
normalization-messed-up edits (the Message Reset and the <body/>
delivery, and normalization concerns can essentially be bypassed
altogether with "Basic Real Time Text").   Although not a good excuse
to use such mechanisms (originally designed for backwards
compatibility and good user experience during MUC/simultaneous login),
it's meritworthy to point out this, and further experience by multiple
vendors can tighten up the standard in regards to normal formats,
during the Draft stage.

Based on the above, I am of the conclusion, it is NOT necessary to
specify the normal format to use -- just that normalization should
occur outside of the RTT codec chain (encoding on sender, to decoding
on recipient)
 -- I feel the advantage of normalization-agnosticity outweighs the
risk of the chain doing its own normalization.   Senders can use any
Unicode normal format (NFC, NFD, NFKC, NFKD) before encoding the
<rtt/> element, and as long as the channel is standards-compliant
(standards-compliant XML parser that doesn't modify normal forms
inside innertext's), it's not going to be converted to a different
normal format.   It'll come out intact on the other end (at least on
all XEP-0301-functioning XMPP chains I've ever tried).

Note: I do realize that some wording tweaks are needed to make the
Unicode stuff readable to a wider variety of audience,  But I still
think it's not necessary to specify a normal form.  (Although I
*could* mention that NFC is the preferred normal form at the
sender-side, since any XML parsers and XMPP servers that decide to do
'rude' normalization, will usually normalize to the most
bandwidth-compact normalization format -- which is NFC -- but I've not
even seen this happen in the real-world)

There are pros and cons about mentioning which normal form.
Best move is to not specify a normalization format (NFC, NFD, NFKC,
NFKD), but if a format has to be mentioned for senders during the
pre-RTT-encode step, I'd say "SHOULD be NFC" -- since experience shows
it is not a REQUIRED.

Thanks
Mark Rejhon



More information about the Standards mailing list