[Standards] XEP-0301 Real-Time Text: Unicode normalization, bidirectional, right-to-left text, etc. -- Comments needed

Kurt Zeilenga Kurt.Zeilenga at Isode.COM
Mon Jul 2 18:45:36 UTC 2012


On Jul 2, 2012, at 11:10 AM, Mark Rejhon wrote:

> Hello --
> XEP-0301 already supports all known forms of Unicode allowed to be transmitted in a <body/> over XMPP.
> This even includes bidirectional Unicode, too.   No protocol changes are needed. 
> 
> However, I need to rewrite Section 4.5.4 "Ensuring Accuracy of Attribute Values" to improve its explanations.
> http://xmpp.org/extensions/xep-0301.html#ensuring_accuracy_of_attribute_values
> 
> The tricky challenge is trying to explain it all in the fewest possible words.
> I would appreciate your comments.   (Ignore the wiki-style tags, a side effect of the software I use to edit my XEP)

Seems to basically address the issues so-far raised on the list (a couple of minor editorial comments below)...

But I wonder if the XEP needs to say something about changes in valid text to valid text which might produce invalid text in the edit?   Consider, if user replaces the single glyph in message, is it allowed to send just the code points that changed, or its necessary to all the code points of each glyph that was changed?  That is, consider the text "tschuss" and the changed to add an diaeresis over the 'u'.  Using decomposed characters, that a change of U+75 to U+75,U+308.  Is it okay to RTT which inserts U+308 instead of replaces U+75 with U+75,U+308?


> 
> __________
> 
> 4.5.4 Ensuring Accuracy Of Attribute Values
> 
> Real-time text is generated based on text normally allowed to be transmitted within the <body/> element.
> 
> Incorrectly calculated p and n values may lead to inconsistencies between the sender and recipient during real-time editing. The Unicode characters of the real-time text needs to make it transparently from the sender to the recipient, without further Unicode character modifications. This is the chain between the sender's creation of real-time text, to the recipient's processing of real-time text. Transparent transmission of Unicode characters is possible with sender pre-processing, as long as the transmission from the sender to the recipient remains standards-compliant, including compliant XML processors and compliant XMPP servers.
> 
> Any inconsistencies that occur during real-time text editing (i.e. non-compliant XMPP server that modifies messages) will recover during the next [[Message Reset]], and also via [[[Basic Real-Time Text]]].
> 
> 
> 
> 4.5.4.1 Guidelines For Senders
> 
> Senders MUST generate real-time text based on the plain text version of the sender's message with all processing completed. Processing include Unicode normalization, conversion of emoticons graphics to text, removal of illegal characters, line-break conversion, and all other Unicode character modifications. This MAY be done internally in parallel to the sender's displayed version of the message (i.e. graphics, formatting, {{XEP-0071}}).
> 
> For the purpose of calculating n and p values, line breaks MUST be treated as a single character, if line breaks are used within real-time text. It is noted conversion of line breaks into a single LINE FEED U+000A is REQUIRED for XML processors, according to section 2.11 of {{{XML}}}.

It seems odd to me to use REQUIRED in a note the reader.

> 
> 
> 
> 4.5.4.2 Guidelines For Recipients
> 
> For recipients, p and n are calculated relative to real-time text obtained from a compliant XML processor, before any further Unicode character modifications. (This includes recipient-side Unicode normalization. In an ideal and compliant scenario, normalizing an already normalized Unicode string, will result in no character modifications, and will not cause any issues.) Recipients MUST NOT do Unicode normalization (or any other code point modifications) on their internal copy of the real-time message, for accurate processing of subsequent action elements. A copy of this real-time message is processed separately for display.
> 
> Note that [[Element <t/> – Insert Text]] is allowed to contain any subset sequence of Unicode characters from the real-time message. This may result in certain situations where the text transmitted in <t/> elements is allowed to be temporarily an incorrectly-formed Unicode string (i.e. orphaned standalone combining mark, orphaned direction-change character for bidi Unicode, etc.) but becomes correct when inserted into the middle of the recipient's real-time message, and passes recipient validation/normalization with no character modifications. Note that a compliant XML processor does not modify or fix Unicode errors caused by taking only a subset of characters from correctly-formed Unicode text. One alternative way for implementers to visualize this, is to visualize the Unicode text as an array of individual code points, and treat the p and n values accordingly.
> 
> 
> 
> 4.5.4.3.Unicode Character Counting
> 
> For platform-independent interoperability, calculations of p and n values MUST be based on Unicode code points. Counts of Unicode code point counts are equal to counts of UTF-8 encoded characters (not bytes) for platforms operating entirely solely in UTF-8.
> 
> However, different platforms use different internal Unicode encodings (i.e. string format), which may be different from the transmission encoding (UTF-8) for XMPP. Consider these factors:
> 
> 	• Multiple Unicode code points may represent one displayable Unicode glyph (e.g. combining marks).
> Action elements operate on Unicode code points, not on displayable character glyphs.
> 
> 	• Characters U+10000 through U+10FFFF, which are single code points, but are represented as multiple surrogate code units in certain Unicode encodings (e.g. UTF-16).
> Action elements operate on Unicode code points, not on individual surrogate code units.
> 
> 	• Some Unicode encodings use a variable number of bytes per Unicode character (e.g. UTF-8).
> Action elements operate on Unicode code points, not on individual bytes.
> 
> Incorrectly calculated p and n values may cause scrambled text during real-time message editing for many languages. This scrambled text persists until full message delivery, or [[[Message Reset]]]. From the perspective of p and n values, a real-time message is treated equivalent to an editable array of Unicode code points, even if not necessarily stored as such.
> 
> Any existing Unicode text direction can be used (right-to-left, left-to-right, and bidirectional). Length and position values (p and n) are relative to the internal Unicode text of the real-time message, independently of the directionality of actual displayed text.
> 
> 




More information about the Standards mailing list