[Standards] XEP-0301 Real-Time Text: Unicode normalization, bidirectional, right-to-left text, etc. -- Comments needed

Mark Rejhon markybox at gmail.com
Mon Jul 2 18:10:46 UTC 2012


Hello --
XEP-0301 already supports all known forms of Unicode allowed to be
transmitted in a <body/> over XMPP.
This even includes bidirectional Unicode, too.   No protocol changes are
needed.

However, I need to rewrite Section 4.5.4 "Ensuring Accuracy of Attribute
Values" to improve its explanations.
http://xmpp.org/extensions/xep-0301.html#ensuring_accuracy_of_attribute_values

The tricky challenge is trying to explain it all in the fewest possible
words.
I would appreciate your comments.   (Ignore the wiki-style tags, a side
effect of the software I use to edit my XEP)

__________4.5.4 Ensuring Accuracy Of Attribute Values

Real-time text is generated based on text normally allowed to be
transmitted within the <body/> element.

Incorrectly calculated *p* and *n* values may lead to inconsistencies
between the sender and recipient during real-time editing. The Unicode
characters of the real-time text needs to make it transparently from the
sender to the recipient, without further Unicode character modifications.
This is the chain between the sender's creation of real-time text, to the
recipient's processing of real-time text. Transparent transmission of
Unicode characters is possible with sender pre-processing, as long as the
transmission from the sender to the recipient remains standards-compliant,
including compliant XML processors and compliant XMPP servers.

Any inconsistencies that occur during real-time text editing (i.e.
non-compliant XMPP server that modifies messages) will recover during the
next [[Message Reset]], and also via [[[Basic Real-Time Text]]].

4.5.4.1 Guidelines For Senders

Senders MUST generate real-time text based on the plain text version of the
sender's message with all processing completed. Processing include Unicode
normalization, conversion of emoticons graphics to text, removal of illegal
characters, line-break conversion, and all other Unicode character
modifications. This MAY be done internally in parallel to the sender's
displayed version of the message (i.e. graphics, formatting, {{XEP-0071}}).

For the purpose of calculating *n* and *p* values, line breaks MUST be
treated as a single character, if line breaks are used within real-time
text. It is noted conversion of line breaks into a single LINE FEED U+000A
is REQUIRED for XML processors, according to section 2.11 of {{{XML}}}.

4.5.4.2 Guidelines For Recipients

For recipients, *p* and *n* are calculated relative to real-time text
obtained from a compliant XML processor, before any further Unicode
character modifications. (This includes recipient-side Unicode
normalization. In an ideal and compliant scenario, normalizing an already
normalized Unicode string, will result in no character modifications, and
will not cause any issues.) Recipients MUST NOT do Unicode normalization
(or any other code point modifications) on their internal copy of the
real-time message, for accurate processing of subsequent action elements. A
copy of this real-time message is processed separately for display.

Note that [[Element <t/> – Insert Text]] is allowed to contain any subset
sequence of Unicode characters from the real-time message. This may result
in certain situations where the text transmitted in <t/> elements is
allowed to be temporarily an incorrectly-formed Unicode string (i.e.
orphaned standalone combining mark, orphaned direction-change character for
bidi Unicode, etc.) but becomes correct when inserted into the middle of
the recipient's real-time message, and passes recipient
validation/normalization with no character modifications. Note that a
compliant XML processor does not modify or fix Unicode errors caused by
taking only a subset of characters from correctly-formed Unicode text. One
alternative way for implementers to visualize this, is to visualize the
Unicode text as an array of individual code points, and treat the *p* and *n
* values accordingly.

4.5.4.3.Unicode Character Counting

For platform-independent interoperability, calculations of *p* and
*n*values MUST be based on Unicode code points. Counts of Unicode code
point
counts are equal to counts of UTF-8 encoded characters (not bytes) for
platforms operating entirely solely in UTF-8.

However, different platforms use different internal Unicode encodings (i.e.
string format), which may be different from the transmission encoding
(UTF-8) for XMPP. Consider these factors:

   -

   Multiple Unicode code points may represent one displayable Unicode glyph
   (e.g. combining marks).
   *Action elements operate on Unicode code points, not on displayable
   character glyphs.*
   -

   Characters U+10000 through U+10FFFF, which are single code points, but
   are represented as multiple surrogate code units in certain Unicode
   encodings (e.g. UTF-16).
   *Action elements operate on Unicode code points, not on individual
   surrogate code units.*
   -

   Some Unicode encodings use a variable number of bytes per Unicode
   character (e.g. UTF-8).
   *Action elements operate on Unicode code points, not on individual bytes.
   *

Incorrectly calculated *p* and *n* values may cause scrambled text during
real-time message editing for many languages. This scrambled text persists
until full message delivery, or [[[Message Reset]]]. From the perspective
of *p* and *n* values, a real-time message is treated equivalent to an
editable array of Unicode code points, even if not necessarily stored as
such.

Any existing Unicode text direction can be used (right-to-left,
left-to-right, and bidirectional). Length and position values (*p* and *n*)
are relative to the internal Unicode text of the real-time message,
independently of the directionality of actual displayed text.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20120702/8b11f03f/attachment.html>


More information about the Standards mailing list