[Standards] XEP-0301 Real-Time Text: Unicode normalization, bidirectional, right-to-left text, etc. -- Comments needed

Peter Saint-Andre stpeter at stpeter.im
Tue Jul 3 19:06:07 UTC 2012

On 7/2/12 3:35 PM, Mark Rejhon wrote:
> On 2012-07-02 2:45 PM, "Kurt Zeilenga" <Kurt.Zeilenga at isode.com
> <mailto:Kurt.Zeilenga at isode.com>> wrote:
>> But I wonder if the XEP needs to say something about changes in valid
> text to valid text which might produce invalid text in the edit?  
> Consider, if user replaces the single glyph in message, is it allowed to
> send just the code points that changed, or its necessary to all the code
> points of each glyph that was changed?  That is, consider the text
> "tschuss" and the changed to add an diaeresis over the 'u'.  Using
> decomposed characters, that a change of U+75 to U+75,U+308.  Is it okay
> to RTT which inserts U+308 instead of replaces U+75 with U+75,U+308?
> Either way is allowed, though all my implementations use a "sends
> differences only" methodology, with success on all public XMPP servers
> tried so far.
> It is already covered in the second paragraph of the rewritten Section
> "Guideline for Recipients" shown below:
>> > Note that [[Element <t/> – Insert Text]] is allowed to contain any
> subset sequence of Unicode characters from the real-time message. This
> may result in certain situations where the text transmitted in <t/>
> elements is allowed to be temporarily an incorrectly-formed Unicode
> string (i.e. orphaned standalone combining mark, orphaned
> direction-change character for bidi Unicode, etc.) but becomes correct
> when inserted into the middle of the recipient's real-time message, and
> passes recipient validation/normalization with no character
> modifications. Note that a compliant XML processor does not modify or
> fix Unicode errors caused by taking only a subset of characters from
> correctly-formed Unicode text. One alternative way for implementers to
> visualize this, is to visualize the Unicode text as an array of
> individual code points, and treat the p and n values accordingly.
>> >
> A minor edit to to clarify this for multiple characters forming one
> glyph, is to add "incompletely formed glyphs" to the list in the
> paranthesis.  Would that make sense?

Do you mean multiple code points forming one character? I still find the
use of the term 'glyph' confusing here and would prefer to leave it out
if possible, because it doesn't seem that we're really talking about
"The actual, concrete image of a glyph representation having been
rasterized or otherwise imaged onto some display surface." I think it's
best if RTT talks about characters and code points.


Peter Saint-Andre

More information about the Standards mailing list