[Standards] review of XEP-0301, sections 1-5 (Advice needed on Peter's comments)

Mark Rejhon markybox at gmail.com
Thu Aug 23 20:06:56 UTC 2012


Hello Peter,

Thanks for clarifying the unclear areas -- appreciated!
I do have some small further inquiries about Unicode handling:

On Thu, Aug 23, 2012 at 11:20 AM, Peter Saint-Andre <stpeter at stpeter.im>
wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/22/12 1:24 PM, Mark Rejhon wrote:
>> Hello,
>>
>> I've managed to address most of Peter's section 1-5 concerns.
>> However, for the remainder -- I need advice from anyone on
>> unaddressed parts of Peter's comments about XEP-0301 (
>> http://xmpp.org/extensions/xep-0301.html ) There are only five
>> major areas of clarifications I need relating to Peter's recent
>> comments.

[snip]

>> ******* CLARIFICATION #5 ******** Peter complimented that the
>> Unicode section was much better.
>>
http://xmpp.org/extensions/xep-0301.html#accurate_processing_of_action_elements
>>
> However, suggestions of further clarifications are also welcome:
>>
>>>> OLD Multiple Unicode code points (e.g. combining marks,
>>>> accents) can form a combining character sequence.
>>>>
>>>> NEW Multiple Unicode code points (e.g. combining marks,
>>>> accents) can form a combining character sequence. In addition,
>>>> some combining character sequences (represented by multiple
>>>> code points) can be transformed into a visually equivalent
>>>> composite character (represented by a single code point), or
>>>> vice-versa (e.g., under Unicode normalization).
>>>
>>> [Comment & Change Made] That's true.  But as we already both
>>> know,
>
> But implementers might not.

Sending implementations might do it, so receiving clients may be receiving
them anyway.  That's why I wrote: "(However, recipients SHOULD NOT assume
this behvior from sending clients. See Guidelines for Recipients)."

Last sentence of 3rd paragraph, Section 4.7.2:
http://xmpp.org/extensions/xep-0301.html#guidelines_for_senders

Receiving clients that are unable to combine a sequence of combining
characters, will just display them the same way for normal <body>.  One
example common handling mechanism by GUI's in common messaging clients for
unrecognized Unicode, is displaying sequence of blocks ([] [] []) or
question marks (? ? ?) - one placeholder block per Unicode code point ....
So it looks exactly the same as if the sending clients transmits a sequence
of combining characters in <body/> that is unrecognized by recipient
clients .....
I'm speaking of pre-existing clients, of course --

You can't prevent senders sending a sequence of combining characters that
the recipient may not recognize, so recipients that do not support them,
will just display it as a sequence of unrecognized Unicode characters,
typically boxes/blocks/placeholder characters (whatever the operating
system supports).

I have observed that exactly the same thing happens with combining
characters in real time text (provided, no unexpected code point
modifications take place to internally-stored real-time messages -- as I
already specify in the specification).

Actual RealJabber testing, in a client that does differential encoding
(section 6.4.1 compliant), shows consistency of Unicode behaviour for
unrecognized Unicode for <body/> versus for <rtt/> -- unrecognized
characters are simply displayed using placeholder characters (the
appearance of the placeholder characters is implementation/platform
specific)

The Unicode.org NFC algorithm clearly specifies behaviour for
un-combinable sequences of code points (cannot be replaced by a composite
character), and this means senders can still potentially send them, and
recipients needs to do something *minimum* --

e.g. minimum behaviour might be to treat code points like array elements to
be inserted into an array of code points -- and pass this string along to
the GUI, the same rendering mechanism for <rtt/> that is normally used to
render <body/> -- viola -- it results in exactly the same unhandled Unicode
character handling (placeholder characters) in recipients that do not
support a specific sequence.  So, text via <rtt/> will render the same as
text via <body/> -- including sequences of combining characters sent by the
sender.

So in the ideal "I followed the spec properly" situation,
then <rtt/> is no different "unhandled Unicode" behaviour versus <body/>


>>> not all combining character sequences can be sent as a single
>>> composite character (e.g. single code point).   So I had hoped
>>> that was automatically implied, but I guess I have to teach more
>>> Unicode here, eh?  :-)
>
> Nothing is automatic and in specifications I prefer not to trust in
> the power of implication. :)

We can't stop senders from sending sequences of combining characters, not
even for <body/>.  Recipients that do not support them (for either <body/>
or <rtt/>) will simply fall back to the normal handling mechanisms for
unrecognized Unicode characters.  Ideally, <rtt/> should be no different
from <body/> behavior in this regards, and implementations that generally
follow differential encoding (Monitoring Text Changes Instead Of Key
Presses) will essentially generally have exactly this behavior, even in
recipients that do not support valid sequences of combining characters.

Given this perspective, do I have to explain handling of unrecognized
sequences of Unicode combining characters?

>>> "Multiple Unicode code points (e.g. combining marks, accents) can
>>> form a combining character sequence. This can also occur in
>>> situations where there isn't a visually equivalent composite
>>> character of a single code point (e.g. when doing Unicode
>>> normalization)" Is this shorter version acceptable?
>
> No, because it's not as accurate.

See above.

>>> The standalone combining mark will never be displayed -- it's
>>> only during transmission.
>
> Only what during transmission?

The incomplete sequence would only exist during transmission (inside an
Insert Text <t></t> wrapper).

>>> See differential encoding according to section 6.4.1 (e.g.
>>> turning a valid two-character sequence into a valid
>>> three-character sequence, by transmitting only the combining mark
>>> detected by differential encoder algorithm in section 6.4.1)
>>>
>>> Perhaps I need to add an additional sentence to make this little
>>> tidbit clearer? If so, what do you suggest?
>
> Maybe just clarify that you're talking about "modifying a valid
> complete combining character sequence, to a new valid combining
> character sequence" -- that wasn't clear to me.

Will do.  Thanks!

Mark Rejhon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20120823/63604568/attachment.html>


More information about the Standards mailing list