[Standards] UPDATED: XEP-0301 (In-Band Real Time Text) -- candidate for LAST CALL

Mark Rejhon markybox at gmail.com
Mon Jul 23 04:17:21 UTC 2012

(Peter, please go ahead and proceed to publish 0.5 I submitted ASAP -- so
that people have context when reading this.   Fixes can go into 0.6, if
deemed important)

On Sun, Jul 22, 2012 at 6:00 PM, Gunnar Hellström <
gunnar.hellstrom at omnitor.se> wrote:

>  11. Section 4.1, Example 1, Line 9 ,  make the text part "my Ju"  ,  so
>> that it is obvious that it is not about word by word transmission.
>> 12. Section 4.1, Example 1, Line 15 ,  make the text part "liet" only,
>> so that it is obvious that it is not about word by word transmission.
>   13. Section 4.2.2 event='new' third line.  change "display, and then
>> process" to "reception, and then process text and"    . Because we must not
>> assume that all applications display the text. "
>  11/12. Edit Deferred -- It is merely an introductory example. Also, if
> people chunk text instead of preserving key press intervals, then
> whole-word burst transmission is greatly preferred over broken-word burst
> transmission.
> But why do you want to confuse the reader with giving the impression that
> transmission is word-wise, when it is time-sampled in reality. I suggest to
> accept my edit proposal in order to not cause wrong impression what it is
> all about.

It's all a matter of perspective -- It is relative.  I suspect more than
half of people here would agree your suggestion is NOT simpler in this case
... Primary reason: My opinion is that the first introductory example MUST
be as simple as possible.  I think most would agree with me here.  There is
no wrong impression to convey here, because other subsequent examples are
self explanatory on what's allowed (breaking up text, turning things into
single keypresses, key press intervals, and the new example I added to
v0.5, really makes it much easier to understand key press intervals.).  But
the bottom line, it is an introductory example, and the introductory
example must be as simple as possible to explain.
... Secondary explanation:  When displayed in forced-color-code XML on the
website (i.e. published at
http://www.xmpp.org/extensions/xep-0301.html)... the transmitted
real-time text words are no longer separately
color-highlighted like the draft copy in the Word version.  So the full
words make them easier to glance out than if they are fragmented words, too.
... Tertiary explanation: We need to view this specification from a less
experienced developer perspective. People who are less experienced with
protocols (we are protocol authors, other people are not), need to be able
to see the simplest possible example (see primary rason)
(Even if you only agree with one or two of the above reasons, that should
be good enough, no?)

18. Consider deleting the "Forward Delete" d action element. It cannot be
> used with the default value for p because that would point outside the
> real-time message. Therefore, a p must always be calculated and included.
> Then it is equal in complexity to use it as Backspace. Having both just
> seem to add complexity to implementations. ( It would have been different
> and of value if it worked from a current cursor position.)   But if you
> have good reasons, e.g. easily matching some editing operation result, you
> can keep it.
> 18. Edit deferred -- Explanation given in long email.
> Forward delete just introduces complexity. Since you do not have the
> concept of "current position" in the specification, a forward delete and a
> backspace of anything else than the last character are equally long in
> coding.  But, if you want to have these two codings of the same operation,
> I can accept it.

About complexity: It only adds 5 lines of complexity to the implementation:

About reasoning:
... Reason 1. There are situations where it made a lot of sense to have the
two separate, including recipient-side time-smoothed display which was
something you also suggested.  For example, <e n="5"/> can
be automatically converted to the equivalent <e/><e/><e/><e/><e/> for
time-smoothed display with the cursor animated backwards.  And <d p='10'
n='5'/> can automatically be converted to the equivalent <d p='10'/><d
p='10'/><d p='10'/><d p='10'/><d p='10'/> for time-smoothed display with
the cursor staying stationary.   If we merged the two, then we can't have
distinctive time-smoothed display of either. (As I recall, you're a strong
proponent of time-smoothed display)  But of course, it might not be that
important, even to you.
... Reason 2. Ability to do accurate journalling of edits, for emergency
purposes.  However, this reason can become moot, especially if we're not
using the 'n' argument, since a single-character backspace transmitted can
be indistinguishable from a single-character delete operation (even for
time-smoothed display).
... Reason 3. It slightly simplifies "Monitoring Key Presses Directly" for
http://xmpp.org/extensions/xep-0301.html#monitoring_key_presses_directly ...
(I know that's not the preferred method)
... Reason 4. It simplifies visualizing of text block deletes (i.e. cut
operations), since you're deleting from normal start position.
... There are other reasons.
I'd like comments from other people once the v0.5 is up on the page
(hopefully by tomorrow), so other people have context on what we're talking
about here.  Let's wait till v0.5 is up so there's context...

19. Edit deferred -- Explanation given in previous email. It helps reader
>> associate WHICH definition of "character" we are using. Even the RFC's say
>> that the word has multiple interpretations, so it's appropriate here in the
>> title. The title is like a glossary entry, and the contents explain we're
>> using code points as the method of counting characters.
>  I still regard this dangerous and confusing. We are counting Unicode
> code points, and that needs to be clear in all explanations.

We will have to agree to disagree -- I think it's safer and less confusing:
Did you know there are 47 occurances of the word "character" in the whole

Therefore, I prefer not to remove the word "Character" in the heading
"Unicode Character Counting".  Thus, it is like the heading of an extended *
glossary* definition here -- and it is in my opinion safer and less
confusing.   Obviously, the section is too big to move to the glossary
section, but I am open to alternate ideas of defining the word "character"
from this mailing list.
For this, I defer to public comment (once 0.5 is up).

  20. Edit deferred -- I didn't like adding the paragraph either, but
> following your suggestion will complicate implementations.  If I do your
> suggestion, it will no longer be easy to do "Monitoring Message Changes
> Instead Of Key Presses"
> http://xmpp.org/extensions/xep-0301.html#sending_realtime_text because I
> would no longer be able to treat the real-time message as easily as if it
> was essentially "an array of code points". You are a strong advocate of
> this method too, and I'm sure you agree with me you don't want to
> complicate section 6.4.1
> I think that typing of characters resulting in a multiple of code points
> will result in these code points being submitted to display at the same
> time, and therefore easily can be put into the same <t/> element.  This is
> valid for example for the combining diacritical marks 300 -36F, that
> normally are displayed together with their base character.
> http://unicode.org/charts/PDF/U0300.pdf
> Usually nothing is displayed on the sending side until both have been
> typed.

That is generally true, but there are situations where a single letter is
refreshed on the sender end as it gains additional combining marks.  You
are familiar with this too, I imagine: A valid displayable glyph (or
"displayable character", as some call it) rendered by multiple Unicode code
points: An example is a standard Unicode character plus a single combining
diacritical mark, such as an umlaut mark) gets expanded into a more complex
displayable glyph, by the insertion of a second combining diacritical mark
(such as a grave).   Many environments require you to specify all marks
beforehand to output the character all at once, but in some environments,
the displayable character is immediately redisplayed after each added
combining mark, in order to visually show the progress of adding multiple
combining marks to the same displayed glyph.   It can be a feature built
into a textbox field, that cannot be overriden, at least for certain
diacritic operations.

In this specific situation, if you are doing "Monitoring Message Changes
Instead Of Keypresses", doing this method would detect that the only
changed code point is a single combining character.  This would be
transmitted by itself.  That single code point is a valid transmission
within an Insert Text action element <t>X</t> where X would be a single
Unicode code point of a combining character (without any accompanying
text), being inserted into the destination string to add an additional
diacritic mark to an existing displayed glyph.   This is a valid operation
that can realistically occur.

Putting both in the same t-element simplifies for both the transmitter and
> the receiver. The receiver does not need to handle an outstanding
> combinable diacritical mark waiting for its base character.
> There would also be no risk that text in edits combine in an erroneous way
> with already existing code points, before next message arrives containing
> the correct second half of the character.
> So, keeping combined characters together is a good goal and simplification
> and should be adviced with a "SHOULD".

I do agree that the normative "SHOULD" is a reasonable, though that adds an
additional sentence to the paragraph.  This can be considered during public
comments, or can even wait until LAST CALL.  I'd like to hear feedback from
others about section "Accurate Processing of Action Elements", once 0.5 is

> Yes, good to distinguish between service discovery, and activating
> support.
> There is something missing in a sentence in version 0.4, chapter 5.
> In order for an application to determine whether an entity supports this
> protocol, where possible it SHOULD use the dynamic, presence-based profile
> of service discovery defined in .
> What was your intention after "in"?

I don't see the error.  It refers to XEP-0115.   Perhaps it is a browser
cache issue -- try the Refresh button at
It should say "In order for an application to determine whether an entity
supports this protocol, where possible it SHOULD use the dynamic,
presence-based profile of service discovery defined in Entity
 [14 <http://xmpp.org/extensions/xep-0301.html#nt-id229147>]. "

This is actually a copy-and-paste from another XEP, and is very consistent
with what most XEP's treat XEP-0115 as, so "Determining Support" in v0.4
and v0.5 is more consistent with other XEP's.  You do observe I still,
however, include XEP-0085 style implicit discovery, since I must have it
work in all situations that chat states work in.

In version 0.4,  section 6.2 looks complex and need further restructuring
> now before I can judge the final result of the protocol.

Section 6.2 is actually rather simple if you interpret it from the
flexibility of choice:
I should point out:
- Activation and Deactivation is optional, as I mentioned in the first
- Some implementors definitely require activation/deactivation, including a
method that can be similiar to the activation of audio/video.
- Other implementers (like you) need it always active at all times.
- If you support it, you don't need to support all activation methods --
just one. (i.e. "Accept after confirm" is okay, or even "Accept" only).
 This is just a list of suggested activation methods, you can support just
one, two, three of those methods -- not a suggsetion to support ALL
activation methods!  (is this the part you're getting confused by?  If so,
I can adjust the wording to clarify.)
- Tehnically, implementers can do anything (And implementers have asked for
the ability to do so) -- it can be a button, it can be a menu, it can be a
preferences/option, etc.  We don't strictly define what's allowed and
what's not allowed, how they do the UI -- this section 6.2 points simply
points out general business rules of activation/deactivation and how it
affects protocol.

Even so, it's not even part of the Protocol section, since it's a purely
optional section -- you can ignore this section and immediately begin
transmitting <rtt/> if Section 5 Determining Support permits you to do so.
 That's it.   For best interoperability, you can also listen for incoming
<rtt/> in the abscence of Determining Support (i.e. an invisible contact
sending you an <rtt event='init'/>) -- some as an invisible contact sending
you an XEP-0085 Chat State without revealing themselves first and without
providing disco first.  (As you know already in previous emails, XEP-0085
section 5.1 also allows implicit discovery by sending a single chat state
to signal support for chat states.)
So essentially that means you are allowed to just simply begin immediately
transmitting <rtt/> upon either (1) Determining Support permits you to do
so, or (2) If you receive incoming <rtt/> elements (that single <rtt
event='init'/> implicit discovery mentioned in section 6.2) ...

There is implementer demand -- I am relieved including section 6.2 in
XEP-0301 because many implementers have talked to me about *their own
ideas* of activation/deactivation methods, which can technically not really
interoperate very well.  By covering general "business rules" for
activation/deactivation methods, for those implementers that need real-time
text to be activatable/deactivatable in a manner that they need to
implement it in, I can ensure interoperability with other clients that
implement a different kind of activation method.   I realize this may be a
hassle for those accessible clients that assume real-time text should
always be attempted and always be turned on, but for such implementers,
they have to see it as a necessary evil to accept that other implementers
may not want to do that -- they might want an accept/reject mechanism of
some kind.   Section 6.2 simply provides generic business rules of
activation/deactivation, if which followed, will eliminate situations of
non-interoperability (i.e. two willing clients that decide not to talk to
each other).   It will interop between clients that always activate
immediately (like what I described to you for assistive programs), and
clients that chooses to do an activation method.   All combinations will
work, even in situations where contacts are in private mode/invisible (just
like XEP-0085 chat states will work, too).

Technically, even though I've strengthened "Determining Support" including
XEP-0115 for parity with other XEP's, the spec still allows you to do
implicit discovery (ala XEP-0085 Chat State style) so that it works in all
situations that XEP-0085 works in, using XEP-0301 as an enhanced "Typing"
chat state in some implementations.  Peter and Kevin has indicated that
this is acceptable, although I know Matt had some misgivings.  (I did,
however, make it much closer in spirit to XEP-0085.)

I'd like them to review 0.5 when it is up -- I just emailed it to Peter
yesterday, so any changes will have to go into 0.6, but I want more public
comment by multiple people on the gray areas we are still talking about.

Mark Rejhon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20120723/be1a2b65/attachment.html>

More information about the Standards mailing list