[Standards] Comments on XEP-0301 (possible impact on -0308 in Section 4.2.3)

Mark Rejhon markybox at gmail.com
Sat Aug 4 10:38:08 UTC 2012

On Fri, Aug 3, 2012 at 4:54 PM, Paul E. Jones <paulej at packetizer.com> wrote:
> Mark, et al,

Excellent comments.  (Now replying to remainder of unreplied comments)

> Section 1:
> "and is favored by deaf and hard of hearing individuals who prefer text
> conversation"
> I suggest you strike the above. What I have been....

It is a compromise between the accessibility folks (and the need to
get this spec included in accessibility documents, such as Access
Board), and the mainstream folks (including you).  It is hard to
satisfy both sides.   I will investigate further toning down, since
Peter also agrees with you, but it's a balancing act.   I agree it has
broad utility.

> I would suggest these slight changes:
> "Real-time text is suitable for smooth and rapid communication,
> complementing the existing en bloc mode for sending text messages.

I would not use the word "en bloc" as it's not a phase common in XMPP
terminology.  Perhaps "message-by-message mode for sending text
messages" is more appropriate.

> Section 2:
> "Next Generation 9-1-1 / 1-1-2 emergency services"
> This leaves out so many countries; very America/Europe centric.  See
> http://en.wikipedia.org/wiki/Emergency_telephone_number.  Should we just get
> rid of "9-1-1 / 1-1-2"?  Point is you want this for next generation
> emergency services anywhere in the world.  I would not capitalize "Next
> Generation", either.

Good point, If Peter agrees with you, I'd genericize it but will still
mention the example.  Such as:
- "Text messaging to next-generation emergency services (such as 9-1-1
and 1-1-2, etc.)"
Or even just (e.g. 9-1-1).

> Section 4:
> Showing <body> not be included in the previous message containing <rtt> in
> Example 1 might lead people to believe this is expected. I would suggest
> making the first example one that had <body> in the end, since I suspect
> that will be the typical case.  Perhaps a word about this somewhere might be
> useful (if not already covered).

It's valid to do it either way.
Doing <body/> together or separate of the final <rtt> -- both
behaviors are valid.

If you're typing fast and then hitting Enter quickly (before the 700ms
interval is up), it's quite efficient to send the last <rtt/> in the
same stanza as the <body>.  You could even send two consecutive
stanzas rapidly, instead (one each for the <rtt> and for the <body/>).
 If you're typing slow, and hit Enter seconds later after finishing
the message, your last <rtt/> can easily be sent several seconds
before the final <body/>.

Comments?  Peter, Paul?  Any spec clarifications warranted here?

> Section 4.2.1:
> Why is "seq" only 31 bits?  Since the same memory is consumed for 31 or 32
> bits, why not just makes it an unsigned 32-bit integer?  And why worry about
> wrap-around?  I would allow it to occur.  Specify the behavior.

I used to define it, but it was more complex wording than in the past,
because I had to accomodate for languages that don't have easy
unsigned integers (e.g. Java doesn't have a native unsigned integer
type), so syncing up wraparound behaviours is not worthwhile where the
Message Resets occur once every 10 seconds.  When you're transmitting
<rtt/> every 0.7 seconds, you've incremented only 15 times.    I don't
see situations where incrementing happen within a human lifetime to
cause a wraparound, unless you delibrately set the seq value very
close to MAXINT.

That's why I thought it was simpler to just skip defining a wraparound

I welcome alternatives though.  Peter, Paul, comments?

> Section 4.4:
> "be approximately 0.7 second" -> " be approximately 0.7 seconds"
> I would even suggest saying 700ms, as I think that reads metter.
> Section 4.5.1:
> "Wait n thousandths of a second."
> I would prefer "wait n milliseconds", especially since the wait time might
> be 2300ms or more, for example.
> Section
> "Support the transmission" --> "Supports the transmission"
> Section
> "Support the behavior of Backspace" --> "Supports the behavior of backspace"
> Section
> Suggest changing:
> "Allow the transmission of intervals, between real-time text actions, to
> support the pauses between key presses."
>     To:
> "Allow for the transmission of intervals between real-time text actions to
> recreate pauses between key presses."
> "Wait n thousandths of a second" --> "Wait n milliseconds"
> Section 4.7:
> " non-compliant servers that modifies messages" --> " non-compliant servers
> that modify messages"
> Section 4.7.2:
> "line breaks MUST be treated as a single character, if line breaks are used
> within real-time text."
> -->
> "any line breaks MUST be treated as a single character."

Peter seems to be agreeing with you here on all these above minor
edits, so I'll add these edits to my todo's during LC (unless I'm
asked to do a 0.7 before LC)

> Section 4.5.2:
> "default value of n MUST be 1" -> "default value of n is 1"

Peter, shouldn't RFC2119 normative be used here?

> "For the purpose of this specification, the word "character" represents a
> single Unicode code point. See Unicode Character Counting."
> Shouldn't the above be moved to Section 3?

As part of the glossary?   That's an interesting idea.
Peter, do you have comment about defining character in a glossary item?
Example Glossary item:

character: For the purposes of this specification, "character"
represents a single Unicode code point.

Or do you think it's best to keep it in-scope with the beginning of
the "devil-in-the-details" which section 4.5.1 and 4.5.2 slowly starts
diving into?

> Question on this:
> "Also, if a Body Element arrives, pauses SHOULD be interrupted to prevent a
> delay in message delivery."
> Do you want to prevent a delay or realize a delay?  I believe you want the
> entire <rtt> element to be fully processed, including delays, before acting
> on <body>.  I'm not sure how to word that, but the above sentence was not
> clear to ne.

Observe I only use a "SHOULD"; not a "MUST"
Actually, you do WANT to interrupt delays, because otherwise you're
lagging the body delivery.   Real-time text recipients should not be
penalized with a delay in final message delivery.  Otherwise you're at
a disadvantage to other clients not running real-time text.   Also,
the software might be already handling <body/> deliveries
synchronously (e.g. existing instant messaging software), and you
don't want to modify that logic when adding the real-time text
feature.  Otherwise you're adding buffering for <body/> and further
complicating the retrofitting of an existing instant messaging client
by modifying its <body/> timing logic where it's not really necessary
to do so...

That may occasionally means the last few keypresses surges at the end
of a message, if the sender hits Enter quickly after finishing their
message.  But that's "fair" and not noticeable by most users, as the
most surge is only 700 milliseconds worth of typing, which would be
equivalent to one word maximum.  No issue occurs if there's a pause
before the message is sent, people often takes half a second before
hitting Enter at least anyway.   And people hit Enter less often
anyway during real-time text.

Also, if you've got ping spikes and then get 2 or 3 seconds of
backlogged <rtt/>, you might be 2 or 3 seconds behind in seeing the
typing.  You'd rather see the typing surge to catch up, after a
latency spike.    If your implementation does not have logic to catch
up on the fly (as mentioned in last paragraph of section 6.5
http://xmpp.org/extensions/xep-0301.html#receiving_realtime_text ...),
then the <body/> catchup will then do it for you;

And...if you're using shorter intervals (e.g. 300ms, or even 100ms for
LAN-based XMPP), then the surge is pretty much unnoticeable for the
most part, since it often takes at least 100ms for someone to hit
Enter after finishing typing :-)

Or if you want your implementation to finish playing <rtt/> before
displaying <body/>, you can.
Or if you wish, you can even playback instantly (skip remainder of
<w/> elements) and then compare to <body/> to make sure the final
real-time message is identical to <body/>.

There's many ways to interpret this -- all still interoperable -- but
obviously, the most fair (everyone equally sees full body), the most
easy, and the most retro-fittable way (to existing clients)

> Section 6.3:
> Whether there is a visible cursor or not, the client has to take steps to
> render text properly.  Since a cursor is not something sent via the
> protocol, I see no point talking about it.  I'd remove this section.

It's increasingly more realistic to just remove this section now that
I eliminated the <c/> elements by requiring senders to send empty <t/>
elements.   Implementers could technically implement this on their
own.   As you already remember, RealJabber has a remote cursor -- and
it is quite useful.
I'd still like to keep mention of a remote cursor somewhere, since
it's relevant from the perspective of why a client can choose to
transmit empty <t/> elements (solely for the purpose of a remote
cursor) -- so it can't be removed entirely.

Personally, I'd rather keep the section there, but perhaps shorten it
significantly (perhaps half its size or less).

> Section 6.4.4:
> I'm not sure what this is telling me.  Why is <t> and <e> "unsuitable for
> most general-purpose clients"? And why encourage a device to use reset
> rather than provide more complete support?  We know rendering is the bigger
> challenge, but receivers must accept what is sent.  I see no reason to
> suggest a sender be lazy.
> I'd suggest removing this section unless there is something here of high
> value that's going over my head.

A long 1.5 hour one-on-one talk with a future implementer (that
controls over 100 million users), revealed they would prefer this
algorithm.   Section 6.4.4 has a fairly high value as a result.
Append-only real-time text can still preserve key press intervals,
when appending is being done.
It can be removed while still allowing the implementer to do that algorithm.

However, I should also note that I am a user of Sprint Captioned
Telephone at www.sprintcaptel.com which uses proprietary HTML-based
append-only real-time text (voice recognition transcription).
Corrections are done by addendums rather than backspacing.   This is a
perfect second use case of append-only real-time text.   Also, a lot
of relay services (i711.com, relaycall.com, ip-relay.com, etc) all
also use proprietary HTML-based real-time text that does not use
editing but only backspacing.   This is a THIRD perfect use case too,
I'd rather such services (That I use, since relay services are the
only real way I can make phone calls to hearing people) implement
standards-based real-time text such as XEP-0301.

> Section 7.4.2:
> It seems that all of the examples show show <w> used between every key
> press.  However, if sampling the input buffer (as recommend earlier in the
> text), one may not know the time between keystrokes.  Perhaps the device
> samples the buffer and sees:
> "a"
> "app"
> This would translate to:
> <rtt><t>a</t></rtt>
> <rtt><t>a</t><w n="100"/>pp</rtt>
> Right?

No...for "a" then "app"
1. When you sample first-pass, you've detected the addition of "a"
(difference between "" and "a")
2. When you sample second-pass, you've detected the addition of "pp"
(difference between "a" and "app")
3. The <w/> interval between steps 1 and 2 is the difference in system
time between step 1 and step 2

Which results in:
<rtt><t>a</t><w n="100"/><t>pp</t></rtt>

RealJabber samples every text change event, not sample at a 100ms
interval, so there'll be <w/> elements for each character, see:
Section 6.4.1 already says "In addition, if Preserving Key Press
Intervals is supported, then Element <w/> – Wait Interval records the
time elapsed between text change events."

> Related to <w>, suppose I type "h" and then "e" with about a 100ms delay.
> Further, suppose the IM client's 700ms timer fires and sends "h" on the wire
> like this:
> <rtt><t>h</t></rtt>
> Now, the client restarts the 700ms timer, after which time it sends:
> <rtt><w n="100"/><t>e</t></rtt>
> Is this correct?  So, there was a 700ms "collection" delay, some message
> transmission delay (perhaps 100 or 200ms) and then an artificial delay
> inserted of 100ms.  So, between "h" and "e", the user might actually wait
> 700+200+100 = 1000 milliseconds?
> Or, does the receiver maintain a running clock and as soon as the message
> arrives, it sees that w=100, but it's internal "wait timer" is already at
> 700+200ms, so it displays "e" immediately?  (I assume this is the case and
> it should be described.)

That would be a poor implementation
Two better implementations:

1. (PREFERRED, RealJabber method) If you use a resetting timer (timer
that restarts after an idle period), your timer restarts upon the
first keypress, so both keypresses 100ms apart will fit in same
<rtt/>, rather than being split to separate <rtt/>.


2. If you have a proper synchronous timer implementation (independent
of first keypress after idle)
If you type "h" near the end of a 700ms interval, you are going to end up with:

<rtt><w n="650/><t>h</t></rtt>

Then in the next 700ms cycle, you transmit:

<rtt><w n="50"><t>e</t></rtt>

In both situations, you're going to have the same 700ms lag (assuming
previous action elements in <rtt/> were already buffered)
Regardless, in all cases, "h" will be 100ms before "e" in both correct
scenario cases
(assuming stable network ping :-)
I didn't think I needed to explain these scenarios in the spec....but should I?


Mark Rejhon

More information about the Standards mailing list