[Standards] Use of XEP-0198 resumption under adverse network conditions

Dave Cridland dave at cridland.net
Wed Nov 4 11:46:58 UTC 2020

Hey all,

We (that is, myself and others from Forward Clinical Ltd, my employer) have
been doing some extensive work to support high latency networks such as
Satellite Links, in relation to our work with UK Defence Medical Services.
Our "long thin" links cover the C2S link.

We believe these findings are more generally useful than just SATCOM - in
particular, we think these will help with the adverse network conditions
found in hospitals (where people keep putting in lifts and lots of cables,
giving lots of blackspots), and general applicability with mobile use of

TL;DR: When the session has a ping timeout, do push notifications, but
otherwise leave it open - mobile clients will often recover after several
minutes have passed.

We assume that established sessions may be in several connectivity states
from the point of view of the server, typically:

"Live" - a session is genuinely live and can be used for communication.
"Unresponsive" - the session has a TCP connection associated with it, but
it unresponsive to pings etc.
"Resumable" - the session has no TCP session, but 198 resumption was
negotiated and the session remains available.

We expect that the majority of servers will immediately move a session
detected as unresponsive into the resumable state by closing the TCP
session, and starting a (relatively short) timeout.

In the process of doing so, unacknowledged stanzas will be processed for
push notifications etc as needed, and errors will be sent as appropriate.

Due to network analysis (and "thanks" to a bug in the server which caused
some useful logging), we were able to examine not only when sessions went
into the unresponsive state, but also when the client subsequently sent
traffic on that session. This often happened well after the session had
fallen into the resumable state - this resulted in an error, as the session
had been closed.

Having seen the result of this in the logging of the server, we followed up
by looking for the same logging output on the production system, where the
majority of users are using WiFi or 4G within hospitals. Coverage is often
poor, and the WiFi overused, so clinicians often operate on a weak 4G
signal, or highly contented WiFi. Think FOSDEM.

Again, we observed clients recovering sometimes well after the ping timeout
had triggered. Had these clients been able to, they could have continued to
use the same TCP session without any disruption (or, for that matter, any
additional RTTs re-establishing).

The usual approach here seems to be to increase the timeout required to
move a session from "live" to "unresponsive" when pinged. However, this has
the effect of delaying push notifications while the session is, in effect
in limbo.

Our proposal is that when a session is found to be unresponsive, the server
starts sending push notifications for unacknowledged (and future) messages,
but otherwise leaves the session live when resumable. Only after a
significantly longer timeout should the TCP session be terminated (and at
that point destroy the session entirely).

This means that a client recovering network after several minutes will find
the connection still live (in effect), whereas if it never recovers, it
will still get the push notifications in a timely manner.

There are likely to be downsides with this approach; particularly presence
state will be badly affected. PSA could help here. Overall, though, we
believe that this will substantially improve the effective performance of
C2S over high latency, high contention links.

I hope this is useful!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201104/683dd0e0/attachment.html>

More information about the Standards mailing list