[Standards] Use of XEP-0198 resumption under adverse network conditions
Ruslan N. Marchenko
me at ruff.mobi
Wed Nov 4 14:00:15 UTC 2020
Am Mittwoch, den 04.11.2020, 11:46 +0000 schrieb Dave Cridland:
> Due to network analysis (and "thanks" to a bug in the server which
> caused some useful logging), we were able to examine not only when
> sessions went into the unresponsive state, but also when the client
> subsequently sent traffic on that session. This often happened well
> after the session had fallen into the resumable state - this resulted
> in an error, as the session had been closed.
> Having seen the result of this in the logging of the server, we
> followed up by looking for the same logging output on the production
> system, where the majority of users are using WiFi or 4G within
> hospitals. Coverage is often poor, and the WiFi overused, so
> clinicians often operate on a weak 4G signal, or highly contented
> WiFi. Think FOSDEM.
> Again, we observed clients recovering sometimes well after the ping
> timeout had triggered. Had these clients been able to, they could
> have continued to use the same TCP session without any disruption
> (or, for that matter, any additional RTTs re-establishing).
> The usual approach here seems to be to increase the timeout required
> to move a session from "live" to "unresponsive" when pinged. However,
> this has the effect of delaying push notifications while the session
> is, in effect in limbo.
> Our proposal is that when a session is found to be unresponsive, the
> server starts sending push notifications for unacknowledged (and
> future) messages, but otherwise leaves the session live when
> resumable. Only after a significantly longer timeout should the TCP
> session be terminated (and at that point destroy the session
Matches my observations  as well. If the session is not too active
tcp recovery is instant, all the snd/rcv buffers are flushed and then
queues are flushed and all live as if nothing happened.
> This means that a client recovering network after several minutes
> will find the connection still live (in effect), whereas if it never
> recovers, it will still get the push notifications in a timely
> There are likely to be downsides with this approach; particularly
> presence state will be badly affected. PSA could help here. Overall,
> though, we believe that this will substantially improve the effective
> performance of C2S over high latency, high contention links.
I'm leaning towards ignoring all the timers whatsoever, only care about
how it affects UX. If tcp is still holding up - let it be, if it got
EOF/EOS/Timeout (from whatever side) - let's just do resumption
reconnection - we're reconnectiong continuously anyway.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Standards