[Standards] LAST CALL: XEP-0353 (Jingle Message Initiation)

Ralph Meijer ralphm at ik.nu
Tue Sep 3 14:14:49 UTC 2019

On 03/09/2019 15.02, Andrew Nenakhov wrote:
> вт, 3 сент. 2019 г. в 17:14, Philipp Hancke <fippo at goodadvice.pages.de 
> <mailto:fippo at goodadvice.pages.de>>:
>     0353 was explicitly designed for push (by not including the full
>     payload
>     due to size constraints) in conjunction with 0357 and should not go to
>     MAM (hence no body).
>     This has some sad consequences like the lack of a message acting as a
>     call data record in the users history.
> We're using Processing hints <store> element to make an archive store 
> such messages. https://xmpp.org/extensions/xep-0313.html#hints
>     The way 0353 is supposed to work is:
>     1) you are offline but have a push-enabled client
>     (there is the more interesting scenario where some clients of yours are
>     online but none does jingle... and you would need to send a push
>     notification to your offline client that does... that is a generic
>     issue
>     however)
>     2) you get a push notification with the <propose/> element and know the
>     senders full jid, the session id and (FYI) the media types involved
>     3) your client requests that session at the sender. If that session
>     doesn't exist anymore the sender will respond with a message stanza
>     with
>     type=error and <item-not-found/> (and potentially the jingle
>     <unknown-session/>
> No, I think push notifications do not work the way you describe. 
> XEP-0357 says that a published <notification/> MAY contain additional 
> custom information, however, all our implementations of Push 
> notifications assume that NO additional information is relayed through 
> third-party services (ejabberd, as I recall, doesn't even support 
> publishing such additional information). Thus, we get NO <propose/> in 
> push notifications, NO message text, nothing. Just notification to an 
> app that it should wake up and update its data. Consequently, the only 
> ways we can get this information are MAM and offline messages/ Since 
> offline messages perform poorly when there are multiple user devices, it 
> leaves us with just MAM.
> I strongly oppose any suggestions to make push notifications work 
> differently. If we start sending payload about calls via FCM/APNS, why 
> stop at calls? We can just send full message text via push notifications 
> as Telegram does. And at that point, why messing with XMPP at all? An 
> FCM-only messenger can be coded in a week, it'll send and receive full 
> message text via FCM, store messages on FCM cloud database and all will 
> work admirably well.


Let me start out saying that, like Philipp, my reading of the current
version of XEP-0353 is also that there is additional information shared
with the client over FCM/APNS. However, I must also say that neither
XEP-0353 nor XEP-357 make it clear how this should work exactly. This is
a problem, because it results in proprietary solutions for doing the
same thing. At minimum this should get another look and more concrete
examples with how existing push services would *actually* be used.

Andrew's description of their use of XEP-0353 introduces new elements in
the existing namespace, and adds a new iq-based exchange. Assuming this
is an experiment, I do want to point out that before it would go into
actual use, those would need to be moved to their own private namespace,
or be included in a next version of this specification (which might
require a new namespace).

With that all out of the way, let me describe what we did at VEON, as I
briefly alluded to in my presentation at the Summit [1]. Our approach
made calls server-mediated. I.e. the Jingle session-initiate was sent to
the callee's bare JID (their 'account'). The server then could send the
IQ on to online resources or via FCM/APNS. The latter notifications
actually did carry payload, including the media descriptions and
transport information (candidates).

The primary reason for doing it like this is speeding up the
negotiation. A receiving client can:

  * start evaluating the payload
  * configure the media library (in our case libwebrtc) to start media
streams on the device
  * re-establish the XMPP connection to the server
  * possibly request credentials for TURN
  * set up a TURN connection
  * retrieve vCards / avatars for the caller
  * etc.

In the mean while it can present the screen for allowing the user to
accept the call. As soon as the slow human user then presses 'accept'
you're immediately good to go and fully establish the call. This saves
several round-trips, and thus many centiseconds compared to XEP-0353 in
its current form, and even more if you have a model that relies on
re-establishing the XMPP connection before starting all of the above.

I want to note that our use cases where the XMPP connection might have
high latency but the actual media flows are local (even within office

Obviously, we also ran into the issue that notification payloads have
size limits. Thiago Camargo wrote a specialized compression library,
Shogun, to tackle this. It relies on a pre-shared dictionary for better

[1] <https://test.ralphm.net/publications/xmpp_chat_voip/>



More information about the Standards mailing list