[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

20 May 2024

Hi Goffi,

See inline comments. Sorry for the wall of text and if it overlaps with
one of the mails you wrote since I started writing this.

On Mon, 2024-05-20 at 16:51 +0200, Goffi wrote:
...
  There are many benefits to using CBOR:

 - It is smaller. While individual pieces of data may be tiny, the
 cumulative 
 amount is significant, and efficiency is crucial. 
The cumulative amount is about 10-20%% [1]. This isn't really a huge
improvement and almost all events will fit into a single network layer
frame anyway, further reducing the impact of encoding size.

...
  - Segmentation is inherent in CBOR, so you always know
if you have
 all the 
 data. This is beneficial for optimization and security. 
Segmentation is also inherent to SCTP, the protocol webrtc data
channels use to transfer content frames. There is no win in segmenting
the same segments twice.

...
  - Encoding and decoding CBOR are much more efficient,
essential for
 quick and 
 efficient data processing, especially by low-resource devices (like
 Arduinos). 
Not untrue, but probably negligible given the resource use of IP, UDP,
DTLS, SCTP - all part of the protocol stack you're building on and thus
involved in every event to be processed. Especially DTLS encryption is
going to be much more resource hungry than the difference between CBOR
parser and JSON parser. And notable, CBOR encoding is not a native
function in web browsers, so if web is a goal of this thing (and
seemingly it is, given all the references to web tech in the XEP), CBOR
is probably not much better than JSON.

...
   - If we define
a protocol for remote control, I would prefer this
 to be
 a <message>-based protocol that can be used either using a
 traditional
 XMPP connection or via XEP-0247 Jingle XML Streams.  
 Using server-based <message> would be highly inefficient. Why send
 gamepad data 
 to the server, incurring delays and extra processing, when you can
 send it 
 directly from your local network? 
XEP-0247 Jingle XML streams doesn't need to go via the server, it uses
Jingle just like your proposed protocol. While the XEP isn't maintained
for some time and makes weird references to other XEPs, nothing in it
forbids using it with webrtc data channels. In fact this has been
discussed as a useful tool for all kinds of things recently (like
initial device crypto setup or device-to-device MAM).

And of course latency when sending via a server might be sub-perfect,
but it's a very similar latency you would see if the network
environment requires to use a TURN server, which is one of the ways to
use Jingle. And as mentioned, there are valid use cases for having
input in cases where low latency is not that crucial. Think of keyboard
input to a remote shell - essentially what SSH does - which is not
uncommon to be routed through proxies/tunnels that add latency. Of
course for game input, drawing and 3d modeling, that's probably not an
option. It depends a lot on the usecase and that's why flexibility is
very much a good idea. Building something that is exclusively/primarily
designed around having a web browser XMPP client connected via Jingle
webrtc datachannels doesn't sound like flexibility was part of the
design.

...
  Regarding direct XML streams, CBOR is still more
efficient.
 Additionally, the 
 protocol is based on web APIs, and CBOR provides a direct mapping.
 Using XML 
 would require reinventing the wheel. 
Just as you can "directly" map data from JSON objects from a web
browser to CBOR, you can directly map them to XML. It's not really a
good idea to do such a direct mapping in both cases though (e.g. if you
used enumerated keys in CBOR instead of a string map, you can
drastically reduce the payload size and improve parsing speed).

...
  The protocol described here is for input sending and
potentially
 other 
 features like clipboard sharing, gamepad, and haptic feedback. In
 combination 
 with existing specifications, one use case can be remote desktop. The
 goal is 
 to reuse existing XMPP building blocks to simplify implementation.
 That’s what 
 XMPP is for: coordinating specifications. 
As I mentioned in another email: If you really feel like using RTP for
screen content transfer, you can always decide to only use the RFB
protocol (or something else) for the input part. I took it as an
example for an existing protocol that (among other features) has logic
for remote control input.

Using RFB for screen transfer may be an adjacent topic, but not a
requirement.

...
  We already have an A/V transmission protocol. With
WebRTC, it's
 extremely 
 efficient regarding latency and bandwidth. It’s suitable for remote
 desktop 
 streaming, including robust network traversal mechanisms. 
Network traversal is on a completely different layer than the protocol
to transfer screen content (RTP vs RFB). Nothing speaks against running
the RFB protocol over webrtc datachannels. Running RFB over websockets
in web browsers is also not well specified anywhere, but is still
widely deployed [2].

...
  And, as mentioned, the protocol comes from Web APIs
because they are
 simple, 
 well-documented, and provides a well-thought-out abstraction of the
 hardware. 
Web APIs are designed around what browsers can reasonably do on the
machines they run on. That doesn't mean they are well thought out for
the generic purpose.

I just played with the
https://w3c.github.io/uievents/tools/key-event-viewer.html and it's
still unclear to me when pressing modifier keys, which events are
emitted when and what is the supposed state of the modifier flag for
those events. I figured that the behavior is inconsistent between
browsers (and probably operating systems) and also between different
keys in the same browser. I bet this is not intended, but as the
specification and MDN don't really tell me what the correct behavior
would be, I can't really blame the browsers either.

What I learned is that, as a web developer, you must be prepared to see
modifier flags set without a keydown event being emitted for the
pressing of the corresponding modifier key and also keyup events being
emitted without a corresponding keydown event indicating the key was in
fact pressed.

So I definitely don't agree to the saying that it must be well-
documented and well-thought-out just because it's coming from the
Web...

...
  Low latency is crucial for inputs, especially for
devices like
 gamepads, 
 touchpads, and mice. Even with keyboards, low latency can be
 important, for 
 instance, when playing a game. 
I'm not saying there aren't any cases where low-latency is important,
where I disagree is that this is the case in all occasions. If you
don't have low latency feedback from the remote device, low latency for
input is very likely not crucial.

Anyway, I remain not convinced that XSF is the place to specify a
remote control protocol from scratch (which is what sections 8 and 9 of
the XEP are about). Mostly because I feel the XSF does not have the
competence for doing so (aka. we will probably do things terribly
wrong, due to lack of experience in the field).

That doesn't mean we don't need /something/ in XMPP to do the signaling
for whatever is used to send remote control events. And using Jingle
for this (be it using webrtc datachannels or any other Jingle
transport) totally makes sense for low latency.

--

There is a bunch of things I would suggest that are not related to this
at all.

Instead of `<device type="keyboard"/>` I would go with `<keyboard
/>`,
allowing for attributes to be added for more information where there is
fit (e.g. for a mouse have an optional buttons attribute with the
number of buttons that are on the mouse, or for a gamepad, you might
want to provide the layout, etc). This also means that to extend new
devices outside this specification, one can just have a `<gamepad
xmlns="urn:xmpp:remote-control:gamepad:0" />` or similar. As a general
guideline, I feel attributes should only be used if the set of possible
is finite.

I would strongly opt to not make the use of datachannels a SHOULD in
this protocol. It really doesn't matter for the purpose of this
protocol and you don't want to need to upgrade this protocol if a new
transport protocol becomes available that would be a better fit. Jingle
does the abstraction to streaming vs datagram, so that application
protocols don't need to deal with it.

There is a lot of specification for interaction with the Jingle RTP and
WebRTC protocols. This seems mostly unnecessary.
- You already write in the requirements that everything should work
even without Jingle RTP
- You put that one MUST use the same "WebRTC session" (what is that
even) for both Jingle RTP and Remote Control. I wouldn't know why this
is. Of course using existing sessions in Jingle often makes sense
(that's why it's a feature), but it definitely doesn't need a MUST
here.
- You write explicitly that Remote Control can be added with content-
add to existing Jingle RTP sessions. This is already given by the
Jingle specification, which doesn't limit what content can be added to
a session (e.g. you can also add a file transfer to an existing call).
- You say that touch devices should not be used when no video RTP
session is active. I don't see why this shouldn't be possible. I do own
a drawing tablet that doesn't have a screen but still is an absolute
pointing device (aka "touch"). If that device was connected via XMPP,
it wouldn't need a RTP session to transfer its input.
- You say that absolute mouse events should not be used when no video
RTP session is active. I also don't see why this restriction is in
place - same as above.

For both touch and mouse you use x and y coordinates "relative to the
video stream". What does that mean? x and y are doubles, so are they
supposed to be relative to the screen, so only values between 0 and 1
(inclusive) are valid? If x and y are absolute values in pixels, why
are they doubles? If they are a pixel value, is it the pixels of the
screen or the pixels of the video (as the video might use a lower
resolution than the actual screen resolution)? I would suggest to go
with relative values 0-1. If you want to use an absolute value in
pixels, I suggest to make it screen pixels and somehow signal the
screen pixels outside and independent of the RTP video resolution.

Wheel events don't have a screen coordinate. I'm pretty sure they
should have those, as the cursor position for the movement does matter
a lot.

If I understood correctly, you specify that a session is a screen share
session by adding a remote control content without any device. This
remote control content would thus effectively not be used, but still
require setup of a data channel. This doesn't seem like a good
protocol. The fact that a video is a screen share should be
communicated outside this specification and this specification should
not be involved at all in such a case (as it's not a remote control). A
remote control without devices should be invalid.

Marvin

[1] https://gist.github.com/mar-v-in/003bedfcafb9e49a6ba6083ae374088b
[2]
https://github.com/novnc/noVNC/wiki/Projects-and-companies-using-noVNC

2025

2024

2023

[Standards] Re: Proposed XMPP Extension: Jingle Remote Control