[Standards] Re: Proposed XMPP Extension: Jingle Remote Control

21 May 2024

Hi Goffi,

On Tue, 2024-05-21 at 12:47 +0200, Goffi wrote:
...
  I know that, I've just ruled out using
<message> through the server
 as it has 
 been proposed in another feedback. 
Why do you rule that out? Because you don't see a purpose, when my
whole point is that I do see a purpose? Of course I can send whatever
CBOR/JSON you come up with as a base64 blob inside a <message> for my
usecase, but then I wonder why not to handle it in first place.

...
  From a quick glance at the Wikipedia page, I see
"In terms of
 transferring 
 clipboard data, "there is currently no way to transfer text outside
 the 
 Latin-1 character set".[5] A common pseudo-encoding extension solves
 the 
 problem by using UTF-8 in an extended format.[2]: § 7.7.27 ", which
 makes me 
 suspicious though. 
RFB definitely is old, so these kind of things are expected. And, while
I see that you added clipboard as a potential future extension, it
seems odd to complain that RFB has a suboptimal implementation of a
feature your proposed XEP currently doesn't have at all.

...
  One of the design goal of my proposal is to have
something really
 simple and 
 straightforward to implement. 
RFB isn't really hard to implement either. And ther are a ton of
implementations out there already.

...
  There is no modifier flag used in the specification.
There is the key
 value, and 
 the location number. From my tests, it's consistent and corresponds
 to the 
 documentation for the browsers that I've tried (Firefox and
 Chromium). 
I know that your specification doesn't transfer the modifier flags,
probably assuming they are superfluous. However, if your browser client
was to naively send the key events it receives as is without further
checking for plausibility, things will go wrong: I tested pressing the
keys that would logically result in the events meta down, control down,
control up, meta up and here are the results on different browsers:
https://imgur.com/a/zVxDAVa

From what I understand, the state of keyup and keydown events in the
web API doesn't need to be consistent (e.g. there can be keydown
without keyup and vice-versa). Do we want the same behavior for this
protocol or something else?

...

  I'm not saying there aren't any cases
where low-latency is
 important,
 where I disagree is that this is the case in all occasions. If you
 don't have low latency feedback from the remote device, low latency
 for
 input is very likely not crucial.  
 I have the feeling that you only see this specification with the
 remote desktop 
 use case point of view. There are other use cases, and one another
 major one 
 is to use a device as input for another one in the same physical
 location: use 
 of a smartphone as ad-hoc touch pad or gamepad for instance. And if
 low 
 latency is easily achieved, I still don't see the point to have other
 mechanism because in some niche case low latency is not that annoying
 (but 
 still is, it's always annoying). 
I think you misunderstood my point. Using a smartphone as a touch pad
or gamepad while playing a game on a screen next to you, is low latency
feedback (you can see the screen with low latency). Example for where
you don't need low latency would be when blindly typing into a remote
shell, because you won't get feedback there (except after confirming a
command which is probably not low latency).

...

 Anyway, I remain not convinced that XSF is the place to specify a
 remote control protocol from scratch (which is what sections 8 and
 9 of
 the XEP are about). Mostly because I feel the XSF does not have the
 competence for doing so (aka. we will probably do things terribly
 wrong, due to lack of experience in the field).  
 Again, it is not from scratch. It's re-using existing protocols, in a
 simple, 
 working, easy-to-implement, and efficient way. 
I was talking about the remote control protocol, which is what runs on
the topmost layer (inside the webrtc datachannel or whatever other
Jingle transport is used). This protocol is mostly from scratch (it's
loosely based on web API events, but then only taking an arbitrarily
picked subset of events and event properties)

...
  The goal here is to be sure that it will work with web
clients, as
 data 
 channels are currently the only way to have direct connection with
 browsers. I 
 can reformulate to only suggest it and get rid of the SHOULD. 
Which isn't an issue if web clients are not relevant for my usecase.
And honestly, any kind of pointing to "you should support web clients"
sounds weird to me. It certainly is interesting that we can support web
clients, but really shouldn't siphon into unrelated specifications (and
this one totally is unrelated to web).

...
  WebRTC has sessions pretty much like Jingle; its ID is
what you have
 in the o= 
 line of your SDP. 
My point is: Either it's a Jingle session or it's not part of XMPP.
Jingle doesn't use WebRTC. It just happens that WebRTC APIs are
somewhat compatible to Jingle (because they are based on Jingle), but
from XMPP perspective, you never have WebRTC sessions. I don't know
exactly what it means to be in the same WebRTC session, but whatever
you want here, make it more explicit, because people that don't use
WebRTC APIs should not be required to first read the WebRTC specs (or
probably implementations source code) to figure out what you mean by
that.

...
  The issue is that video feed is used in this case to
get the screen
 dimension. 
 Without it, we can't get touch event which use absolute position
 (while for 
 mouse, there is a relative position mode for exactly this use case). 
That's a problematic design. As I said, clients might scale the video
to reduce bandwidth use. Dino also has logic to adjust the video
resolution of cameras depending on available bandwidth.

And as I understood for mouse, it's not relative to the screen, but
relative to the previous position, aka a movement vector, like reported
from touchpads.
An screen relative position that is 0,0 is upper left corner, 0.5,0.5
is center of the screen and 1,1 is lower right corner, would work
independent of the target screen resolution.

...
  An alternative would be to specify screen dimension
when establishing
 the 
 remote control session. 
Might work, but then you also need to cover the case where the screen
resolution changes during remote control.

...
  No, its value is in pixels, the same as for the Web
API. Its double
 because 
 pixels can be subdivided (High-DPI displays, transformations). I
 realize that, 
 besides the link to MDN, this is not explicitly stated; I'll add a
 notice in 
 future revisions. 
The Web API uses double because they did weird things for HiDPI. On the
hardware layer, there are only pixels and if you click on a point on
the screen, it will always be on a pixel (at least in all OS that I am
aware of). The transformation of HiDPI in browsers abstract away from
actual pixels and 1px might be more or less than a physical pixel. But
why would you want to carry this abstraction through the network to a
system that shouldn't care about what browsers can do and what they
think a pixel is?

...
  It was just to handle the case where no device is
accepted, there was
 2 
 options:
 - reject it totally
 - say it's a simple screen share session.

 I've chosen the later one. But indeed, data channel is then useless.
 Can 
 change it for the other option. 
We also don't allow Jingle file transfers of no file or RTP contents
without any codecs. As this protocol is for remote control, it should
remain entirely unused for screen share only.

...
  - I'm not hard set on technologies, and I'm OK
to get rid of CBOR is
 there is 
 consensus on it. I personally still think that it's a superior
 solution. 
To me the use of CBOR here feels not well motivated, except for obscure
"better performance" reasons before having done any measurement to back
that claim. From XMPP perspective, something in a Jingle XML stream
would be more canonical (because it reuses the stack we already have in
every XMPP client anyway) and anything diverting from that IMO should
be well reasoned.

If you're reasoning that CBOR provides significant performance gain
over XML, then why is it not a priority to figure out how we use CBOR
instead of XML everywhere in XMPP (e.g. by creating some XML<>CBOR
translation and using that as an optional stream feature).

...
  - regarding using RFB for input events only, I'll
have a deeper look
 at the 
 spec and evaluate it. It may be an option it is comparable in ease of
 implementation, efficiency and flexibility to the current proposal. 
I want to repeat that I haven't verified that RFB is particularly good
fit for the purpose, I just know it's very popular.

Best,
Marvin

2025

2024

2023

[Standards] Re: Proposed XMPP Extension: Jingle Remote Control