Hi Marvin,
Le lundi 20 mai 2024, 22:48:42 UTC+2 Marvin W a écrit :
Hi Goffi,
See inline comments. Sorry for the wall of text and if it overlaps with
one of the mails you wrote since I started writing this.
On Mon, 2024-05-20 at 16:51 +0200, Goffi wrote:
There are many benefits to using CBOR:
[SNIP]
The cumulative amount is about 10-20%% [1]. This isn't really a huge
improvement and almost all events will fit into a single network layer
frame anyway, further reducing the impact of encoding size.
[SNIP]
Segmentation is also inherent to SCTP, the protocol webrtc data
channels use to transfer content frames. There is no win in segmenting
the same segments twice.
Note that while recommended, WebRTC Data Channel is not mandatory, and any
streaming transport may be used. Your arguments are only valid for WebRTC Data
Channels.
- Encoding and decoding CBOR are much more
efficient, essential for
quick and
efficient data processing, especially by low-resource devices (like
Arduinos).
Not untrue, but probably negligible given the resource use of IP, UDP,
DTLS, SCTP - all part of the protocol stack you're building on and thus
involved in every event to be processed. Especially DTLS encryption is
going to be much more resource hungry than the difference between CBOR
parser and JSON parser. And notable, CBOR encoding is not a native
function in web browsers, so if web is a goal of this thing (and
seemingly it is, given all the references to web tech in the XEP), CBOR
is probably not much better than JSON.
Working with Web is a goal, but it should work of course outside web too (I
currently have a web implementation for controlling device, and CLI ones for
basic controlling device, and for controlled device).
CBOR is not native, but there are many implementations available.
Anyway, I'm not hard set on CBOR. If the consensus is to get rid of it, we can
get rid of it.
Regarding the choice of web, it's only because sending event, specially with
keyboard, is hard to do well. There are many different way to encode depending
on platforms, and various kind of keyboards with special characters. Web API
is simple, documented, and abstract this complexity. The web has been around
for 35 years, they have already gone through the rough patches. But again, I'm
not against switching if there is something as simple and complete.
[SNIP]
XEP-0247 Jingle XML streams doesn't need to go via the server, it uses
Jingle just like your proposed protocol.
I know that, I've just ruled out using <message> through the server as it has
been proposed in another feedback.
While the XEP isn't maintained
for some time and makes weird references to other XEPs, nothing in it
forbids using it with webrtc data channels. In fact this has been
discussed as a useful tool for all kinds of things recently (like
initial device crypto setup or device-to-device MAM).
In general I love the idea of XEP-0247 for many use cases. I just feel that
XML is not adapted in this particular use case.
And of course latency when sending via a server might
be sub-perfect,
but it's a very similar latency you would see if the network
environment requires to use a TURN server, which is one of the ways to
use Jingle.
TURN relay is a worst case scenario. And even then, it's more efficient because
you don't have to wait for server queue handling, and <message> processing.
And as mentioned, there are valid use cases for
having
input in cases where low latency is not that crucial. Think of keyboard
input to a remote shell - essentially what SSH does - which is not
uncommon to be routed through proxies/tunnels that add latency. Of
course for game input, drawing and 3d modeling, that's probably not an
option. It depends a lot on the usecase and that's why flexibility is
very much a good idea. Building something that is exclusively/primarily
designed around having a web browser XMPP client connected via Jingle
webrtc datachannels doesn't sound like flexibility was part of the
design.
It is not designed around having a web browser at all! It's not because it's
inspired by web API that it's the case, otherwise every HTTP upload is
designed for web browser too. Fact is there has been and still is a enormous
amount of engineering into web techs, and many good things have emerged from
there, like WebRTC, WebSockets, WebAssembly, etc.
And again, I have a non web implementation already (and a web one).
Sure with ssh latency is less a problem (while still annoying), but the
current mechanism works in all cases, is simple, and efficient. While adding
complexity with another mechanism because "there are valid use cases for
having input in cases where low latency is not that crucial".
Just as you can "directly" map data from
JSON objects from a web
browser to CBOR, you can directly map them to XML. It's not really a
good idea to do such a direct mapping in both cases though (e.g. if you
used enumerated keys in CBOR instead of a string map, you can
drastically reduce the payload size and improve parsing speed).
To have a successful specification, there is a balance to find between efficiency,
ease of implementation and flexibility. I believe that it's the case with
string map, and selectively mapping data.
[SNIP]
As I mentioned in another email: If you really feel like using RTP for
screen content transfer, you can always decide to only use the RFB
protocol (or something else) for the input part. I took it as an
example for an existing protocol that (among other features) has logic
for remote control input.
Again I'm not hard set on chosen technologies.
I'm not familiar with the internals of RFB, and will look at it. If it's a
good fit, I'm not against replacing the current events wire format with it.
From a quick glance at the Wikipedia page, I see "In terms of transferring
clipboard data, "there is currently no way to transfer text outside the
Latin-1 character set".[5] A common pseudo-encoding extension solves the
problem by using UTF-8 in an extended format.[2]: § 7.7.27 ", which makes me
suspicious though.
One of the design goal of my proposal is to have something really simple and
straightforward to implement.
Using RFB for screen transfer may be an adjacent topic, but not a
requirement.
The discussed specification focuses on remote controlling a device, rather than
screen/audio transfer. It explains how to use it in conjunction with the
current specification for A/V calls for remote desktop, but designing the
desktop transfer protocol is out of scope.
Another XEP may be specified if XEP-0167 proves not to be sufficient for desktop
transfer, and this proposal will be usable with it without issue. Such a XEP
could utilize RFB, SPICE, or whatever.
[SNIP]
I just played with the
https://w3c.github.io/uievents/tools/key-event-viewer.html and it's
still unclear to me when pressing modifier keys, which events are
emitted when and what is the supposed state of the modifier flag for
those events. I figured that the behavior is inconsistent between
browsers (and probably operating systems) and also between different
keys in the same browser. I bet this is not intended, but as the
specification and MDN don't really tell me what the correct behavior
would be, I can't really blame the browsers either.
There is no modifier flag used in the specification. There is the key value, and
the location number. From my tests, it's consistent and corresponds to the
documentation for the browsers that I've tried (Firefox and Chromium).
I'm not saying there aren't any cases where
low-latency is important,
where I disagree is that this is the case in all occasions. If you
don't have low latency feedback from the remote device, low latency for
input is very likely not crucial.
I have the feeling that you only see this specification with the remote desktop
use case point of view. There are other use cases, and one another major one
is to use a device as input for another one in the same physical location: use
of a smartphone as ad-hoc touch pad or gamepad for instance. And if low
latency is easily achieved, I still don't see the point to have other
mechanism because in some niche case low latency is not that annoying (but
still is, it's always annoying).
Anyway, I remain not convinced that XSF is the place to specify a
remote control protocol from scratch (which is what sections 8 and 9 of
the XEP are about). Mostly because I feel the XSF does not have the
competence for doing so (aka. we will probably do things terribly
wrong, due to lack of experience in the field).
Again, it is not from scratch. It's re-using existing protocols, in a simple,
working, easy-to-implement, and efficient way.
Thank you for your feedback, and for the rest of your message, I'll take it
into account for next revision if if the protoXEP is accepted.
Instead of `<device type="keyboard"/>`
I would go with `<keyboard />`,
allowing for attributes to be added for more information where there is
fit (e.g. for a mouse have an optional buttons attribute with the
number of buttons that are on the mouse, or for a gamepad, you might
want to provide the layout, etc). This also means that to extend new
devices outside this specification, one can just have a `<gamepad
xmlns="urn:xmpp:remote-control:gamepad:0" />` or similar. As a general
guideline, I feel attributes should only be used if the set of possible
is finite.
The specification says that other child elements can be used in <device> for
parameters. But you proposition may be cleaner, I'll consider it for a next
revision if the protoXEP is accepted. Thanks!
I would strongly opt to not make the use of datachannels a SHOULD in
this protocol. It really doesn't matter for the purpose of this
protocol and you don't want to need to upgrade this protocol if a new
transport protocol becomes available that would be a better fit. Jingle
does the abstraction to streaming vs datagram, so that application
protocols don't need to deal with it.
The goal here is to be sure that it will work with web clients, as data
channels are currently the only way to have direct connection with browsers. I
can reformulate to only suggest it and get rid of the SHOULD.
There is a lot of specification for interaction with the Jingle RTP and
WebRTC protocols. This seems mostly unnecessary.
- You already write in the requirements that everything should work
even without Jingle RTP
- You put that one MUST use the same "WebRTC session" (what is that
even) for both Jingle RTP and Remote Control. I wouldn't know why this
is. Of course using existing sessions in Jingle often makes sense
(that's why it's a feature), but it definitely doesn't need a MUST
here.
WebRTC has sessions pretty much like Jingle; its ID is what you have in the o=
line of your SDP.
The goal here is to reuse the connection, and to know which streams are used
for what. However, this is not ideal, I agree. I have a plan to get rid of
this section and work on a separate specification to add metadata to
distinguish which streams are used for what.
- You write explicitly that Remote Control can be
added with content-
add to existing Jingle RTP sessions. This is already given by the
Jingle specification, which doesn't limit what content can be added to
a session (e.g. you can also add a file transfer to an existing call).
- You say that touch devices should not be used when
no video RTP
session is active. I don't see why this shouldn't be possible. I do own
a drawing tablet that doesn't have a screen but still is an absolute
pointing device (aka "touch"). If that device was connected via XMPP,
it wouldn't need a RTP session to transfer its input.
The issue is that video feed is used in this case to get the screen dimension.
Without it, we can't get touch event which use absolute position (while for
mouse, there is a relative position mode for exactly this use case).
An alternative would be to specify screen dimension when establishing the
remote control session.
- You say that absolute mouse events should not be
used when no video
RTP session is active. I also don't see why this restriction is in
place - same as above.
For both touch and mouse you use x and y coordinates "relative to the
video stream". What does that mean? x and y are doubles, so are they
supposed to be relative to the screen, so only values between 0 and 1
(inclusive) are valid?
No, its value is in pixels, the same as for the Web API. Its double because
pixels can be subdivided (High-DPI displays, transformations). I realize that,
besides the link to MDN, this is not explicitly stated; I'll add a notice in
future revisions.
The Web API was initially using int, and then moved to double. That's the kind
of reason why I'm using a mapping for the Web API: they went that way, and the
types are carefully chosen.
[SNIP]
Wheel events don't have a screen coordinate. I'm pretty sure they
should have those, as the cursor position for the movement does matter
a lot.
Cursor position is handled by other devices (mouse or touch). Wheel by itself
doesn't has any position (it can be an independent device not linked to a
mouse).
If I understood correctly, you specify that a session is a screen share
session by adding a remote control content without any device. This
remote control content would thus effectively not be used, but still
require setup of a data channel. This doesn't seem like a good
protocol.
The fact that a video is a screen share should be
communicated outside this specification and this specification should
not be involved at all in such a case (as it's not a remote control). A
remote control without devices should be invalid.
It was just to handle the case where no device is accepted, there was 2
options:
- reject it totally
- say it's a simple screen share session.
I've chosen the later one. But indeed, data channel is then useless. Can
change it for the other option.
Thanks for the time you took to review the spec and write this feedback.
As a summary:
- I'm not hard set on technologies, and I'm OK to get rid of CBOR is there is
consensus on it. I personally still think that it's a superior solution.
- regarding using RFB for input events only, I'll have a deeper look at the
spec and evaluate it. It may be an option it is comparable in ease of
implementation, efficiency and flexibility to the current proposal.
- I will take other feedback into account for a future revision.
Thanks!
Best,
Goffi