[Standards] Support for stickers (custom emojis)
xmpp at larma.de
Fri Oct 25 14:25:43 UTC 2019
On 10/25/19 3:15 PM, Sam Whited wrote:
> On Thu, Oct 24, 2019, at 18:32, Marvin W wrote:
> XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8.
I do agree that this is true inside XMPP, but the data being transported
inside XMPP might be transcoded to non-xmpp transport (examples: bridges
to other networks, clients that don't do XMPP on c2s connections) and
for those use-cases different encodings might occur. We shouldn't focus
on non-UTF-8 encodings, but considering it also doesn't hurt.
> This problem exists with codepoints too, though to a lesser extent and
> it may be less clear how it should be handled in all cases. For example,
> in the middle of a multi-codepoint emoji or country flag.
Yes and no. multi-codepoint emojis are still valid characters when
split, whereas multi-byte codepoints cannot be split. There is nothing
wrong with displaying the flag 🇪🇺 as 🇪🇺 *, so your implementation
is always capable in strictly following any markup being done on a
codepoint basis, even if the markup border is inside a multi-codepoint
> There's also the minor problem of having to decode all the bytes up to
> the start position at the application layer if we have to count
Some programming languages handle strings in unicode codepoints instead
of bytes. I agree that this would be an issue for non messaging content
(i.e. large files) but I don't think we are talking about. For messaging
content, it's no issue that the client has two decode all the bytes - it
will be required to do so anyway for displaying.
> With bytes you only have two checks: is the start and the
> end marker on a byte boundary? If so the string in the middle can be
> assumed to be valid.
Assuming you meant codepoint boundary instead of byte boundary, I agree
that this would also be an option, as long as we make sure people
actually do these checks. I personally prefer codepoints, but both are
valid and sane options - as long as we don't go with grapheme cluster or
any like this, we are fine IMO.
* I put a zero-width space in there to ensure your mail client is not
going to merge the two characters.
More information about the Standards