[Standards] Support for stickers (custom emojis)

Marvin W xmpp at larma.de
Fri Oct 25 14:25:43 UTC 2019


On 10/25/19 3:15 PM, Sam Whited wrote:
> On Thu, Oct 24, 2019, at 18:32, Marvin W wrote:
> XMPP uses UTF-8, and there's almost no reason to use anything but UTF-8.

I do agree that this is true inside XMPP, but the data being transported 
inside XMPP might be transcoded to non-xmpp transport (examples: bridges 
to other networks, clients that don't do XMPP on c2s connections) and 
for those use-cases different encodings might occur. We shouldn't focus 
on non-UTF-8 encodings, but considering it also doesn't hurt.
> This problem exists with codepoints too, though to a lesser extent and
> it may be less clear how it should be handled in all cases. For example,
> in the middle of a multi-codepoint emoji or country flag.

Yes and no. multi-codepoint emojis are still valid characters when 
split, whereas multi-byte codepoints cannot be split. There is nothing 
wrong with displaying the flag 🇪🇺 as 🇪​🇺 *, so your implementation 
is always capable in strictly following any markup being done on a 
codepoint basis, even if the markup border is inside a multi-codepoint 
emoji.

> There's also the minor problem of having to decode all the bytes up to
> the start position at the application layer if we have to count
> codepoints. 

Some programming languages handle strings in unicode codepoints instead 
of bytes. I agree that this would be an issue for non messaging content 
(i.e. large files) but I don't think we are talking about. For messaging 
content, it's no issue that the client has two decode all the bytes - it 
will be required to do so anyway for displaying.

> With bytes you only have two checks: is the start and the
> end marker on a byte boundary? If so the string in the middle can be
> assumed to be valid.

Assuming you meant codepoint boundary instead of byte boundary, I agree 
that this would also be an option, as long as we make sure people 
actually do these checks. I personally prefer codepoints, but both are 
valid and sane options - as long as we don't go with grapheme cluster or 
any like this, we are fine IMO.

Marvin

-- 

* I put a zero-width space in there to ensure your mail client is not 
going to merge the two characters.


More information about the Standards mailing list