[Standards] Support for stickers (custom emojis)

Marvin W xmpp at larma.de
Thu Oct 24 18:32:04 UTC 2019

On 10/21/19 4:06 PM, Jonathan Lennox wrote:
> The right concept here is probably "grapheme clusters", as defined in
> Unicode Standard Annex 29.  ICU has support for this.

We should refrain from using things like grapheme clusters in wire 
formats, as those are subject to changes in upcoming Unicode versions 
and thus the wire format would be understood differently depending on 
the Unicode version implemented by the client.

Technically we could also agree on using a certain Unicode version now 
and for all eternity, but this sounds like a stupid concept and will 
cause people to use ICU or similar which will break eventually as the 
standard changes.

We should strive for the maximum compatibility. This gives us basically 
two options: bytes and codepoints. As our encoding is fixed to UTF-8 per 
RFC6120, both would be equally understandable by clients. However there 
are two good reasons against bytes:
1) At some point we might want to allow the usage of UTF-16 or any other 
encoding. Byte counts would have to be translated when re-encoding which 
a server is probably unable to do generically.
2) There is no useful meaning of starting a link or bold inside a 
codepoint. Depending on the tech stack used, it might cause developers 
to unintentionally allow the generation of invalidly encoded strings, 
causing all kind of issues (including potential security impact)

Thus, I would vote for using codepoints. This would of course open the 
questions what happens if multiple codepoints result in a single 
grapheme and anything points inside the grapheme. The rule should just 
be that clients should not do that on outgoing data. If a clients 
receives input pointing inside a grapheme, it's implementation-defined 
if the grapheme is included, excluded or split. In practice this 
shouldn't happen so I doubt it is really worth it to define ruling in 
the respective XEP, but this would also be an option.

By the way, the often mentioned flag example is not consistent across 
browsers either, try https://larma.de/splitflag.html with various 
browsers and browser versions. (Bonus Task: Build a browser detector 
based on flag rendering)


More information about the Standards mailing list