[Standards] Support for stickers (custom emojis)
teddsterr at outlook.com
Sat Oct 26 13:52:26 UTC 2019
I presume that the majority of implementations will do the UTF-8 decoding before/during XML parsing, so with offsets specified as bytes they will likely awkwardly re-encode the string again to be able to cross-reference these byte offsets with the codepoint* offsets they need.
For those which must operate on the byte level, anything other than byte offsets is going to be awkward. You can still manage without fully decoding UTF-8 however, as all non-head bytes have the pattern 01xxxxxx, so counting only head bytes will lead you to the correct start-of-codepoint - though it's obviously a little more work than direct indexing.
Bytes has the possibility of all error cases that codepoints has, but bytes has the additional possibility of offsets landing mid-codepoint, while that's impossible if codepoints are your units.
As for mid-glyph offsets, is it such a problem beyond possibly displaying badly? Where it's assumed to be an error, an easy solution would be to quietly round the start/end offsets to the start/end of their glyphs - obviously this is handled most efficiently by the display layer, but presumably that's the only place it matters anyway.
>From another angle, I'd position XMPP above XML, and XML above the text encoding scheme used (UTF-8), so then it seems wrong to be concerning ourselves with details of the encoding scheme from the top level.
* It's probably worth mentioning that there are a number of confusions people have with Unicode, and saying 'character' when they mean 'codepoint' is one of them (as they're equivalent for the single-codepoint characters they're familiar with.)
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Standards