[Standards] XEP-0372: References

Sam Whited sam at samwhited.com
Mon Mar 12 14:56:04 UTC 2018

On Mon, Mar 12, 2018, at 09:20, Jonas Wielicki wrote:
> FWIW, I’d argue that using codepoints is much saner, 
> because those are readily available and consistent across all receiving 
> entities of the same message. (Even though this potentially introduces some 
> interesting edge-cases when somebody creates a references starting in the 
> middle of, e.g., an emoji sequence.)

I agree that "characters" needs to be specified, but codepoints (or, to be pedantic, "scalar values" which is a subset of codepoints and is all we need to worry about in XMPP/UTF-8 land) is probably the wrong way to do it. Codepoints doesn't really get us anything over bytes, because just as scalar values can be made up of multiple bytes, glyphs (or "grapheme clusters") may be made up of multiple scalar values (and, as you pointed out, the range could end in the middle of a grapheme cluster that uses multiple scalar values).

In my mind there are only two things that make sense here:

- Use bytes and come up with a way to handle bad ranges that end in the middle of a UTF-8 sequence
- Use grapheme clusters and require that everyone implement the segmentation algorithm

I lean towards bytes because it keeps things simple and I doubt we'd see much adoption if we used grapheme clusters, though I do worry about implementations assuming that everything is ASCII.


Sam Whited
sam at samwhited.com

More information about the Standards mailing list