[Standards] XEP-0372: References

Jonas Wielicki jonas at wielicki.name
Mon Mar 12 15:17:07 UTC 2018

On Montag, 12. März 2018 15:56:04 CET Sam Whited wrote:
> On Mon, Mar 12, 2018, at 09:20, Jonas Wielicki wrote:
> > FWIW, I’d argue that using codepoints is much saner,
> > because those are readily available and consistent across all receiving
> > entities of the same message. (Even though this potentially introduces
> > some
> > interesting edge-cases when somebody creates a references starting in the
> > middle of, e.g., an emoji sequence.)
> I agree that "characters" needs to be specified, but codepoints (or, to be
> pedantic, "scalar values" which is a subset of codepoints and is all we
> need to worry about in XMPP/UTF-8 land) 

This is true, XML restricts to Scalar Values. Thanks, I didn’t know that term.

> is probably the wrong way to do it.
> Codepoints doesn't really get us anything over bytes, 

They do, because implementations may not even have the bytes representation 
around or accessible, because, as you say, XML operates on Unicode Scalar 
Values, not on Scalar Values encoded to UTF-8 bytes.

(Now, you could argue that XMPP on its lowest level indeed is UTF-8 encoded, 
which is true. However, having an XML parser inbetween usually abstracts you 
away from that and you might still not have access to UTF-8 encoded bytes but 
only sequences of Scalar Values and still have a perfectly good (I would 
personally even argue, superior) XML imlpmentation.)

> because just as
> scalar values can be made up of multiple bytes, glyphs (or "grapheme
> clusters") may be made up of multiple scalar values (and, as you pointed
> out, the range could end in the middle of a grapheme cluster that uses
> multiple scalar values).
> In my mind there are only two things that make sense here:
> - Use bytes and come up with a way to handle bad ranges that end in the
> middle of a UTF-8 sequence 

That proposal does not make sense at all. It doesn’t solve the issue of having 
a range start or end in the middle of a grapheme cluster, and it introduces 
extra complexity by requiring implementations to re-obtain a UTF-8 
representation of the character data (or keep it around). Sounds like the 
worst of both worlds (Grapheme Clusters vs. Scalar Values). XML Character Data 
is specified in Scalar Values (they call it Characters, but it really is a 
Scalar Value minus \uFFFF and \uFFFE), so it makes most sense to re-use that.

> - Use grapheme clusters and require that
> everyone implement the segmentation algorithm

This will bring us all kinds of issues with different unicode versions.

> I lean towards bytes because it keeps things simple and 

Then let’s stay with Scalar Values, which is what XML works with, instead of 
using a lower-level representation.

> I doubt we'd see
> much adoption if we used grapheme clusters

I agree.

If we want to find a middle way, I would suggest the following (in addition to 
switching to Scalar Values as index base for @start and @end):

> If an implementation uses references to apply any kind of markup to the 
> text range referred to by the reference, it SHOULD ensure that the @start 
> points to the beginning of a Grapheme Cluster and @end to the end of a 
> Grapheme Cluster.

An implementation may also just attach something to the whole message or in 
another way interpret the reference without paying attention to @start and 

kind regards,
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.jabber.org/pipermail/standards/attachments/20180312/b235bf1f/attachment-0001.sig>

More information about the Standards mailing list