[Standards] XEP-0372: References
flo at geekplace.eu
Mon Mar 12 17:01:39 UTC 2018
On 12.03.2018 16:17, Jonas Wielicki wrote:
> On Montag, 12. März 2018 15:56:04 CET Sam Whited wrote:
>> On Mon, Mar 12, 2018, at 09:20, Jonas Wielicki wrote:
>> because just as
>> scalar values can be made up of multiple bytes, glyphs (or "grapheme
>> clusters") may be made up of multiple scalar values (and, as you pointed
>> out, the range could end in the middle of a grapheme cluster that uses
>> multiple scalar values).
>> In my mind there are only two things that make sense here:
>> - Use bytes and come up with a way to handle bad ranges that end in the
>> middle of a UTF-8 sequence
> That proposal does not make sense at all. It doesn’t solve the issue of having
> a range start or end in the middle of a grapheme cluster, and it introduces
> extra complexity by requiring implementations to re-obtain a UTF-8
> representation of the character data (or keep it around). Sounds like the
> worst of both worlds (Grapheme Clusters vs. Scalar Values). XML Character Data
> is specified in Scalar Values (they call it Characters, but it really is a
> Scalar Value minus \uFFFF and \uFFFE), so it makes most sense to re-use that.
>> - Use grapheme clusters and require that
>> everyone implement the segmentation algorithm
> This will bring us all kinds of issues with different unicode versions.
>> I lean towards bytes because it keeps things simple and
> Then let’s stay with Scalar Values, which is what XML works with, instead of
> using a lower-level representation.
I'm also leaning towards this.
And possibly specify that a pointer to the start or the middle of a
grapheme cluster is not recommended, and if found, should be treated as
a pointer to the cluster itself.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 642 bytes
Desc: OpenPGP digital signature
More information about the Standards