[Standards] Proposed XMPP Extension: Character counting in message bodies

Marvin W xmpp at larma.de
Mon Dec 7 15:22:24 UTC 2020


Hi,

On 04.12.20 21:23, Florian Schmaus wrote:
> And I am in favor of code points because it allows us to aim for the
>  extended grapheme cluster algorithm, while also allowing for the 
> "simply count code points" fallback.

XEP-0426 already discusses why it's using codepoints instead of
grapheme clusters in its rationale:

> The most obvious way of counting characters is to count them how 
> humans would. This sounds easy when only having western scripts in 
> mind but becomes more complicated in other scripts and most 
> importantly is not well-defined across Unicode versions. New unicode 
> versions regularly added new possibilities to build grapheme 
> clusters, including from existing code points. To be forward 
> compatible, counting grapheme clusters, graphemes, glyphs or similar 
> is thus not an option.

Also I forgot to mention that grapheme clusters are locale specific
(example: "ch" is considered a single grapheme cluster in slowak). The
TR#29 even says:

> The Unicode definitions of grapheme clusters are defaults: not meant
> to exclude the use of more sophisticated definitions of tailored
> grapheme clusters where appropriate.

Finally, I don't think that it's generally inappropriate to point inside
a grapheme cluster (even if that's hard to implement). An example of
where it seems appropriate to reference a part of a grapheme cluster is
this: https://larma.de/grapheme.html

Marvin


More information about the Standards mailing list