[Standards] Proposed XMPP Extension: Character counting in message bodies
Marvin W
xmpp at larma.de
Mon Dec 7 15:22:24 UTC 2020
Hi,
On 04.12.20 21:23, Florian Schmaus wrote:
> And I am in favor of code points because it allows us to aim for the
> extended grapheme cluster algorithm, while also allowing for the
> "simply count code points" fallback.
XEP-0426 already discusses why it's using codepoints instead of
grapheme clusters in its rationale:
> The most obvious way of counting characters is to count them how
> humans would. This sounds easy when only having western scripts in
> mind but becomes more complicated in other scripts and most
> importantly is not well-defined across Unicode versions. New unicode
> versions regularly added new possibilities to build grapheme
> clusters, including from existing code points. To be forward
> compatible, counting grapheme clusters, graphemes, glyphs or similar
> is thus not an option.
Also I forgot to mention that grapheme clusters are locale specific
(example: "ch" is considered a single grapheme cluster in slowak). The
TR#29 even says:
> The Unicode definitions of grapheme clusters are defaults: not meant
> to exclude the use of more sophisticated definitions of tailored
> grapheme clusters where appropriate.
Finally, I don't think that it's generally inappropriate to point inside
a grapheme cluster (even if that's hard to implement). An example of
where it seems appropriate to reference a part of a grapheme cluster is
this: https://larma.de/grapheme.html
Marvin
More information about the Standards
mailing list