[Standards] Proposed XMPP Extension: Character counting in message bodies
flo at geekplace.eu
Mon Dec 7 18:34:02 UTC 2020
Hi Marvin :)
On 12/7/20 4:22 PM, Marvin W wrote:
> On 04.12.20 21:23, Florian Schmaus wrote:
>> And I am in favor of code points because it allows us to aim for the
>> extended grapheme cluster algorithm, while also allowing for the
>> "simply count code points" fallback.
> XEP-0426 already discusses why it's using codepoints instead of
> grapheme clusters in its rationale:
> Also I forgot to mention that grapheme clusters are locale specific
> (example: "ch" is considered a single grapheme cluster in slowak).
We do have xml:lang, don't we?
> Finally, I don't think that it's generally inappropriate to point inside
> a grapheme cluster (even if that's hard to implement). An example of
> where it seems appropriate to reference a part of a grapheme cluster is
> this: https://larma.de/grapheme.html
Fair point. (I am not sure about the relevance, though).
Let us ignore grapheme clusters for a moment and focus on XEP-0426: Have
you considered Unicode normalization? Especially when a text that was
originally in decomposed form is normalized to composed form. This would
corrupt the code point indexes.
XMPP does not require any Unicode normal form. Nor does XML 1.0 (as far
as I can tell). Furthermore, XMPP does not require that the Unicode form
Hence it would be perfectly possible that the Unicode normal form of
text exchanged via XMPP changes between hops. While I am not aware of an
implementation that does that, it is not forbidden. And when you think
that this will never happen, then please also keep in mind that stanzas
may be persisted in a database. For example when put in the MAM archive.
And a database engine may perform normalization of the data.
I think that due to this, XEP-0426 should specify that counting happens
with the text in NFC form. Or am I missing something?
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 495 bytes
Desc: OpenPGP digital signature
More information about the Standards