[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Mon Dec 7 18:34:02 UTC 2020

Hi Marvin :)

On 12/7/20 4:22 PM, Marvin W wrote:
> On 04.12.20 21:23, Florian Schmaus wrote:
>> And I am in favor of code points because it allows us to aim for the
>>   extended grapheme cluster algorithm, while also allowing for the
>> "simply count code points" fallback.
> XEP-0426 already discusses why it's using codepoints instead of
> grapheme clusters in its rationale:
> […]
> Also I forgot to mention that grapheme clusters are locale specific
> (example: "ch" is considered a single grapheme cluster in slowak).

We do have xml:lang, don't we?

> Finally, I don't think that it's generally inappropriate to point inside
> a grapheme cluster (even if that's hard to implement). An example of
> where it seems appropriate to reference a part of a grapheme cluster is
> this: https://larma.de/grapheme.html

Fair point. (I am not sure about the relevance, though).

Let us ignore grapheme clusters for a moment and focus on XEP-0426: Have 
you considered Unicode normalization? Especially when a text that was 
originally in decomposed form is normalized to composed form. This would 
corrupt the code point indexes.

XMPP does not require any Unicode normal form. Nor does XML 1.0 (as far 
as I can tell). Furthermore, XMPP does not require that the Unicode form 
is maintained.

Hence it would be perfectly possible that the Unicode normal form of 
text exchanged via XMPP changes between hops. While I am not aware of an 
implementation that does that, it is not forbidden. And when you think 
that this will never happen, then please also keep in mind that stanzas 
may be persisted in a database. For example when put in the MAM archive. 
And a database engine may perform normalization of the data.

I think that due to this, XEP-0426 should specify that counting happens 
with the text in NFC form. Or am I missing something?

- Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201207/799893a8/attachment.sig>

More information about the Standards mailing list