[Standards] Proposed XMPP Extension: Character counting in message bodies

Marvin W xmpp at larma.de
Mon Dec 7 22:34:29 UTC 2020


Hi,

On 07.12.20 19:34, Florian Schmaus wrote:
> We do have xml:lang, don't we?

Unforunately, it doesn't help in all cases. It's perfectly fine to write
a message with xml:lang="en":

> "chlapec" is "boy" in slowak

This is 27 grapheme clusters, but I guess most western people would
count it as 28.

> Let us ignore grapheme clusters for a moment and focus on XEP-0426: 
> Have you considered Unicode normalization? Especially when a text 
> that was originally in decomposed form is normalized to composed 
> form. This would corrupt the code point indexes.
> 
> [..]
> 
> I think that due to this, XEP-0426 should specify that counting 
> happens with the text in NFC form. Or am I missing something?


Normalization was already discussed wrt to XEP 0426 (not sure if that
was on list or in chat). Normalizing any text as part of processing is
modifying the content (as per the XML specification). For most purposes
we assume that the server does not modify the <body> of a message or any
other XML element which isn't clearly servers domain to modify. If we
assume servers modify the (XML) content of <body> any attempt to
character counting can become worthless anyway.

I could imagine going for something like:

> Receiving or intermediary entities SHOULD not apply Unicode 
> normalization to the text referenced from character counting. If 
> entities apply Unicode normalization, they SHOULD update all 
> positions, indices and lengths derived from character counting if 
> required. It is RECOMMENDED that entities creating the original 
> stanzas use NFC form.

Marvin


More information about the Standards mailing list