[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Wed Dec 9 07:59:03 UTC 2020


On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus 
wrote:
>> We do have xml:lang, don't we?
> 
> Unforunately, it doesn't help in all cases. It's perfectly fine to write
> a message with xml:lang="en":
> 
> "chlapec" is "boy" in slowak
> 
> This is 27 grapheme clusters, but I guess most western people would
> count it as 28.

But the recipient would be able to apply the same rules regarding 
localization as the sender when counting grapheme clusters.


>> Let us ignore grapheme clusters for a moment and focus on XEP-0426:
>> Have you considered Unicode normalization? Especially when a text
>> that was originally in decomposed form is normalized to composed
>> form. This would corrupt the code point indexes.
>>
>> [..]
>>
>> I think that due to this, XEP-0426 should specify that counting
>> happens with the text in NFC form. Or am I missing something?
> 
> I could imagine going for something like:

Yes, that definitely goes into the right direction.


> Receiving or intermediary entities SHOULD not apply Unicode
> normalization to the text referenced from character counting.

I am not sure that you can (or that we should) put normative text that 
applies to intermediate hops into XEP-0426. The XEP could/should limit 
itself to describe normative clauses for the point end-points exchanging 
character counting data.


> If
> entities apply Unicode normalization, they SHOULD update all
> positions, indices and lengths derived from character counting if
> required.

As above. I think this would need at least a discoverable disco#info 
feature. But even then, I doubt that this is useful in a normative form. 
However, it probably can not hurt to have XEP-0426 spell this out as 
recommendation in an informative way.


> It is RECOMMENDED that entities creating the original
> stanzas use NFC form.

Now that is the part I really like and which I believe to be missing 
from XEP-0426. +1

I also suggest that the receiving side is considered. For example: 
"Entities that receive character counted text should normalize the 
counted text to Unicode Normalization Form C (NFC) [1] form prior 
evaluating the character indexes."

1: https://unicode.org/reports/tr15/

- Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201209/6e998da3/attachment-0001.sig>


More information about the Standards mailing list