[Standards] Proposed XMPP Extension: Character counting in message bodies
Florian Schmaus
flo at geekplace.eu
Wed Dec 9 07:59:03 UTC 2020
On 12/7/20 11:34 PM, Marvin W wrote> On 07.12.20 19:34, Florian Schmaus
wrote:
>> We do have xml:lang, don't we?
>
> Unforunately, it doesn't help in all cases. It's perfectly fine to write
> a message with xml:lang="en":
>
> "chlapec" is "boy" in slowak
>
> This is 27 grapheme clusters, but I guess most western people would
> count it as 28.
But the recipient would be able to apply the same rules regarding
localization as the sender when counting grapheme clusters.
>> Let us ignore grapheme clusters for a moment and focus on XEP-0426:
>> Have you considered Unicode normalization? Especially when a text
>> that was originally in decomposed form is normalized to composed
>> form. This would corrupt the code point indexes.
>>
>> [..]
>>
>> I think that due to this, XEP-0426 should specify that counting
>> happens with the text in NFC form. Or am I missing something?
>
> I could imagine going for something like:
Yes, that definitely goes into the right direction.
> Receiving or intermediary entities SHOULD not apply Unicode
> normalization to the text referenced from character counting.
I am not sure that you can (or that we should) put normative text that
applies to intermediate hops into XEP-0426. The XEP could/should limit
itself to describe normative clauses for the point end-points exchanging
character counting data.
> If
> entities apply Unicode normalization, they SHOULD update all
> positions, indices and lengths derived from character counting if
> required.
As above. I think this would need at least a discoverable disco#info
feature. But even then, I doubt that this is useful in a normative form.
However, it probably can not hurt to have XEP-0426 spell this out as
recommendation in an informative way.
> It is RECOMMENDED that entities creating the original
> stanzas use NFC form.
Now that is the part I really like and which I believe to be missing
from XEP-0426. +1
I also suggest that the receiving side is considered. For example:
"Entities that receive character counted text should normalize the
counted text to Unicode Normalization Form C (NFC) [1] form prior
evaluating the character indexes."
1: https://unicode.org/reports/tr15/
- Florian
-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201209/6e998da3/attachment-0001.sig>
More information about the Standards
mailing list