[Standards] Proposed XMPP Extension: Character counting in message bodies

Ralph Meijer ralphm at ik.nu
Sat Dec 21 10:07:03 UTC 2019


On December 21, 2019 10:57:02 AM GMT+01:00, Florian Schmaus <flo at geekplace.eu> wrote:
>On 18.12.19 16:00, Marvin W wrote:
>> It's indeed a good question if anything in XMPP allows servers or
>> in-between entities to do normalization. I was under the assumption
>that
>> servers do not change the codepoints. In XML [1] Characters with
>> multiple possible representations in ISO/IEC 10646 (e.g. characters
>with
>> both precomposed and base+diacritic forms) match only if they have
>the
>> same representation in both strings. Thus by XML specification,
>> normalization is changing the body.
>
>I am not sure if it is not a little bit far fetched to deduce from the
>XML "string match" definition that XMPP entities are not provided with
>a
>little bit of freedom to transform Unicode string representation within
>a certain degree. At least I am currently missing the link from the XML
>"string match" definition to "XMPP entities must use this when
>serializing/de-serializing XML".
>
>If we can make that link, then we do not need normalization. And we
>probably want to clearly state that requirement in rfc6120bis, because
>it is not obvious (at least for me).

I'd be quite sad if the character data would be normalized/canonicalized. Also I haven't seen this anywhere in XMPP implementations outside of JID matching.


>> Also the main reason why we shouldn't ask for Unicode normalization
>to
>> happen is that different Unicode version have different
>normalizations.> Thus if the sender normalizes with Unicode version X
>and calculates
>> offsets from that, then receiver normalizes with Unicode version Y
>and
>> determines the offsets there, they can end up in pointing to
>different
>> characters.
>
>We need Unicode agility anyway in XMPP, which I do not believe to be a
>big issue. Especially since Unicode is likely to introduce lesser
>changes with every future standard version.

Quite (except for JIDs). This is also a reason why for example Grapheme Cluster counting would bring us a world of pain.


-- 
Cheers,

ralphm


More information about the Standards mailing list