[Standards] Proposed XMPP Extension: Character counting in message bodies

Marvin W xmpp at larma.de
Wed Dec 18 15:00:44 UTC 2019


It's indeed a good question if anything in XMPP allows servers or 
in-between entities to do normalization. I was under the assumption that 
servers do not change the codepoints. In XML [1] Characters with 
multiple possible representations in ISO/IEC 10646 (e.g. characters with 
both precomposed and base+diacritic forms) match only if they have the 
same representation in both strings. Thus by XML specification, 
normalization is changing the body.

Also the main reason why we shouldn't ask for Unicode normalization to 
happen is that different Unicode version have different normalizations. 
Thus if the sender normalizes with Unicode version X and calculates 
offsets from that, then receiver normalizes with Unicode version Y and 
determines the offsets there, they can end up in pointing to different 
characters.

[1] https://www.w3.org/TR/REC-xml/#dt-match

On 12/18/19 11:59 AM, Florian Schmaus wrote:
> But I wonder if we
> shouldn't require Unicode normalization, i.e. the sender and receiver
> MUST normalize prior counting.
> 
> Given that nothing in XMPP guarantees you that the Unicode is not
> transformed somewhere in the stanza processing and routing, e.g. gets
> combined, this would be required so that sender and receiver operate on
> the same Unicode data.
> 
> And I believe that there could be cases where such transformations
> actually really happen, e.g. message archives which persist the Unicode
> data in combined form for efficiency reasons.


More information about the Standards mailing list