[Standards] Proposed XMPP Extension: Character counting in message bodies
xmpp at larma.de
Wed Dec 18 15:00:44 UTC 2019
It's indeed a good question if anything in XMPP allows servers or
in-between entities to do normalization. I was under the assumption that
servers do not change the codepoints. In XML  Characters with
multiple possible representations in ISO/IEC 10646 (e.g. characters with
both precomposed and base+diacritic forms) match only if they have the
same representation in both strings. Thus by XML specification,
normalization is changing the body.
Also the main reason why we shouldn't ask for Unicode normalization to
happen is that different Unicode version have different normalizations.
Thus if the sender normalizes with Unicode version X and calculates
offsets from that, then receiver normalizes with Unicode version Y and
determines the offsets there, they can end up in pointing to different
On 12/18/19 11:59 AM, Florian Schmaus wrote:
> But I wonder if we
> shouldn't require Unicode normalization, i.e. the sender and receiver
> MUST normalize prior counting.
> Given that nothing in XMPP guarantees you that the Unicode is not
> transformed somewhere in the stanza processing and routing, e.g. gets
> combined, this would be required so that sender and receiver operate on
> the same Unicode data.
> And I believe that there could be cases where such transformations
> actually really happen, e.g. message archives which persist the Unicode
> data in combined form for efficiency reasons.
More information about the Standards