[Standards] Proposed XMPP Extension: Character counting in message bodies

Ralph Meijer ralphm at ik.nu
Fri Dec 20 12:15:09 UTC 2019

On 20-12-2019 12:55, Andrew Nenakhov wrote:
> чт, 19 дек. 2019 г. в 19:02, Ralph Meijer <ralphm at ik.nu 
> <mailto:ralphm at ik.nu>>:
>     If you want consistent counting on all platforms and languages,
>     counting
>     Unicode characters seems to be the best way forward.
> We do not dispute that 'counting unicode characters seems the best way 
> forward'. However, we do dispute when to count them. It's more of a 
> preference issue, but we chose to count characters in the XML doc we 
> send, because XML standard is common for any platform and language.

Just to be clear. An XML Stream is encoded in UTF-8 and has additional 
processing (like entities) to represent a text. While does series of 
UTF-8 encoded characters are themselves also represent a sequence of 
Unicode characters (let's call them seq1), that sequence is not 
necessarily equivalent to the abstract sequence of characters that 
represents the above mentioned text (seq2).

Counting in seq1 and seq2 are different things as soon as there a CDATA 
sections, entities, etc, and I consider counting seq1 to be the wrong 
approach. I.e. I expect the character count for the text in the body 
element of the following equivalent XML snippets to be exactly 1 (the 
sequence containing the single character U+003c), and not 4, 5, 9, or 
13, irregardless of where you choose to count:



