[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Fri Dec 4 20:23:57 UTC 2020

On 12/4/20 8:25 PM, Sam Whited wrote:
> On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
>> My problem with your proposal is that it uses bytes. I don't get why
>> you want to use bytes here.
> Naturally. Likewise my problem with your proposal is that it uses code
> points and I don't get why you'd want to use them here :)

I begin to feel that a lot of your rationale is based on the idea that 
you always (/often?) have access to the raw UTF-8 bytes as they appeared 
on the wire.

While is is probably true for languages where the String type's native 
encoding is also UTF-8. It is usually not true for others. For example, 
widely used XML parser in Java will return Java's String type, which is 
UTF-16 (or ISO-8859-1 [1]) based. Then there is Python 3, where the str 
type is a sequence of Unicode characters (code points). Of course, it 
would be possible to design and implement XML parsers in Java and Python 
that return strings as they appeared in the parsed XML document/stream.

However, given that there is a wide variety here, I am not sure if it is 
worth to take any of that into consideration.

Instead, my rationale is based on the idea that you always have access 
to the Unicode code points of the textual content obtained from the XML. 
And I am in favor of code points because it allows us to aim for the 
extended grapheme cluster algorithm, while also allowing for the "simply 
count code points" fallback.

Note that both methods, counting grapheme (clusters) vs. counting 
codepoints, would, if I did not miss a grapheme cluster, yield the same 
result for this e-mail.

- Florian

1: Please ignore this. I have only mentioned it for completeness. If you 
are curious, lookup "JEP 254: Compact Strings".

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201204/b871b4be/attachment.sig>

More information about the Standards mailing list