[Standards] Proposed XMPP Extension: Character counting in message bodies
flo at geekplace.eu
Fri Dec 4 20:23:57 UTC 2020
On 12/4/20 8:25 PM, Sam Whited wrote:
> On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
>> My problem with your proposal is that it uses bytes. I don't get why
>> you want to use bytes here.
> Naturally. Likewise my problem with your proposal is that it uses code
> points and I don't get why you'd want to use them here :)
I begin to feel that a lot of your rationale is based on the idea that
you always (/often?) have access to the raw UTF-8 bytes as they appeared
on the wire.
While is is probably true for languages where the String type's native
encoding is also UTF-8. It is usually not true for others. For example,
widely used XML parser in Java will return Java's String type, which is
UTF-16 (or ISO-8859-1 ) based. Then there is Python 3, where the str
type is a sequence of Unicode characters (code points). Of course, it
would be possible to design and implement XML parsers in Java and Python
that return strings as they appeared in the parsed XML document/stream.
However, given that there is a wide variety here, I am not sure if it is
worth to take any of that into consideration.
Instead, my rationale is based on the idea that you always have access
to the Unicode code points of the textual content obtained from the XML.
And I am in favor of code points because it allows us to aim for the
extended grapheme cluster algorithm, while also allowing for the "simply
count code points" fallback.
Note that both methods, counting grapheme (clusters) vs. counting
codepoints, would, if I did not miss a grapheme cluster, yield the same
result for this e-mail.
1: Please ignore this. I have only mentioned it for completeness. If you
are curious, lookup "JEP 254: Compact Strings".
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 495 bytes
Desc: OpenPGP digital signature
More information about the Standards