[Standards] Proposed XMPP Extension: Character counting in message bodies

Sam Whited sam at samwhited.com
Fri Dec 4 20:33:38 UTC 2020



On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote:
> I begin to feel that a lot of your rationale is based on the idea that
> you always (/often?) have access to the raw UTF-8 bytes as they
> appeared on the wire.

Yes, most of it is.

> While is is probably true for languages where the String type's native
> encoding is also UTF-8. It is usually not true for others. For
> example, widely used XML parser in Java will return Java's String
> type, which is UTF-16 (or ISO-8859-1 [1]) based.

Yes, this is fair, I was thinking you could probably always get the raw
bytes, but it does look like a lot of these *only* do DOM based parsing
and don't keep the original representation.

> However, given that there is a wide variety here, I am not sure if it
> is worth to take any of that into consideration.

Yes, fair enough.

> Instead, my rationale is based on the idea that you always have
> access to the Unicode code points of the textual content obtained
> from the XML.

I do not have that access without converting from UTF-8 to code points
in the hot-path where it would be inappropriate. It's effectively the
same thing: I don't want to convert from bytes to code points, you don't
want to convert from codepoints to bytes. Some languages will have to do
the conversion either way, so it seems worth using the thing that allows
for the most flexibility with the least amount of work in eg. IoT
devices using C that are trying to optimize for performance where
passing along the bytes as received on the wire (possibly with some
validation that the range is accurate) is acceptable.

> And I am in favor of code points because it allows us to aim for the
> extended grapheme cluster algorithm, while also allowing for the
> "simply count code points" fallback.

If you do bytes you could also easily convert to codepoints and then to
grapheme clusters. It also allows for the simple "count codepoints" or
"count bytes" fallback.

—Sam


More information about the Standards mailing list