[Standards] Proposed XMPP Extension: Character counting in message bodies

Sam Whited sam at samwhited.com
Fri Dec 4 19:25:33 UTC 2020



On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
> Often you don't get raw bytes from your XML parser, but an instance of
> your programming language's native String type. But often your
> programming language provides an API to encode that String to UTF-8
> encoded bytes, which *should* match exactly the bytes on the wire.

That would also be expensive to do every time and I'd be willing to be
the XML parser *also* gives you the ability to get bytes. Otherwise what
would it do with XML documents that don't use the same encoding as your
language (again, I know we always use UTF-8, but an XML parser won't
know that and may have to deal with other encodings)? Would it always
implicitly convert every single thing? That seems like it will be a
potentially very slow XML parser if it doesn't have a fallback for me to
say "just give me the raw bytes".

> My problem with your proposal is that it uses bytes. I don't get why
> you want to use bytes here.

Naturally. Likewise my problem with your proposal is that it uses code
points and I don't get why you'd want to use them here :)


>  You most certainly will obtain from your XML parser a type that can
>  be converted to a sequence of Unicode code points.

Right, which is probably UTF-8 encoded bytes. If I have to convert them
all to a series of unicode codepoints which is more expensive. If I have
bytes to begin with I have to check if the values at the start/end of
the range are valid UTF-8 (one of the nice properties of UTF-8 is you
can know if you're at the start of a character without parsing the
whole string) instead of having to convert everything up to the end.
Then I can ignore all the bits in the middle and deal with them later
outside of the hot path if/when I convert it to a string or whatever
for display.


> Hence I think your proposal should use code points instead. And then,
> if I am not mistaken, your proposal matches my proposal for
> opportunistic interoperability as fallback.

You may be right that it's the same as far as fallback goes. I suspect
that more things will have a UTF-8 to whatever they are conversion than
a UTF-32 to whatever they are conversion, but to be fair I have no
proof for that.

Out of curiosity, can you provide an example of an XML decoder that can
*only* give you an instance of a UTF-32 string (or whatever the
language/OS uses)? I can give plenty (the Go one for starters) where you
only get bytes out and it's up to you to figure out what to do with
them. I *could* convert those to a UTF- 32 slice, but that would be
unnecessary and expensive in a language designed for performance whereas
if it's a language that's doing implicit conversion to its own thing
it's already doing implicit work and probably isn't optimizing for the
kind of fast-path performance I'd like to get.

I think I should simplify my argument to: most things use UTF-8 or at least can convert from UTF-8 so we should too. Using codepoints is effectively using UTF-32, which most things [citation needed] don't use by default.

—Sam


More information about the Standards mailing list