[Standards] Proposed XMPP Extension: Character counting in message bodies
sam at samwhited.com
Fri Dec 4 19:25:33 UTC 2020
On Fri, Dec 4, 2020, at 19:00, Florian Schmaus wrote:
> Often you don't get raw bytes from your XML parser, but an instance of
> your programming language's native String type. But often your
> programming language provides an API to encode that String to UTF-8
> encoded bytes, which *should* match exactly the bytes on the wire.
That would also be expensive to do every time and I'd be willing to be
the XML parser *also* gives you the ability to get bytes. Otherwise what
would it do with XML documents that don't use the same encoding as your
language (again, I know we always use UTF-8, but an XML parser won't
know that and may have to deal with other encodings)? Would it always
implicitly convert every single thing? That seems like it will be a
potentially very slow XML parser if it doesn't have a fallback for me to
say "just give me the raw bytes".
> My problem with your proposal is that it uses bytes. I don't get why
> you want to use bytes here.
Naturally. Likewise my problem with your proposal is that it uses code
points and I don't get why you'd want to use them here :)
> You most certainly will obtain from your XML parser a type that can
> be converted to a sequence of Unicode code points.
Right, which is probably UTF-8 encoded bytes. If I have to convert them
all to a series of unicode codepoints which is more expensive. If I have
bytes to begin with I have to check if the values at the start/end of
the range are valid UTF-8 (one of the nice properties of UTF-8 is you
can know if you're at the start of a character without parsing the
whole string) instead of having to convert everything up to the end.
Then I can ignore all the bits in the middle and deal with them later
outside of the hot path if/when I convert it to a string or whatever
> Hence I think your proposal should use code points instead. And then,
> if I am not mistaken, your proposal matches my proposal for
> opportunistic interoperability as fallback.
You may be right that it's the same as far as fallback goes. I suspect
that more things will have a UTF-8 to whatever they are conversion than
a UTF-32 to whatever they are conversion, but to be fair I have no
proof for that.
Out of curiosity, can you provide an example of an XML decoder that can
*only* give you an instance of a UTF-32 string (or whatever the
language/OS uses)? I can give plenty (the Go one for starters) where you
only get bytes out and it's up to you to figure out what to do with
them. I *could* convert those to a UTF- 32 slice, but that would be
unnecessary and expensive in a language designed for performance whereas
if it's a language that's doing implicit conversion to its own thing
it's already doing implicit work and probably isn't optimizing for the
kind of fast-path performance I'd like to get.
I think I should simplify my argument to: most things use UTF-8 or at least can convert from UTF-8 so we should too. Using codepoints is effectively using UTF-32, which most things  don't use by default.
More information about the Standards