[Standards] Proposed XMPP Extension: Character counting in message bodies
teddsterr at outlook.com
Wed Dec 9 19:09:19 UTC 2020
>> The decoding _should_ be done upfront - that's how you get a valid XML document.
> I don't think this is true. XML is defined as UTF-8 (in this case),
> which is a collection of bytes. They don't have to be separated out and
> transformed into some higher representation of code points. Just because
> Python et al. convert things into UTF-32 strings first doesn't mean
> everything has to.
> Regardless of what language you're using it's trivial to deal with this
> as a UTF-8 byte stream, it is not always trivial to handle this as a UTF-
> 32 integer stream as the example shows.
XML is defined as a sequence of characters, it doesn't specify how those character must be encoded (though it does require support for both UTF-8 and UTF-16.) UTF-7/8/16/32 are encoding schemes, not character representations - people do make the mistake of conflating the two things, but that doesn't mean they are the same.
Unicode doesn't specify the size of characters - they don't have a specific bit-width, they are as large as required; the encoding scheme is then a method to transform characters into a sequence of bytes. It shouldn't matter what encoding scheme is used - UTF-8, UTF-16, ISO-8859-9, ISO-2022-JP, Shift_JIS, EBCDIC, are all possibilities - because you're supposed to decode the data into characters before doing anything it.
The fact that you're able to take advantage of the foreknowledge of your data being encoded using UTF-8 is purely because XMPP happens to define it that way, not because XML is defined using any specific encoding scheme. Basing your entire implementation around the expectation of UTF-8 allows you to take some convenient short-cuts, but much of that only works because XML markup uses ASCII-compatible characters, which conveniently have an equivalent single-byte representation when encoded as UTF-8; if it were almost any other encoding then it simply wouldn't work without some form of decoding first. If you insist on not decoding and then run into difficulties with handling characters because you're purposely avoiding handling characters while simultaneously using XML which is defined as a sequence of characters, an appropriate response is "what did you expect?"
It's not trivial to handle everything as UTF-8 in implementations where the application receives already decoded strings (a sequence of characters, not bytes) from the XML parser. The most likely approach to dealing with that will be to re-encode the already decoded data back into UTF-8 just to deal with the offsets, which is precisely the kind of inefficient processing you're suggesting should be avoided. And considering the whole purpose of references is for marking sequences of characters, those characters are going to be decoded at some point; you're trying to avoid decoding early, while still validating offsets, so that the decoding can be done later anyway.
Regardless, your argument is still "bytes is more convenient for me, so everyone else should do what's best for me." I don't think that's a good argument.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Standards