[Standards] XEP-0372: References
sam at samwhited.com
Mon Mar 12 16:27:02 UTC 2018
On Mon, Mar 12, 2018, at 11:16, Jonas Wielicki wrote:
> libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is intended
> to be used with the xmlUTF8 family of functions providing access to the
> Codepoints. 
As far as I can tell you just get bytes externally and that XML string is for validation internally, but maybe not, it's been a while since I've used libxml2 and I haven't used it all that extensively outside of other language wrappers.
> libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves
> the developer alone with that, as far as I can tell. 
> This one’s funny. It doesn’t specify which encoding you get when you unmarshal
> into a byte. string is clear, and UTF-8 internally, but string uses the
> concept of runes which are Codepoints . Unmarshaling has the option of
> using either.
Go strings aren't runes, they're byte strings. String literals are guaranteed to be valid UTF-8. Runes are a distinct concept.
> I have no knowledge about Java or Rust though.
Rust will more or less do either happily. You can get bytes or code points.
> Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to
> use UTF-8 for everything in an implementation which chooses to go down that
> route), I’m not going to die on this hill.
> Still, using bytes of UTF-8 in a
> layer which clearly operates on Character Data defined in terms of Scalar
I am disputing this; it does not "clearly" operate on scalar values in any way.
> is breaking abstractions for no gain (on the contrary, we’ll have to
> add wording on how to handle mid-UTF-8-word ranges;
Now we have to add the same wording about handling mid-scalar-value ranges, there's no difference except that I have to decode UTF-8 when I might otherwise not have to do so (depending on the system I'm using, my requirements, etc).
> in the case of
> implementations which already get code points one needs an additional encode-
> to-utf8-step + validation; implementations which already work on UTF-8 need
> the utility functions to operate on code points anyways).
This will always be a problem; in the case of a system that operates on UTF-8 and hands me bytes I now have to re-decode the UTF-8 even though I know it's already valid.
This isn't an argument for one or the other.
More information about the Standards