[Standards] XEP-0372: References

Sam Whited sam at samwhited.com
Mon Mar 12 16:27:02 UTC 2018

On Mon, Mar 12, 2018, at 11:16, Jonas Wielicki wrote:
> libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is intended 
> to be used with the xmlUTF8 family of functions providing access to the 
> Codepoints. [1]

As far as I can tell you just get bytes externally and that XML string is for validation internally, but maybe not, it's been a while since I've used libxml2 and I haven't used it all that extensively outside of other language wrappers.

> libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves  
> the developer alone with that, as far as I can tell. [2]

Right, bytes.

> This one’s funny. It doesn’t specify which encoding you get when you unmarshal 
> into a []byte. string is clear, and UTF-8 internally, but string uses the 
> concept of runes which are Codepoints [3]. Unmarshaling has the option of 
> using either.

Go strings aren't runes, they're byte strings. String literals are guaranteed to be valid UTF-8. Runes are a distinct concept.

> I have no knowledge about Java or Rust though.

Rust will more or less do either happily. You can get bytes or code points.

> Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to 
> use UTF-8 for everything in an implementation which chooses to go down that 
> route), I’m not going to die on this hill.

> Still, using bytes of UTF-8 in a 
> layer which clearly operates on Character Data defined in terms of Scalar 
> Values

I am disputing this; it does not "clearly" operate on scalar values in any way.

> is breaking abstractions for no gain (on the contrary, we’ll have to 
> add wording on how to handle mid-UTF-8-word ranges;

Now we have to add the same wording about handling mid-scalar-value ranges, there's no difference except that I have to decode UTF-8 when I might otherwise not have to do so (depending on the system I'm using, my requirements, etc).

> in the case of 
> implementations which already get code points one needs an additional encode-
> to-utf8-step + validation; implementations which already work on UTF-8 need 
> the utility functions to operate on code points anyways).

This will always be a problem; in the case of a system that operates on UTF-8 and hands me bytes I now have to re-decode the UTF-8 even though I know it's already valid.
This isn't an argument for one or the other.


More information about the Standards mailing list