[Standards] XEP-0372: References
jonas at wielicki.name
Mon Mar 12 16:16:49 UTC 2018
On Montag, 12. März 2018 16:31:57 CET Sam Whited wrote:
> On Mon, Mar 12, 2018, at 10:17, Jonas Wielicki wrote:
> > This is true, XML restricts to Scalar Values. Thanks, I didn’t know that
> > term.
> Your entire reply seems to be hinged on the fact that all XML libraries
> return scalar values
No, I’m arguing in terms of the XML standard, not implementations. But let’s
go down that route.
> , but this isn't true to my knowledge. Most XML
> libraries are going to do whatever the language they're written in does. If
> you're using Python 3, you're going to get scalar values
Yep. (There are numerous XML library bindings for Python all of which, to my
knowledge, return str (a sequence of codepoints), I’m not quoting each
> (assuming it's
> returning a string which in Python 3 is scalar values… more or less), if
> you use libxml2, or expat, or the
libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is intended
to be used with the xmlUTF8 family of functions providing access to the
libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves
the developer alone with that, as far as I can tell. 
> Go encoding/xml library, etc. you're probably going to get bytes.
This one’s funny. It doesn’t specify which encoding you get when you unmarshal
into a byte. string is clear, and UTF-8 internally, but string uses the
concept of runes which are Codepoints . Unmarshaling has the option of
doomed anyways, since it can only do UTF-16, which will be a PITA no matter
which route we go.
Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to
use UTF-8 for everything in an implementation which chooses to go down that
route), I’m not going to die on this hill. Still, using bytes of UTF-8 in a
layer which clearly operates on Character Data defined in terms of Scalar
Values is breaking abstractions for no gain (on the contrary, we’ll have to
add wording on how to handle mid-UTF-8-word ranges; in the case of
implementations which already get code points one needs an additional encode-
to-utf8-step + validation; implementations which already work on UTF-8 need
the utility functions to operate on code points anyways).
If this isn’t up-to-date, sorry. I wasn’t able to find anything more
recent on the expat website.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: This is a digitally signed message part.
More information about the Standards