[Standards] XEP-0372: References

Jonas Wielicki jonas at wielicki.name
Mon Mar 12 16:16:49 UTC 2018

On Montag, 12. März 2018 16:31:57 CET Sam Whited wrote:
> On Mon, Mar 12, 2018, at 10:17, Jonas Wielicki wrote:
> > This is true, XML restricts to Scalar Values. Thanks, I didn’t know that
> > term.
> Your entire reply seems to be hinged on the fact that all XML libraries
> return scalar values

No, I’m arguing in terms of the XML standard, not implementations. But let’s 
go down that route.

> , but this isn't true to my knowledge. Most XML
> libraries are going to do whatever the language they're written in does. If
> you're using Python 3, you're going to get scalar values

Yep. (There are numerous XML library bindings for Python all of which, to my 
knowledge, return str (a sequence of codepoints), I’m not quoting each 
individual one)

> (assuming it's
> returning a string which in Python 3 is scalar values… more or less), if
> you use libxml2, or expat, or the 

libxml2 uses UTF-8 internally, but aliases the xmlChar type, which is intended 
to be used with the xmlUTF8 family of functions providing access to the 
Codepoints. [1]

libexpat uses UTF-8 or UTF-16 (compile time switch, XML_UNICODE) and leaves  
the developer alone with that, as far as I can tell. [2]

> Go encoding/xml library, etc. you're probably going to get bytes.

This one’s funny. It doesn’t specify which encoding you get when you unmarshal 
into a []byte. string is clear, and UTF-8 internally, but string uses the 
concept of runes which are Codepoints [3]. Unmarshaling has the option of 
using either.

I have no knowledge about Java or Rust though. AFAIK with JavaScript one is 
doomed anyways, since it can only do UTF-16, which will be a PITA no matter 
which route we go.

Acknowledging that XMPP enforces UTF-8 (so it would be a reasonable choice to 
use UTF-8 for everything in an implementation which chooses to go down that 
route), I’m not going to die on this hill. Still, using bytes of UTF-8 in a 
layer which clearly operates on Character Data defined in terms of Scalar 
Values is breaking abstractions for no gain (on the contrary, we’ll have to 
add wording on how to handle mid-UTF-8-word ranges; in the case of 
implementations which already get code points one needs an additional encode-
to-utf8-step + validation; implementations which already work on UTF-8 need 
the utility functions to operate on code points anyways).

kind regards,

   [1]: http://www.xmlsoft.org/html/libxml-xmlstring.html
   [2]: https://www.xml.com/pub/1999/09/expat/index.html

        If this isn’t up-to-date, sorry. I wasn’t able to find anything more 
        recent on the expat website.
   [3]: https://blog.golang.org/strings
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.jabber.org/pipermail/standards/attachments/20180312/fa1e0742/attachment.sig>

More information about the Standards mailing list