[Standards] <[CDATA[ in XMPP

Robin Redeker elmex at x-paste.de
Tue Jul 31 12:02:32 UTC 2007


On Mon, Jul 30, 2007 at 06:49:45PM -0700, Rachel Blackman wrote:
> 
> On Jul 30, 2007, at 6:28 PM, Justin Karneges wrote:
> 
> >On Monday 30 July 2007 5:37 pm, Rachel Blackman wrote:
> >>If I send '<item> stpeter at jabber.org</item>' to the server in a
> >>roster add/remove request, it will almost certainly eat that
> >>whitespace at the beginning.
> >
> >It most certainly must not do that.
> 
> I cannot remember which server it was that I encountered, but one  
> consumed all 'extraneous' whitespace (even inside of stanzas) to deal  
> with the common practice of sending whitespace to keepalive a  
> connection.

The keepalive whitespace is afaik only outside of stanzas. A server
doesn't need to ignore all whitespace if it just has to ignore
whitespace in the toplevel <stream> element.

> I do not dispute that the behavior is likely erroneous, but it was in  
> place in at least one server; I witnessed it when it would eat any  
> leading spaces in a <body/> element.

Yes, Sounds like a big bug.

> >>But now let's say I do '<item><![CDATA[ stpeter at jabber.org]]></item>'
> >>-- is that processed as ' stpeter at jabber.org' (with the raw space),
> >>thus requiring a CDATA block any time you want to refer to that JID?
> >>Or is the burden on the server to convert it to \20stpeter at jabber.org
> >>for the sake of compatibility, or what?
> >
> >This:
> >  <item> stpeter at jabber.org</item>
> >
> >and this:
> >  <item><![CDATA[ stpeter at jabber.org]]></item>
> >
> >have identical meaning.  JID escaping does not come into play here  
> >whatsoever.
> 
> Fine.  If we must be pedantic in this discussion rather than actually  
> addressing the issue, how about using a different example.
> 
> What about using CDATA versus JID escaping for a hypothetical JID  
> which should display to the end-user as 'john&mary at family.org'?   
> Should that be a JID-escaped string when you send it to the server?   

You can send it two ways: first if you want the server not to close the
connection with an error you need to perform JID escaping like described
XEP-0106.

If you want the server to barf back at you with an error, you have to
XML escape the JID in the XML you are generating:

   <message to="john&mary at family.org"></message>

Then it depends on the newest ongoing changes with the nodepart of a JID
what the server does. It either screams loud out, because you used & in
a node part, or it doesn't.

Same with <item> element:

   <item>john&mary at family.org</item>
   <item><![CDATA[john&mary at family.org]]></item>

Are both great and fine XML, just the servr might not like the JID.

To become friends with the servers opinion what characters are allowd in
a JID you either need XEP-0106 here or a new stringprep profile for the
nodepart.

> Should it be CDATA-enclosed?

It doesn't matter. The choice is: do we >XML< escape characters or use the
<![CDATA[...]]> element when sending something.

> It can't be sent raw, either way, due to the ampersand.

In case of CDATA we can send it raw.

> My thought is that if we allow CDATA, it should only be allowed in  
> message elements, because otherwise you end up with questions like  
> this.  Of course, that's an un-XML-ish limitation of CDATA, so...

I don't see any problems or open questions here? CDATA has clearly
defined semantics and syntax in http://www.w3.org/TR/REC-xml/#syntax and
it can appear in XML _ANYWHERE_ where usually normal character content
can appear.

When thinking about this please bear in mind that at the time the server
knows that it has to interpret a string as JID it is already XML
unescaped. This means in pseudo-code:

   dom_node = parse ("<item><![CDATA[john&mary at family.org]]></item>");
   dom_node->get_characters => "john&mary at family.org"

And:

   dom_node = parse ("<item>john&mary at family.org</item>");
   dom_node->get_characters => "john&mary at family.org"

Now here comes the layer of JID escaping in, if you don't want the
server to throw you out:

   dom_node = parse ("<item>john\26mary at family.org</item>");
   dom_node->get_characters => "john\26mary at family.org"

And the same with CDATA:

   dom_node = parse ("<item><![CDATA[john\26mary at family.org]]></item>");
   dom_node->get_characters => "john\26mary at family.org"

It makes no difference to JID escaping or ANY other character data that
we have been and are transmitting, CDATA just makes it easier to write
code in servers and clients that want to do things like this:

   stream->send (
      "<message><body><![CDATA["
      + chatwindow->get_message_from_user
      + "]]></body></message>"
   );

That code without CDATA would look like:

   stream->send (
      "<message><body>"
      + xml_escape (chatwindow->get_message_from_user)
      + "</body></message>"
   );

It saves us CPU time to escape the message. CDATA doesn't give us much,
but there are certainly uses where CDATA is more convenient than some
escaping method.

Imagine this code (sprintf can be found in most modern languages like
C/C++ or Perl, and even Java has similar constructs):

   data = sprintf (
      "<body><![CDATA[%d: %s/%s]]></body>",
      get_int, chatsession->get_from_nick, chatsession->get_to_nick
   );

That code looks clean nice and small and precise. With xml_escape it
will become more loaded:

   data = sprintf (
      "<body>%d: %s/%s</body>",
      get_int, xml_escape (chatsession->get_from_nick),
      xml_escape (chatsession->get_to_nick)
   );

Of course these are verrry minor conveniences when writing code, I
barely use CDATA myself, but I can imagine others find it very
convenient. (Might those speak up and give me some example? I'm
interested just out of curiosity :)



R



More information about the Standards mailing list