[Standards] Handling for characters that have entities, but XML does not require them to be escaped

Matthias Wimmer m at tthias.eu
Sun Jul 22 21:25:23 UTC 2007


Hi Robin!


Robin Redeker schrieb:
>> Why at all do these characters have to be escaped?
> I guess because many people did implement their own broken XML parsers
> in the past and many couldn't handle real XML, so they enforced escaping
> that character for the backward compatibility. (just a guess)

I can't beleave that there are any such problems. There is already 
software producing XML, that is valid but not escaping all possible 
characters.

Examples for this are jabberd2 (but to a very new SVN version), jadc2s 
(up to today), Psi (still not escaping " and ' in text nodes).

So there is out many software, that worked for years now, but 
introducing this unneccessary restricting in RFC 3920 made them broken.

> If you use expat you could get the original string from a text node
> and look for a '>' in that string. But this is an ugly hack that I also
> consider unneccessary.

How do I do this with expat? I have never seen something like this. At 
least normally expat is a SAX parser, that you set an 
CharacterDataHandler. And the function you register as the 
CharacterDataHandler gets passed unescaped UTF-8 data. Within the 
CharacterDataHandler I see now way to determine if a > has been 
transfered as > or as >.

> The RFC should be fixed and software that doesn't parse unescaped > in
> text nodes should be fixed (noone is forced in todays world to write his
> own XML parser, libxml2 (afaik) and expat (for sure) can be convinced to
> handle partial transferred XML documents these days).

Yes ... I'd also say that because of reusing standards and 
implementations of them, we should not force software to not accept 
unescaped entities. We should even encourage software to accept these 
unescaped entities.


Matthias



More information about the Standards mailing list