[Standards-JIG] LAST CALL: JEP-0106 (JID Escaping)

Peter Saint-Andre stpeter at jabber.org
Tue May 3 23:24:02 UTC 2005

On Tue, May 03, 2005 at 06:19:58PM -0500, Peter Saint-Andre wrote:
> On Thu, Apr 21, 2005 at 10:43:11PM +0100, Richard Dobson wrote:
> > >Well, as pointed out in this morning's Jabber Council meeting, I was
> > >looking at the transformations only in one direction. It is perfectly
> > >valid to have domain names that start with 20, 22, 26, 27, 2f, 3a, 3c,
> > >3e, and 40. Consider the case of an MSN user whose email address is
> > >up at 3am.com ... once transformed by an MSN gateway, that person's JID
> > >might be:
> > >
> > >up%3am.com at msn.example.com
> > >
> > >However, the characters %3a are now ambiguous: do they signify "@3a"
> > >through an MSN gateway or ":" as decoded in JID escaping? Thus an
> > >an application would have no programmatic way of distinguishing
> > >between the following interpretations of that JID:
> > >
> > >1. an entity whose decoded node identifier is "up at 3am.com"
> > >
> > >2. an entity whose decoded node identifier is "up:m.com"
> > >
> > >Ambiguity is bad because it breaks things. And one of our cardinal rules
> > >is not to break things.
> > >
> > >Therefore the Council has decided to retain the #xx; escaping mechanism
> > >for the 9 code points (and only for the 9 code points) that are
> > >explicitly disallowed in the Nodeprep profile of stringprep. While this
> > >prevents conforming applications from re-using existing URI-processing
> > >libraries for the purpose of JID escaping, the Council decided that
> > >that's slight hardship when special-casing the 9 code points in the node
> > >identifier portion of JIDs, and to proceed with advancement of JEP-0106
> > >as-is (actually, with some slight wording changes that I am working on
> > >now).
> > >
> > >A transcript of the Council discussion is here:
> > >
> > >http://jabber.org/muc-logs/council@conference.jabber.org/2005-04-21.html
> > >
> > >Feedback is welcome as always.
> > 
> > I would have to argue that its an equally slight hardship to alter the 
> > existing MSN transports so we can just use the internet standard, plus how 
> > many MSN users actually have addresses that could present a problem in any 
> > case?? Certainly on my MSN contact list which is about 92 people 86 have 
> > @hotmail.com or @msn.com addresses and of the remaining ones none of them 
> > have domains that could cause a problem.
> > 
> > Overall I truely fail to see the problem here from a real world point of 
> > view and as far as I can see it seems an entirely theoretical problem and 
> > thus shouldnt be holding us back from doing things properly.

Oops, my message did not come through correctly (bad mutt!). I meant to


The question is: what is the proper thing to do?

Some feel that percent-encoding is the proper approach. That is the
approach used in transforming disallowed characters in URLs/URIs.
However, JIDs are not URIs, so while it might be nice to use existing
URI encoding rules for JIDs, that is by no means necessary, and to
assume that percent-encoding is the Right Thing for JIDs is, I think,
misguided. Maybe that's the right approach, but we can't assume so.

The main problem space we care about for JID escaping is the
transformation of existing non-XMPP addresses into JIDs. And here the
most common problem is re-using existing email addresses as JIDs. Now,
RFC 2822 (and before that 822) specifies that the following characters
disallowed in JIDs are allowed in email addresses: & ' /

Note that other characters are also allowed in email addresses while not
disallowed in JIDs, for example, the % character.

So some interesting email addresses could be, for example:

etcetera&c at example.com
d'artagnan at example.com
slash/.dot at example.com
cr%zyguy66 at example.com

As mailto: URIs, those would be:

mailto:etcetera%26c at example.com
mailto:d%27artagnan at example.com
mailto:slash%2f.dot at example.com
mailto:cr%25zyguy66 at example.com

As JIDs converted using percent-encoding, those would be:

etcetera%26c at example.com
d%27artagnan at example.com
slash%2f.dot at example.com
cr%25zyguy66 at example.com

As JIDs converted using JEP-0106, those would be:

etcetera#26;c at example.com
d#27;artagnan at example.com
slash#2f;.dot at example.com
cr%zyguy66 at example.com

(No need to transform % since it is allowed in XMPP node identifiers.)

Now, you might say, aha, this proves the point -- let's use
percent-encoding! Look, those JIDs are *different*!

Not so fast. Exactly why is it a good thing for the special-cased
transformations for the disallowed characters (as defined in JEP-0106)
to use "standard" percent-encoding? The whole issue here is that these
characters (code points) are special-cased just for JIDs. To my mind,
it's more problematic to use percent-encoding here because we are
talking only about the 9 code points that are disallowed in XMPP node
identifiers. So you can't generally apply URI escaping logic to an
address that you want to transform into a JID -- many of the characters
you would transform using URI rules MUST NOT be transformed when
creating a JID. So now you have to special-case your URI encoding on a
character-by-character basis, no? You can't transform the entire input
using URI rules and have a proper JID come out the other side, instead
you need to feed the URI encoder one character at a time and have your
standard algorithm convert only SP " # & ' / : < > @. At that point it
seems to me that the great advantage of using standard URI rules is no 
longer so wonderful, because you're still doing special-casing. So what
is the big deal about special-casing for those 9 code points but
converting them using #xx; rather than %xx? I understand the desire for
complete consistency and protocol hygiene, I really do. But I don't see
how that is going to make JID escaping any easier for implementors in
this situation. It's a simple switch statement, for Pete's sake!

About the existing MSN gateways, yes, they would need to be modified to
handle percent-encoding. And so would existing rosters! We have a lot of
deployed code in production systems that uses % to escape @ in MSN JIDs
(which in itself is ambiguous, since % is allowed in email addresses).
To simply say "tough luck, time to upgrade" is not very friendly. Why
break things on the network if we don't have to, all in the pursuit of
full consistency with the URI specs even though JIDs are not URIs? I
just don't see the logic, and neither did anyone else on the Council
when we discussed this in the April 21 meeting.

Furthermore, adding a service discovery feature for JID escaping (and
one is already included in JEP-0106, see Section 4 of the JEP) is not
going to solve the problem of deployed gateways and all of their
associated roster items (believe me, as a server admin for jabber.org,
I can tell you that there are millions of such roster items out there).

In my judgement and in the judgement of the four Council members who
have voted on this JEP so far (Thomas Muldowney has yet to vote), the
desire for consistency with the URI specs is simply not compelling 
enough to break existing deployments or even risk doing so (e.g., by
introducing rather serious migration issues).

As noted, Thomas Muldowney has yet to vote on this JEP but will probably
do so on the Council list or in the next Council meeting (May 12). Feel
free to try to convince him to vote -1 before then and to make your case
further on this list. A truly compelling argument might even convince
Council members who have already voted +1 to change their votes to -1. 
So far, such a compelling argument has not appeared, at least not in the 
opinion of those who do the voting (i.e., elected Council memembers).

Personally I think it's time to move on to more important matters and to
accept that we have rough consensus on the JID escaping rules defined in
version 0.5 of JEP-0106. But far be it from me to shut down debate, so
feel free to keep arguing about the matter on this list.


Peter Saint-Andre
Jabber Software Foundation

More information about the Standards mailing list