[Standards-JIG] JEP-0060: Adjustments for content-based subscriptions

Bob Wyman bob at wyman.us
Sat Jun 12 23:29:19 UTC 2004


	I'm really having trouble squeezing the content-based pubsub service
that we provide at PubSub.com into the confines of JEP-0060.
	The problem is that JEP-0060 was clearly defined with "topic-based"
pubsub in mind. i.e. There are nodes (topics) to which people can publish
and subscribe and the expectation is that subscribers will get copies of
everything that is published to the node.
	At PubSub.com, we implement a "content-based" service which assumes
that clients *do not* want to receive everything that is published to any
particular node. What you do in a content-based system is subscribe to a
node/topic by defining a filter or selection-query that identifies what
subset of published content that you are interested in. For some topics, it
is likely that you will create multiple subscriptions -- each selecting a
different sub-set of the messages being published via the topic/node. 
	For instance, we have a single node through which we publish a few
million messages every day. The messages are extracted from over 2 million
web logs and 50,000 NNTP newsgroups. Our typical user today has five or six
different subscriptions which are all selecting different subsets of the
messages that are published to the node. The following is a set of what
might be "typical" subscriptions for any one user. All of these
subscriptions are "filters" on a single topic or node.

	1. "Bob Wyman" OR "Robert Wyman"
	2. SOURCE:bobwyman.pubsub.com
	3. (pubsub OR "publish/subscribe" OR "pub/sub" OR "publish and
subscribe")
	4. URI:pubsub.com
	5. URI:bobwyman.pubsub.com
	
	Publishing our data over JEP-0060 is easy. We do it now. What we do
is create a Jabber node for every subscription to our service. This works
fine as long as the subscriptions are being created on the PubSub website --
not via Jabber. For us to be able to support subscriptions being created
within Jabber, we would have to have a way to give a unique identifier to a
subscription. But, in its current form, JEP-0060 provides no means to
identify a subscription. I.e. subscriptions aren't "nodes". In JEP-0060, a
"subscription" is really just part of your affiliation with a node. You're
either subscribed to the node or you're not. JEP-0060 doesn't support
multiple subscriptions to a single node.
	I originally thought I would get around this problem by having
people create "subscriptions" by creating a node and passing the "topic id"
as part of node configuration. But, this doesn't really work very well. The
problem is that it means that we've got this additional set of resources
called "topics" that aren't really nodes but that behave somewhat like
nodes. It is also a problem that in Jabber the creators of nodes are allowed
to publish to their nodes... But, you can't publish to a "subscription"! If
you publish, you need to publish to the topic/node that the subscription
filters, not the subscription itself.
	These and other problems lead me to the conclusion that we should be
treating our "topics" as Jabber Nodes and extending Jabber so that it
returns uniquely named subscriptions when people subscribe to topics/nodes.
For instance, you would subscribe to a node using something like the
following:

<iq type="set"
    from="sample_at_pubsub_dot_com at pubsub.com"
    to="xmpp.pubsub.com"
    id="sub1">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <subscribe
        node="pubsub/topics/weblogs"
        jid="sample_at_pubsub_dot_com at pubsub.com"/>
      <options>
        <x xmlns="jabber:x:data" type="submit">
          <field var="FORM_TYPE" type="hidden">
            <value>http://jabber.org/protocol/
             pubsub#subscribe_options</value>
          </field>
          <field var="title">
            <value>Mentions of RSS at PubSub.</value>
          </field>
          <field var="query-string">
            <value>(SOURCE:pubsub.com AND "RSS")</value>
          </field>
        </x>
      </options>
    </subscribe>
  </pubsub>
</iq>

If successful, the server would respond with something like:

<iq type="result"
    from="xmpp.pubsub.com"
    to="sample_at_pubsub_dot_com at pubsub.com"
    id="sub1">
  <pubsub xmlns="http://jabber.org/protocol/pubsub">
    <entity subid="39AB3990989098088323"
            node="pubsub/topics/weblogs"
            jid="sample_at_pubsub_dot_com at pubsub.com"
            affiliation="none"
            subscription="subscribed">
      <options>
        <x xmlns="jabber:x:data" type="submit">
          <field var="FORM_TYPE" type="hidden">
            <value>http://jabber.org/protocol/pubsub
               #subscribe_options</value>
          </field>
          <field var="title">
            <value>Mentions of RSS at PubSub.</value>
          </field>
          <field var="query-string">
            <value>(SOURCE:pubsub.com AND "RSS")</value>
          </field>
          <field var="xmlLink"><value>http://rss.pubsub.com/22/b7/
          d1e9845b330137935cf3384bd7.xml</value></field>
        </x>
      </options>
    </entity>
  </pubsub>
</iq>

	In the example above, the key difference from JEP-0060 as it stands
is that a "subid" is returned. The subid allows the user and system to keep
track of multiple subscriptions to a single topic/node. The creation of this
subid has a few implications throughout the rest of the system. 
	The most critical impact is, I think, on the messages that get
published to clients. The problem is that a single message may satisfy more
than one subscription and, given that we send very large messages, we don't
want to be forced to send multiple copies of the message. Thus, we need to
be able to list multiple subids for a single message. Although it isn't
pretty, I think this is best done something like the following:

   <message to='sample_at_pubsub_dot_com at xmpp.pubsub.com' 
     from='pubsub-delivery at xmpp.pubsub.com' >
     <event xmlns='http://www.jabber.org/protocol/pubsub#event'>
       <items node=' pubsub/topics/weblogs'>
         <item id='6802'>
           <subscription>
             <subid>7098709860970897</subid>
             <subid>098789790987888</subid>
           </subscription>
           <pubsub-message xmlns="http://www.pubsub.com/xmlns">
             Message content goes here... This element would be 
             omitted if sending notifications only.
           </pubsub-message>
         </item>
       </items>
     </event>
   </message>

	In many other places in the spec, there would also need to be
support provided for subids. For instance, when unsubscribing, the user
would need to specify not only the nodeID but also the subID. If a subID was
not specified, then it would be assumed that *all* subscriptions for the
specified node should be deleted.
	I'm still working on a few issues. For instance, item deletion is a
bit of problem since it would require that all deleted items be matched
against outstanding subscriptions to determine which subscriptions should
receive the item deleted messages defined in JEP-0060 at Section 8.1.3.
Unfortunately, while this is reasonably practical if the filtering/query
language is reasonably simple, it isn't very practical if notifications
based on event patterns, inter-message dependencies, or context external to
a message are supported. (i.e. it may not be possible at the moment of
retraction to determine what would have matched a subscription at some
earlier moment in time.) 
	Comments? Am I missing some obvious easier solution to the problems
outlined above?

		bob wyman





More information about the Standards mailing list