[Standards-JIG] JEP-0060: Adjustments for content-based subscriptions

Bob Wyman bob at wyman.us
Sun Jun 13 18:06:41 UTC 2004

Ralph Meijer wrote: 
> see http://ralphm.net/blog/2004/06/13/pubsub.com_xmpp
> Bob also shortly mentions the situation where a user might
> receive items more than once, because they match multiple
> queries. I'm not sure if you would have to solve that server side.
	I think it is critical for this issue to be resolved "server side"
in order to: 
	* Limit bandwidth consumption
	* Enable distributed networks of pubsub servers to work better.

As you said in your blog: 
> I envision a distributed system with news authors 
> publishing news (or any other data) themselves, to which 
> people can subscribe, and having this data spread through 
> hierarchies of pubsub repeaters.
	I share your vision and hope that we'll be able to use Jabber/XMPP
in order to construct such a distributed pubsub network. Without boring you
too much with the theory of distributed pubsub, let me give a concrete
example of how two nodes in such a network might interact:
	In addition to our main service at PubSub.com, I have also built a
little service at http://mystack.com that subscribes to PubSub.com and
creates "stacks" which are bits of HTML that can be included in web pages to
present links to recent blog entries that match some subscription. (If you
go to the http://mystack.com site, the "Mentions of RSS" that you see on the
right side of the page is an example of such a "stack.") Users define their
stacks at mystack.com and then MyStack creates a subscription with
PubSub.com on behalf of the user. PubSub.com delivers the entries that match
the subscriptions to mystack.com and MyStack then builds the stacks. By
subscribing to PubSub.com, instead of getting a full feed from it,
MyStack.com sees only a tiny portion of the full message traffic that flows
through PubSub.com.
	We find that there is a great deal of commonality in the
subscriptions that people create on MyStack. Thus, it is very likely that if
PubSub is to send any single item to MyStack, MyStack will need to insert it
into more than one stack. However, since PubSub.com can append a list of
potentially many subscriptions to each of the messages it sends out, it
means that even when an item is to be delivered to 100 stacks, MyStack only
gets one copy of the message. The result is a tremendous savings in
bandwidth in communications between MyStack and PubSub.com. 
	An alternative approach would have been to build into MyStack.com
the ability to do what is called subscription "covering"[1]. Using this
method, MyStack.com would compare subscriptions and would, for instance,
decide that a subscription for "Jabber mentioned on any site" would "cover"
a subscription for "Jabber only when discussed at some specific site." Thus,
it would combine the two subscriptions into a single subscription and then
re-match on the results that it got from PubSub.com. However, this approach
makes MyStack.com massively more complex then it currently is. It would
require that MyStack.com implements complex code to do the analysis of
subscriptions for commonality as well as implement the ability to re-match
results to determine who they are destined for. This would convert
MyStack.com from a fun weekend project into a major software development
effort. That would not be a good thing...
	While it is probably obvious from the example above that being able
to send to multiple subscriptions is useful for inter-server traffic, it
should also be seen that it is useful even when a server sends a message to
a single client. As you are probably aware, one of the major difficulties
with traditional RSS/Atom news aggregators is detecting "duplicate"
messages. In general, we try to do everything we can to reduce the
transmission of duplicates in order to reduce the requirement of clients to
implement complex duplicate detection code. Also, when we are sending to
low-bandwidth clients (mobile phones, people on dial-up lines, folk in
Africa...) we try to do our best to reduce the waste of bandwidth. Tagging
messages with multiple subscription ids makes this much easier.
	The distributed network of pubsub servers that many of us hope will
be created in the future will rely on a wide variety of content-distribution
mechanisms. There is a role in this network for topic-based as well as
content-based subscriptions. There is a role in this network for
easy-to-implement inter-server subscriptions as well as more complex schemes
that rely on subscription covering. 
	Being able to tell a server that a particular message satisfies more
than one subscription is an important part of making pubsub practical.

		bob wyman

[1] The Siena project has good research material available on "covering":

More information about the Standards mailing list