[Standards-JIG] pubsub: cache-last-item

Bob Wyman bob at wyman.us
Mon Feb 6 22:08:12 UTC 2006


Peter Saint-Andre wrote on 30-Jan-2006:
> I want to make sure that the proposed "cache-last-item" functionality
> is clearly understood before we move forward with revisions to JEP-0060.

	I think this caching business needs a great deal more thought.
	Caching the last message published would can, I think, only serve
one of two purposes:

	1) Tell you that data has actually been published in the past.
	2) Provide the last state of some resource.

	The first motivation isn't, I think, sufficiently strong to require
such an expensive feature. The second motivation is compelling only in that
limited set of cases where the node only provides information about a single
resource and then, it is only going to be useful in cases where historical
information about the resource is useful. Also, the second motivation may
require exceptionally expensive operations in the case of content-based
subscriptions that filter node events.

	As Peter Millard mentioned in a recent reply, required caching can
place a significant resource burden on the server. At PubSub.com, our
JEP-0060 implementation currently services a couple million subscriptions.
Would this require that we cache the last message for each of those millions
of subscriptions? That is a pretty major cost in exchange for some
"simplicity..."
	What is the interaction between this requirement and the ability to
subscribe to Collections? Imagine that I have a stock-quote server that
aggregates nodes for each "Fortune 500" stock in a collection that can be
subscribed to. What do I send when someone subscribes to the collection?
Would I send just the last item published to any of the nodes in the
collection or would I publish the potentially large set of items that are
the last items sent to any member of the collection? If I should only
publish one item -- the last item published to the collection -- then how is
that useful to a subscriber? What good does it do you to have one item out
of potentially hundreds?
	In the case of content-based subscriptions, one can have a single
node that carries data about many different resources. For instance, at
PubSub.com, our JEP-0060 server offers nodes that carry all updates to over
20 million blogs. We publish millions of items every day. It is unlikely
that the last item published to the "weblogs" node will match a newly
created subscription. In order to model the apparently desired behavior
(i.e. always deliver a single result when a subscription is created) we
would have to maintain a retrospective search engine that did a
retrospective search of all items published in history in order to find the
last item that would have matched... Certainly, there are reasons why this
might be desirable -- however, it is unreasonable to require it.
	In the case of a system which issues fire alarms, it may be that two
years ago the system issued such an alarm but has not published any alarm
data since that time. Should a new subscriber to the fire-alarm node be
presented with the fire-alarm from two years ago? (Yes, they should notice
that it was a long time ago... One might even suggest that a fire-alarm
should eventually be followed by an "all clear." However, why would this
complexity be necessary?)
	I believe there are cases where "last message" is useful. However,
there are many, many cases where it is not. It appears to me that most of
the folk who are commenting so far are thinking only of those cases where it
does make sense. Given that there are many cases where it doesn't make
sense, it would be reasonable to avoid general "should" statements in the
protocol specification and instead rely on application protocols to define
the "SHOULD" cases on a use-case by use-case basis. Thus, a definition of
the use of PubSub for presence applications or in some gaming applications
might say that you "SHOULD" cache the last message -- but the base protocol
MUST NOT...

	bob wyman
 




More information about the Standards mailing list