[Standards-JIG] proto-JEP: Smart Presence Distribution

Dave Cridland dave at cridland.net
Thu May 18 10:17:49 UTC 2006


On Thu May 18 10:05:09 2006, Carlo v. Loesch wrote:
> Dave Cridland typeth:
> | It's not mathematics, it's information theory - good old Shannon. 
> The | information is this proto-jep is only marginally lower, and 
> across 
> how can you say that? matthias has showed figures that 59.76% of
> presence stanzas are redundant duplicates. this proto-jep eliminates
> the duplicates, or at least reduces them to the amount of involved
> servers, which must range between 30-50%, if you add real multicast
> it gets even better. so why are you coming up with totally unreal
> ideas, that this proto-jep would hardly bring improvement?
> 
> 
Information. Not redundancy, or octets. I'm not questioning whether 
this proto-JEP would bring improvement or not, it demonstrably will. 
The question is whether the improvement is:

a) Worthwhile based on the cost of implementation. (Is this tricky to 
code right?)
b) Worthwhile based on the cost of deployment. (Is this going to cost 
CPU, memory, etc?)
c) Worthwhile based on the detrimental effects to security. (Is this 
handing over responsibility for policy to a foerign domain?)
d) Worthwhile to do now, or should we be concentrating on simpler 
methods first. (The "low hanging fruit" argument).


> so you think compression does the same job?

No, I did say a "perfect compression mechanism". Specifically, that's 
a theoretical limit that's unattainable in practise, although certain 
patents appear to disagree with me. A compression algorithm that 
produced the same result would have to not only be perfect, but 
maintain state across connections, and would probably use an infinite 
amount of memory, as well as an infinite amount of CPU power.

>  well i can see how
> compression helps boil down the verbose xml syntax, but how will
> it reduce the complete collection of jids which are currently
> being sent over the wire? tokenize the @s and .s in it? great idea!
> 
> 
It won't, but the jids will compress relative to one another anyway. 
Deflate won't tokenize single characters, of course, but the Huffman 
encoding phase will actually reduce those from 8-bits (probably to a 
lot fewer, @ and . being quite common characters in XMPP), and the LZ 
compression will be doing the server name quite nicely, and the 
payloads very well indeed, due to their exact similarity.

Any real-world compression algorithm will reduce the complete 
collection of jids by a significant amount because they're 
self-similar, containing a significant quantity of repetitive, and 
thus redundant, data. This is basic information theory stuff.

Of course, you shouldn't let the facts get in the way of an 
opportunity to express your undoubted wit.


> can we have an example of how many bytes say 10 copies of presence 
> with
> different jids compress to in comparison to one copy of presence 
> without
> a 'to' or with a placebo 'to'?
> 
> 
Yes, I already did that in the message you quote above. 154 octets to 
62, if memory serves. Enclosed in a single TCP/IP packet, that's ~200 
to ~100, so roughly a 50% saving on - taking Matthias's figures as 
ideal - 50% of the data, or a 25% saving relative to compression 
alone.

Compression alone I would expect to be running to approximately a 75% 
saving, and careful use of TCP-level buffering and associated 
compression buffering may improve that without a noticeable latency 
increase. (Although it'd reduce XMPP's effectiveness for 
near-real-time applications, by adding to the latency).

I should point out those figures assume an immmediate flush after the 
data (A Z_FLUSH_SYNC, for zlib-fans), and no prior data, so the 
figures are not terribly realistic, and I'd personally only take 
these as an approximate guide. I would expect in practise that the 
difference was less pronounced, because prior data would aid in the 
compression of the presence stanzas, and in addition, the presence 
stanzas would aid in the compression of both subsequent data in 
general, and subsequent presence stanzas in particular. As the 
compression ratio improves, the octet count difference decreases.

By the way, I'd drop that sarcastic tone about the "placebo to", as 
well, because I'm not clear it's redundant in the case where several 
domains are hosted on the same server. Besides which, whatever the 
technical issues, logistically it's too difficult to remove in 
specification.

Dave.

PS: It's not "typeth" - that's second person singular archaic. The 
reason it's archaic is partly because under most circumstances, using 
"thou" is generally over-familiar, like using "tu" in French, for 
example. It was still used in Northern England into my lifetime, 
although the form had changed to "tha types", instead of "thou 
typeth", and is fading fast because of the familiarity it implies.
-- 
           You see things; and you say "Why?"
   But I dream things that never were; and I say "Why not?"
    - George Bernard Shaw



More information about the Standards mailing list