[Standards-JIG] proto-JEP: Smart Presence Distribution
dave at cridland.net
Thu May 18 10:17:49 UTC 2006
On Thu May 18 10:05:09 2006, Carlo v. Loesch wrote:
> Dave Cridland typeth:
> | It's not mathematics, it's information theory - good old Shannon.
> The | information is this proto-jep is only marginally lower, and
> how can you say that? matthias has showed figures that 59.76% of
> presence stanzas are redundant duplicates. this proto-jep eliminates
> the duplicates, or at least reduces them to the amount of involved
> servers, which must range between 30-50%, if you add real multicast
> it gets even better. so why are you coming up with totally unreal
> ideas, that this proto-jep would hardly bring improvement?
Information. Not redundancy, or octets. I'm not questioning whether
this proto-JEP would bring improvement or not, it demonstrably will.
The question is whether the improvement is:
a) Worthwhile based on the cost of implementation. (Is this tricky to
b) Worthwhile based on the cost of deployment. (Is this going to cost
CPU, memory, etc?)
c) Worthwhile based on the detrimental effects to security. (Is this
handing over responsibility for policy to a foerign domain?)
d) Worthwhile to do now, or should we be concentrating on simpler
methods first. (The "low hanging fruit" argument).
> so you think compression does the same job?
No, I did say a "perfect compression mechanism". Specifically, that's
a theoretical limit that's unattainable in practise, although certain
patents appear to disagree with me. A compression algorithm that
produced the same result would have to not only be perfect, but
maintain state across connections, and would probably use an infinite
amount of memory, as well as an infinite amount of CPU power.
> well i can see how
> compression helps boil down the verbose xml syntax, but how will
> it reduce the complete collection of jids which are currently
> being sent over the wire? tokenize the @s and .s in it? great idea!
It won't, but the jids will compress relative to one another anyway.
Deflate won't tokenize single characters, of course, but the Huffman
encoding phase will actually reduce those from 8-bits (probably to a
lot fewer, @ and . being quite common characters in XMPP), and the LZ
compression will be doing the server name quite nicely, and the
payloads very well indeed, due to their exact similarity.
Any real-world compression algorithm will reduce the complete
collection of jids by a significant amount because they're
self-similar, containing a significant quantity of repetitive, and
thus redundant, data. This is basic information theory stuff.
Of course, you shouldn't let the facts get in the way of an
opportunity to express your undoubted wit.
> can we have an example of how many bytes say 10 copies of presence
> different jids compress to in comparison to one copy of presence
> a 'to' or with a placebo 'to'?
Yes, I already did that in the message you quote above. 154 octets to
62, if memory serves. Enclosed in a single TCP/IP packet, that's ~200
to ~100, so roughly a 50% saving on - taking Matthias's figures as
ideal - 50% of the data, or a 25% saving relative to compression
Compression alone I would expect to be running to approximately a 75%
saving, and careful use of TCP-level buffering and associated
compression buffering may improve that without a noticeable latency
increase. (Although it'd reduce XMPP's effectiveness for
near-real-time applications, by adding to the latency).
I should point out those figures assume an immmediate flush after the
data (A Z_FLUSH_SYNC, for zlib-fans), and no prior data, so the
figures are not terribly realistic, and I'd personally only take
these as an approximate guide. I would expect in practise that the
difference was less pronounced, because prior data would aid in the
compression of the presence stanzas, and in addition, the presence
stanzas would aid in the compression of both subsequent data in
general, and subsequent presence stanzas in particular. As the
compression ratio improves, the octet count difference decreases.
By the way, I'd drop that sarcastic tone about the "placebo to", as
well, because I'm not clear it's redundant in the case where several
domains are hosted on the same server. Besides which, whatever the
technical issues, logistically it's too difficult to remove in
PS: It's not "typeth" - that's second person singular archaic. The
reason it's archaic is partly because under most circumstances, using
"thou" is generally over-familiar, like using "tu" in French, for
example. It was still used in Northern England into my lifetime,
although the form had changed to "tha types", instead of "thou
typeth", and is fading fast because of the familiarity it implies.
You see things; and you say "Why?"
But I dream things that never were; and I say "Why not?"
- George Bernard Shaw
More information about the Standards