[Standards] Proposed XMPP Extension: Character counting in message bodies
jonas at wielicki.name
Tue Dec 8 21:32:47 UTC 2020
On Freitag, 4. Dezember 2020 21:33:38 CET Sam Whited wrote:
> On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote:
> > I begin to feel that a lot of your rationale is based on the idea that
> > you always (/often?) have access to the raw UTF-8 bytes as they
> > appeared on the wire.
> Yes, most of it is.
> > While is is probably true for languages where the String type's native
> > encoding is also UTF-8. It is usually not true for others. For
> > example, widely used XML parser in Java will return Java's String
> > type, which is UTF-16 (or ISO-8859-1 ) based.
> Yes, this is fair, I was thinking you could probably always get the raw
> bytes, but it does look like a lot of these *only* do DOM based parsing
> and don't keep the original representation.
This has nothing to do with DOM vs. whatever. SAX can also give you the data
in the format which is described by the XML model (code points).
So it appears there are two sides and arguing from the point of view of
programming languages will give us always those who get the raw representation
of the data on the wire (C-ish things) and those who get the high-level
Thus, I propose that we stick with what the standards offer. XMPP is based on
XML in that all data exchanged is somehow wrapped in XML. XML specifies that
all character data (text) is a sequence of unicode code points. The encoding
on the wire is irrelevant after decoding of XML; on the *abstract* layer, XML
provides sequences of code points, nothing else.
Some libraries always convert to UTF-8 (libxml2), some bindings always offer
some kind of unicode codepoints (e.g. python which opportunistically chooses
ASCII/UCS-2/UCS-4 depending on the data), some bindings may even expose the
raw bytes and let the user deal with it (I think there was/is a zero-copy
implementation which mostly consisted of strategically replacing XML
metacharacters with NUL bytes in the incoming data).
But all implementations which want to be XMPP and XML 1.0 compliant need to
have some way to convert or offer access to code points, as that’s the XML data
model. Let’s build on that.
Much easier than writing 20 emails on this topic, and that just in this
> > However, given that there is a wide variety here, I am not sure if it
> > is worth to take any of that into consideration.
> Yes, fair enough.
> > Instead, my rationale is based on the idea that you always have
> > access to the Unicode code points of the textual content obtained
> > from the XML.
> I do not have that access without converting from UTF-8 to code points
> in the hot-path where it would be inappropriate. It's effectively the
> same thing: I don't want to convert from bytes to code points, you don't
> want to convert from codepoints to bytes. Some languages will have to do
> the conversion either way, so it seems worth using the thing that allows
> for the most flexibility with the least amount of work in eg. IoT
> devices using C that are trying to optimize for performance where
> passing along the bytes as received on the wire (possibly with some
> validation that the range is accurate) is acceptable.
Note that you do not have to decode UTF-8 (which can be between O(n) and
O(n^2) depending on the implementation and circumstances) to count code
points; you can certainly do the counting in O(n) (which is the same as
strlen() in C). And it would be similarly easy to write algorithms to do
efficient batched codepoint indexed operations on UTF-8 strings in C (such as
splitting UTF-8 byte ranges based on start/end information or such), if you
really wanted to do such things in C.
However, I also think that the IoT use-case is a bit strawmanny, given that
IoT devices would rarely have to deal with markup or other rich human-facing
formats which require decoding of such codepoint references.
Thus ... I don’t buy this argument. Devices which render markup or references
would have to deal with complexity way beyond this. And they’ll have to do the
decoding anyway to do some kind of text rendering.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 833 bytes
Desc: This is a digitally signed message part.
More information about the Standards