[Standards] Proposed XMPP Extension: Character counting in message bodies

Jonas Schäfer jonas at wielicki.name
Tue Dec 8 21:32:47 UTC 2020


On Freitag, 4. Dezember 2020 21:33:38 CET Sam Whited wrote:
> On Fri, Dec 4, 2020, at 20:23, Florian Schmaus wrote:
> > I begin to feel that a lot of your rationale is based on the idea that
> > you always (/often?) have access to the raw UTF-8 bytes as they
> > appeared on the wire.
> 
> Yes, most of it is.
> 
> > While is is probably true for languages where the String type's native
> > encoding is also UTF-8. It is usually not true for others. For
> > example, widely used XML parser in Java will return Java's String
> > type, which is UTF-16 (or ISO-8859-1 [1]) based.
> 
> Yes, this is fair, I was thinking you could probably always get the raw
> bytes, but it does look like a lot of these *only* do DOM based parsing
> and don't keep the original representation.

This has nothing to do with DOM vs. whatever. SAX can also give you the data 
in the format which is described by the XML model (code points).

So it appears there are two sides and arguing from the point of view of 
programming languages will give us always those who get the raw representation 
of the data on the wire (C-ish things) and those who get the high-level 
representation.

Thus, I propose that we stick with what the standards offer. XMPP is based on 
XML in that all data exchanged is somehow wrapped in XML. XML specifies that 
all character data (text) is a sequence of unicode code points. The encoding 
on the wire is irrelevant after decoding of XML; on the *abstract* layer, XML 
provides sequences of code points, nothing else. 

Some libraries always convert to UTF-8 (libxml2), some bindings always offer 
some kind of unicode codepoints (e.g. python which opportunistically chooses 
ASCII/UCS-2/UCS-4 depending on the data), some bindings may even expose the 
raw bytes and let the user deal with it (I think there was/is a zero-copy 
implementation which mostly consisted of strategically replacing XML 
metacharacters with NUL bytes in the incoming data).

But all implementations which want to be XMPP and XML 1.0 compliant need to 
have some way to convert or offer access to code points, as that’s the XML data 
model. Let’s build on that.

Easy choice.

Much easier than writing 20 emails on this topic, and that just in this 
thread.

> > However, given that there is a wide variety here, I am not sure if it
> > is worth to take any of that into consideration.
> 
> Yes, fair enough.
> 
> > Instead, my rationale is based on the idea that you always have
> > access to the Unicode code points of the textual content obtained
> > from the XML.
> 
> I do not have that access without converting from UTF-8 to code points
> in the hot-path where it would be inappropriate. It's effectively the
> same thing: I don't want to convert from bytes to code points, you don't
> want to convert from codepoints to bytes. Some languages will have to do
> the conversion either way, so it seems worth using the thing that allows
> for the most flexibility with the least amount of work in eg. IoT
> devices using C that are trying to optimize for performance where
> passing along the bytes as received on the wire (possibly with some
> validation that the range is accurate) is acceptable.

Note that you do not have to decode UTF-8 (which can be between O(n) and 
O(n^2) depending on the implementation and circumstances) to count code 
points; you can certainly do the counting in O(n) (which is the same as 
strlen() in C). And it would be similarly easy to write algorithms to do 
efficient batched codepoint indexed operations on UTF-8 strings in C (such as 
splitting UTF-8 byte ranges based on start/end information or such), if you 
really wanted to do such things in C.

However, I also think that the IoT use-case is a bit strawmanny, given that 
IoT devices would rarely have to deal with markup or other rich human-facing 
formats which require decoding of such codepoint references.

Thus ... I don’t buy this argument. Devices which render markup or references 
would have to deal with complexity way beyond this. And they’ll have to do the 
decoding anyway to do some kind of text rendering.

kind regards,
Jonas
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: This is a digitally signed message part.
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201208/06257113/attachment.sig>


More information about the Standards mailing list