[Standards] Proposed XMPP Extension: Character counting in message bodies

Sam Whited sam at samwhited.com
Fri Dec 4 16:41:24 UTC 2020

On Fri, Dec 4, 2020, at 16:10, Florian Schmaus wrote:
> XMPP uses Unicode because XML, upon which XMPP is build, uses Unicode,
> hence I doubt that you will ever find an API where e.g.
> Message.getBody() will return data that is not Unicode encoded, but
> uses some other encoding scheme.

Wasn't that what you were saying? It might be UTF-16 on a JavaScript
implementation, or it might be <non-unicode> in a future where Unicode
is no longer the dominant way of representing characters, or in an east
asian protocol where Unicode isn't always used (again, I have no
examples of this, I've just been told that other encodings are
frequently used in China and thereabouts, but I can't verify this).

Even in languages that do use Unicode not all of them provide easy
access to codepoints. The language itself may not support UTF-8
directly, for example and always return bytes at which point the user
would have to load a UTF-8 package and parse the runes out. Or it may be
using a general purpose XML encoder that checks the XML heading to find
the character type. At the application level we know that it will always
be UTF-8, but the XML library doesn't know that so it always returns
bytes (this one is a real example that I deal with a lot).

> So, I am sorry, but I do not see your point. Furthermore, the Strings
> of all modern programming languages, I am aware of, allow you to
> derive the Unicode code points they consist of. And from those code
> points one can derive grapheme clusters.

People implement XMPP in languages that aren't modern too. That's
partially a joke, but jokes aside you're assuming the language will have
some form of encoded string. Like the XML library I gave before I can
imagine many languages won't return a string at all and will always
return bytes. Even if the language has UTF-8 encoded strings we might
want to return bytes for efficiency (bytes can be appended to an
existing buffer that gets reused, strings are immutable and therefore
require expensive allocations). In Go in particular which I write a lot
of I expect this to be the case (I would want to return bytes to give
the user the option of figuring out what to do with them).

Bytes are the only way to not make assumptions about the libraries,
languages, etc. being used.


More information about the Standards mailing list