[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Fri Dec 4 14:50:36 UTC 2020


On 12/4/20 3:27 PM, Sam Whited wrote:
> FWIW I was a big proponent of doing it this way too, but I've changed my
> mind after seeing too many grapheme segmentation implementations be
> broken in small, different, ways. My new position is that we have to
> just count bytes and figure out a sane behavior in case someone sends us
> an invalid offset in the middle of a codepoint or something. This is
> encoding agnostic (not that it matters for XMPP) and makes it very easy
> to count: go to that byte offset, check if we're on any sort of UTF-8
> boundary, if so call it a day, if not do whatever the fallback is.

This also reads like it is mixing multiple independent layers, i.e. the 
bytes on the wire with the data you receive in the higher layers, e.g. 
your XMPP API may provide a method Message.getBody(), which returns a 
String. But this String will be represented in your programming 
language's native String representation, which may or may not match the 
bytes on the wire.

As I do not know any alternative, grapheme cluster counting is the only 
sound way for interoperability and does not exclude our friends from all 
over the world and their characters. Which is important to me.

However, I have a counter proposal that goes into a similar direction as 
yours: Even if the specification asks for grapheme clusters, there is 
nothing wrong to fallback to character counting if you haven't 
implemented grapheme cluster counting (yet). I would expect that it will 
just work most of the time (for users of the arabic alphabet).

While this does in no way allow for sound interoperability, it is some 
sort of opportunistic interoperability.

- Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201204/cf5ed7de/attachment-0001.sig>


More information about the Standards mailing list