[Standards] Proposed XMPP Extension: Character counting in message bodies
flo at geekplace.eu
Fri Dec 4 14:50:36 UTC 2020
On 12/4/20 3:27 PM, Sam Whited wrote:
> FWIW I was a big proponent of doing it this way too, but I've changed my
> mind after seeing too many grapheme segmentation implementations be
> broken in small, different, ways. My new position is that we have to
> just count bytes and figure out a sane behavior in case someone sends us
> an invalid offset in the middle of a codepoint or something. This is
> encoding agnostic (not that it matters for XMPP) and makes it very easy
> to count: go to that byte offset, check if we're on any sort of UTF-8
> boundary, if so call it a day, if not do whatever the fallback is.
This also reads like it is mixing multiple independent layers, i.e. the
bytes on the wire with the data you receive in the higher layers, e.g.
your XMPP API may provide a method Message.getBody(), which returns a
String. But this String will be represented in your programming
language's native String representation, which may or may not match the
bytes on the wire.
As I do not know any alternative, grapheme cluster counting is the only
sound way for interoperability and does not exclude our friends from all
over the world and their characters. Which is important to me.
However, I have a counter proposal that goes into a similar direction as
yours: Even if the specification asks for grapheme clusters, there is
nothing wrong to fallback to character counting if you haven't
implemented grapheme cluster counting (yet). I would expect that it will
just work most of the time (for users of the arabic alphabet).
While this does in no way allow for sound interoperability, it is some
sort of opportunistic interoperability.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 495 bytes
Desc: OpenPGP digital signature
More information about the Standards