[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Fri Dec 4 20:53:28 UTC 2020


On 12/4/20 9:33 PM, Sam Whited wrote:
>> And I am in favor of code points because it allows us to aim for the
>> extended grapheme cluster algorithm, while also allowing for the
>> "simply count code points" fallback.
> 
> If you do bytes you could also easily convert to codepoints and then to
> grapheme clusters. It also allows for the simple "count codepoints" or
> "count bytes" fallback.

If you count the bytes of the UTF-8 encoded representation, then there 
is no way to have any fallback (as the indexes would be wrong).

Maybe an example is able to illustrate where I see the advantage of 
counting graphemes/code points over counting the bytes of the UTF-8 
encoded representation. Consider the following text:

Über

Code points: U+00DC U+0062 U+0065 U+0072
Graphemes:   (U+00DC) (U+0062) (U+0065) (U+0072)
UTF-8 bytes: c3 8b 62 65 72

Assume we want to provide the coordinates for the span that consists of 
the first two letters. e.g.:

Über
^^

Then, with a zero-indexes scheme where start is inclusive and end is 
exclsuive, you may either end up with

start=0
end=3

if you count bytes.

But you end up with

start=0
end=2

irregardless of counting code points or graphemes.

This is, of course, because in the example the number of code points and 
graphemes is identical. But this allows developers to easily bootstrap 
this scheme by simply counting code points in the beginning. I wouldn't 
be surprised if that it would work so well that they never even switch 
to grapheme counting.

- Florian

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201204/93267457/attachment.sig>


More information about the Standards mailing list