[Standards] Proposed XMPP Extension: Character counting in message bodies
flo at geekplace.eu
Fri Dec 4 20:53:28 UTC 2020
On 12/4/20 9:33 PM, Sam Whited wrote:
>> And I am in favor of code points because it allows us to aim for the
>> extended grapheme cluster algorithm, while also allowing for the
>> "simply count code points" fallback.
> If you do bytes you could also easily convert to codepoints and then to
> grapheme clusters. It also allows for the simple "count codepoints" or
> "count bytes" fallback.
If you count the bytes of the UTF-8 encoded representation, then there
is no way to have any fallback (as the indexes would be wrong).
Maybe an example is able to illustrate where I see the advantage of
counting graphemes/code points over counting the bytes of the UTF-8
encoded representation. Consider the following text:
Code points: U+00DC U+0062 U+0065 U+0072
Graphemes: (U+00DC) (U+0062) (U+0065) (U+0072)
UTF-8 bytes: c3 8b 62 65 72
Assume we want to provide the coordinates for the span that consists of
the first two letters. e.g.:
Then, with a zero-indexes scheme where start is inclusive and end is
exclsuive, you may either end up with
if you count bytes.
But you end up with
irregardless of counting code points or graphemes.
This is, of course, because in the example the number of code points and
graphemes is identical. But this allows developers to easily bootstrap
this scheme by simply counting code points in the beginning. I wouldn't
be surprised if that it would work so well that they never even switch
to grapheme counting.
-------------- next part --------------
A non-text attachment was scrubbed...
Size: 495 bytes
Desc: OpenPGP digital signature
More information about the Standards