[Standards] Proposed XMPP Extension: Character counting in message bodies

Marvin W xmpp at larma.de
Wed Dec 9 11:02:31 UTC 2020


Hi,

On 09.12.20 08:59, Florian Schmaus wrote:
> But the recipient would be able to apply the same rules regarding
> localization as the sender when counting grapheme clusters.

Which rules? Unicode does not provide a locale specific grapheme
clustering algorithm, TR29 only mentions that those exist and that it
only provides a "default" algorithm that can be extended upon with
locale specific rules. AFAIK there is not standard that properly defines
grapheme clustering other than the TR29 algorithm which specifically
declares to not create proper locale-specific grapheme clusters. The
only thing we can do is say "do what TR29 says" (it actually gives two
options, but lets just stick with extended grapheme clusters). However,
TR29 itself does not make any statements regarding its stability and
Unicode updates in the last years did change TR29 behavior even for
existing codepoints. Thus if we rely on TR29 algorithm we need to
specify a version of it, which in general is a bad idea.

> I also suggest that the receiving side is considered. For example:
> "Entities that receive character counted text should normalize the
> counted text to Unicode Normalization Form C (NFC) [1] form prior
> evaluating the character indexes."

As I mentioned earlier, normalizing is changing the codepoints and thus
(in XML layer) changing the transferred content. In my tests, I haven't
seen any current server implementation doing that. Worst case,
normalizing can result in messages getting unreadable to the receiving
client that otherwise would have been readable (if the server has a
newer unicode version than both client's fonts). So instead of adding
client side behavior to handle servers doing modifications, I'd rather
codify that servers SHOULD NOT modify the codepoints in <body>. Where we
put this rule is another question.

In my draft I specifically had the rule that if an entity applies
normalization they have to update the indices if needed. This also
applies to receiving entities which is incompatible with what you wrote
(or at least I understand that you want to normalize without updating
indices).

Here is the rationale behind that:
Normalization as per TR15 is considered stable, which means that as long
as you only use codepoints that are defined in the Unicode version your
code uses, any future Unicode/TR15 version will consider the string
normalized. In other terms, this means that to ensure your client only
sends normalized strings (which you would need to, so that any other
entity can apply normalization without changing indices), you'd have to
restrict your client to only send codepoint that are defined in the
Unicode version it supports.
However in practice, users have been sending codepoints that are not
part of the Unicode specification implemented in their clients. This is
because you can practically use new emojis (and their codepoints) as
soon as they appear in popular fonts.

Just to make an example: To support latest Emojis in Android apps, you
can use the "EmojiCompat" support library (that includes a font with all
emojis of the latest version) and thereby become able to display them.
However, the supported Unicode version for all text processing still
remains the version implemented by the ICU4J version shipped with the
operating system. About 60% of Android devices currently in use have
Android 9 or earlier and thus implement Unicode 10.0 or earlier (which
was released mid 2017). Thus 60% of Android devices would not be able to
correctly normalize messages that include the 🦠 microbe emoji. Thus, in
practice, sending clients cannot guarantee to send normalized strings
without severely harming user experience by not accepting new
codepoints. This also means that receiving clients cannot rely on
receiving normalized messages or messages where indices refer to
normalized messages.

Marvin


More information about the Standards mailing list