[Standards] Proposed XMPP Extension: Character counting in message bodies

Florian Schmaus flo at geekplace.eu
Fri Dec 4 14:15:07 UTC 2020

On 12/4/20 3:03 PM, Andrew Nenakhov wrote:
> Upping a year-old email thread for Florian.

Thanks, but I am well aware of the thread and the situation.

I think this below mixes aspects the XML layer with the Unicode layer, 
which do not have to get mixed when counting "characters". Ultimately 
what you get out of the textual representation of the <body/> element is 
a sequence of grapheme clusters (identified via extended grapheme 
clustering algorithm). Those are the entities that eventually should get 

Reply containing rant about how unpractical grapheme cluster counting is 
in 3, 2, 1… :)

- Florian

> ср, 18 дек. 2019 г. в 20:41, Marvin W <xmpp at larma.de>:
>> [inline]
>> On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
>>> In the end we have settled for counting characters of escaped string, so
>> This sounds like a terrible idea. In encoded XML, ">", "&#x3E;", ">"
>> and "<!CDATA[>]]>" are equivalent. I just tried it out and servers
>> indeed do convert all of those to their shortest well-formed variant
>> (which is ">") so you cannot rely on their reference length at all.
>> Servers may at their discretion convert non-ascii characters to their
>> character reference form (starting with &#). I have seen this at least
>> once happening with emojis.
>>> to draw *&&&* in a client we count it as string with a length of 15,
>>> thus <bold> reference points to characters 0..14:
>>> <reference xmlns="urn:xmpp:reference:0" begin="0" end="14"
>>> type="markup"><bold /></reference>
>> Luckily for you, this looks pretty non-standard, so you don't have to
>> deal with your implementation being incompatible with others. Also as
>> soon as XEP-0372 becomes actually more stable, you are technically
>> standard non-compliant because there is no <bold /> element defined for
>> the namespace "urn:xmpp:reference:0". You are apparently mixing XEP-0372
>> and XEP-0394.
>> Also that's a weird counting there, usually I would expect end to point
>> to the position after the last referenced character - at least that's
>> what you do in most programming languages (e.g. "&&&"[0:14]
>> will give you "&&&amp" without the last ";").
>> _______________________________________________
>> Standards mailing list
>> Info: https://mail.jabber.org/mailman/listinfo/standards
>> Unsubscribe: Standards-unsubscribe at xmpp.org
>> _______________________________________________

-------------- next part --------------
A non-text attachment was scrubbed...
Name: OpenPGP_signature
Type: application/pgp-signature
Size: 495 bytes
Desc: OpenPGP digital signature
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201204/b42eae67/attachment-0001.sig>

More information about the Standards mailing list