[Standards] Proposed XMPP Extension: Character counting in message bodies

Ralph Meijer ralphm at ik.nu
Fri Dec 20 12:02:03 UTC 2019

Oops, the following should have been sent to the list.

On 19-12-2019 15:02, Ralph Meijer wrote:
> On 19-12-2019 13:59, Andrew Nenakhov wrote:
>> ср, 18 дек. 2019 г. в 20:12, Ralph Meijer <ralphm at ik.nu 
>> <mailto:ralphm at ik.nu>>:
>>     My assumption was that we are looking at character data on the
>>     abstract layer /after/ parsing XML. You shouldn't see entities there
>>     (they'd be resolved to their respective characters), nor should you
>>     see <![CDATA[]] wrappers.
>> Hm, please, define 'abstract' layer more precisely. Citing example 
>> from the XEP proposal, which is the true abstract layer?
>> this, image.png, or this:image.png ?  Or the layer with 'codepoints'? 
>> Is it really any better than escaped XML text?
>> This approach is also not very practical. When you do stanza 
>> processing on a server, most often you just take stanza as is, 
>> passing all references data without transferring data to abstract 
>> layer back and forth.  Plus, when doing the web client this means an 
>> additional escaping - deescaping routine every time when something is 
>> sent-displayed, cause browsers require their own escaping.
> Abstract as in the abstract sequence of characters after parsing, 
> however represented by your programming language. If I parse an XML 
> document <blah><!CDATA[less < more]]></blah>, and request the text for 
> the `blah` node, I get an object that encodes the abstract sequence of 
> characters: `less < more`. In Python, for example, that'd be 
> represented by a unicode string object.
> See also https://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G2212 
> for various definitions around characters, code points, glyphs, 
> graphemes, and the like. So yes, you'd be counting ZWJs and such for 
> your example, and I think it tallies up to 7 for just man/man/boy/boy, 
> without Fitzpatrick modifiers, hair variations, hair color, direction.
> With regards to having to re-encode for HTML representation, as 
> unfortunate that may be, other situations require other 
> transformations, like encoded in UTF-8, for them to be used in other 
> systems (UI, storage, etc.).
> If you want consistent counting on all platforms and languages, 
> counting Unicode characters seems to be the best way forward.

More information about the Standards mailing list