[Standards] Proposed XMPP Extension: Character counting in message bodies
ralphm at ik.nu
Fri Dec 20 12:02:03 UTC 2019
Oops, the following should have been sent to the list.
On 19-12-2019 15:02, Ralph Meijer wrote:
> On 19-12-2019 13:59, Andrew Nenakhov wrote:
>> ср, 18 дек. 2019 г. в 20:12, Ralph Meijer <ralphm at ik.nu
>> <mailto:ralphm at ik.nu>>:
>> My assumption was that we are looking at character data on the
>> abstract layer /after/ parsing XML. You shouldn't see entities there
>> (they'd be resolved to their respective characters), nor should you
>> see <![CDATA] wrappers.
>> Hm, please, define 'abstract' layer more precisely. Citing example
>> from the XEP proposal, which is the true abstract layer?
>> this, image.png, or this:image.png ? Or the layer with 'codepoints'?
>> Is it really any better than escaped XML text?
>> This approach is also not very practical. When you do stanza
>> processing on a server, most often you just take stanza as is,
>> passing all references data without transferring data to abstract
>> layer back and forth. Plus, when doing the web client this means an
>> additional escaping - deescaping routine every time when something is
>> sent-displayed, cause browsers require their own escaping.
> Abstract as in the abstract sequence of characters after parsing,
> however represented by your programming language. If I parse an XML
> document <blah><!CDATA[less < more]]></blah>, and request the text for
> the `blah` node, I get an object that encodes the abstract sequence of
> characters: `less < more`. In Python, for example, that'd be
> represented by a unicode string object.
> See also https://www.unicode.org/versions/Unicode12.1.0/ch03.pdf#G2212
> for various definitions around characters, code points, glyphs,
> graphemes, and the like. So yes, you'd be counting ZWJs and such for
> your example, and I think it tallies up to 7 for just man/man/boy/boy,
> without Fitzpatrick modifiers, hair variations, hair color, direction.
> With regards to having to re-encode for HTML representation, as
> unfortunate that may be, other situations require other
> transformations, like encoded in UTF-8, for them to be used in other
> systems (UI, storage, etc.).
> If you want consistent counting on all platforms and languages,
> counting Unicode characters seems to be the best way forward.
More information about the Standards