[Standards] Proposed XMPP Extension: Character counting in message bodies

Andrew Nenakhov andrew.nenakhov at redsolution.com
Thu Dec 19 12:59:47 UTC 2019


ср, 18 дек. 2019 г. в 20:12, Ralph Meijer <ralphm at ik.nu>:

> My assumption was that we are looking at character data on the abstract
> layer /after/ parsing XML. You shouldn't see entities there (they'd be
> resolved to their respective characters), nor should you see <![CDATA[]]
> wrappers.
>
Hm, please, define 'abstract' layer more precisely. Citing example from the
XEP proposal, which is the true abstract layer?
this, [image: image.png], or this:[image: image.png] ?  Or the layer with
'codepoints'? Is it really any better than escaped XML text?

This approach is also not very practical. When you do stanza processing on
a server, most often you just take stanza as is, passing all references
data without transferring data to abstract layer back and forth.  Plus,
when doing the web client this means an additional escaping - deescaping
routine every time when something is sent-displayed, cause browsers require
their own escaping.

ср, 18 дек. 2019 г. в 20:41, Marvin W <xmpp at larma.de>:

> [inline]
>
> On 12/18/19 3:22 PM, Andrew Nenakhov wrote:
> > In the end we have settled for counting characters of escaped string, so
>
> This sounds like a terrible idea. In encoded XML, ">", "&#x3E;", ">"
> and "<!CDATA[>]]>" are equivalent. I just tried it out and servers
> indeed do convert all of those to their shortest well-formed variant
> (which is ">") so you cannot rely on their reference length at all.
> Servers may at their discretion convert non-ascii characters to their
> character reference form (starting with &#). I have seen this at least
> once happening with emojis.
>

Why should standard be concerned about different server implementations
converting anything?  If a server does some converting for some reason from
one way of escaping XML to another, of course it should recalculate all
references.


> > to draw *&&&* in a client we count it as string with a length of 15,
> > thus <bold> reference points to characters 0..14:
> > <reference xmlns="urn:xmpp:reference:0" begin="0" end="14"
> > type="markup"><bold /></reference>
>
> Luckily for you, this looks pretty non-standard,


...


> You are apparently mixing XEP-0372 and XEP-0394.
>

I am not mixing them. XEP-0394 is a pathetic ill-concieved nonsense, which
couldn't even use the same attrubute names as preceeding references XEP:
0372 uses 'start' and 'end' and 394 uses 'begin' and 'end'.
Standards, right.

We chose to ignore both 394 and 385 and have develped a very uniform way to
do all things we nee  in messages - markup, links, images, voice messages,
files, locations, etc. So far our 'non-standard' way of using references is
in fact way more 'standard' than what is currently suggested by this
mish-mash of different XEPs.

ср, 18 дек. 2019 г. в 21:00, Ralph Meijer <ralphm at ik.nu>:

> On 18-12-2019 16:40, Marvin W wrote:
> > [..]
> >
> > Also that's a weird counting there, usually I would expect end to
> > point to the position after the last referenced character - at least
> > that's what you do in most programming languages (e.g.
> > "&&&"[0:14] will give you "&&&amp" without the
> > last ";").
>
> I'd not be opposed to changing the definition of 'end' here. Twitter
> Entities [1] also points to the character after.


Should we really be blindly fixed on copying Twitter approach, when, in
fact, we have a significantly different use case? For one, Twitter entities
are ALWAYS splitted by some symbol (space, punctuation marks). They never
have url next to hashtag without some separator between them. The advantage
of this approach is that you can derive the length of a reference by
subtracting begin from end, but in return you end up with weird
intersecting ranges:
<body>*this**is**not**good*</body>

0..4: bold
4..6: italic
6..9: underscore
9..13: bold italic

Not really cool, right? Also, by twitter own rules, the last indice should
be 9..12, not 9..13:

The second integer represents the location of the first non-URL character
> occurring after the URL *(or the end of the string if the URL is the last
> part of the Tweet text)*


(emphasis mine). Since Twitter does not use null terminated strings, entity
pointing to full "&&&" tweet would have indices [0, 14], not
[0, 15]

With all this written, I think it is safe to put to rest references (sic)
to Twitter way of doing things. We thank them for inspiration, but that's
it. We have different use cases.

Cited example of programming languages is valid only in part. Yes, it is so
in java or python, but not so in swift, obj-c or erlang. The last three use
index of the first character and length, which is  actually my favourite
approach.

ср, 18 дек. 2019 г. в 21:59, Marvin W <xmpp at larma.de>:

> I don't think it really is a "change", in XEP-394 it is already defined
> this way ("the last affected codepoint is the one just before end" [1])
> and the example in XEP-372 [2] also counts that way (char 72 is the "J"
> of and char 78 is the space after "Juliet"). Only the text misleadingly
> says "An end attribute is similarly used for the index of the last
> character of the reference.", so this may need a clarification.
>

Well. I strongly object. Text of XEP-0372 clearly says that the end
attribute uses the index of the last referenced character, not the
character succeeding it. So the right thing to do here is to change value
of the 'end' attribute from the example to '77' instead of '78'.

( Btw, did anyone but us implement this XEP at all?  )

On 'already defined' 394. As we have learned from 0071 debacle, even widely
implemented XEPs can be deprecated with vague reasoning, so deprecating a
contradictory XEP that, to my knowledge, wasn't even implemented anywhere,
shouldn't be too much of an issue.

-- 
Andrew Nenakhov
CEO, redsolution, OÜ
https://redsolution.com <http://www.redsolution.com>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20191219/e689a98f/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 4938 bytes
Desc: not available
URL: <http://mail.jabber.org/pipermail/standards/attachments/20191219/e689a98f/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image.png
Type: image/png
Size: 5643 bytes
Desc: not available
URL: <http://mail.jabber.org/pipermail/standards/attachments/20191219/e689a98f/attachment-0003.png>


More information about the Standards mailing list