[Standards] Proposed XMPP Extension: Character counting in message bodies

Marvin W xmpp at larma.de
Thu Dec 19 19:48:55 UTC 2019

On 12/19/19 1:59 PM, Andrew Nenakhov wrote:
> Is it really any better than escaped XML text?

Yes. Any sane implementation of XML parsers would resolve references as 
part of the parsing, so you would have to do extra work to find out what 
references were in the text before.

> Plus, when doing the web client this means an additional 
> escaping - deescaping routine every time when something is 
> sent-displayed, cause browsers require their own escaping.

I hope that any web client would not use innerHtml or similar techniques 
to display the message body, but instead rely on 
document.createTextNode() which expects a string without references. 
Similarly inputElement.value and element.textContent give you their 
strings without references. In generally HTML/JS do their best to 
abstract away from references, because why should an application 
developer deal with that?

Also HTML uses a different set of predefined references then XML and has 
different requirements - ä is valid in HTML but not in XML (without 
it being defined as an entity in a DTD).

> Why should standard be concerned about different server implementations 
> converting anything?  If a server does some converting for some reason 
> from one way of escaping XML to another, of course it should recalculate 
> all references.

On the XML layer (which is what XMPP build on) this "conversion" does 
not change anything (the texts stay the same), that's why it is 
perfectly valid for a server to do it. The protocol on top of XML (and 
subsequently XMPP) should not deal with references, they are resolved on 
the layer below. That's why it is a bad idea to assume specific 
characters to be represented using certain references, because you can't 
control that (you can only assume things).

So I tried with Xabber/xabber.org and either your server or the client 
(I guess it's the server) seems to fail to properly do what you just 
said it should: When sending the message

<message type="chat">
   <reference xmlns='urn:xmpp:reference:0' begin='1' end='1' 
   <reference xmlns='urn:xmpp:reference:0' begin='3' end='3' 

it is displayed as


with g and ; in bold.

> So far our 'non-standard' way of using 
> references is in fact way more 'standard' than what is currently 
> suggested by this mish-mash of different XEPs.

I guess we have different definitions of a standard. These mish-mash of 
different XEPs is a publicly viewable standard proposal. I am not aware 
of a documentation of what Xabber is doing

> Not really cool, right? 

What's bad about that? I would say that having "0..0 bold" is pretty 
weird, because it sounds like an empty range (it starts and ends at the 
same point, so it must be empty).

>     The second integer represents the location of the first non-URL
>     character occurring after the URL *(or the end of the string if the
>     URL is the last part of the Tweet text)*

I think you are misunderstanding them here. I am pretty sure "the end of 
the string" is *after* the last character, not the last character.

> Cited example of programming languages is valid only in part. Yes, it is 
> so in java or python, but not so in swift, obj-c or erlang. The last 
> three use index of the first character and length, which is  actually my 
> favourite approach.

I don't think it really makes sense to discuss which programming 
language is the one that matters most, but:
- Swift has two operators "ABCDE"[2...4] = "CDE" and "ABCDE"[2..<4] = "CD"
- Objective-C substring functions require index and length
- Erlang uses 1-based indices, string:sub_string("ABCDE", 2, 4) = "BCD", 
thus is equivalent to python [1:4]

Also when you prefer index of first char and length, why not use <ref 
begin="2" length="2" /> then? For languages that take string length, you 
currently have to calculate length = end+1-begin (because you chose to 
have end one less than everyone else does).

> ср, 18 дек. 2019 г. в 21:59, Marvin W <xmpp at larma.de 
> <mailto:xmpp at larma.de>>:
>     I don't think it really is a "change", in XEP-394 it is already defined
>     this way ("the last affected codepoint is the one just before end" [1])
>     and the example in XEP-372 [2] also counts that way (char 72 is the "J"
>     of and char 78 is the space after "Juliet"). Only the text misleadingly
>     says "An end attribute is similarly used for the index of the last
>     character of the reference.", so this may need a clarification.
> Well. I strongly object.

Either we need to change the text in XEP-372 slightly or we have to 
change the examples in XEP-372 and the text and examples in XEP-394 
(because both should do the same). I see you have a strong opinion on 
the one side for some reason.

> ( Btw, did anyone but us implement this XEP at all?  )

Converse has an implementation of XEP-372 for mentions (the only usecase 
that is properly defined in that XEP IMO).

> On 'already defined' 394. As we have learned from 0071 debacle, even 
> widely implemented XEPs can be deprecated with vague reasoning, so 
> deprecating a contradictory XEP that, to my knowledge, wasn't even 
> implemented anywhere, shouldn't be too much of an issue.

Sure, we could deprecate XEP-394, but I don't see a proper replacement 
for it yet. I consider the thing Xabber is doing more like a misuse of 
XEP-372, which according to its abstract defines a method for one XMPP 
stanza to provide references to another entity, such as mentioning 
users, HTTP resources, or other XMPP resources -  not a way for putting 
markup everywhere. I'd rather like to get rid of XEP-372 (which has a 
lot of unclear things and pending TODOs in it) then XEP-394 (which of 
course can surely be improved).

More information about the Standards mailing list