[Standards] Need sanity check on an example in XEP-0393: Message Styling

Tedd Sterr teddsterr at outlook.com
Sat Nov 7 14:53:21 UTC 2020

> > Whereas I say: the first asterisk CAN'T BE an open (it's followed
> > immediately by a close), … I don't think there is a rule violation in
> > either case precisely because this isn't specified
> I think that's covered by "Spans are always parsed from the beginning of
> the byte stream to the end". This in my mind meant we can't go backwards
> to decide if the original start element really was one or not. Maybe
> that's what needs to be clarified here. I wonder if it would be a
> violation of the rules to add it.

You don't need to go backwards, you decide whether the current character is a valid open according to the next character - you have to do this anyway to check for spaces. It doesn't decide there's an open and then possibly go back and re-evaluate; the entire string is parsed once from beginning to end and directives are identified in-place based entirely on what character is before/after.

> > I'm not sure there is anything in the current rules that would require
> > the extra lookahead; you should be able to parse the string once to
> > identify all of the potential directives and then construct the spans
> > using that list.
> That's still unbounded look ahead, requiring that you range over the
> entire string to make sure one span actually is styled.

The directives themselves are identified using a lookahead of one, not unbounded; spans are then constructed based solely on the identified directives (without going through the string again). You might argue that you shouldn't have to build an explicit list of directives and then construct the spans, but doing this recursively still builds the list implicitly on the stack.

> > Repeat searching trying to find the best match doesn't sound
> > very lazy.
> You don't have to do repeated searching, just one which is potentially
> unbounded in the forward direction. The way my code works right now is
> basically as you've described (it does actually repeat in several cases,
> but that's just because it made it easier to reason about and I didn't
> care about performance for short messages, it could be rewritten to do
> it in one pass instead of finding the close elements, then recursing
> into the bytes inside of them to look for more spans), except that you
> can't go back and change your mind on start tokens because I don't think
> the current rules allow that.

Given a very-contrived-to-prove-the-point input such as: "*text _text ~text"
You would see the asterisk, decide that's an open, and then search all the way to end of the string looking for a matching close - you wouldn't find one, so then you'd have to go back and say the asterisk isn't an open.
Then you'd get to the underscore, decide that's an open, and then search all the way to end of the string looking for a matching close - you wouldn't find one, so then you'd have to go back and say the underscore isn't an open.
Then you'd get to the tilde, decide that's an open, and then search all the way to end of the string looking for a matching close - you wouldn't find one, so then you'd have to go back and say the tilde isn't an open.
That seems very much like repeatedly searching substrings looking for a match.

My way would identify the possibly-open directives in one pass, and then attempt to construct spans from those - which would fail due to a lack of matching closes.

Either way, you can't say for definite that an open starts a span until after you've checked further, i.e. this isn't a context-free grammar.

> > 3. {*text *text*} Example 1 is the same as in your examples; 2 is
> >    basically the same (the space comes after the close); but for 3 the
> >    space comes before the second asterisk which invalidates it as a
> >    close, thus making it a possible-but-not-directive-present-between-two-
> >    directives. That's easily identified without searching the whole
> >    string, and no rules are broken.
> I don't understand this example, sorry. This is how it works today and
> is consistent with skipping the middle * in the "***" example.

You had mentioned the possibility of potentially-a-directive characters appearing within valid spans, while those characters themselves should be treated as text - this was an example of that. And, yes, this is how it currently works; my point was that it's still handled correctly doing it my way and no rules are violated (as you'd suggested otherwise.)

Whether '***' is consistent with this example is where we differ.
Your version sees the open and starts searching for a matching close, the next character is an asterisk but that would mean no intervening text and so it's not a valid close, so then you move onto the next asterisk which is a valid close - resulting in {***}.
My version sees a possible open, but the following character makes it invalid (just as a space would), so it's not an open; then the next asterisk is also not a valid open for the same reason - resulting in *** (just text).
Neither way violates the rules because the relevant decision isn't specified - that is, either: tenaciously hold onto the open and desperately try to find some kind of match (giving up anyway if you reach the end of the string without finding one), or simply say it's not a valid open and move on without having to check the entire rest of the string just in case.

> > Anyway, let's not continue to spam the mailing list - sorry everyone!
> > - we can continue this debate elsewhere if necessary.
> This is the place to discuss XEPs, so this seems fine. I'd love to get
> others opinions on whether this is underspecified or if I'm just
> overthinking it and this seems like the place to do it.

Fair enough.

> > The best thing to do is write some code that follows the rules and see
> > what that leads to - that should also allow you to identify where the
> > rules are underspecified and generate consistent examples.
> That's what I have done. I have written multiple implementations which
> led me to realizing that this part is confusing and possibly
> underspecified. Now that's why I'm having this conversation: I'm trying
> to figure out what the current specification means if you follow it. One
> of my implementations had "***" unformated, one of them had it all
> strong and I'm trying to figure out which one is right, and/or where
> things need to be clarified. I can see the argument for both, but I'm
> unsure if it's underspecified, or if it's just unclear and one or the
> other is right.
> Maybe it would be more productive to ask what other implementations have
> done? If there's broad consensus I can just clarify the rules to mean
> whatever everyone is already doing.

If you have implemented it and even you came up with multiple versions with different behaviours, while following the rules, then clearly the rules are underspecified and we couldn't expect anybody else to follow the rules and still come up with a consistent implementation.

One problem with this is that you (or anyone) will implement it with some internal understanding of how you expect it to work, without that necessarily being explicit in the rules; while somebody else might have a different understanding and produce something different, but still consistent with the rules given. So, yes, it might be better to make the rules more explicit to match whatever others have already done.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/standards/attachments/20201107/c927ddc5/attachment.html>

More information about the Standards mailing list