[Members] wiki.xmpp.org data recovery

Guus der Kinderen guus.der.kinderen at gmail.com
Fri Jun 23 10:07:37 UTC 2017


I've manually restored my application pages and all page's from Tobi's
archive that started with Summer_of_Code

>From that, I've learned that these manual modifications are needed for a
page that is transformed using the xidel / pandoc combination mentioned
earlier:

   - The table of content needs to be removed (Mediawiki will add one
   automatically)
   - Everything that matches this regex need to be removed <span [^>]*>
   (these were used to create anchors for the old ToC, I think)
   - Everything that matches </span> needs to be removed (closing tags for
   the anchors mentioned above)
   - The old context root of the wiki was /web/, while the new one is
   /index.php/ - search the text for web/ which gives you some old references
   to pages and or user profiles
   - Some pages start with a level 2 header - you'll have to reduce all
   header levels down by one for these pages.
   - Generally, get rid of <div> and <br> tags
   - Images that are used on some pages are lost
   - When images were used, there now is a table of two columns, each
   column having a fixed with of 50%. You should drop that 50% fixation.

After that, Mediawiki's preview can be used for smell-testing your
resulting page.

On 22 June 2017 at 17:03, Goffi <goffi at goffi.org> wrote:

> Le jeudi 22 juin 2017, 10:06:05 CEST Guus der Kinderen a écrit :
> > Oh, that's actually handy. I'm not much of a bash scripter, but by
> > combining xidel (to select the part of the HTML that is the article
> > content) and pandoc (for conversion to the Mediawiki format), I'm getting
> > something that is pretty close. Example:
> >
> > $ xidel --html Edwin_Mons_Application_2011.html --css
> "#mw-content-text" |
> > pandoc --from html --to mediawiki
> >
> > Can someone improve on that?
>
> We can also use weboob with  webcontentedit to automatize publishing on the
> wiki, something like
>
> $ xidel --html Edwin_Mons_Application_2011.html --css "#mw-content-text" |
>   pandoc --from html --to mediawiki |
>   webcontentedit edit Edwin_Mons_Application_2011
>
> Add curl or wget to the game, and I think we can make a script to handle
> this
> not too badly, we can fix issues after by hand.
>
> I'm too busy right now to work on a script, but it should not be really
> complicated to do.
>
> Goffi
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.jabber.org/pipermail/members/attachments/20170623/5452ca75/attachment.html>


More information about the Members mailing list