Converting Moinmoin Text To Markdown

Whatever next WikiEngine I work with after MoinMoin, I'll want the SmartAscii to be MarkDown, just because it's the winning standard.

What differences in syntax do I have to worry about?

The meta-question is: which Mark Down?

Common Mark
whichever flavor is used in Flask https://pythonhosted.org/Flask-Markdown/
whichever flavor is used in WebPy
actually for either Flask or WebPy I could probably swap in a Common Mark library without much trouble.

So I guess I'll have to see whether those differences matter...

(Also note I already tweaked MoinMoin during Cloning Zwiki With Moinmoin.)

Will strip beginning z from BlogBit pages.

hmm, should I start breaking down post types? (cf IndieWeb)

Italics/bold are fine as-is, no conversion needed.

Bullet-lists (where are double-returns prohibited? before? between?), esp nested.

Pre-formatted text

inline - backticks - no change
blocks - Common Mark makes "fenced code blocks" when surrounded by triple-ticks. (Opening set must be on its own line. Does it need double-break before? Test!)

Oct'2015: just decided to switch from Smashed Together Words to Free Link for Automatic Linking.

Links to other sites are different: [label](url)

and just raw URL doesn't auto-link?!?! Have to do <url>. Ugh I have a lot of those, and they're not even always on a line by themselves

Images? Linked images? Do I have any?

Nov15'2015: have run the "scraping" code to make a metadata file and copy the latest MoinMoin version-file of each page to a re-named target, so I can download and convert.

I think the first conversion step is to replace the Smashed Together Words with double-brackets

and batch-convert to Expanding Wiki Words
- both for page titles, and wherever referenced in other pages
and deal with BlogBit z-prefix cases
- just decided to change pretty Title to look like Blogbit With Pretty Long Title Have To Decide How To Render (2015-11-12)
  - nope changed back to (2015-11-12) Blogbit With Pretty Long Title Have To Decide How To Render to keep order similar to URL
- and eliminate - right after end of WikiWord for plural (and maybe other cases I'm not remember)
How?
- already have list of pages; run code over list to make Expanding Wiki Words
- manually review list, manually override exceptions (mainly single-word WikiWord-s) - should I copy this to a separate list?
- update my WikiGraph list so I have every WikiWord reference
  - already have a fresh list, but have to manually review weird UniCode cases my code isn't handling right
  - hmm, also have issue with WikiWord-s that are used but don't have a page. Maybe make separate list of those and manually review, etc. (better to review merged list)
- run my mapping list against that so I have nice list of substitutions specific to every page
- then run code that steps through every page and does manual replacement of each case from the WikiGraph map
  - check out the cases where a BlogBit is referenced on a page, since those already have double brackets
- then delete any - immediately following ]]

Then do

raw URL-s
image URL-s

Then do regular external links - that should just be a regex in text editor.

Then all the other SmartAscii stuff above...

Converting SmashedTogetherWords: realized my current code has lots of false positives, catching strings that are inside urls, etc. Discovered browsing graph subset of wikiwords not having pages. Options

assume bad cases aren't SmashedTogetherWords, just cap-strings. Browse list, delete non-words (by deleting the line from the look-up file, it doesn't get substituted when found)
add other bits of MarkDown conversion, run overall script against batch, then run against HTML validator.

Doing links

Fix the way that saving Edit redirect to FrontPage rather than page just edited. Done Mar03'2016

Doing lists

to avoid getting p tags around list items, keep it "tight", so get rid of the empty lines (double-returns): but how do this?
maybe could keep state as stepping through lines. Have to make sure I exit properly, so keep double-return between end of list and next paragraph.
''or maybe could do regex!''
first-level list items don't need leading space
ugh nesting requirements are confusing, prefer to experiment
if first level has no leading space
child needs 2 leading spaces
grandchild needs 4, etc.
if first level has 1 leading space
child needs 3
grandchild need 5
hmm do I want to keep leading space for top-level? I prefer not-to. On the other hand, PikiPiki requires it. Conclusion: get rid of it.
so, given that MoinMoin pages always have top-level having 1 leading space... if a line in MoinMoin has n leading spaces, then for CommonMark should become:
if n=1, then nn=0
if n>1, then nn= (n-1)*2 (so actually don't need previous special case) - done Mar05?

Also

row of dashes does horizontal rule fine
Italics doesn't cover multiple lines - confirmed. Options? (Found in 92 files from 16k)
regexp - nope: multi-line, leading bullets, etc.
write the Python code - ugh not even consistent; sometimes ital-open is by itself on line, sometimes in middle of line; sometimes ital-close is by itself, sometimes at end of line
manually conform start/end, then batch-run code
decision: ignore, fix by hand when I run across it (might use Python for that, but manually set up the start/end)

Images

CommonMark doesn't give you a way to pass img width or put break around it; it actually surrounds with p tags!
and you can't just have the truly-raw URL, or raw-surrounded-with-angle-brackets.
hmm try just putting in parens
''maybe I should batch convert to nicer chunk of HTML?''
yikes some of them already are
and some have local URLs
put examples from current pages into a test doc (with name of source); test possible target outcomes in WikiFlux. Conclusion: need ![label](url) format.
- (update: also note CommonMark way to do linebreak, which might make sense before an image)
plan
convert raw URLs (jpeg, jpg, png) surrounded by whitespace into bracket-paren syntax
in TextWrangler, search-grep is (\s)(http[^ ]+(jpg|jpeg|png))(\s)
should I just do this in TextWrangler instead of code?
between this and already-HTML cases, vast majority handled, so ignore rest
maybe future feature to turn all CommonMark images into fancier output (right-flush, smaller, link-to-full...)

fenced code blocks - Mar30

start/end with 3 ticks not 3 curly-brackets
needs to open at start of line - so have to adjust those
my closings are fine
again, should I just use TextWrangler?

When creating new page isn't starting out making the title SpaceSeparated - ToDo
need to add more InterWiki cases (in code): ISBN, ASIN - '''check ASIN'''
check the number of files - seems way low

Pondering auto-post architecture

cases for auto-posting
conversion/bootup/migration
InstaPaper
TwittEr
should the ongoing bits be combined? should everything be combined?
where/how will I run them?
probably never hands-off, will scrape/process, then review/tweak by hand, then post
could have scraper code separated-by-partner, with central AutoPost code
for ongoing probably better to post via HTTP, but up-front might be on-server - nah just do HTTP API
actually not API, just HttpPost like with form
pass created_date, with modified_date being same. Done Apr05
hmm, CSRF? do I even have that working?

Next: AutoPost for converted MoinMoin pages

have working v1 - Apr14'2016
using requests library to do HttpPost with cookies
bugs
title field has dbl-quare-brackets around it
body has asterisk at the end - yikes the real issue is that only the first line of the body got posted!
but dates are right!
fix brackets and 1st-line-only bugs: Apr15
now posting whole directory! Apr15

'''Back to SmashedTogetherWords''' Apr16

Weird cases

typical BlogBit pages
if my new style is to have the date piece in parens, then a space, what happens to the URL?
in naming the page, I give URL which doesn't change, then tweak the title. So non-issue.
in referring to such a page from a different page - ugh it doesn't compress the spacing, am I doing that on purpose? Apparently, because it ''does'' compress when referring to page that doesn't start with a number. That's in wikiweb/url_from_wiki_word(str).
where I want to change the spacing
what I want (in referring to a BlogBit page from inside another page)
typically pasting in URL-name, so that should work as-is
don't see need to add the parens in this usage
but could definitely see adding the spacing - and should do this in bulk-conversion
so wikiweb code should remove any spaces - but be careful capwords() might remove dashes
fixed wikiweb Apr25
early pages that just have tie-breaking single letter following the date? There are 600 of them!
if I rename them, have to catch any links to them!
process I could use
manually review each file
don't rename the file, but rename its entry in 1 meta file: page_names.txt
then in conversion process
catch any link and rename it
rename the output file

But I don't even have the regular link-conversion working yet!

ugh, realize in page_names.txt I have the paren-formatted BlogBit title, not the simpler format. So, to make the ref-handle, I'll have to check for that and convert in code. (Considered "fixing" the list file, but that creates redundant text so that changes have to be edited in both places.)
finish first working cut, but seeing some weird final results. Realize that should do this ''after'' running the regular link conversion code. Now working Apr26.

Run a batch, auto-post. Looking pretty good.

issue where a WikiWord is inside parens - get an extra space. Have to track that down.
it's not in the MarkDown, it's in the rendering of MarkDown.
ah, line already commented as suspicious in wikiweb/repl_wiki_word().....
isolate and comment out that logic. Now things look good. Apr26.
then lots of fixing of ugly WikiWord cases in the meta file - done Apr28
argh am I going to also review the PagelessWords meta file? Right now falling back to just use generic rule on all those cases, but that isn't kosher either. Or maybe it's ok, can correct as I find them.... but nervous about bad cases... should review file to look for some of them, go back to source to see if fail, or weird case. Then decide what to do.
did lots of pattern cleaning
searched for numbers and lots of spaces to dump suspicious strings
then dropped any string that lacked a lower-case letter! Because these aren't pages anyway... done Apr29
next - rewrite wikiweb/repl_wiki_word() to use pageless_words then fall back to just passing through - no rule-based bracketing - Apr29
finish all the posting: Apr30
derp realize I never ran all the regex stuff noted at the end of my convert script! But going to review anyway

Review sample of pages:

raw links not done! (need to record regex)
actually, weirdly inconsistent!
mis-linking subset-wikiword: AI inside AIML! http://localhost:8082/wiki/2014-07-06-AiAssistedCoaching
ISBN links not working (actually ISBN itself wants to link, but not chain)
See a working InterWiki case but it uses brackets. Raw case doesn't work.
pages that were renamed have ## comment at top, which is now getting converted to an H2. Maybe just remove that line?

Working on InterWiki

set regexp
probably have to run this ''before'' scripts so individual names don't get bracketed
grr doesn't help - it removed the existing brackets!?!?
options for getting through here
nope, makes more sense to fix afterwards - catch ]]:
important to make sure any space-names are defined in page_names to stay rendered as SmashedTogetherWords
need code cases for MUPC, PhilJones
hrm another edge-case: where page-name piece matches one of my page names, which gets space-separated, then initial rendering of WikiWikiWeb:WardCunningham becomes WikiWikiWeb:Ward Cunningham - options
live with it
do separate regex for it
prevent it with totally different approach
another fail - ISBN where I left the dashes in, only gets matched/bracketed up to the first dash (at least if I use \w+ as my regex pattern). Options:
replace those cases separately (remove the dashes before starting?)
loosen my core InterWiki to include dashes - probably better - yes that works

Another problem: false-positive matches: once a WikiWord string is identified in a doc, all cases of it get replaced with its mapped-value. But this is even true when it's part of a longer word. So, for instance, OSAF gets replaced even when it's inside OSAFStatusReports

just live with it, since it's pretty rare?

["z2016-06-17-SettingUpFlaskAtLinodeForWikiFlux"]

Jun30'2016 omg just realized my new BlogBit URL model means all my old URLs will break! Not Cool! So need to redirect them.

fixed

Nov12'2016 start conversion of content changed since last batch (a year ago!)

modify scrape function to only grab/list pages that were added/changed since last time
run script that generates page_names.txt and pageless_words.txt
manually edit page_names.txt
realize pageless_words.txt is wrong because it's working from subset. So toss it and use the old one (will be missing a little but that's ok) (also note it now has some false entries, for cases where I had a WikiWord but no page last year, but added a page since)
run conversion
fix InterWiki, then ISBN/ASIN
fix images
next: scan other pages for WikiWord substitution. Maybe it's using pages-list for full list of names to match on? Need separate file?

Nov25: work with new meta pages, re-run. Haven't reviewed yet.

Flip switch! ["z2016-11-21-PushWikilogMigrationThanksgiving"]

Nov29: review converted pages

some bad cases where wikiname is inside phrase. Will catch some by hand, move along.
issue with raw https links: fixed manually
ISBN, ASIN, InterWiki
next: images

(Have been hand-finishing some pages to post manually.)

Dec10

fenced blocks
hr tags
next: variation of auto-post code to test for pre-existence, do edit

Dec24

copy over production code to replace local, so can start working on changes to do post-edit
code not working! issue with CommonMark versions, VirtualEnv issue prob.
Dec26: Nope, just case different for HtmlRenderer() call.

Dec28

finish new autopost handler
actually post 2016 backlog of 500+ pages!

Dec29

17102 records in 'nodes'
grr wikigraph db has 17191, and that isn't current
newest raw/full graph file from 2012 had 15664
had 16395 in May'2014 https://twitter.com/BillSeitz/status/466002169930723328
should export list from WikiGraph and see what's missing copy (select url from pages where space_name='WebSeitzWiki') to '/etc/postgresql/9.1/pages_wikigraph.txt'
ugh the difference is even worse because there are a bunch of cases where there are pages in wikiflux but ''not'' wikigraph, typically because of newer pages that didn't get added to the graph. So that means there are ''more'' MoinMoin pages missing than mere subtraction would imply.
plan: script to
local: list pages only missing from MoinMoin
at BlueHost: scrape that list of pages
then have to convert, autopost, etc.

Jan11'2017

finally generate diff-list. Oh, not even 20! (Think there was a dupes issue somewhere)

Edited: 2020-09-20 12:22:05.942318 | Tweet this! | Search Twitter for discussion

No backlinks!

No twinpages!

Bill Seitz