(2017-04-15) Updating WikiGraph Db

I haven't updated the WikiGraph db at least since migrating to WikiFlux, and probably long before then. (I wasn't doing much blogging for the prev couple years, because of shifting my reading to InstaPaper. But still...)

OK, my local file that gets generated during the process was last updated Mar23'2016.

WikiFlux became live site on Nov26'2016.

And the last files from my initial migration are dated Dec10'2016. What are these? Maybe batches of markdown-conversion fixes?

Updating isn't as simple as it sounds

  • short of rescraping every page on the site
    • which I don't feel like doing)
      • mainly because of the way I cheated and back-dated pages that I migrated. Then again, I generally did that using dates pulled from BlueHost, so maybe that's ok...
    • actually since I have all the migrated and instapaper-autposted pages as local CommonMark files, I could scrape all those pretty easily.
      • on the other hand, I know that my markup-migration process wasn't perfect, so I'm concerned about screwing up pages unnecessarily...

So, need to...

  • query WikiFlux db for pages created/modified during Mar23-Nov26; scrape local migration files for those pages
  • scrape local migration pages from Nov26-Dec10
  • scrape all my local InstaPaper auto-post pages
  • build list of WikiFlux pages not present at all in WikiGraph; probably need to web-scrape at least some of those
  • maybe also web-scrape WikiFlux pages that were added/changed since Dec10, but not included in previous InstaPaper list....

Apr16: starting with those first 2 bullets

  • make list file for each, combine (gross total 425, unique 333)
  • look for an example - 2003-05-18-BurkeTerrorists. Where is it?!
  • find it in Trash in migrate_posted_real folder, which has 10k items, why isn't it 16-17k?!?
  • decide I'll try to just run against what I have, save any fails and then go find them
  • create forward-links file as used in the past - done, and looks like no fails
  • run script on WikiGraph server to insert records - done!

Apr16: graph-file for InstaPaper posts

  • they're all in 1 folder
  • but every file still starts with the instapaper_article_id
    • for now just use regex in the final file to strip the prefix
  • also, this process also hits files which I strip the contents out of (because sometimes I combined related pages before auto-post, or add a new page to an existing page), so in the future should probably skip files with 0 length.
  • insert frontlinks for 82 files. Done.

Ugh, discover issue - because I've made WikiFlux case-insensitive for WikiNames, a page can now refer to Facebook and it will link to FaceBook. But when I'm on the FaceBook page, the backlinks query will only find when it's referred to as FaceBook not Facebook. So need to adjust the code for some case-insensitivity.

  • this will probably also be relevent with helping Phil Jones
  • hmm also notice that the Visible Backlinks aren't in alphabetical order, that's odd.

But before fixing those issues, should finish adding the last 2 buckets of links from above.

Apr22: looking at WikiGraph db

  • realize I already made some of this case-insensitive because pages table has fields name which is lower-case and url which is CamelCase
    • hrm also noting that the BlogBit pages still have initial z - did I hack around that elsewhere?
      • oh yeah, I let WikiGraph return the initial-z, but when I ask WikiFlux for that page it strips the z and redirects. So it works from that standpoint. But it's time to fix WikiGraph. But after finishing this add-pages process. Though, hrm, just realized I'm at risk for dupes...
  • dump sql list of pages from WikiFlux and WikiGraph
    • WikiFlux has 17261
    • WikiGraph has 17512 (and I'll just regex in editor to strip leading zs) (updated below)
    • which is "interesting" because you'd expect WikiFlux to have more
    • maybe WikiGraph includes "Help" pages from MoinMoin but I didn't migrate those?
    • well, I'll have to make 2 separate comparison lists...
    • hmm discover lots of empty (\N) records in WikiGraph
      • gonna have to figure out what's going on in db (later)
      • after stripping those have 16991
      • no equivalent issue in WikiFlux list
      • which at least makes the counts logical, but will still compare both directions
    • wikigraph_not_wikiflux list is just 17, and many are obsolete pages. Saving but taking not action for now.
    • wikiflux_not_wikiweb list is 288
  • running graph-create script, variant I used on the 'gap' list
    • seeing lots of lists repeated across pages!
      • Check the old gap-graph file, don't see it happening there.
      • fix an obvious bug in script, re-run, still have problem
      • look at a page with wrong entries, find local file - ah, it's in a different location
      • this helps me find/fix another bug - now script fails in a logical way
      • add new location to check
      • now runs
      • only get 121 entries (many pages not found at all)
      • and the pages from the new location have bad links - duh because those were the scraped pages, not those converted to CommonMark. So need to find a later version of those. Ah probably in Trash.
    • digging through Trash
      • looks like I want files from Apr-May'2016
      • hrm many dupes, more-recent has timestamp in its name
        • ah, those have correct ISBN link
        • easier to take the raw-name files and mass-regex-fix the links? Esp since there are some cases without dupes...
      • there are 30k files in Trash, which is making scrolling painful. Maybe time to play some ls games
      • make an ls file. Strip distant dates, leave Jan-May'2016 files, that's still 29k. Going to sniff around a bit at them...
      • end up leaving files in Trash, having ls file as index
      • still only get 121 matches, but at least they make sense now.

Apr23: looking at this list of pages I don't have local contents for

  • hmm whole mess are InstaPaper pages, I know I added those already, right?
    • check some by hand, yes they're really there
  • go back and look at dump of pages from WikiGraph - last BlogBit page was 2016-01-17-TrumpSupportersAuthoritarian which is ridiculous! Going to go do some db querying.
  • aha! Back when I changed structure to make case-insensitive I must not have changed the insert code, because all these records have null url field. Do batch update, then re-export.
  • getting more weird results. omg have total mix of cases in WikiGraph db - in some cases name is all-lower, in some cases url is. Major mess, have to go back and figure out how it was supposed to work.
    • ah my PrivateWiki has WikiGraph With FreeLinks page which I'm now copying over.
    • so url is supposed to be the URL (duh), and name is supposed to be (a) stripped of punctuation and spaces (note I haven't done that to my own BlogBit pages, I should change that), and (b) made all-lower-case.
    • so next step is to fix my db
      • fixed case
      • stripped leading z
      • hmm, what if I strip leading dates entirely? Could break uniqueness... (fallback - strip dashes? Pointless but consistent...) Not doing anything yet...
      • in mentions table strip leading z; them make page_mentions field all-lower, and tweak code to search lower.
  • fix batch-insert code for future.

Edited:    |       |    Search Twitter for discussion