InstaPaper

Center of my online reading Highlighting And Annotating process. Working on scraping my highlights to auto-post to the WikiLog.

ReadItLater app/service to scrape (typically longer-form) web articles and save clean-text format for reading later, typically with Data Synch to a Mobile device for Off-Line reading.

Created by Marco Arment. Launched 2008.

For over a year I had the free service, and used the free/lite Insta Fetch client on my Archos70 Tablet.

Sept'2012: started paying for full service, and bought the official Android client for my Archos70. (via Amazon AppStore since Archos didn't get Google store)

The bookmark to save doesn't work well with GitHub Mark Down pages - use http://instapaper.com/hello2?url=

  • actually, that seemed to work in the web UI, but didn't synch anything useful to my Android app

Dec'2013: Bought Nexus Seven. Everything migrated over nicely.

Nov'2014: frustrated that Amazon-store version is still stuck at v2.9.2, while official version is up to v4.2. They say if I just install the new app, it will synch up so I won't lose any of my inventory. Update: wow, it was so fast, I think the old local data file must have been left behind for re-use?


insta_repost

Sept'2015: would like to use their API to suck down items I've Liked/Starred/Hearted, so I can post them to my WikiLog. I want to scrape my highlights, the original piece title, and maybe even the first sentence/paragraph to quote.

The Python instapaperlib package that comes up most obviously uses the "simple" API which offers little except adding a URL to your account.

The instacache code uses the "full" API, so I'm going to just steal pieces of that to create my own stuff. I want to store my stuff in files so I can browse/edit them more easily before

Oct07: applying for API OAuth token.

Oct09: stepping through code one line at a time

  • when get to client.request("%s/oauth/access_token" get error which seems to related to linking in an old version of Open S S L.
  • python -c "import ssl; print ssl.OPENSSL_VERSION" from here gives me OpenSSL 0.9.7l 28 Sep 2006 which is bad
  • going to try updating Open S S L
    • but brew update gives
xcrun: error: active developer path ("/Applications/Xcode.app/Contents/Developer") does not exist, use xcode-select to change
Error: Failure while executing: git init 
  • update: ended up re-installing X Code, which fixed some issues

Nov09: nudging forward with VirtualEnv for Flask

  • typing piece at a time, now get past resp, token = client.request(...) successfully, so have token value
  • but everything I try to do after that gives me 403 response
  • ah, looks like need to create new Client instance with the token
  • and need to create the token object rather than just passing in the token value?
  • yep that works
  • can ask for folder_id='starred' in payload to get the starred/liked/hearted items
  • results look like:
'[
{"type": "meta"}, 

{"username": "fluxent@gmail.com", "user_id": 1761795, "type": "user", "subscription_is_active": "1"}, 

{"hash": "pzvgEV0W", "description": "", "bookmark_id": 653717723, "private_source": "", "title": "Open issues: lessons learned building an open source business", "url": "http://werd.io/2015/open-issues-lessons-learned-building-an-open-source-business", "progress_timestamp": 1447170795, "time": 1447162969, "progress": 0.966090679169, "starred": "1", "type": "bookmark"} ]'
  • trying to get the first-line of the original text can include lots of junk, and there's no clear end to the line except maybe </p>\n
<figure><a href="<https://www.flickr.com/photos/kid_pro_quo/243281786"><img> alt="South Park" src="<https://farm1.staticflickr.com/92/243281786_d03baeab9d_z.jpg?zz=1"/></a><figcaption><a> href="<https://www.flickr.com/photos/kid_pro_quo/243281786">South> Park</a></figcaption></figure>\n<p><strong><em>Prologue</em>:</strong></p>\n
  • getting the highlights for a bookmark gives a list for which one entry looks like:
{"highlight_id": 1816475, "text": "we didn\'t know how we were going to pay rent, and growth was linear. For a project, we were doing well. For a company, we weren\'t doing well - and there were still only two of us", "bookmark_id": 653717723, "time": 1446956770, "position": 0, "type": "highlight"}, 

Jan21'2017: being up-and-running with WikiFlux for a month now, getting back to this.

  • argh getting the same old-SSL error noted Oct09 above! Suspect VirtualEnv issue... derp yeah have to use my Flask VirtualEnv instead of WebPy one. Then print ssl.OPENSSL_VERSION gives OpenSSL 0.9.8za 5 Jun 2014

Jan22

  • turn all script bits into working code, as far as getting list
  • ToDo: handle UniCode content, decide when to turn to ascii (get unidecode package), when to allow
  • hrm time attribute reflects when I bookmarked it, not when page was written
    • accept/embrace that?
    • double-check that page-get-text function doesn't have anything - confirmed
    • scrape head of original source for metatag? use urllib2.response.info().dict[last-modified'] - but how often will it be there? Will try this first.
    • in some cases can parse url, but yuck

Jan23: dealing with unicode before time

  • thought if I just stripped unicode from chars going into url then I'd be ok
  • but can't just use file.write() with unicode either, sigh have to dig in

Feb18

  • UniCode/UTF8 solution, just f.write(title.encode("utf-8")) - seems to work
  • now getting HTTP error 403 on one article when grabbing the original to get the date MetaTag. Can grab target URL with browser fine. I wonder if it's blocking unknown agents?
  • grab it with CUrl and see it's giving a 301 redirect from http to https. Just putting this inside the try for now, let it skip these cases. Now grabbing all successfully. Next: grab the highlights.

Mar04

  • derp somehow I hadn't updated some variables properly, so I had still been using some old API bits! (And/or they made a recent change.)
  • So now the highlights come along in the main initial call.
  • Interestingly, they aren't children of the bookmarked articles, but just a parallel flat list.
  • Frustratingly, it looks like highlight position is rarely used anymore! Which means if I scroll back in an article to add a highlight, it will be in time-order rather than article-position-order in my excerpts. Ah well.
  • Almost never getting date MetaTag on original articles, and noticing how few original URLs have full date in them, so I think I'm going to have to suck it up and use the save-date... later.
  • Starting to modify my autopost code to handle these posts...
  • Done! Auto-posted 10 pages.
  • Next - delete them from InstaPaper. Done!
  • Status: WikiLog has 17,135 nodes. (hmm don't see a way to get my count of Liked items in InstaPaper)

Edited: |

blog comments powered by Disqus