Personal Web Archive

System that caches everything you read on the web (maybe you have to hit a Save button, or maybe it defaults to saving, and you hit an Ignore button; or maybe it only saves what you blog about).

see also WikiProxy.

I first read about this from Robot Wisdom in Dec'2000: You certainly, immediately, want it to start archiving everything you read on the Net, for future reference. You want this all to be word-indexed, like a generic search-engine but entirely 'local'. You want all the good stuff to be sorted by topic, into your own personal Yahoo/D Moz, reflecting your priorities... and you want it to watch and alert you whenever a good webpage you've archived/mirrored is updated. When you think a whole website looks good, you probably want to mirror the whole site, so that your future local searches will find whatever it offers on that topic.

Building your personal Yahoo should center around creating custom topic-pages, which you may or may not publish to the Web. These will be like notebook pages that you continually revise as you study the topic-- accumulating links to web originals, to your local mirrors as well, and annotating these for your own reference.

If you're going to save all this, are you going to want a half-decent (boolean, relevance-ranking) search engine?

If so, are you going to want to search your desktop files as well? And your email which is locked up in MsOutlook?

Hmm, this also gets back to the issue of Home Server vs Hosted Server, in terms of where this sucker is stored. At a Hosted Server you're more constrained on storage space probably. And the performance of a proxy could get annoying with double the network latency. Though if you really just cache stuff that you blog that might be OK.


Les Orchard has considered using Google's API to hit its cache for items you can no longer hit directly. So you don't need to save your copy. Of course, that assumes that the old item doesn't get purged from the cache, which I don't think is true over a longish period of time. I also don't think it will help with the cases that are most frequently annoying: when a site slides its archives into a Paid black hole (like the Ny Times and Time Mag). And the other common case would be when a site switches CMS systems and their URLs change: the new site will get indexed, Google will see the old URLs no longer work, and they'll get purged.


The Galeon browser has a feature like Barger's wanted "personal yahoo". It has "smart bookmarks" which is basically shows you where you've been lately, along with all the sites you've bookmarked. Also, I userstand that Aaron Swartz wrote a personal web archive for his own use. -- Luke Francl


Edited:    |       |    Search Twitter for discussion