Web Robot

Any piece of code which automatically send HTTP requests.

Reasons to have this:

grab contents of non-local pages to populate a Search Engine
grab pages for Off-Line browsing
scan a list of pages to find which ones have changed (to then alert the robot's user)
does an RssAggregator count? It probably should, esp if it also grabs non-RSS URLs based on RSS contents.

There are Game Rule-s for robots to follow (Robot Exclusion):

http://www.robotstxt.org/wc/faq.html
Robots Txt file
- the problem with this is it's hierarchical in terms of defining protect URL-space. Which can be sufficient for content-type URLs, but is less likely to work for more dynamic pages (that's a poor distinction, but I won't clarify for now)
- another case where it's a problem - maybe you want to block most agents (like RssAggregator-s) from grabbing your full content, but let them take your RSS file.
RobotsMetaTags (No Index, No Follow)
- the problem is that the robot has already hit that given page, which is bad if such a page is processor-intensive
?

Some additional rules I think are appropriate:

a single required substring ("robot"?) to include somewhere in the User Agent HTTP param, so that a dynamic server can take that into account without having to try and maintain a list of various robot agent names
an HREF No Follow attribute similar to the RobotsMetaTags, so that a given page could have some HREFs in it which a robot is welcome to follow, while excluding others.
Site List Txt URL to list the URLs on a site in reverse-mod-date order (most recently changed at top), with a last-mod-date-time attribute included. This allows a robot to avoid grabbing pages which haven't changed since its last visit. (Needed detail to handle - dead URLs, so they can be removed from a Search Engine.)
- This was proposed by InfoSeek years ago.
  - Dave Winer suggested a similar mechanism.
some way for a robot to recognize that multiple hosts/domains might be served by the same piece of hardware, so it can cap its request-rate at the hardware level. (Dave Winer has historically complained about this.)
- in general, some way for a host to request limits on how often its content gets grabbed by any given robot - this is particularly an issue with RssAggregator-s, where some people want to grab every 5 minutes
  - or is the real solution there some better sort of Publish And Subscribe model?

Bill Seitz