Web Robot
Any piece of code which automatically send HTTP requests.
Reasons to have this:
-
grab contents of non-local pages to populate a Search Engine
-
grab pages for Off-Line browsing
-
scan a list of pages to find which ones have changed (to then alert the robot's user)
-
does an RssAggregator count? It probably should, esp if it also grabs non-RSS URLs based on RSS contents.
There are Game Rule-s for robots to follow (Robot Exclusion):
-
Robots Txt file
-
the problem with this is it's hierarchical in terms of defining protect URL-space. Which can be sufficient for content-type URLs, but is less likely to work for more dynamic pages (that's a poor distinction, but I won't clarify for now)
-
another case where it's a problem - maybe you want to block most agents (like RssAggregator-s) from grabbing your full content, but let them take your RSS file.
-
-
RobotsMetaTags (No Index, No Follow)
- the problem is that the robot has already hit that given page, which is bad if such a page is processor-intensive
-
?
Some additional rules I think are appropriate:
-
a single required substring ("robot"?) to include somewhere in the User Agent HTTP param, so that a dynamic server can take that into account without having to try and maintain a list of various robot agent names
-
an HREF No Follow attribute similar to the RobotsMetaTags, so that a given page could have some HREFs in it which a robot is welcome to follow, while excluding others.
-
Site List Txt URL to list the URLs on a site in reverse-mod-date order (most recently changed at top), with a last-mod-date-time attribute included. This allows a robot to avoid grabbing pages which haven't changed since its last visit. (Needed detail to handle - dead URLs, so they can be removed from a Search Engine.)
-
This was proposed by InfoSeek years ago.
- Dave Winer suggested a similar mechanism.
-
-
some way for a robot to recognize that multiple hosts/domains might be served by the same piece of hardware, so it can cap its request-rate at the hardware level. (Dave Winer has historically complained about this.)
-
in general, some way for a host to request limits on how often its content gets grabbed by any given robot - this is particularly an issue with RssAggregator-s, where some people want to grab every 5 minutes
- or is the real solution there some better sort of Publish And Subscribe model?
-
Edited: | Tweet this! | Search Twitter for discussion