| WebSeitz/wikilog |
| Web Robot |
|
| last edited by BillSeitz on Jun 30, 2008 10:05 pm |
Any piece of code which automatically send HTTP requests.
Reasons to have this:
grab contents of non-local pages to populate a Search Engine
grab pages for OffLine browsing
scan a list of pages to find which ones have changed (to then alert the robot's user)
does an Rss Aggregator count? It probably should, esp if it also grabs non-RSS [URLs] based on RSS contents.
There are Game Rule-s for robots to follow ([Robot Exclusion]):
the problem with this is it's hierarchical in terms of defining protect [URL]-space. Which can be sufficient for content-type [URLs], but is less likely to work for more dynamic pages (that's a poor distinction, but I won't clarify for now)
another case where it's a problem - maybe you want to block most agents (like Rss Aggregator-s) from grabbing your full content, but let them take your RSS file.
[Robots Meta Tags] ([NoIndex], [No Follow])
the problem is that the robot has already hit that given page, which is bad if such a page is processor-intensive
?
Some additional rules I think are appropriate:
a single required substring ("robot"?) to include somewhere in the User Agent HTTP param, so that a dynamic server can take that into account without having to try and maintain a list of various robot agent names
an HREF [No Follow] attribute similar to the [Robots Meta Tags], so that a given page could have some [HREFs] in it which a robot is welcome to follow, while excluding others.
Site List Txt [URL] to list the [URLs] on a site in reverse-mod-date order (most recently changed at top), with a last-mod-date-time attribute included. This allows a robot to avoid grabbing pages which haven't changed since its last visit. (Needed detail to handle - dead [URLs], so they can be removed from a Search Engine.)
This was proposed by Info Seek years ago.
Dave Winer suggested a similar mechanism.
some way for a robot to recognize that multiple hosts/domains might be served by the same piece of hardware, so it can cap its request-rate at the hardware level. (Dave Winer has historically complained about this.)
in general, some way for a host to request limits on how often its content gets grabbed by any given robot - this is particularly an issue with Rss Aggregator-s, where some people want to grab every 5 minutes
or is the real solution there some better sort of Publish And Subscribe model?
| See Back Links: Sister Sites | Site List Txt | Web Spider | Zwiki Freebsd Stability Problems | |
| User Options Recent Changes Help Page |