WWW::RobotRules man page on BSDi

Man page or keyword search:  
man Server   6284 pages
apropos Keyword Search (all sections)
Output format
BSDi logo
[printable version]



lib::WWW::RobotUsersContributed Perl Documlib::WWW::RobotRules(3)

NAME
       WWW::RobotsRules - Parse robots.txt files

SYNOPSIS
	require WWW::RobotRules;
	my $robotsrules = new WWW::RobotRules 'MOMspider/1.0';

	use LWP::Simple qw(get);

	$url = "http://some.place/robots.txt";
	my $robots_txt = get $url;
	$robotsrules->parse($url, $robots_txt);

	$url = "http://some.other.place/robots.txt";
	my $robots_txt = get $url;
	$robotsrules->parse($url, $robots_txt);

	# Now we are able to check if a URL is valid for those servers that
	# we have obtained and parsed "robots.txt" files for.
	if($robotsrules->allowed($url)) {
	    $c = get $url;
	    ...
	}

DESCRIPTION
       This module parses a robots.txt file as specified in "A
       Standard for Robot Exclusion", described in
       <URL:http://info.webcrawler.com/mak/projects/robots/norobots.html>
       Webmasters can use the robots.txt file to disallow
       conforming robots access to parts of their WWW server.

       The parsed file is kept in the WWW::RobotRules object, and
       this object provide methods to check if access to a given
       URL is prohibited.  The same WWW::RobotRules object can
       parse multiple robots.txt files.

METHODS
       $rules = new WWW::RobotRules 'MOMspider/1.0'

       This is the constructor for WWW::RobotRules objects.  The
       first argument given to new() is the name of the robot.

       $rules->parse($url, $content, $fresh_until)

       The parse() method takes as arguments the URL that was
       used to retrieve the /robots.txt file, and the contents of
       the file.

       $rules->allowed($url)

       Returns TRUE if this robot is allowed to retrieve this
       URL.

24/Aug/1997	       perl 5.005, patch 03			1

lib::WWW::RobotUsersContributed Perl Documlib::WWW::RobotRules(3)

       $rules->agent([$name])

       Get/set the agent name. NOTE: Changing the agent name will
       clear the robots.txt rules and expire times out of the
       cache.

ROBOTS.TXT
       The format and semantics of the "/robots.txt" file are as
       follows (this is an edited abstract of
       <URL:http://info.webcrawler.com/mak/projects/robots/norobots.html>):

       The file consists of one or more records separated by one
       or more blank lines. Each record contains lines of the
       form

	 <field-name>: <value>

       The field name is case insensitive.  Text after the '#'
       character on a line is ignored during parsing.  This is
       used for comments.  The following <field-names> can be
       used:

       User-Agent
	  The value of this field is the name of the robot the
	  record is describing access policy for.  If more than
	  one User-Agent field is present the record describes an
	  identical access policy for more than one robot. At
	  least one field needs to be present per record.  If the
	  value is '*', the record describes the default access
	  policy for any robot that has not not matched any of
	  the other records.

       Disallow
	  The value of this field specifies a partial URL that is
	  not to be visited. This can be a full path, or a
	  partial path; any URL that starts with this value will
	  not be retrieved

       Examples

       The following example "/robots.txt" file specifies that no
       robots should visit any URL starting with
       "/cyberworld/map/" or "/tmp/":

	 # robots.txt for http://www.site.com/

	 User-agent: *
	 Disallow: /cyberworld/map/ # This is an infinite virtual URL space
	 Disallow: /tmp/ # these will soon disappear

       This example "/robots.txt" file specifies that no robots
       should visit any URL starting with "/cyberworld/map/",
       except the robot called "cybermapper":

24/Aug/1997	       perl 5.005, patch 03			2

lib::WWW::RobotUsersContributed Perl Documlib::WWW::RobotRules(3)

	 # robots.txt for http://www.site.com/

	 User-agent: *
	 Disallow: /cyberworld/map/ # This is an infinite virtual URL space

	 # Cybermapper knows where to go.
	 User-agent: cybermapper
	 Disallow:

       This example indicates that no robots should visit this
       site further:

	 # go away
	 User-agent: *
	 Disallow: /

SEE ALSO
       the LWP::RobotUA manpage, the WWW::RobotRules::AnyDBM_File
       manpage

24/Aug/1997	       perl 5.005, patch 03			3

[top]

List of man pages available for BSDi

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net