crawl man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

CRAWL(1)		  BSD General Commands Manual		      CRAWL(1)

NAME
     crawl — a small and efficient HTTP crawler

SYNOPSIS
     crawl [-u urlincl] [-e urlexcl] [-i imgincl] [-I imgexcl] [-d imgdir]
	   [-m depth] [-c state] [-t timeout] [-A agent] [-R] [-E external]
	   [url ...]

DESCRIPTION
     The crawl utility starts a depth-first traversal of the web at the speci‐
     fied URLs.	 It stores all JPEG images that match the configured con‐
     straints.

     The options are as follows:

     -v level	  The verbosity level of crawl in regards to printing informa‐
		  tion about URL processing.  The default is 1.

     -u urlincl	  A regex(3) expression that all URLs that should be included
		  in the traversal have to match.

     -e urlexcl	  A regex(3) expression that determines which URLs will be
		  excluded from the traversal.

     -i imgincl	  A regex(3) expression that all image URLs have to match in
		  order to be stored on disk.

     -I imgexcl	  A regex(3) expression that determines the images that will
		  not be stored.

     -d imagedir  Specifies the directory under which the images will be
		  stored.

     -m depth	  Specifies the maximum depth of the traversal.	 A 0 means
		  that only the URLs specified on the command line will be
		  retrieved. A -1 stands for unlimited traversal and should be
		  used with caution.

     -c state	  Continues a traversal that was interrupted previosly.	 The
		  remaining URLs with be read from the file state.

     -t timeout	  Specifies the time in seconds that needs to pass between
		  successive access of a single host.  The parameter is a
		  float.  The default is five seconds.

     -A agent	  Specifies the agent string that will be included in all HTTP
		  requests.

     -R		  Specifies that the crawler should ignore the robots.txt
		  file.

     -E external  Specifies an external filter program that can refine which
		  URLs are to be included in the traversal.  The filter pro‐
		  gram reads the URLs on stdin and outputs a single character
		  on stdout.  An output of ‘y’ indicates that the URL may be
		  included, ‘n’ means that the URL should be excluded.

     The source code for existing web crawlers tend to be very complicated.
     crawl is a very simple design with pretty simple source code.

     A configuration file can be used instead of the command line arguments.
     The configuration file contains the MIME-type that is being used.	To
     download other objects besides images the MIME-type needs to be adjusted
     accordingly.  For more information, see crawl.conf.

EXAMPLES
     crawl -m 0 http://www.w3.org/

     Searches for images in  the index page of the web consortium without fol‐
     lowing any other links.

ACKNOWLEDGEMENTS
     This product includes software developed by Ericsson Radio Systems.

     This product includes software developed by the University of California,
     Berkeley and its contributors.

AUTHORS
     The crawl utility has been developed by Niels Provos.

BSD				 May 29, 2001				   BSD
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net