glimpseindex man page on Ultrix

Man page or keyword search:  
man Server   3690 pages
apropos Keyword Search (all sections)
Output format
Ultrix logo
[printable version]


       glimpseindex 4.1 - index whole file systems to be searched by glimpse

       Glimpse	(which	stands	for  GLobal IMPlicit SEarch) is a popular UNIX
       indexing and query system that allows you to search through a large set
       of  files  very	quickly.   Glimpseindex	 is  the  indexing program for
       glimpse.	 Glimpse supports most of agrep's options (agrep is our power‐
       ful version of grep) including approximate matching (e.g., finding mis‐
       spelled words), Boolean queries, and even some limited forms of regular
       expressions.  It is used in the same way, except that you don't have to
       specify file names.  So, if you are looking for a  needle  anywhere  in
       your  file  system,  all	 you  have to do is say glimpse needle and all
       lines containing needle will appear preceded by the file name.  See man
       glimpse for details on how to use glimpse.

       Glimpseindex provides three indexing options: a tiny index (2-3% of the
       total size of all files), a small index (7-8%) and a medium-size	 index
       (20-30%).   Search  times  are  normally	 better	 with  larger  indexes
       (although unless files are quite large, the small index is  just	 about
       as  good	 as  the medium one).  To index all your files, you say glimp‐
       seindex ~ for tiny index (where	~  stands  for	the  home  directory),
       glimpseindex -o ~ for small index, and glimpseindex -b ~ for medium.

       Mail to be added to the glimpse mailing
       list.  Mail to report bugs, ask questions,  dis‐
       cuss  tricks  for using glimpse, etc. (this is a moderated mailing list
       with very little traffic, mostly announcements).	 HTML version of these
       manual  pages  can be found in‐
       dexhelp.html Also, see the glimpse home pages in http://glimpse.cs.ari‐

       glimpseindex  [ -abEfFiInostT -w number -dD filename(s) -H directory -M
       number -S number ] directory_name[s]

       Glimpseindex builds an index of all text files in all  the  directories
       specified  and all their subdirectories (recursively).  It is also pos‐
       sible to build several separate indexes	(possibly  even	 overlapping).
       The simplest way to index your files is to say

       glimpseindex -o ~

       The  index  consists  of several files (described in detail below), all
       with the prefix .glimpse_ stored in the user's home  directory  (unless
       otherwise specified with the -H option).	 Files with one of the follow‐
       ing suffixes are not indexed: ".o", ".gz", ".Z", ".z", ".hqx",  ".zip",
       ".tar".	 (Unless  the  -z  option  is  used, see below.)  In addition,
       glimpseindex attempts to determine whether a file is a  text  file  and
       does  not  index	 files that it thinks are not text files.  Numbers are
       not indexed unless the -n option is used.  It is	 possible  to  prevent
       specified  files	 from  being  indexed  by  adding  their  names to the
       .glimpse_exclude file (described below).	 The -o option builds a larger
       index than without it (typically about 7-8% vs. 2-3% without -o) allow‐
       ing for a faster search (1-5 times faster).   The  -b  builds  an  even
       larger  index  and allows an even faster search some of the time (-b is
       helpful mostly when large files are present).  There is an  incremental
       indexing	 option	 -f,  which  updates  an existing index by determining
       which files have been created or modified since the index was built and
       adding  them  to	 the index (see -f).  Glimpseindex is reasonably fast,
       taking about 20 minutes to index 15,000 files of about 200MB (on an Dec
       Alpha  233)  and 2-4 minutes to update an existing index. (Your mileage
       may vary.)  It is also possible to increment the index by adding a spe‐
       cific file (the -a option).

       Once an index is built, searching for pattern is as easy as saying

       glimpse pattern

       (See man glimpse for all glimpse's options and features.)

       Glimpse	does not automatically index files.  You have to tell it to do
       it.  This can be done manually, but a better way is to set  it  to  run
       every  night.   It is probably a good idea to run glimpseindex manually
       for the first time to be sure it works properly.	 The  following	 is  a
       simple  script  to  run	glimpseindex every night.  We assume that this
       script is stored in a file called glimpse.script:

       glimpseindex -o -t -w 5000 ~ >& .glimpse_out
       at -m 0300 glimpse.script
       (It might be interesting to collect  all	 the  outputs  of  glimpse  by
       changing	 >&  to >>& so that the file .glimpse_out maintains a history.
       In this case the file must be created before  the  first	 time  >>&  is
       used.  If you use ksh, replace '>&' with '2>&1'.)

       Glimpseindex  stores  the names of all the files that it indexed in the
       file .glimpse_filenames.	 Each file is listed by its full path name  as
       obtained	  at   the   time   the	 files	were  indexed.	 For  example,
       /usr1/udi/file1.	 Glimpse uses this full	 name  when  it	 performs  the
       search,	so  the	 name  must match the current name.  This may become a
       problem when the indexing  and  the  search  are	 done  from  different
       machines (e.g., through NFS), which may cause the path names to be dif‐
       ferent.	For example, /tmp_mnt/R/xxx/xxx/usr1/udi/file1.	 (The same  is
       true for several other .glimpse files.  See below.)

       Glimpseindex  does not follow symbolic links unless they are explicitly
       included in the .glimpse_include file (described below).

       Glimpseindex makes an effort to identify non-text files such as	binary
       files,  compressed  files,  uuencoded  files,  postscript files, binhex
       files, etc.  These files are automatically not indexed.	 In  addition,
       all files whose names end with `.o', `.gz', `.Z', `.z', `.hqx', `.zip',
       or `.tar' will not be indexed (unless they are specifically included in
       .glimpse_include - see below).

       The options for glimpseindex are as follows:

       -a     adds  the given file[s] and/or directories to an existing index.
	      Any given directory will be traversed recursively and all	 files
	      will  be	indexed	 (unless  they appear in .glimpse_exclude; see
	      below).  Using this option is generally much faster than	index‐
	      ing  everything  from  scratch, although in rare cases the index
	      may not be as good.  If for some reason the index is full (which
	      can  happen  unless -o or -b are used) glimpseindex -a will pro‐
	      duce an error message and will exit without changing the	origi‐
	      nal index.

       -b     builds  a	 medium-size  index (20-30% of the size of all files),
	      allowing faster search.	This  option  forces  glimpseindex  to
	      store  an	 exact (byte level) pointer to each occurrence of each
	      word (except for some very common words belonging	 to  the  stop

       -B     uses  a  hash table that is 4 times bigger (256k entries instead
	      of 64K) to speed up indexing.  The memory	 usage	will  increase
	      typically	 by  about  2  MB.   This  option is only for indexing
	      speed; it does not affect the final index.

       -d filename(s)
	      deletes the given file(s) from the index.

       -D filename(s)
	      deletes the given file(s) from the list of file names,  but  not
	      from  the	 index.	  This is much faster than -d, and the file(s)
	      will not be found by glimpse.  However, the  index  itself  will
	      not become smaller.

       -E     does  not	 run a check on file types.  Glimpse normally attempts
	      to exclude non-text files, but this attempt is not  always  per‐
	      fect.   With  -E,	 glimpseindex  indexes all files, except those
	      that are specifically excluded  in  .glimpse_exclude  and	 those
	      whose file names end with one of the excluded suffixes.

       -f     incremental  indexing.  glimpseindex scans all files and adds to
	      the index only those files that were created or  modified	 after
	      the current index was built.  If there is no current index or if
	      this procedure fails, glimpseindex automatically reverts to  the
	      default  mode (which is to index everything from scratch).  This
	      option may create an inefficient index for several reasons,  one
	      of  which	 is that deleted files are not really deleted from the
	      index.  Unless changes are small, mostly additions,  and	-o  is
	      used, we suggest to use the default mode as much as possible.

       -F     Glimpseindex  receives  the list of files to index from standard

       -H directory
	      Put or update the index and all  other  .glimpse	files  (listed
	      below) in "directory".  The default is the home directory.  When
	      glimpse is run, the -H option must be used to direct glimpse  to
	      this directory, because glimpse assumes that the index is in the
	      home directory (see also the -H option in glimpse).

       -i     Make .glimpse_include (SEE GLIMPSEINDEX FILES)  take  precedence
	      over  .glimpse_exclude,  so  that,  for example, one can exclude
	      everything (by putting *) and then explicitly include files.

       -I     Instead of indexing, only show (print to standard out) the  list
	      of files that would be indexed.  It is useful for filtering pur‐
	      poses.  ("glimpseindex -I dir | glimpseindex -F" is the same  as
	      "glimpseindex dir".)

       -M x   Tells  glimpseindex  to use x MB of memory for temporary tables.
	      The more memory you allow the faster glimpseindex will run.  The
	      default  is  x=2.	  The  value  of x must be a positive integer.
	      Glimpseindex will need more memory than x for other things,  and
	      glimpseindex may perform some 'forks', so you'll have to experi‐
	      ment if you want to use this option.  WARNING: If x is too large
	      you may run out of swap space.

       -n     Index numbers as well as text.  The default is not to index num‐
	      bers.  This is useful when searching for dates or other  identi‐
	      fying numbers, but it may make the index very large if there are
	      lots of numbers.	In general, glimpseindex strips away any  non-
	      alphabetic  character.   For  example, the string abc123 will be
	      indexed as abc if the -n option is not used and as abc123 if  it
	      is  used.	  Glimpse provides warnings (in .glimpse_messages) for
	      all files in which more than half the words that were  added  to
	      the  index from that file had digits in them (this is an attempt
	      to identify data files that should  probably  not	 be  indexed).
	      One  can	use the .glimpse_exclude file to exclude data files or
	      any other files.	(See GLIMPSEINDEX FILES.)

       -o     Build a small index rather than tiny (meaning 7-9% of the	 sizes
	      of  all  files  - your mileage may vary) allowing faster search.
	      This option forces glimpseindex to allocate one block  per  file
	      (a  block	 usually contains many files).	A detailed explanation
	      of how blocks affect glimpse can be found in the	glimpse	 arti‐
	      cle.  (See also LIMITATIONS.)

       -R     Recompute .glimpse_filenames_index from .glimpse_filenames.  The
	      file .glimpse_filenames_index speeds up processing.   Glimpsein‐
	      dex  usually  computes  it  automatically.  However, if for some
	      reason one wants to change the path names of the files listed in
	      .glimpse_filenames,  then	 running  glimpseindex	-R  recomputes
	      .glimpse_filenames_index.	 This is useful if the index  is  com‐
	      puted  on	 one  machine,	but  is used on another (with the same
	      hierarchy).  The names of the files listed in .glimpse_filenames
	      are used in runtime, so changing them can be done at any time in
	      any way (as long as just the names not the content is  changed).
	      This  is	not really an option in the regular sense;  rather, it
	      is a program by itself, and it is	 meant	as  a  post-processing
	      step.  (Avaliable only from version 3.6.)

       -s     supports	structured  queries.  This option was added to support
	      the Harvest project and it is applicable mostly in that context.
	      See  STRUCTURED  QUERIES	below  for  more  information and also for more information about the  Har‐
	      vest project.

       -S k   The  number  k  determines the size of the stop-list.  The stop-
	      list consists of words that are too common and are  not  indexed
	      (e.g.,  'the'  or	 'and').  Instead of having a fixed stop-list,
	      glimpseindex figures out the words that are too common for every
	      index  separately.   The	rules  are different for the different
	      indexing options.	 The tiny index contains all words  (the  sav‐
	      ings from a stop-list are too small to bother).  The small index
	      (-o), the number k is a percentage threshold.  A word will be in
	      the  stop	 list  if it appears in at least k% of all files.  The
	      default value is 80%.  (If there are less than 256  files,  then
	      the  stop-list is not maintained.)  The medium index (-b) counts
	      all occurrences of all words, and a word is added to  the	 stop-
	      list  if	it  appears  at	 least k times per MByte.  The default
	      value is 500.  A query that includes a  stop  list  word	is  of
	      course less efficient.  (See also LIMITATIONS below.)

       -t     (A  new  option  in  version 3.5.)  The order in which files are
	      indexed is determined by	scanning  the  directories,  which  is
	      mostly  arbitrary.   With the -t option, combined with either -o
	      and -b, the indexed files are stored in reversed order of	 modi‐
	      fication age (younger files first).  Results of queries are then
	      automatically returned in this order.  Furthermore, glimpse  can
	      filter results by age; for example, asking to look at only files
	      that are at most 5 days old.

       -T     builds the turbo file.  Starting at version  3.0,	 this  is  the
	      default, so using this option has no effect.

       -w k   Glimpseindex does a reasonable, but not a perfect, job of deter‐
	      mining which files should not be	indexed.   Sometimes  a	 large
	      text  file  should not be indexed; for example, a dictionary may
	      match most queries.  The -w  option  stores  in  a  file	called
	      .glimpse_messages	 (in the same directory as the index) the list
	      of all files that contribute at least k new words to the	index.
	      The  user can look at this list of files and decide which should
	      or should not be indexed.	 The  file  .glimpse_exclude  contains
	      files  that  will not be indexed (see more below).  We recommend
	      to set k to about 1000.  This is	not  an	 exact	measure.   For
	      example,	if  the	 same file appears twice, then the second copy
	      will not contribute any new words to the dictionary (but if  you
	      exclude  the  first  copy	 and index again, the second copy will

       -X     (starting at version 4.0B1) Extract titles from HTML  pages  and
	      add the titles to the index (in .glimpse_filenames).  (This fea‐
	      ture was added to improve the performance of WebGlimpse.)	 Works
	      only  on	files  whose  names  end with .html, .htm, .shtml, and
	      .shtm.  (see glimpse.h/EXTRACT_INFO_SUFFIX to add to these  suf‐
	      fixes.)	The  routine to extract titles is called extract_info,
	      in index/filetype.c.  This feature can be	 modified  in  various
	      ways  to	extract	 info  from  many  filetypes.	The titles are
	      appended to the corresponding filenames with a space  separator.
	      Glimpseindex assumes that filenames don't have spaces in them.

       -z     Allow customizable filtering, using the file .glimpse_filters to
	      perform the programs listed there	 for  each  match.   The  best
	      example is compress/decompress.  If .glimpse_filters include the
	      *.Z   uncompress <
	      (separated by tabs) then before indexing any file	 that  matches
	      the  pattern "*.Z" (same syntax as the one for .glimpse_exclude)
	      the command listed is executed first  (assuming  input  is  from
	      stdin, which is why uncompress needs <) and its output (assuming
	      it goes to stdout) is indexed.  The file itself is  not  changed
	      (i.e.,  it  stays	 compressed).  Then if glimpse -z is used, the
	      same program is used on these files on the fly.  Any program can
	      be  used (we run 'exec').	 For example, one can filter out parts
	      of files that should not	be  indexed.   Glimpseindex  tries  to
	      apply  all  filters  in  .glimpse_filters	 in the order they are
	      given.  For example, if you want to uncompress a file  and  then
	      extract  some part of it, put the compression command (the exam‐
	      ple above) first	and  then  another  line  that	specifies  the
	      extraction.  Note that this can slow down the search because the
	      filters need to be run before files are searched.

       All files used by glimpse are located at the directory(ies)  where  the
       index(es)  is  (are)  stored and have .glimpse_ as a prefix.  The first
       two files (.glimpse_exclude and .glimpse_include) are  optionally  sup‐
       plied by the user.  The other files are built and read by glimpse.

	      contains a list of files that glimpseindex is explicitly told to
	      ignore.  In general, the syntax of  .glimpse_exclude/include  is
	      the same as that of agrep (or any other grep).  The lines in the
	      .glimpse_exclude file are matched to the file names, and if they
	      match,  the  files  are  excluded.  Notice that agrep matches to
	      parts  of	 the  string!	e.g.,  agrep   /ftp/pub	  will	 match
	      /home/ftp/pub and /ftp/pub/whatever.  So, if you want to exclude
	      /ftp/pub/core, you just list it, as is, in the  .glimpse_exclude
	      file.   If  you  put  "/home/ftp/pub/cdrom" in .glimpse_exclude,
	      every file name that matches that string will be excluded, mean‐
	      ing all files below it.  You can use ^ to indicate the beginning
	      of a file name, and $ to indicate the end of one,	 and  you  can
	      use  *  and  ?  in  the  usual way.  For example /ftp/*html will
	      exclude	 /ftp/pub/foo.html,    but    will    also     exclude
	      /home/ftp/pub/html/whatever;   if you want to exclude files that
	      start with /ftp and end with html use  ^/ftp*html$  Notice  that
	      putting  a  *  at	 the  beginning or at the end is redundant (in
	      fact, in this case glimpseindex will remove the * when  it  does
	      the   indexing).	 No  other  meta  characters  are  allowed  in
	      .glimpse_exclude (e.g., don't use .* or # or |).	Lines  with  *
	      or  ?  must  have	 no  more  than	 30  characters.  Notice that,
	      although the index itself will not be indexed, the list of  file
	      names  (.glimpse_filenames) will be indexed unless it is explic‐
	      itly listed in .glimpse_exclude.

	      See the description above for the -z option.

	      contains a list of files that glimpseindex is explicitly told to
	      include  in  the	index  even though they may look like non-text
	      files.  Symbolic links are followed by glimpseindex only if they
	      are  specifically	 included here.	 The syntax is the same as the
	      one for .glimpse_exclude (see there).  If	 a  file  is  in  both
	      .glimpse_exclude and .glimpse_include it will be excluded unless
	      -i is used.

	      contains the list of all indexed file names, one per line.  This
	      is  an ASCII file that can also be used with agrep to search for
	      a file name leading to a fast find command.  For example,
	      glimpse 'count#\.c$' ~/.glimpse_filenames
	      will output the names  of	 all  (indexed)	 .c  files  that  have
	      'count'  in  their name (including anywhere on the path from the
	      index).  Setting the following alias in the .login file  may  be
	      alias findfile 'glimpse -h :1 ~/.glimpse_filenames'

	      contains	the index.  The index consists of lines, each starting
	      with a word followed by a list of block numbers (unless  the  -o
	      or  -b  options are used, in which case each word is followed by
	      an offset into the file .glimpse_partitions where	 all  pointers
	      are kept).  The block/file numbers are stored in binary form, so
	      this is not an ASCII file.

	      contains the output of the -w option (see above).

	      contains the partition of the indexed  space  into  blocks  and,
	      when  the index is built with the -o or -b options, some part of
	      the index.  This file is used internally by glimpse and it is  a
	      non-ASCII file.

	      contains	some statistics about the makeup of the index.	Useful
	      for some advanced applications and customization of glimpse.

       Glimpse can search for Boolean combinations of "attribute=value"	 terms
       by  using the Harvest SOIF parser library (in glimpse/libtemplate).  To
       search this way, the index must be made	by  using  the	-s  option  of
       glimpseindex  (this  can be used in conjunction with other glimpseindex
       options). For glimpse and glimpseindex to recognize "structured" files,
       they  must be in SOIF format. In this format, each value is prefixed by
       an attribute-name with the size of the value (in bytes) present in "{}"
       after  the name of the attribute.  For example, The following lines are
       part of an SOIF file:
       type{17}:       Directory-Listing
       md5{32}:	       3858c73d68616df0ed58a44d306b12ba
       Any  string  can	 serve	as   an	  attribute   name.    Glimpse	 "pat‐
       tern;type=Directory-Listing"  will  search  for "pattern" only in files
       whose type is "Directory-Listing".  The file itself is considered to be
       one  "object"  and  its name/url appears as the first attribute with an
       "@" prefix; e.g., @FILE { http://xxx... } The scope of  Boolean	opera‐
       tions  changes  from  records  (lines)  to  whole files when structured
       queries are used in glimpse (since individual query terms can  look  at
       different attributes and they may not be "covered" by the record/line).
       Note that glimpse can only search for patterns in the  value  parts  of
       the SOIF file: there are some attributes (like the TTL, MD5, etc.) that
       are  interpreted	 by  Harvest's	internal  routines.   See  http://har‐ for more detailed information
       of the SOIF format.

       1.     U. Manber and S. Wu, "GLIMPSE: A Tool to Search  Through	Entire
	      File  Systems,"  Usenix  Winter  1994 Technical Conference (best
	      paper award), San Francisco (January 1994),  pp.	23-32.	 Also,
	      Technical	 Report	 #TR 93-34, Dept. of Computer Science, Univer‐
	      sity of Arizona, October 1993 (a postscript file is available by
	      anonymous		  ftp		at	     ftp://ftp.cs.ari‐

       2.     S. Wu and U. Manber, "Fast Text Searching Allowing Errors," Com‐
	      munications of the ACM 35 (October 1992), pp. 83-91.

       agrep(1),  ed(1), ex(1), glimpse(1), glimpseserver(1), grep(1V), sh(1),

       The index of glimpse is word based.  A pattern that contains more  than
       one  word cannot be found in the index.	The way glimpse overcomes this
       weakness is by splitting any multi-word pattern into its set  of	 words
       and looking for all of them in the index.  For example, glimpse 'linear
       programming' will first consult the index to find all files  containing
       both  linear and programming, and then apply agrep to find the combined
       pattern.	 This is usually an effective solution, but it can be slow for
       cases where both words are very common, but their combination is not.

       The  index  of glimpse stores all patterns in lower case.  When glimpse
       searches the index it first converts all patterns to lower case,	 finds
       the  appropriate	 files,	 and  then searches the actual files using the
       original patterns.  So, for example, glimpse ABCXYZ will first find all
       files  containing  abcxyz  in any combination of lower and upper cases,
       and then searches these files directly, so only the right cases will be
       found.  One problem with this approach is discovering misspellings that
       are caused by wrong cases.  For example, glimpse -B abcXYZ  will	 first
       search  the  index for the best match to abcxyz (because the pattern is
       converted to lower case); it will find that there are matches  with  no
       errors,	and  will go to those files to search them directly, this time
       with the original upper cases.  If the closest match  is,  say  AbcXYZ,
       glimpse may miss it, because it doesn't expect an error.	 Another prob‐
       lem is speed.  If you search for "ATT", it will look at the  index  for
       "att".	Unless you use -w to match the whole word, glimpse may have to
       search all files containing, for example, "Seattle" which has "att"  in

       There  is  no  size  limit for simple patterns and simple patterns with
       Boolean AND or OR.  More complicated patterns are currently limited  to
       approximately  30  characters.	Lines  are limited to 1024 characters.
       Records are limited to 48K, and may be truncated	 if  they  are	larger
       than  that.  The limit of record length can be changed by modifying the
       parameter Max_record in agrep.h.

       Each line in .glimpse_exclude or .glimpse_include that contains a *  or
       a ? must not exceed 30 characters length.

       Glimpseindex does not index words of size > 64.

       A medium-size index (-b) may lead to actually slower query times if the
       files are all very small.

       Under -b, it may be impossible to make the stop list empty.  Glimpsein‐
       dex  is	using the "sort" routine, and all occurrences of a word appear
       at some point on one line.  Sort is limiting the size of lines  it  can
       handle (the value depends on the platform; ours is 16KB).  If the lines
       are too big, the word is added to the stop list.

       Please send bug reports or comments to

       (Only in version 3.6 and above.)
       exit status 0: terminated normally;
       exit status 1: glimpseindex errors (e.g., bad option combos,  no	 files
       were indexed, etc.)
       exit  status  2: system errors (e.g., write failed, sort failed, malloc

       Udi Manber and Burra Gopal, Department of Computer Science,  University
       of  Arizona,  and  Sun Wu, the National Chung-Cheng University, Taiwan.

			       November 10, 1997	       GLIMPSEINDEX(l)

List of man pages available for Ultrix

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
Vote for polarhome
Free Shell Accounts :: the biggest list on the net