catdoc man page on DragonFly

Man page or keyword search:  
man Server   44335 pages
apropos Keyword Search (all sections)
Output format
DragonFly logo
[printable version]

catdoc(1)							     catdoc(1)

NAME
       catdoc - reads MS-Word file and puts its content as plain text on stan‐
       dard output

SYNOPSIS
       catdoc [-vlu8btawxV] [-m number] [ -s charset] [ -d charset] [ -f  out‐
       put-format] file

DESCRIPTION
       catdoc  behaves much like cat(1) but it reads MS-Word file and produces
       human-readable text on standard output.	Optionally it can use latex(1)
       escape  sequences  for characters which have special meaning for LaTeX.
       It also makes some effort to  recognize	MS-Word	 tables,  although  it
       never  tries  to	 write	correct headers for LaTeX tabular environment.
       Additional output formats, such is HTML can be easily defined.

       catdoc doesn't attempt to extract  formatting  information  other  than
       tables  from  MS-Word  document, so different output modes means mainly
       that different characters should be escaped and different ways used  to
       represent  characters,  missing from output charset. See CHARACTER SUB‐
       STITUTION below

       catdoc uses internal unicode(4) representation of text, so it  is  able
       to  convert texts when charset in source document doesn't match charset
       on target system.  See CHARACTER SETS below.

       If no file names supplied, catdoc processes its standard	 input	unless
       it  is  terminal. It is unlikely that somebody could type Word document
       from keyboard, so if catdoc invoked without arguments and stdin is  not
       redirected,  it	prints	brief  usage message and exits.	 Processing of
       standard input (even among other files) can be forced using dash '-' as
       file name.

       By  default,  catdoc  wraps lines which are more than 72 chars long and
       separates paragraphs by blank lines. This behavior can be turned of  by
       -w  switch. In wide mode catdoc prints each paragraph as one long line,
       suitable for import into word processors which  perform	word  wrapping
       theirselves.

OPTIONS
       -a      -  shortcut for -f ascii. Produces ASCII text as output.	 Sepa‐
	       rates table columns with TAB

       -b      - process broken MS-Word file. Normally, catdoc checks if first
	       8 bytes of file is Microsoft OLE signature. If so, it processes
	       file, otherwise it just copies it to stdin. It is  intended  to
	       use catdoc as filter for viewing all files with .doc extension.

       -dcharset
	       -  specifies  destination charset name. Charset file has format
	       described in CHARACTER SETS below and should have  .txt	exten‐
	       sion	and    reside	 in   catdoc   library	 directory   (
	       /usr/local/share/catdoc ). By default, current  locale  charset
	       is used if langinfo support compiled in.

       -fformat
	       -  specifies  output format as described in CHARACTER SUBSTITU‐
	       TION below.  catdoc comes with two output formats -  ascii  and
	       tex. You can add your own if you wish.

       -l      Causes catdoc to list names of available charsets to the stdout
	       and exit successfully.

       -mnumber
	       Specifies right margin for text	(default 72).  -m 0 is equiva‐
	       lent to -w

       -scharset
	       Specifies  source charset. (one used in Word document), if Word
	       document doesn't contain UTF-16	text. When reading  rtf	 docu‐
	       ments,  it  is  typically  not necessary, because rtf documents
	       contain ansicpg specification. But it can be set wrong by  Word
	       (I've  seen  RTF	 documents on Russian, where cp1252 was speci‐
	       fied). In this case this	 option	 would	take  precedence  over
	       charset,	 specified  in the document. But source_charset state‐
	       ment in the configuration file have less priority than  charset
	       in the document.

       -t      - shortcut for -f tex
		converts  all  printable chars, which have special meaning for
	       LaTeX(1) into appropriate control  sequences.  Separates	 table
	       columns by &.

       -u      -  declares  that  Word	 document  contain  UNICODE   (UTF-16)
	       representation of text (as some Word-97 documents).  If	catdoc
	       fails  to  correct   Word document with	default charset,   try
	       this  option.

       -8      - declares is Word document is 8 bit. Just in case that catdoc
		recognizes file format incorrectly.

       -w      disables word wrapping. By default catdoc  output  is  splitted
	       into  lines  not	 longer	 than  72 (or  number, specified by -m
	       option)	 characters and	 paragraphs  are  separated  by	 blank
	       line. With this option each paragraph is one long line.

       -x      causes  catdoc  to  output unknown UNICODE character as \xNNNN,
	       instead of question marks.

       -v      causes catdoc to print some useless information about word doc‐
	       ument structure to stdout before actual start of text.

       -V      outputs catdoc version

CHARACTER SETS
       When  processing MS-Word file catdoc uses information about two charac‐
       ter sets, typically different
	-  input and output. They are stored in plain  text  files  in	catdoc
       library	directory.  Character set files should contain two whitespace-
       separated hexadecimal numbers - 8-bit code in character set and	16-bit
       Unicode	code.	Anything  from hash mark to end of line is ignored, as
       well as blank lines.

       catdoc distribution includes some of these character  sets.  Additional
       character  set  definitions,  directly usable by catdoc can be obtained
       from ftp.unicode.org. Charset files have .txt suffix,  which  shouldn't
       be specified in command-line or configuration files.

       Note  that  catdoc is distributed with Cyrillic charsets as default. If
       you are not Russian, you probably don't want it, an should  reconfigure
       catdoc at compile time or in runtime configuration file.

       When  dealing with documents with charsets other than default, remember
       that Microsoft never uses ISO charsets. While letters  in,  say	cp1252
       are at the same position as in ISO-8859-1, some punctuation signs would
       be lost, if you specify ISO-8859-1 as input charset. If you use cp1252,
       catdoc  would deal with those signs as described in CHARACTER SUBSTITU‐
       TION below.

CHARACTER SUBSTITUTION
       catdoc converts	MS-Word file into following internal Unicode represen‐
       tation:

       1. Paragraphs are separated by ASCII Line Feed symbol (0x000A)

       2. Table cells within row are separated by ASCII Field Separator symbol
	   (0x001C)

       3. Table rows are separated by ASCII Record Separator (0x001E)

       4.  All printable characters, including whitespace are represented with
       their
	   respective UNICODE codes.

       This UNICODE representation is subsequently converted into  8-bit  text
       in target character set using following four-step algorithm:

       1. List of special characters is searched for given Unicode character.
	   If  found,  then  appropriate  multi-character  sequence  is output
	   instead of character.

       2. If there is an equivalent in target character set, it is output.

       3. Otherwise, replacement list is searched and, if there is multi-char‐
       acter
	   substitution for this UNICODE char, it is output.

       4. If all above fails, "Unknown char" symbol (question mark) is output.

       Lists of special characters and list of substitution are character set-
       independent, because special chars  should  be  escaped	regardless  of
       their  existence	 in  target character set  (usually, they are parts of
       US-ASCII, and therefore exist in any  character	set)  and  replacement
       list is searched only for those characters, which are not found in tar‐
       get character set.

       These lists are stored in catdoc library directory in files with prefix
       of format name. These files have following format:

       Each  line  can	be either comment (starting with hash mark) or contain
       hexadecimal UNICODE value, separated by whitespace from	string,	 which
       would  be substituted instead of it. If string contain no whitespace it
       can be used as is, otherwise it should be enclosed in single or	double
       quotes.	Usual  backslash sequences like '\n','\t' can be used in these
       string.

RUNTIME CONFIGURATION
       Upon  startup  catdoc  reads  its  system-wide  configuration  file   (
       /usr/local/etc/catdocrc	)  and	then  user-specific configuration file
       ${HOME}/.catdocrc.

       These files can contain following directives:

       source_charset = charset-name
	       Sets default source charset, which  would  be  used  if	no  -s
	       option specified. Consult configuration of nearby windows work‐
	       station to find one you need.

       target_charset = charset-name
		Sets default output charset. You probably know, which one  you
	       use.

       charset_path = directory-list
	       colon-separated	list  of  directories,	which are searched for
	       charset files.  This allows you to install additional  charsets
	       in  your	 home directory.  If first directory component of path
	       is ~ it is replaced by contents of HOME	environment  variable.
	       On  MS-DOS  platform,  if  directory name starts with %s, it is
	       replaced with directory of executable file.  Empty  element  in
	       list (i.e. two consequitve colons) is considered current direc‐
	       tory.

       map_path = directory-list
	       colon-separated list of directories,  which  are	 searched  for
	       special	character  map and replacement map.  Same substitution
	       rules as in charset_path are applied.

       format = format name
	       Output format which would be used  by  default.	 catdoc	 comes
	       with  two formats - ascii and tex but nothing prevents you from
	       writing your own format (set two map files - special  character
	       map and replacement map).

       unknown_char = character specification
	       sets  character	to output instead of unknown Unicode character
	       (default '?')  Character specification can have one of two form
	       - character enclosed in single quotes or hexadecimal code.

       use_locale =(yes|no)
	       Enables	or  disables  automatic	 selection  of	output charset
	       (default yes),
		based on system locale settings (if enabled at compile	time).
	       If automatic detection is enabled, than output charset settings
	       in the configuration files (but not in the  command  line)  are
	       ignored,	 and  current  system  locale charset is used instead.
	       There are no automatic choice of input charset, based of locale
	       language,  because  most	 modern Word files (since Word 97) are
	       Unicode anyway

BUGS
       Doesn't handle fast-saves properly. Prints footnotes as separate	 para‐
       graphs at the end of file, instead of producing correct LaTeX commands.
       Cannot distinguish between empty table cell and end of table row.

SEE ALSO
       xls2csv(1), cat(1), strings(1), utf(4), unicode(4)

AUTHOR
       V.B.Wagner <vitus@45.free.net>

MS-Word reader			Version 0.94.2			     catdoc(1)
[top]

List of man pages available for DragonFly

Copyright (c) for man pages and the logo by the respective OS vendor.

For those who want to learn more, the polarhome community provides shell access and support.

[legal] [privacy] [GNU] [policy] [cookies] [netiquette] [sponsors] [FAQ]
Tweet
Polarhome, production since 1999.
Member of Polarhome portal.
Based on Fawad Halim's script.
....................................................................
Vote for polarhome
Free Shell Accounts :: the biggest list on the net