DOC2TXT(1)DOC2TXT(1)NAME
doc2txt, doc2ps, wdoc2txt, xls2txt, olefs, mswordstrings, msexceltables
- extract printable text from Microsoft documents
SYNOPSISdoc2txt [ file.doc ]
doc2ps [ file.doc ]
wdoc2txt [ file.doc ]
xls2txt [ file.xls ]
aux/olefs [ -m mtpt ] file.doc
aux/mswordstrings mtpt/WordDocument
aux/msexceltables [ -qaDnt ] [ -d delim ] [ -c column-range ] [ -w
worksheet-range ] mtpt/Workbook
DESCRIPTION
Doc2txt is an rc(1) script that uses olefs and mswordstrings to extract
the printable text from the body of a Microsoft Word document and write
it on the standard output. Doc2ps is similar, but emits PostScript
corresponding to the document. Wdoc2txt is similar to doc2txt, but
uses plumb(1) to send the output to a new acme(1) window instead.
Xls2txt performs a similar function for Microsoft Excel documents.
Microsoft Office documents are stored in OLE (Object Linking and Embed‐
ding) format, which is a scaled down version of Microsoft's FAT file
system. Olefs presents the contents of an MS Office document as a file
system on mtpt, which defaults to /mnt/doc. Mswordstrings or msex‐
celtables may then be used to parse the files inside, extracting a text
stream. Msexceltables may be given options to control the formatting
of its output.
-a Attempt conversion of non-tabular sheets in the workbook
(charts).
-d delim
Sets the inter-field delimiter to the string delim, by default a
single space.
-D Enables debugging output.
-c range
Range is a comma-separated list of column numbers and ranges.
Ranges are separated by dashes. Limit processing to just those
columns named; by default all columns are output.
-n Disables field padding to column width.
-q Disable quoting of textural fields (see quote(2).)
-t Truncate fields to the column width.
-w range
Range is a comma-separated list of worksheet numbers and ranges,
this limits the sheets output using the same syntax as the -c
option above. Suppressed chart pages are always included in the
sheet count.
EXAMPLE
Extract pieces of an MS Excel spreadsheet.
aux/olefs report.xls
msexceltables -q -w 1,7,9-14 -c 3-5 -n -d '@' /mnt/doc/Workbook > rpt.txt
unmount /mnt/doc
SOURCE
/rc/bin
doc2txt, doc2ps, wdoc2txt, and xls2txt
/sys/src/cmd/aux
the others
SEE ALSOstrings(1)
``Microsoft Word 97 Binary File Format'', at Microsoft's developer
(MSDN) home page.
``LAOLA Binary Structures'', http://user.cs.tu-berlin.de/~schwartz/pmh
``OpenOffice.Org's Excel Documentation'',
http://sc.openoffice.org/excelfileformat.pdf
DOC2TXT(1)