SYNOPSISanytotx [options] [inputfile]
DESCRIPTION
The anytotx program attempts to identify and translate its
input file to ASCII text. This can be used when crawling non-text
file formats (such as PDF and MS-Word), to obtain the plain text for
searching. (The SQL function totext() calls this program
internally.) There is built-in support for many common file formats,
and any new file format can be added by modifying the formats.rule
config file.
The input file is given last on the command line, after any options; if not present, standard input is assumed. The output is the text version of the document, written to standard output. In version 4.02.1047588542 Mar 13 2003 and later, the output is always MIME, and may be multi-part/mixed to support multi-file archives such as ZIP files.
The following options are supported. The non-assignment, separate-argument syntax variant of some assignment-style long options was added in version 8.
-h
Print synopsis of options.
-p
Select alternate text ordering for PDF conversion. By default,
the text output for PDFs is done linearly, so that hit markup with
pdfxml is done properly. However, this may output text in
a less desirable ordering for text searching, especially with
tables and multi-column pages. The -p option selects
non-linear text output mode.
-pp
Select "pretty-print" mode for PDF conversion.
-s
Keep short lines (3 characters or less) when converting in
-fOTHER mode. By default, short lines are suppressed as
they are often garbage.
-Ppass
Use pass as the password to access protected
files (e.g. certain PDFs).
-l (lower-case el)
Extract hyperlinks from document, where supported.
Each link is printed as a Link: header in the MIME output.
-mNAME
Extract meta data field NAME from document, where
supported. Common meta fields are Title, Subject
and Keywords. Each meta field is printed as a header in
the MIME output.
-M
Extract all known meta data. Varies by input type:
HTML: titleFlash: version, framesize,
framerate, framecountPDF: Author, CreationDate,
ModDate, Creator, Producer, Title,
Subject, Keywords, X-Print, X-Change,
X-Copy, X-Addnotes, X-Linear, X-Encrypted,
X-Pages, X-PDF-Version, X-Tagged,
X-Filter-VersionMSW,XLS,MSO:
Title, Subject, Author, Keywords,
Comments, Template, Last-Author,
Revision, Edit-Time, Printed, Created,
Saved, Pages, Words, Chars,
Thumbnail, Creator, Security, Category,
Target, Bytes, Lines, Paragraphs,
Slides, Notes, Hidden-Slides, MM-Clips,
Scale-Crop, Heading-Pairs, Titles,
Manager, Company, Links-Up-To-Date,
X-Filter-VersionTIFF: ImageWidth, ImageLength,
DocumentName, ImageDescription, Make,
Model, PageName, PageNumber, Software,
DateTime, Artist, HostComputer,
InkNames, TargetPrinter, Copyright
-fCODE
Assume input file is one of the built-in formats indicated by
CODE, which is one of:
PDF for Adobe Acrobat PDF; MIME type application/pdfHTML for HyperText Markup Language; MIME type text/htmlXML for XML; MIME type text/xmlMSW for Microsoft Word; MIME type application/mswordXLS for Microsft Excel; MIME type application/vnd.ms-excelPPT for Microsoft PowerPoint; MIME type
application/vnd.ms-powerpointMSO for other Microsoft formats; MIME type
application/x-ms-otherSWF for Shockwave-Flash; MIME type
application/x-shockwave-flashGIF for Graphics Interchange Format; MIME type image/gif.
Added in version 4.02.1046193282 Feb 25 2003.TIFF for Tag Image File Format; MIME type image/tiff.
Added in version 5.00.1084000000 May 8 2004.TNEF for Microsoft Transport-Neutral Encoding Format; MIME type
application/tnef. Added in version 4.02.1047588542 Mar 13 2003.GZIP for gzip files; MIME type application/x-gzipCOMPRESS for compressed files; MIME type
application/x-compressWPD for WordPerfect files; MIME type
application/wordperfect (added in version 7.01;
previously handled as OTHER)AUTO to auto-detect the format (the default)OTHER for an unknown format; MIME type
application/octet-stream
Codes are case-insensitive. The default is to automatically
detect the input file type (e.g. -fAUTO). Note that there
may be more file formats supported (via formats rule file) that
are listed here. It is not usually necessary to specify the input
type; most are detected properly. See also the
--content-type option which supercedes this. The MIME type
may also be given to -f instead of the code; parameters
(e.g. charset) may be given but are largely ignored (HTML mode
uses charset, as of version 7.02.1416893000 20141125).
-g
Print additional information in headers, such as input
file type, translator arguments, etc.
-G
Same as -g, but quit: don't attempt actual translation.
-v
Enable verbose output.
-Dnnn
Enable debugging output, level nnn. Default is 0.
Optional nnn added in version 5.01.1110400000 Mar 9 2005.
-uURL
Use URL as the URL of the input file (for informational
purposes, does not fetch anything).
--install-dir{=| }DIR
Set the Texis install dir to use. Default is as installed, or
typically /usr/local/morph3 under Unix. Added in version
4.03.1051600000 Apr 29 2003.
--rule-file{=| }FILE
Use formats rule file FILE. The default is the file
specified by Rule File in the [Anytotx] section of
the conf/texis.ini config file, or if that is not set,
conf/formats.rule in the Texis install dir. If the formats
rule file cannot be found or read, a default internal version is
used. The formats rule file tells anytotx how to identify
and translate file formats; see below for syntax. Added in
version 4.02.1045857437 Feb 21 2003.
--types-config{=| }FILE
Use MIME types config file FILE. The default is the file
specified by Types Config in the [Anytotx] section
of the conf/texis.ini config file, or if that is not set,
conf/mime.types in the Texis install dir. This file maps
MIME types to file extensions, as a fall back for identifying
files (a formats rule file entry is still usually needed). It is
the same format as Apache mime.types files, i.e. each line
is a MIME type followed by zero or more space-separated file
extensions (no dot). Added in version 4.02.1045857437 Feb 21
2003.
--max-depth{=| }N
Maximum depth to recurse when processing a file. Multiple
translators may need to be run to translate a file to text
(e.g. RTF to HTML to text). Keeping this setting low can prevent
an infinite loop if the content ever "bounces" between types.
The default is 5, which may need to be raised if complex,
multi-level translators are used. Added in version
4.02.1045857437 Feb 21 2003.
--tmp{=| }DIR
Use directory DIR for temporary files during translation.
The default is the dir specified by the environment variables
TMP, TMPDIR, TEMP or TEMPDIR. If no
environment variable is set, the dir C:\ (Windows) or
/tmp (Unix) is used, or the tmp subdirectory of
the Texis installation directory.
--timeout{=| }NNN
Timeout in seconds; default is 30. Use -1 for no timeout.
Added in version 4.03.1051675200 Apr 30 2003.
--content-type{=| }TYPE
Assume input is MIME type TYPE. The default is to
automatically detect the type. If specified, the MIME type should
be one that has a translator in the formats rule file, or a
built-in type such as application/octet-stream. Added in
version 4.02.1045857437 Feb 21 2003.
Note that unlike -f this option's value can be other than
one of the built-in MIME types: rather than dictating how to
translate the input, the value merely describes it, and will be
looked up in the formats rule file. Parameters (e.g. charset)
are permitted in version 7.02.1416893000 20141125 and later, and
are ignored.
--error-log{=| }FILE
Log errors to FILE. The default is standard error. Added
in version 4.02.1045857437 Feb 21 2003.
--save-files
Save temporary output files in the temp dir. The default is to
delete them. This can be used to see the raw files unpacked from
an archive, before text translation (e.g. for TNEF or ZIP
archives).
--output-enc{=| }CHARSET
Use output encoding CHARSET where possible, e.g. UTF-8.
Added in version 5.01.1110258000 Mar 8 2005.
--expand-ligatures{=| }MIME
Expand single-character Unicode ligatures (e.g. "ffi" character)
into multiple characters for input MIME type MIME. Default
application/pdf; set none to turn off. Increases
document searchability, but may affect browser PDF plugin
highlighting. Added in version 5.01.1110258000 Mar 8 2005.
--name{=| }STR
Set name of object being processed, for error messages.
Added in version 4.03.1055995200 Jun 19 2003.
--fix-mode{=| }{y|n}
Whether to fix attribute mode of unpacked files to non-hidden/readable.
Default is y.
Added in version 4.03.1059364800 Jul 28 2003.
--trace-pipe{=| }NNN
Debugging: set pipe trace level NNN.
Added in version 4.04.1071637200 Dec 17 2003.