SYNOPSISanytotx [options] [inputfile]
DESCRIPTION
The anytotx program attempts to identify and translate its
input file to ASCII text. This can be used when crawling non-text
file formats (such as PDF and MS-Word), to obtain the plain text for
searching. (The SQL function totext()
calls this program
internally.) There is built-in support for many common file formats,
and any new file format can be added by modifying the formats.rule
config file.
The input file is given last on the command line, after any options; if not present, standard input is assumed. The output is the text version of the document, written to standard output. In version 4.02.1047588542 Mar 13 2003 and later, the output is always MIME, and may be multi-part/mixed to support multi-file archives such as ZIP files.
The following options are supported. The non-assignment, separate-argument syntax variant of some assignment-style long options was added in version 8.
-h
Print synopsis of options.
-p
Select alternate text ordering for PDF conversion. By default,
the text output for PDFs is done linearly, so that hit markup with
pdfxml
is done properly. However, this may output text in
a less desirable ordering for text searching, especially with
tables and multi-column pages. The -p
option selects
non-linear text output mode.
-pp
Select "pretty-print" mode for PDF conversion.
-s
Keep short lines (3 characters or less) when converting in
-fOTHER
mode. By default, short lines are suppressed as
they are often garbage.
-Ppass
Use pass
as the password to access protected
files (e.g. certain PDFs).
-l
(lower-case el)
Extract hyperlinks from document, where supported.
Each link is printed as a Link:
header in the MIME output.
-mNAME
Extract meta data field NAME
from document, where
supported. Common meta fields are Title
, Subject
and Keywords
. Each meta field is printed as a header in
the MIME output.
-M
Extract all known meta data. Varies by input type:
HTML:
title
Flash:
version
, framesize
,
framerate
, framecount
PDF:
Author
, CreationDate
,
ModDate
, Creator
, Producer
, Title
,
Subject
, Keywords
, X-Print
, X-Change
,
X-Copy
, X-Addnotes
, X-Linear
, X-Encrypted
,
X-Pages
, X-PDF-Version
, X-Tagged
,
X-Filter-Version
MSW,XLS,MSO:
Title
, Subject
, Author
, Keywords
,
Comments
, Template
, Last-Author
,
Revision
, Edit-Time
, Printed
, Created
,
Saved
, Pages
, Words
, Chars
,
Thumbnail
, Creator
, Security
, Category
,
Target
, Bytes
, Lines
, Paragraphs
,
Slides
, Notes
, Hidden-Slides
, MM-Clips
,
Scale-Crop
, Heading-Pairs
, Titles
,
Manager
, Company
, Links-Up-To-Date
,
X-Filter-Version
TIFF:
ImageWidth
, ImageLength
,
DocumentName
, ImageDescription
, Make
,
Model
, PageName
, PageNumber
, Software
,
DateTime
, Artist
, HostComputer
,
InkNames
, TargetPrinter
, Copyright
-fCODE
Assume input file is one of the built-in formats indicated by
CODE, which is one of:
PDF
for Adobe Acrobat PDF; MIME type application/pdf
HTML
for HyperText Markup Language; MIME type text/html
XML
for XML; MIME type text/xml
MSW
for Microsoft Word; MIME type application/msword
XLS
for Microsft Excel; MIME type application/vnd.ms-excel
PPT
for Microsoft PowerPoint; MIME type
application/vnd.ms-powerpoint
MSO
for other Microsoft formats; MIME type
application/x-ms-other
SWF
for Shockwave-Flash; MIME type
application/x-shockwave-flash
GIF
for Graphics Interchange Format; MIME type image/gif
.
Added in version 4.02.1046193282 Feb 25 2003.TIFF
for Tag Image File Format; MIME type image/tiff
.
Added in version 5.00.1084000000 May 8 2004.TNEF
for Microsoft Transport-Neutral Encoding Format; MIME type
application/tnef
. Added in version 4.02.1047588542 Mar 13 2003.GZIP
for gzip
files; MIME type application/x-gzip
COMPRESS
for compress
ed files; MIME type
application/x-compress
WPD
for WordPerfect files; MIME type
application/wordperfect
(added in version 7.01;
previously handled as OTHER
)AUTO
to auto-detect the format (the default)OTHER
for an unknown format; MIME type
application/octet-stream
Codes are case-insensitive. The default is to automatically
detect the input file type (e.g. -fAUTO
). Note that there
may be more file formats supported (via formats rule file) that
are listed here. It is not usually necessary to specify the input
type; most are detected properly. See also the
--content-type
option which supercedes this. The MIME type
may also be given to -f
instead of the code; parameters
(e.g. charset) may be given but are largely ignored (HTML mode
uses charset, as of version 7.02.1416893000 20141125).
-g
Print additional information in headers, such as input
file type, translator arguments, etc.
-G
Same as -g
, but quit: don't attempt actual translation.
-v
Enable verbose output.
-Dnnn
Enable debugging output, level nnn
. Default is 0.
Optional nnn
added in version 5.01.1110400000 Mar 9 2005.
-uURL
Use URL
as the URL of the input file (for informational
purposes, does not fetch anything).
--install-dir{=| }DIR
Set the Texis install dir to use. Default is as installed, or
typically /usr/local/morph3 under Unix. Added in version
4.03.1051600000 Apr 29 2003.
--rule-file{=| }FILE
Use formats rule file FILE
. The default is the file
specified by Rule File
in the [Anytotx]
section of
the conf/texis.ini
config file, or if that is not set,
conf/formats.rule
in the Texis install dir. If the formats
rule file cannot be found or read, a default internal version is
used. The formats rule file tells anytotx
how to identify
and translate file formats; see below for syntax. Added in
version 4.02.1045857437 Feb 21 2003.
--types-config{=| }FILE
Use MIME types config file FILE
. The default is the file
specified by Types Config
in the [Anytotx]
section
of the conf/texis.ini
config file, or if that is not set,
conf/mime.types
in the Texis install dir. This file maps
MIME types to file extensions, as a fall back for identifying
files (a formats rule file entry is still usually needed). It is
the same format as Apache mime.types
files, i.e. each line
is a MIME type followed by zero or more space-separated file
extensions (no dot). Added in version 4.02.1045857437 Feb 21
2003.
--max-depth{=| }N
Maximum depth to recurse when processing a file. Multiple
translators may need to be run to translate a file to text
(e.g. RTF to HTML to text). Keeping this setting low can prevent
an infinite loop if the content ever "bounces" between types.
The default is 5, which may need to be raised if complex,
multi-level translators are used. Added in version
4.02.1045857437 Feb 21 2003.
--tmp{=| }DIR
Use directory DIR
for temporary files during translation.
The default is the dir specified by the environment variables
TMP
, TMPDIR
, TEMP
or TEMPDIR
. If no
environment variable is set, the dir C:\
(Windows) or
/tmp
(Unix) is used, or the tmp
subdirectory of
the Texis installation directory.
--timeout{=| }NNN
Timeout in seconds; default is 30. Use -1 for no timeout.
Added in version 4.03.1051675200 Apr 30 2003.
--content-type{=| }TYPE
Assume input is MIME type TYPE
. The default is to
automatically detect the type. If specified, the MIME type should
be one that has a translator in the formats rule file, or a
built-in type such as application/octet-stream
. Added in
version 4.02.1045857437 Feb 21 2003.
Note that unlike -f
this option's value can be other than
one of the built-in MIME types: rather than dictating how to
translate the input, the value merely describes it, and will be
looked up in the formats rule file. Parameters (e.g. charset)
are permitted in version 7.02.1416893000 20141125 and later, and
are ignored.
--error-log{=| }FILE
Log errors to FILE
. The default is standard error. Added
in version 4.02.1045857437 Feb 21 2003.
--save-files
Save temporary output files in the temp dir. The default is to
delete them. This can be used to see the raw files unpacked from
an archive, before text translation (e.g. for TNEF or ZIP
archives).
--output-enc{=| }CHARSET
Use output encoding CHARSET
where possible, e.g. UTF-8.
Added in version 5.01.1110258000 Mar 8 2005.
--expand-ligatures{=| }MIME
Expand single-character Unicode ligatures (e.g. "ffi" character)
into multiple characters for input MIME type MIME
. Default
application/pdf
; set none
to turn off. Increases
document searchability, but may affect browser PDF plugin
highlighting. Added in version 5.01.1110258000 Mar 8 2005.
--name{=| }STR
Set name of object being processed, for error messages.
Added in version 4.03.1055995200 Jun 19 2003.
--fix-mode{=| }{y|n}
Whether to fix attribute mode of unpacked files to non-hidden/readable.
Default is y
.
Added in version 4.03.1059364800 Jul 28 2003.
--trace-pipe{=| }NNN
Debugging: set pipe trace level NNN
.
Added in version 4.04.1071637200 Dec 17 2003.