anytotx - Translate file formats to text

SYNOPSIS

anytotx [options] [inputfile]


DESCRIPTION
The anytotx program attempts to identify and translate its input file to ASCII text. This can be used when crawling non-text file formats (such as PDF and MS-Word), to obtain the plain text for searching. (The SQL function totext() calls this program internally.) There is built-in support for many common file formats, and any new file format can be added by modifying the formats.rule config file.

The input file is given last on the command line, after any options; if not present, standard input is assumed. The output is the text version of the document, written to standard output. In version 4.02.1047588542 Mar 13 2003 and later, the output is always MIME, and may be multi-part/mixed to support multi-file archives such as ZIP files.

The following options are supported. The non-assignment, separate-argument syntax variant of some assignment-style long options was added in version 8.

-h   Print synopsis of options.

-p   Select alternate text ordering for PDF conversion. By default, the text output for PDFs is done linearly, so that hit markup with pdfxml is done properly. However, this may output text in a less desirable ordering for text searching, especially with tables and multi-column pages. The -p option selects non-linear text output mode.

-pp   Select "pretty-print" mode for PDF conversion.

-s   Keep short lines (3 characters or less) when converting in -fOTHER mode. By default, short lines are suppressed as they are often garbage.

-Ppass   Use pass as the password to access protected files (e.g. certain PDFs).

-l (lower-case el)   Extract hyperlinks from document, where supported. Each link is printed as a Link: header in the MIME output.

-mNAME   Extract meta data field NAME from document, where supported. Common meta fields are Title, Subject and Keywords. Each meta field is printed as a header in the MIME output.

-M   Extract all known meta data. Varies by input type:

  • HTML: title

  • Flash: version, framesize, framerate, framecount

  • PDF: Author, CreationDate, ModDate, Creator, Producer, Title, Subject, Keywords, X-Print, X-Change, X-Copy, X-Addnotes, X-Linear, X-Encrypted, X-Pages, X-PDF-Version, X-Tagged, X-Filter-Version

  • MSW,XLS,MSO: Title, Subject, Author, Keywords, Comments, Template, Last-Author, Revision, Edit-Time, Printed, Created, Saved, Pages, Words, Chars, Thumbnail, Creator, Security, Category, Target, Bytes, Lines, Paragraphs, Slides, Notes, Hidden-Slides, MM-Clips, Scale-Crop, Heading-Pairs, Titles, Manager, Company, Links-Up-To-Date, X-Filter-Version

  • TIFF: ImageWidth, ImageLength, DocumentName, ImageDescription, Make, Model, PageName, PageNumber, Software, DateTime, Artist, HostComputer, InkNames, TargetPrinter, Copyright

-fCODE   Assume input file is one of the built-in formats indicated by CODE, which is one of:

  • PDF for Adobe Acrobat PDF; MIME type application/pdf

  • HTML for HyperText Markup Language; MIME type text/html

  • XML for XML; MIME type text/xml

  • MSW for Microsoft Word; MIME type application/msword

  • XLS for Microsft Excel; MIME type application/vnd.ms-excel

  • PPT for Microsoft PowerPoint; MIME type application/vnd.ms-powerpoint

  • MSO for other Microsoft formats; MIME type application/x-ms-other

  • SWF for Shockwave-Flash; MIME type application/x-shockwave-flash

  • GIF for Graphics Interchange Format; MIME type image/gif. Added in version 4.02.1046193282 Feb 25 2003.

  • TIFF for Tag Image File Format; MIME type image/tiff. Added in version 5.00.1084000000 May 8 2004.

  • TNEF for Microsoft Transport-Neutral Encoding Format; MIME type application/tnef. Added in version 4.02.1047588542 Mar 13 2003.

  • GZIP for gzip files; MIME type application/x-gzip

  • COMPRESS for compressed files; MIME type application/x-compress

  • WPD for WordPerfect files; MIME type application/wordperfect (added in version 7.01; previously handled as OTHER)

  • AUTO to auto-detect the format (the default)

  • OTHER for an unknown format; MIME type application/octet-stream

Codes are case-insensitive. The default is to automatically detect the input file type (e.g. -fAUTO). Note that there may be more file formats supported (via formats rule file) that are listed here. It is not usually necessary to specify the input type; most are detected properly. See also the --content-type option which supercedes this. The MIME type may also be given to -f instead of the code; parameters (e.g. charset) may be given but are largely ignored (HTML mode uses charset, as of version 7.02.1416893000 20141125).

-g   Print additional information in headers, such as input file type, translator arguments, etc.

-G   Same as -g, but quit: don't attempt actual translation.

-v   Enable verbose output.

-Dnnn   Enable debugging output, level nnn. Default is 0. Optional nnn added in version 5.01.1110400000 Mar 9 2005.

-uURL   Use URL as the URL of the input file (for informational purposes, does not fetch anything).

--install-dir{=| }DIR   Set the Texis install dir to use. Default is as installed, or typically /usr/local/morph3 under Unix. Added in version 4.03.1051600000 Apr 29 2003.

--rule-file{=| }FILE   Use formats rule file FILE. The default is the file specified by Rule File in the [Anytotx] section of the conf/texis.ini config file, or if that is not set, conf/formats.rule in the Texis install dir. If the formats rule file cannot be found or read, a default internal version is used. The formats rule file tells anytotx how to identify and translate file formats; see below for syntax. Added in version 4.02.1045857437 Feb 21 2003.

--types-config{=| }FILE   Use MIME types config file FILE. The default is the file specified by Types Config in the [Anytotx] section of the conf/texis.ini config file, or if that is not set, conf/mime.types in the Texis install dir. This file maps MIME types to file extensions, as a fall back for identifying files (a formats rule file entry is still usually needed). It is the same format as Apache mime.types files, i.e. each line is a MIME type followed by zero or more space-separated file extensions (no dot). Added in version 4.02.1045857437 Feb 21 2003.

--max-depth{=| }N   Maximum depth to recurse when processing a file. Multiple translators may need to be run to translate a file to text (e.g. RTF to HTML to text). Keeping this setting low can prevent an infinite loop if the content ever "bounces" between types. The default is 5, which may need to be raised if complex, multi-level translators are used. Added in version 4.02.1045857437 Feb 21 2003.

--tmp{=| }DIR   Use directory DIR for temporary files during translation. The default is the dir specified by the environment variables TMP, TMPDIR, TEMP or TEMPDIR. If no environment variable is set, the dir C:\ (Windows) or /tmp (Unix) is used, or the tmp subdirectory of the Texis installation directory.

--timeout{=| }NNN   Timeout in seconds; default is 30. Use -1 for no timeout. Added in version 4.03.1051675200 Apr 30 2003.

--content-type{=| }TYPE  

Assume input is MIME type TYPE. The default is to automatically detect the type. If specified, the MIME type should be one that has a translator in the formats rule file, or a built-in type such as application/octet-stream. Added in version 4.02.1045857437 Feb 21 2003.

Note that unlike -f this option's value can be other than one of the built-in MIME types: rather than dictating how to translate the input, the value merely describes it, and will be looked up in the formats rule file. Parameters (e.g. charset) are permitted in version 7.02.1416893000 20141125 and later, and are ignored.

--error-log{=| }FILE   Log errors to FILE. The default is standard error. Added in version 4.02.1045857437 Feb 21 2003.

--save-files   Save temporary output files in the temp dir. The default is to delete them. This can be used to see the raw files unpacked from an archive, before text translation (e.g. for TNEF or ZIP archives).

--output-enc{=| }CHARSET   Use output encoding CHARSET where possible, e.g. UTF-8. Added in version 5.01.1110258000 Mar 8 2005.

--expand-ligatures{=| }MIME   Expand single-character Unicode ligatures (e.g. "ffi" character) into multiple characters for input MIME type MIME. Default application/pdf; set none to turn off. Increases document searchability, but may affect browser PDF plugin highlighting. Added in version 5.01.1110258000 Mar 8 2005.

--name{=| }STR   Set name of object being processed, for error messages. Added in version 4.03.1055995200 Jun 19 2003.

--fix-mode{=| }{y|n}   Whether to fix attribute mode of unpacked files to non-hidden/readable. Default is y. Added in version 4.03.1059364800 Jul 28 2003.

--trace-pipe{=| }NNN   Debugging: set pipe trace level NNN. Added in version 4.04.1071637200 Dec 17 2003.



Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.