Formats Rule File

The formats rule file (typically conf/formats.rule in the install dir; see --rule-file option) tells anytotx how to identify and translate various file formats. Its syntax is loosely based on the Unix magic utility's config format, with extensions, and was added in version 4.02.1045857437 Feb 21 2003.

Each line specifies a content test, a MIME type, and a translator. Each test is run in order; the first successful test indicates the input is identified as the corresponding MIME type, and the translator is run to translate the input to text. (If the MIME type is specified on the command line via -f or --content-type, the appropriate translator is searched for by MIME type instead of content tests.)

Subtests

If a line begins with a greater-than sign (">"), it is a sub-test, specifying an additional content test for that MIME type. The test level is indicated by the number of such leading greater-than signs; most tests have none and are thus top-level or level 1. A level N test's children are all tests at level N+1 that follow it, up to (but not including) the next level N test. If a test at level N succeeds, its children (if any) are run recursively in config file order. This process continues until a successful test with a non-empty command line and no successful children is found. In this way, complex input types can be identified that require more than one part of the data to be examined.

Chained rules

Once a translator is identified and run, its output is examined. If it is identified as a known non-text type, another translator may be run to convert it again. For example, an RTF input file may be identified and translated to HTML via one translator. That output is identified as HTML, and is translated again (via a built-in translator) to text. Because of this multiple-pass feature, translators can be used that do not output text, but output a type that another translator can handle.

Directory/archive rules

Some translators may not produce any output at all, but produce a series of files, in a new directory. These translators have %DIROUTPUT% (or %TMP%, see below) in their command line in the formats rule file. After such a translator is run, any files it created are recursively processed by anytotx, and the resulting output will typically be multi-part MIME. In this way, archive formats such as ZIP and tar can be processed.

Other translators may not take a file as input, but instead take a directory tree, usually unpacked from a previous ZIP or tar archive %DIROUTPUT% rule. These rules have the %DIRINPUT% option set (below), and have no content test; the offset datatype value fields are each a single dash ("-"). These rules are for archive dirs that are actually a single monolithic document (e.g. Open Document Format file), not a group of distinct files.

Rule format

Each line of the formats rule file is of the following form (blank lines and pound-sign/semicolon comment lines are ignored):

[>...]offset datatype value mimetype commandline

The offset, datatype and value fields specify the content test. (For rules with no content test, e.g. %DIRINPUT% rules, each of these fields is a single dash.) The MIME type is given by mimetype, and the corresponding translator's space-separated command line follows. Each field has a particular syntax, as explained below:

offset
  Specifies the integer file offset to look at in the input. May be decimal, hexadecimal or octal. The test data is read at this offset. (If the rule has no content test, e.g. a %DIRINPUT% rule, the offset, data type and value may each be a single dash.)

The offset may be a range instead of a single value. This is indicated by an optional second offset after the first, separated by a dash. An optional increment may also appear after the second offset, separated by a comma; the default increment is one. If a range is given, the rule is tested at each offset from the first through and including the second, incrementing by the increment each time. For example, an offset of 0x100-0x200,0x10 would indicate that the rule should be tested at offset 0x100, then 0x110, then 0x120, etc. through and including 0x200.

If the offset is in parentheses, it is indirect. An indirect offset is of the form: (X[b|s|l|B|S|L[8|16|32|64]][+|-Y]). A value is read at offset X, which is in turn used as the offset for the test data. The value type and size read is determined by an optional suffix after the indirect offset:

b A byte
s A little-endian short
l (lower-case el) A little-endian long
B A byte
S A big-endian short
L A big-endian long

After the indirect suffix, an optional bit size may appear. This overrides the size indicated by the suffix, whose size may vary by platform. The bit size must be 8, 16, 32 or (on platforms that support it) 64.

After the indirect suffix and/or bit size, an optional sub-offset Y may appear. This positive or negative integer is added to the offset value read to compute the offset for the test.

datatype
  The type and size of data to read at the offset. One of the following values:
byte A single byte
short A short value
long A long value
string A string value (size determined by value)
date A Unix time_t date
language An identifiable language (see below)

An integer (non-string) type may optionally be prefixed by u to indicate an unsigned compare is to be made, be to indicate a big-endian value, and/or le to indicate little-endian. An optional bit-size suffix of 8, 16, 32 or (if supported) 64 may also be appended to override the size indicated by the type (which is platform-dependent).

A string type may be prefixed by i to indicate a case-insensitive compare.

After the data type and optional suffix, an optional value mask may appear for integer types. This is indicated by an ampersand ("&") and integer value (decimal, octal or hex). This value mask will be bit-wise ANDed to the input value before comparing for the test.

In version 7.01 and later, if the datatype is "language", instead of looking for a specific value, up to 1MB is examined to see if it is in a recognized language (e.g. English, Spanish etc.). If the language probability so determined is over the value threshold, the test passes.

value
  The specified value to compare with the input value for the test. It is an optional operator character followed by a value. The possible operator characters are:

= Input value must equal specified value (default)
< Input value must be less than specified value
> Input value must be less than specified value
& Input value must have all bits set that are set in the specified value, i.e. input value bit-wise ANDed with specified value must equal specified value
^ Input value must have cleared any bit that is set in the specified value, i.e. input value bit-wise ANDed with specified value must not equal specified value
x No-op: any value will match

The value must be an integer (decimal, octal or hex) for integer types, or a string for string or date types. String values will have C-style escapes translated. A date value must be a Texis-parseable date value. For string and language types, only the operators "=", "<" and ">" are valid. For the language type, the value is a probability, in the range "0.0" through "1.0" or "0%" through "100%".

mimetype
  The MIME type associated with this test. Multiple tests can have the same MIME type, e.g. if there are multiple ways to identify it. In version 6 and later, the MIME type may also contain asterisk ("*") wildcards, to match a group of MIME types.

commandline
  The translator, i.e. the space-separated command line with arguments to run to translate input of this MIME type, preferably to text. May be empty (i.e. """") to indicate there is no translator; this means that the MIME type is not fully identified by this test and sub-tests must be run.

The command line may contain certain special variables, enclosed in percent-signs ("%"). These variables will be replaced with certain values in the command line, or indicate certain options. Options will be removed from the command line, and should occur first, i.e. before the program name.

%IGNORE%   Option: There is no translator; this MIME type contains no text and is to be ignored. Useful for identifying non-text types like images; otherwise, the fallback -fOTHER mode may be used, which would print garbage. Should be used alone.

%DIROUTPUT%   Replaced with a unique, empty temporary directory, which is created and chdir()'d to before running the translator. This also indicates that the translator is expected to create multiple output files in this temporary dir, e.g. unpack a multi-file ZIP archive. The resulting anytotx output will be multi-part MIME, and each unpacked file will be recursively processed further by anytotx.

If the %DIROUTPUT% variable occurs as one of the first items in the command line (i.e. before the program), then it is an option and is removed from the command line, but all other behavior is the same. This is useful for un-archiving translators that do not take a target dir argument, but nonetheless unpack an archive to the current directory.

In version 5 and earlier, this variable was %TMP%, which is still supported but is deprecated.

%DIRINPUT%  

Option: this rule takes a directory (e.g. containing multiple associated files) as input, not a file. Useful for translators that work on an unpacked archive to produce a single output. For example, the Open Document Format translator odf takes the unzipped document directory tree as input (instead of the original .odt file), and outputs the document text. Since there is no file input with %DIRINPUT% rules, there cannot be a content test, so the offset datatype value fields must each be a single dash to indicate no test. Added in version 6. See also the archivemimefile setting, which is typically how these rules are recognized (instead of by content test).

%8.3%   Option: Use MSDOS-style 8.3 filenames where possible. Useful for older MSDOS executable translators that can't handle long filenames. No effect on non-Windows platforms.

%MIME%   Option: The translator produces MIME output, i.e. headers, which will be parsed. Certain headers are significant and will be stripped or replaced in the output; these include: Content-Type, X-Input-Content-Type, Content-Transfer-Encoding and X-Translator-Status. Some of these headers are used by translators to further identify the input, and tell anytotx how to proceed.

%IGNORESTDOUT%   Option: Ignore the standard output of the translator, instead of parsing and/or reporting it. Typically used with some un-archiving translators that produce unwanted standard-out messages in addition to unpacking files. Note: If output goes to a file instead of standard-out, but should be reported, use %OUT% instead. Added in version 5.01.1202350000 20080206.

%IGNORESTDERR%   Option: Ignore the standard-error output of the translator, instead of reporting it as an error.

%ANYTOTX%   Replaced with the path to the running anytotx executable. Used in conjunction with %ANYTOTXFLAGS% to use anytotx to translate a known built-in MIME type. Should only be used for anytotx built-in MIME types.

%ANYTOTXFLAGS%  

Replaced with the command-line flags passed to the running anytotx executable, with some modifications. Any --max-depth, --content-type and/or -f arguments are stripped. An appropriate (decremented) --max-depth argument and a -fcontent-type argument are added; also (in version 7.07.1611702000 20210126 and later) a --install-dir=... argument is added if not already present. Thus, the called anytotx will already know the MIME type and will not try to identify it, and will also know the current flags like -g. Should only be used for anytotx built-in MIME types (otherwise a loop occurs and the data is not translated).

%IN% or %IN.ext%   Replaced with the anytotx input file name. This must be given for translators that expect an explicit input filename on their command line. The standard input for the translator will also be redirected from /dev/null (the default is to redirect from the anytotx input file). If the anytotx input is not a file but is standard input, a temporary file will be created and the input copied to it.

The second version (%IN.ext%) is useful where a translator expects its input file to have a certain extension. The input file name that replaces %IN.ext% on the command line will have the extension .ext. If the actual input file name does not, or comes from standard in, an appropriate temporary file will be created and the input copied to it.

%OUT% or %OUT.ext%   Replaced with an output file name. This must be given for translators that expect an explicit output filename on their command line. If given, the standard output for the translator is ignored and this file will be read afterwards; if not given, the standard output from the translator will be used afterwards. A unique empty temporary file is created.

The second version (%OUT.ext%) is useful where a translator expects its output file to have a certain extension. The output file name that replaces %OUT.ext% on the command line will have the extension .ext.

%INSTALLDIR%   Replaced with the Texis install dir.

%BINDIR%   Replaced with the Texis binary dir (same as install dir for Windows, install dir plus "/bin" for Unix).

%LIBDIR%  

Replaced with the Texis library dir (typically the lib subdir of the Texis install dir). Added in version 8.

%LOGDIR%  

Replaced with the log directory, i.e. [Texis] Log Dir value. For log files. Added in version 8.

%RUNDIR%  

Replaced with the run directory, i.e. [Texis] Run Dir value. For run-time-only files, e.g. PID files etc. Added in version 8.

%EXEDIR%   Replaced with the directory of the currently running executable. Added in version 5.01.1214185000 20080622. In version 7 and later, if the executable dir is not determinable, the Texis binary dir will be used.

%%   Replaced with a single percent-sign ("%").

Command line arguments may be quoted (single or double). Under Unix, the enclosed values become a single argument and the quotes are stripped. Under Windows the quotes are untouched and it is up to the translator to parse its command line accordingly. When a special variable is replaced with its value in the command line, the value (and its adjacent non-whitespace characters, if any) will automatically be quoted if it contains spaces and is not already explicitly quoted. For %ANYTOTXFLAGS%, the quoting is applied on a per-argument basis.

Settings

In version 6 and later, the formats.rule file may also contain the following setting:

archivemimefile file  

Certain file types are actually archives (ZIP files) that describe a single document, not multiple. Some of these archives contain a file that describes the MIME type of the document. These MIME type files can be recognized with the archivemimefile setting. The named file, if seen after an archive (%DIROUTPUT% rule) is unpacked (or a directory is given as input to anytotx instead of a file), is checked for %DIRINPUT% rules' MIME types. If a matching MIME type is found, the %DIRINPUT% rule is then run, instead of the normal recursive processing of the individual files in the directory.

For example, Open Document Format files are really ZIP archives, and contain a file called mimetype that contains the MIME type of the document (e.g. "application/vnd.oasis.opendocument.text"). Thus, after the ZIP rule unpacks an Open Document Format file (like any other ZIP), an archivemimefile mimetype setting would tell anytotx to look for the mimetype file: if it is found, and contains a MIME type that matches a %DIRINPUT% rule, that translator is run. Otherwise, the ZIP file's contents would be processed individually.


EXAMPLE
An example formats.rule file might be:

0 string =PK\003\004 application/zip %BINDIR%/unzip -d %DIROUTPUT% %IN%

99 byte      x0      application/octet-stream ''
>0 beshort16 =0xd0cf application/msword       %ANYTOTX% %ANYTOTXFLAGS%
>0 beshort16 =0xdba5 application/msword       %ANYTOTX% %ANYTOTXFLAGS%

The first line's test is to check for the string PK followed by ASCII char 3 and ASCII char 4, at offset 0 in the input. If the string matches, the MIME type is application/zip, and the program unzip in the Texis binary dir is run, with the input file as the last argument. Multiple output files are expected to be written to the unique %DIROUTPUT% dir by unzip and will be recursively processed by anytotx.

The next 3 lines are all related, because the last 2 have a greater-than sign indicating they are sub-tests of the one above. The first test matches any byte at offset 99 in the input file. In effect, it verifies the input is at least 100 bytes long. But there is no translator specified ("''"), so the input isn't identified yet. The sub-tests are run: each looks for a different 16-bit big-endian short integer at offset 0. The MIME type and translator are the same for both, and indicate that anytotx should be run to process the file. Since %ANYTOTXFLAGS% will have a --content-type argument appended, the sub-process anytotx will know the type and run its built-in translator directly.


CAVEATS
The anytotx plugin's availability is license dependent. Contact Thunderstone for details.

Versions of anytotx before 4.02.1045857437 Feb 21 2003 may not print any headers in the output, e.g. if no meta data is requested.


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.