The formats rule file (typically
conf/formats.rule in the
install dir; see
--rule-file option) tells
to identify and translate various file formats. Its syntax is loosely
based on the Unix
magic utility's config format, with
extensions, and was added in version 4.02.1045857437 Feb 21 2003.
Each line specifies a content test, a MIME type, and a translator.
Each test is run in order; the first successful test indicates the
input is identified as the corresponding MIME type, and the translator
is run to translate the input to text. (If the MIME type is specified
on the command line via
appropriate translator is searched for by MIME type instead of content
If a line begins with a greater-than sign ("
>"), it is a
sub-test, specifying an additional content test for that MIME type.
The test level is indicated by the number of such leading
greater-than signs; most tests have none and are thus top-level or
level 1. A level N test's children are all tests at level
N+1 that follow it, up to (but not including) the next level N
test. If a test at level N succeeds, its children (if any) are run
recursively in config file order. This process continues until a
successful test with a non-empty command line and no successful
children is found. In this way, complex input types can be identified
that require more than one part of the data to be examined.
Once a translator is identified and run, its output is examined. If it is identified as a known non-text type, another translator may be run to convert it again. For example, an RTF input file may be identified and translated to HTML via one translator. That output is identified as HTML, and is translated again (via a built-in translator) to text. Because of this multiple-pass feature, translators can be used that do not output text, but output a type that another translator can handle.
Some translators may not produce any output at all, but produce a
series of files, in a new directory. These translators have
%TMP%, see below) in their command line
in the formats rule file. After such a translator is run, any files
it created are recursively processed by
anytotx, and the
resulting output will typically be multi-part MIME. In this way,
archive formats such as ZIP and
tar can be processed.
Other translators may not take a file as input, but instead take a
directory tree, usually unpacked from a previous ZIP or
%DIROUTPUT% rule. These rules have the
%DIRINPUT% option set (below), and have no content test; the
offset datatype value fields are each a single dash
-"). These rules are for archive dirs that are actually
a single monolithic document (e.g. Open Document Format file),
not a group of distinct files.
Each line of the formats rule file is of the following form (blank lines and pound-sign/semicolon comment lines are ignored):
[>...]offset datatype value mimetype commandline
value fields specify
the content test. (For rules with no content test,
%DIRINPUT% rules, each of these fields is a single dash.)
The MIME type is given by
mimetype, and the corresponding
translator's space-separated command line follows. Each field has a
particular syntax, as explained below:
%DIRINPUT%rule, the offset, data type and value may each be a single dash.)
The offset may be a range instead of a single value. This is
indicated by an optional second offset after the first, separated
by a dash. An optional increment may also appear after the second
offset, separated by a comma; the default increment is one. If a
range is given, the rule is tested at each offset from the first
through and including the second, incrementing by the increment
each time. For example, an offset of
would indicate that the rule should be tested at offset 0x100,
then 0x110, then 0x120, etc. through and including 0x200.
If the offset is in parentheses, it is indirect. An indirect
offset is of the form:
value is read at offset X, which is in turn used as the offset
for the test data. The value type and size read is determined by
an optional suffix after the indirect offset:
sA little-endian short
l(lower-case el) A little-endian long
SA big-endian short
LA big-endian long
After the indirect suffix, an optional bit size may appear. This overrides the size indicated by the suffix, whose size may vary by platform. The bit size must be 8, 16, 32 or (on platforms that support it) 64.
After the indirect suffix and/or bit size, an optional sub-offset Y may appear. This positive or negative integer is added to the offset value read to compute the offset for the test.
byteA single byte
shortA short value
longA long value
stringA string value (size determined by value)
languageAn identifiable language (see below)
An integer (non-string) type may optionally be prefixed by
u to indicate an unsigned compare is to be made,
to indicate a big-endian value, and/or
le to indicate
little-endian. An optional bit-size suffix of 8, 16, 32 or (if
supported) 64 may also be appended to override the size indicated
by the type (which is platform-dependent).
A string type may be prefixed by
i to indicate a
After the data type and optional suffix, an optional value mask
may appear for integer types. This is indicated by an ampersand
&") and integer value (decimal, octal or hex). This
value mask will be bit-wise ANDed to the input value before
comparing for the test.
In version 7.01 and later, if the
language", instead of looking for a specific value, up
to 1MB is examined to see if it is in a recognized language
(e.g. English, Spanish etc.). If the language probability so
determined is over the
value threshold, the test passes.
=Input value must equal specified value (default)
<Input value must be less than specified value
>Input value must be less than specified value
&Input value must have all bits set that are set in the specified value, i.e. input value bit-wise ANDed with specified value must equal specified value
^Input value must have cleared any bit that is set in the specified value, i.e. input value bit-wise ANDed with specified value must not equal specified value
xNo-op: any value will match
The value must be an integer (decimal, octal or hex) for integer
types, or a string for string or date types. String values will
have C-style escapes translated. A date value must be a
Texis-parseable date value. For string and
only the operators "
<" and "
valid. For the
language type, the value is a probability,
in the range "
0.0" through "
0%" through "
*") wildcards, to match a group of MIME types.
""") to indicate there is no translator; this means that the MIME type is not fully identified by this test and sub-tests must be run.
The command line may contain certain special variables, enclosed
in percent-signs ("
%"). These variables will be replaced
with certain values in the command line, or indicate certain
options. Options will be removed from the command line, and
should occur first, i.e. before the program name.
%IGNORE%Option: There is no translator; this MIME type contains no text and is to be ignored. Useful for identifying non-text types like images; otherwise, the fallback
-fOTHERmode may be used, which would print garbage. Should be used alone.
%DIROUTPUT%Replaced with a unique, empty temporary directory, which is created and
chdir()'d to before running the translator. This also indicates that the translator is expected to create multiple output files in this temporary dir, e.g. unpack a multi-file ZIP archive. The resulting
anytotxoutput will be multi-part MIME, and each unpacked file will be recursively processed further by
%DIROUTPUT% variable occurs as one of the first
items in the command line (i.e. before the program), then it
is an option and is removed from the command line, but all
other behavior is the same. This is useful for un-archiving
translators that do not take a target dir argument, but
nonetheless unpack an archive to the current directory.
In version 5 and earlier, this variable was
which is still supported but is deprecated.
Option: this rule takes a directory (e.g. containing multiple
associated files) as input, not a file. Useful for
translators that work on an unpacked archive to produce a
single output. For example, the Open Document Format
odf takes the unzipped document directory
tree as input (instead of the original
.odt file), and
outputs the document text. Since there is no file input with
%DIRINPUT% rules, there cannot be a content test, so
the offset datatype value fields must each be a single
dash to indicate no test. Added in version 6. See also the
archivemimefile setting, which is typically how these
rules are recognized (instead of by content test).
%8.3%Option: Use MSDOS-style 8.3 filenames where possible. Useful for older MSDOS executable translators that can't handle long filenames. No effect on non-Windows platforms.
%MIME%Option: The translator produces MIME output, i.e. headers, which will be parsed. Certain headers are significant and will be stripped or replaced in the output; these include:
X-Translator-Status. Some of these headers are used by translators to further identify the input, and tell
anytotxhow to proceed.
%IGNORESTDOUT%Option: Ignore the standard output of the translator, instead of parsing and/or reporting it. Typically used with some un-archiving translators that produce unwanted standard-out messages in addition to unpacking files. Note: If output goes to a file instead of standard-out, but should be reported, use
%OUT%instead. Added in version 5.01.1202350000 20080206.
%IGNORESTDERR%Option: Ignore the standard-error output of the translator, instead of reporting it as an error.
%ANYTOTX%Replaced with the path to the running
anytotxexecutable. Used in conjunction with
anytotxto translate a known built-in MIME type. Should only be used for
anytotxbuilt-in MIME types.
Replaced with the command-line flags passed to the running
anytotx executable, with some modifications. Any
arguments are stripped. An appropriate (decremented)
--max-depth argument and a
argument are added. Thus, the called
already know the MIME type and will not try to identify it,
and will also know the current flags like
only be used for
anytotx built-in MIME types (otherwise
a loop occurs and the data is not translated).
%IN.ext%Replaced with the
anytotxinput file name. This must be given for translators that expect an explicit input filename on their command line. The standard input for the translator will also be redirected from
/dev/null(the default is to redirect from the
anytotxinput file). If the
anytotxinput is not a file but is standard input, a temporary file will be created and the input copied to it.
The second version (
%IN.ext%) is useful where a
translator expects its input file to have a certain extension.
The input file name that replaces
%IN.ext% on the
command line will have the extension
.ext. If the
actual input file name does not, or comes from standard in, an
appropriate temporary file will be created and the input
copied to it.
%OUT.ext%Replaced with an output file name. This must be given for translators that expect an explicit output filename on their command line. If given, the standard output for the translator is ignored and this file will be read afterwards; if not given, the standard output from the translator will be used afterwards. A unique empty temporary file is created.
The second version (
%OUT.ext%) is useful where a
translator expects its output file to have a certain
extension. The output file name that replaces
%OUT.ext% on the command line will have the extension
%INSTALLDIR%Replaced with the Texis install dir.
%BINDIR%Replaced with the Texis binary dir (same as install dir for Windows, install dir plus "
/bin" for Unix).
%EXEDIR%Replaced with the directory of the currently running executable. Added in version 5.01.1214185000 20080622. In version 7 and later, if the executable dir is not determinable, the Texis binary dir will be used.
%%Replaced with a single percent-sign ("
Command line arguments may be quoted (single or double). Under Unix,
the enclosed values become a single argument and the quotes are
stripped. Under Windows the quotes are untouched and it is up
to the translator to parse its command line accordingly. When
a special variable is replaced with its value in the command line,
the value (and its adjacent non-whitespace characters, if any)
will automatically be quoted if it contains spaces
and is not already explicitly quoted. For
the quoting is applied on a per-argument basis.
In version 6 and later, the
formats.rule file may also contain
the following setting:
Certain file types are actually archives (ZIP files) that describe
a single document, not multiple. Some of these archives contain a
file that describes the MIME type of the document. These MIME
type files can be recognized with the
setting. The named file, if seen after an archive
%DIROUTPUT% rule) is unpacked (or a directory is given as
anytotx instead of a file), is checked for
%DIRINPUT% rules' MIME types. If a matching MIME type is
%DIRINPUT% rule is then run, instead of the
normal recursive processing of the individual files in the
For example, Open Document Format files are really ZIP archives,
and contain a file called
mimetype that contains the MIME
type of the document
after the ZIP rule unpacks an Open Document Format file (like any
other ZIP), an archivemimefile mimetype setting would tell
anytotx to look for the
mimetype file: if it is
found, and contains a MIME type that matches a
rule, that translator is run. Otherwise, the ZIP file's contents
would be processed individually.
formats.rule file might be:
0 string =PK\003\004 application/zip %BINDIR%/unzip -d %DIROUTPUT% %IN%
99 byte x0 application/octet-stream ''
>0 beshort16 =0xd0cf application/msword %ANYTOTX% %ANYTOTXFLAGS%
>0 beshort16 =0xdba5 application/msword %ANYTOTX% %ANYTOTXFLAGS%
The first line's test is to check for the string
by ASCII char 3 and ASCII char 4, at offset 0 in the input. If the
string matches, the MIME type is
application/zip, and the
unzip in the Texis binary dir is run, with the input
file as the last argument. Multiple output files are expected to be
written to the unique
%DIROUTPUT% dir by
unzip and will be
recursively processed by
The next 3 lines are all related, because the last 2 have a
greater-than sign indicating they are sub-tests of the one above. The
first test matches any byte at offset 99 in the input file. In
effect, it verifies the input is at least 100 bytes long. But there
is no translator specified ("
''"), so the input isn't
identified yet. The sub-tests are run: each looks for a different
16-bit big-endian short integer at offset 0. The MIME type and
translator are the same for both, and indicate that
should be run to process the file. Since
--content-type argument appended, the sub-process
anytotx will know the type and run its built-in translator
anytotx plugin's availability is license dependent.
Contact Thunderstone for details.
anytotx before 4.02.1045857437 Feb 21 2003
may not print any headers in the output, e.g. if no meta data is