The formats rule file (typically conf/formats.rule
in the
install dir; see --rule-file
option) tells anytotx
how
to identify and translate various file formats. Its syntax is loosely
based on the Unix magic
utility's config format, with
extensions, and was added in version 4.02.1045857437 Feb 21 2003.
Each line specifies a content test, a MIME type, and a translator.
Each test is run in order; the first successful test indicates the
input is identified as the corresponding MIME type, and the translator
is run to translate the input to text. (If the MIME type is specified
on the command line via -f
or --content-type
, the
appropriate translator is searched for by MIME type instead of content
tests.)
If a line begins with a greater-than sign (">
"), it is a
sub-test, specifying an additional content test for that MIME type.
The test level is indicated by the number of such leading
greater-than signs; most tests have none and are thus top-level or
level 1. A level N test's children are all tests at level
N+1 that follow it, up to (but not including) the next level N
test. If a test at level N succeeds, its children (if any) are run
recursively in config file order. This process continues until a
successful test with a non-empty command line and no successful
children is found. In this way, complex input types can be identified
that require more than one part of the data to be examined.
Once a translator is identified and run, its output is examined. If it is identified as a known non-text type, another translator may be run to convert it again. For example, an RTF input file may be identified and translated to HTML via one translator. That output is identified as HTML, and is translated again (via a built-in translator) to text. Because of this multiple-pass feature, translators can be used that do not output text, but output a type that another translator can handle.
Some translators may not produce any output at all, but produce a
series of files, in a new directory. These translators have
%DIROUTPUT%
(or %TMP%
, see below) in their command line
in the formats rule file. After such a translator is run, any files
it created are recursively processed by anytotx
, and the
resulting output will typically be multi-part MIME. In this way,
archive formats such as ZIP and tar
can be processed.
Other translators may not take a file as input, but instead take a
directory tree, usually unpacked from a previous ZIP or tar
archive %DIROUTPUT%
rule. These rules have the
%DIRINPUT%
option set (below), and have no content test; the
offset datatype value fields are each a single dash
("-
"). These rules are for archive dirs that are actually
a single monolithic document (e.g. Open Document Format file),
not a group of distinct files.
Each line of the formats rule file is of the following form (blank lines and pound-sign/semicolon comment lines are ignored):
[>...]offset datatype value mimetype commandline
The offset
, datatype
and value
fields specify
the content test. (For rules with no content test,
e.g. %DIRINPUT%
rules, each of these fields is a single dash.)
The MIME type is given by mimetype
, and the corresponding
translator's space-separated command line follows. Each field has a
particular syntax, as explained below:
%DIRINPUT%
rule, the offset, data type and value may each be a single dash.)
The offset may be a range instead of a single value. This is
indicated by an optional second offset after the first, separated
by a dash. An optional increment may also appear after the second
offset, separated by a comma; the default increment is one. If a
range is given, the rule is tested at each offset from the first
through and including the second, incrementing by the increment
each time. For example, an offset of 0x100-0x200,0x10
would indicate that the rule should be tested at offset 0x100,
then 0x110, then 0x120, etc. through and including 0x200.
If the offset is in parentheses, it is indirect. An indirect
offset is of the form:
(
X[b|s|l|B|S|L
[8|16|32|64
]][+|-
Y])
. A
value is read at offset X, which is in turn used as the offset
for the test data. The value type and size read is determined by
an optional suffix after the indirect offset:
b
A byte
s
A little-endian short
l
(lower-case el) A little-endian long
B
A byte
S
A big-endian short
L
A big-endian long
After the indirect suffix, an optional bit size may appear. This overrides the size indicated by the suffix, whose size may vary by platform. The bit size must be 8, 16, 32 or (on platforms that support it) 64.
After the indirect suffix and/or bit size, an optional sub-offset Y may appear. This positive or negative integer is added to the offset value read to compute the offset for the test.
byte
A single byte
short
A short value
long
A long value
string
A string value (size determined by value)
date
A Unix time_t
date
language
An identifiable language (see below)
An integer (non-string) type may optionally be prefixed by
u
to indicate an unsigned compare is to be made, be
to indicate a big-endian value, and/or le
to indicate
little-endian. An optional bit-size suffix of 8, 16, 32 or (if
supported) 64 may also be appended to override the size indicated
by the type (which is platform-dependent).
A string type may be prefixed by i
to indicate a
case-insensitive compare.
After the data type and optional suffix, an optional value mask
may appear for integer types. This is indicated by an ampersand
("&
") and integer value (decimal, octal or hex). This
value mask will be bit-wise ANDed to the input value before
comparing for the test.
In version 7.01 and later, if the datatype
is
"language
", instead of looking for a specific value, up
to 1MB is examined to see if it is in a recognized language
(e.g. English, Spanish etc.). If the language probability so
determined is over the value
threshold, the test passes.
=
Input value must equal specified value (default)
<
Input value must be less than specified value
>
Input value must be less than specified value
&
Input value must have all bits set that are set in
the specified value, i.e. input value bit-wise ANDed with
specified value must equal specified value
^
Input value must have cleared any bit that is
set in the specified value, i.e. input value bit-wise ANDed with
specified value must not equal specified value
x
No-op: any value will match
The value must be an integer (decimal, octal or hex) for integer
types, or a string for string or date types. String values will
have C-style escapes translated. A date value must be a
Texis-parseable date value. For string and language
types,
only the operators "=
", "<
" and ">
" are
valid. For the language
type, the value is a probability,
in the range "0.0
" through "1.0
" or
"0%
" through "100%
".
*
") wildcards, to match a group of MIME types.
""
") to indicate there is no
translator; this means that the MIME type is not fully identified
by this test and sub-tests must be run.
The command line may contain certain special variables, enclosed
in percent-signs ("%
"). These variables will be replaced
with certain values in the command line, or indicate certain
options. Options will be removed from the command line, and
should occur first, i.e. before the program name.
%IGNORE%
Option: There is no translator; this MIME type contains no
text and is to be ignored. Useful for identifying non-text
types like images; otherwise, the fallback -fOTHER
mode
may be used, which would print garbage. Should be used alone.
%DIROUTPUT%
Replaced with a unique, empty temporary directory, which is
created and chdir()
'd to before running the translator.
This also indicates that the translator is expected to create
multiple output files in this temporary dir, e.g. unpack a
multi-file ZIP archive. The resulting anytotx
output
will be multi-part MIME, and each unpacked file will be
recursively processed further by anytotx
.
If the %DIROUTPUT%
variable occurs as one of the first
items in the command line (i.e. before the program), then it
is an option and is removed from the command line, but all
other behavior is the same. This is useful for un-archiving
translators that do not take a target dir argument, but
nonetheless unpack an archive to the current directory.
In version 5 and earlier, this variable was %TMP%
,
which is still supported but is deprecated.
%DIRINPUT%
Option: this rule takes a directory (e.g. containing multiple
associated files) as input, not a file. Useful for
translators that work on an unpacked archive to produce a
single output. For example, the Open Document Format
translator odf
takes the unzipped document directory
tree as input (instead of the original .odt
file), and
outputs the document text. Since there is no file input with
%DIRINPUT%
rules, there cannot be a content test, so
the offset datatype value fields must each be a single
dash to indicate no test. Added in version 6. See also the
archivemimefile
setting, which is typically how these
rules are recognized (instead of by content test).
%8.3%
Option: Use MSDOS-style 8.3 filenames where possible. Useful
for older MSDOS executable translators that can't handle long
filenames. No effect on non-Windows platforms.
%MIME%
Option: The translator produces MIME output, i.e. headers,
which will be parsed. Certain headers are significant and
will be stripped or replaced in the output; these include:
Content-Type
, X-Input-Content-Type
,
Content-Transfer-Encoding
and
X-Translator-Status
. Some of these headers are used by
translators to further identify the input, and tell
anytotx
how to proceed.
%IGNORESTDOUT%
Option: Ignore the standard output of the translator, instead
of parsing and/or reporting it. Typically used with some
un-archiving translators that produce unwanted standard-out
messages in addition to unpacking files. Note: If output goes
to a file instead of standard-out, but should be
reported, use %OUT%
instead. Added in version
5.01.1202350000 20080206.
%IGNORESTDERR%
Option: Ignore the standard-error output of the translator,
instead of reporting it as an error.
%ANYTOTX%
Replaced with the path to the running anytotx
executable. Used in conjunction with %ANYTOTXFLAGS%
to use anytotx
to translate a known built-in MIME type.
Should only be used for anytotx
built-in MIME types.
%ANYTOTXFLAGS%
Replaced with the command-line flags passed to the running
anytotx
executable, with some modifications. Any
--max-depth
, --content-type
and/or -f
arguments are stripped. An appropriate (decremented)
--max-depth
argument and a -f
content-type
argument are added; also (in version 7.07.1611702000 20210126
and later) a --install-dir=
... argument is added if
not already present. Thus, the called anytotx
will
already know the MIME type and will not try to identify it,
and will also know the current flags like -g
. Should
only be used for anytotx
built-in MIME types (otherwise
a loop occurs and the data is not translated).
%IN%
or %IN.ext%
Replaced with the anytotx
input file name. This must
be given for translators that expect an explicit input
filename on their command line. The standard input for the
translator will also be redirected from /dev/null
(the
default is to redirect from the anytotx
input file).
If the anytotx
input is not a file but is standard
input, a temporary file will be created and the input copied
to it.
The second version (%IN.ext%
) is useful where a
translator expects its input file to have a certain extension.
The input file name that replaces %IN.ext%
on the
command line will have the extension .ext
. If the
actual input file name does not, or comes from standard in, an
appropriate temporary file will be created and the input
copied to it.
%OUT%
or %OUT.ext%
Replaced with an output file name. This must be given for
translators that expect an explicit output filename on their
command line. If given, the standard output for the
translator is ignored and this file will be read afterwards;
if not given, the standard output from the translator will be
used afterwards. A unique empty temporary file is created.
The second version (%OUT.ext%
) is useful where a
translator expects its output file to have a certain
extension. The output file name that replaces
%OUT.ext%
on the command line will have the extension
.ext
.
%INSTALLDIR%
Replaced with the Texis install dir.
%BINDIR%
Replaced with the Texis binary dir (same as install dir for
Windows, install dir plus "/bin
" for Unix).
%LIBDIR%
Replaced with the Texis library dir (typically the lib subdir of the Texis install dir). Added in version 8.
%LOGDIR%
Replaced with the log directory, i.e. [Texis] Log Dir value. For log files. Added in version 8.
%RUNDIR%
Replaced with the run directory, i.e. [Texis] Run Dir value. For run-time-only files, e.g. PID files etc. Added in version 8.
%EXEDIR%
Replaced with the directory of the currently running
executable. Added in version 5.01.1214185000 20080622. In
version 7 and later, if the executable dir is not
determinable, the Texis binary dir will be used.
%%
Replaced with a single percent-sign ("%
").
Command line arguments may be quoted (single or double). Under Unix,
the enclosed values become a single argument and the quotes are
stripped. Under Windows the quotes are untouched and it is up
to the translator to parse its command line accordingly. When
a special variable is replaced with its value in the command line,
the value (and its adjacent non-whitespace characters, if any)
will automatically be quoted if it contains spaces
and is not already explicitly quoted. For %ANYTOTXFLAGS%
,
the quoting is applied on a per-argument basis.
In version 6 and later, the formats.rule
file may also contain
the following setting:
archivemimefile
file
Certain file types are actually archives (ZIP files) that describe
a single document, not multiple. Some of these archives contain a
file that describes the MIME type of the document. These MIME
type files can be recognized with the archivemimefile
setting. The named file, if seen after an archive
(%DIROUTPUT%
rule) is unpacked (or a directory is given as
input to anytotx
instead of a file), is checked for
%DIRINPUT%
rules' MIME types. If a matching MIME type is
found, the %DIRINPUT%
rule is then run, instead of the
normal recursive processing of the individual files in the
directory.
For example, Open Document Format files are really ZIP archives,
and contain a file called mimetype
that contains the MIME
type of the document
(e.g. "application/vnd.oasis.opendocument.text
"). Thus,
after the ZIP rule unpacks an Open Document Format file (like any
other ZIP), an archivemimefile mimetype setting would tell
anytotx
to look for the mimetype
file: if it is
found, and contains a MIME type that matches a %DIRINPUT%
rule, that translator is run. Otherwise, the ZIP file's contents
would be processed individually.
EXAMPLE
An example formats.rule
file might be:
0 string =PK\003\004 application/zip %BINDIR%/unzip -d %DIROUTPUT% %IN%
99 byte x0 application/octet-stream ''
>0 beshort16 =0xd0cf application/msword %ANYTOTX% %ANYTOTXFLAGS%
>0 beshort16 =0xdba5 application/msword %ANYTOTX% %ANYTOTXFLAGS%
The first line's test is to check for the string PK
followed
by ASCII char 3 and ASCII char 4, at offset 0 in the input. If the
string matches, the MIME type is application/zip
, and the
program unzip
in the Texis binary dir is run, with the input
file as the last argument. Multiple output files are expected to be
written to the unique %DIROUTPUT%
dir by unzip
and will be
recursively processed by anytotx
.
The next 3 lines are all related, because the last 2 have a
greater-than sign indicating they are sub-tests of the one above. The
first test matches any byte at offset 99 in the input file. In
effect, it verifies the input is at least 100 bytes long. But there
is no translator specified ("''
"), so the input isn't
identified yet. The sub-tests are run: each looks for a different
16-bit big-endian short integer at offset 0. The MIME type and
translator are the same for both, and indicate that anytotx
should be run to process the file. Since %ANYTOTXFLAGS%
will
have a --content-type
argument appended, the sub-process
anytotx
will know the type and run its built-in translator
directly.
CAVEATS
The anytotx
plugin's availability is license dependent.
Contact Thunderstone for details.
Versions of anytotx
before 4.02.1045857437 Feb 21 2003
may not print any headers in the output, e.g. if no meta data is
requested.