The following urlcp
settings control how HTML documents are
formatted (e.g. the return value of <urlinfo text>
etc.):
8bithtml
(boolean)
If true (default), 8-bit HTML characters are left alone when
formatting HTML text. If false, 8-bit characters are replaced with
the closest 7-bit character(s).allowinputfiledefault
(boolean)
If true, <input type="file">
default values (i.e. those
assigned in the original HTML, as opposed to those set by <urlcp domvalue>) will be allowed. If false (the default), such
default values will be suppressed (i.e. empty, as if unset). This
is for security, to help prevent malicious HTML from
surreptitiously "stealing" local files by pre-setting
file-upload dialogs. Added in version 6.00.1335222312 20120423.
Returns previous value (1 or 0).allowpunct
(boolean)
Sets whether to allow punctuation in tag/attribute names when
parsing HTML. Added in version 4.0.1001550000 20010926.
Default is on, which aids in parsing of XML-like attributes.alttxt
(boolean)
If true (default), the text from ALT
attributes in
IMG
and AREA
tags is included in the formatted text.
If false, this text is ignored. This is useful when the
ALT
text is "gif" or "image" or something equally
inane.charsetconfigfromfile
(string)
Load charset configuration from the given file. The file format
is a set of charset names, each followed by zero or more
space-separated aliases:
Charset: ISO-8859-1
Aliases: 8859-1 CP819 csISOLatin1 IBM819
Aliases: ISO_8859-1:1987 iso-ir-100 l1 latin1
...
Charsets encountered during fetch processing that match names in
Aliases
are canonicalized to their official Charset
name. The default charset config file is set in
texis.ini
by the [Texis] Charset Config
setting (here). If that is
unspecified, the file conf/charsets.conf in the install dir
is used. Returns 2 on success, 1 on partial success (file found
but has errors), 0 on failure. Added in version 6.
charsetconfigfromtext
(string)
Parses the given string buffer as a charset configuration, in the
same format as a charsetconfigfromfile
file. Returns 2 on
success, 1 on partial success (file found but has errors), 0 on
failure. Added in version 6.charsetconverter
(string)
The command and arguments to execute to convert character sets not
known by the internal charset converter. The default (or if set
empty) is the value set in texis.ini
by
[Texis] Charset Converter
(here).
If that is unspecified, the
value "%INSTALLDIR%/etc/iconv" -f %CHARSETFROM% -t
%CHARSETTO% -c is used. The variables %INSTALLDIR%
,
%CHARSETFROM%
and %CHARSETTO%
will be replaced with
the Texis installation directory, source charset name and target
charset name, respectively. Double quotes should be placed around
single arguments that may contain spaces (e.g. the path to
iconv
) and will be removed in Unix versions. If the
option %ALL%
is given at the start, all charset conversions
will be handled by this converter, even those that the internal
converter knows. If the option %NONE%
is given at the start
(nothing else needed), no charset conversions will be handled by
the converter; i.e. only the internal converter will be used.
Returns previous setting. Added in version 5.00.1089408135 20040709.
%NONE%
added in version 7.02.1415897000 20141113.charsetpartialconvok
(boolean)
Whether to accept timeout/non-zero exit of external charset
translator program, if at least some output was generated.
Sometimes a few bad characters on a page can cause the translator
to generate valid output for the rest of the page, but exit
non-zero; if this setting is on, such partial output will be
accepted. Added (and defaults to on) in version 5.01.1098470049 20041022.
Returns previous setting.charsetsrc
(string)
Sets the character set to assume is the source for all pages
fetched. Note that this forces the character set, i.e. all pages
are interpreted as this character set even if labelled or detected
differently. This setting should only be used as a last resort to
force the character set for a mis-labelled or undetectable page;
normally only charsetsrcdefault
need be set. Note that
this does not affect the charset for the output formatted text;
see charsettxt
. Currently recognized charsets are
ISO-8859-1
, UTF-8
, UTF-16
, UTF-16BE
and UTF-16LE
. Give empty string to set default, which is
none/unknown - do not force charset, instead check next available
source (i.e. charset set explicitly in page). Added in version 5.
charsetsrcdefault
(string)
Sets the character set to assume for the source of pages fetched,
when it is not forced (with charsetsrc
), nor labelled nor
detectable; i.e. a last-resort fallback. Note that this does not
affect the charset for the output formatted text; see
charsettxt
. This setting is useful if most pages are
correctly labelled, but a few are not labelled and Vortex is
not correctly recognizing them. Give empty string to set default,
which is none/unknown. Added in version 5.
charsettxt
or charsettext
(string)
Sets the character set to return formatted text in
(i.e. <urlinfo text>). The default in version 5 is UTF-8. If the
charset of a given source page is different, its formatted text will
be translated to this charset. The charsettext
value may be
set to "source
" or "src
" to indicate that the
source page charset should be used instead. It may be set to ""
(empty string) to reset to the default value.del
(boolean)
If true (default), the text within <DEL>
blocks is
included in the formatted text obtained with <urlinfo text>.
If the del
setting is false, this text is
deleted. (This setting is the same as ignoredel
but
negated.) Added in version 3.01.962850000 20000705.filedirrobotsfollow
(boolean)
If true (default), file://
directory URLs' HTML will
contain a <meta>
robots
tag value of follow
,
which indicates to crawlers that the pages' links should be
followed. If false, the value will be nofollow
. Added in
version 5.01.1226709000 20081114.filedirrobotsindex
(boolean)
If true, file://
directory URLs' HTML will contain a
<meta>
robots
tag value of index
, which
indicates to crawlers that the pages' content itself should be
indexed. If false, the value will be noindex
. The default
is false, since directory contents are mostly filenames, which
would clutter up the crawler's index. Added in version
5.01.1226709000 20081114.formatxmlashtml
(boolean)
If true (default), XML documents are formatted and parsed as HTML
(XSL stylesheets are not currently supported by the internal fetch
formatter). If false, XML documents are left unparsed.
Parsing XML as HTML will tend to return just the content of tags
for formatted text, whereas leaving XML unparsed will return the
entire raw document for formatted text.
Added in version 5.01.1195086345 20071114.formtxt
or formtext
(boolean)
Controls the select
, input
and textarea
settings
together. (This setting is the same as ignoreformtxt
or ignoreformtext
but negated.) Added in version
3.01.985900000 20010329.ftpdirrobotsfollow
(boolean)
If true (default), ftp://
directory URLs' HTML will
contain a <meta>
robots
tag value of follow
,
which indicates to crawlers that the pages' links should be
followed. If false, the value will be nofollow
. Added in
version 5.01.1226709000 20081114.ftpdirrobotsindex
(boolean)
If true, ftp://
directory URLs' HTML will contain a
<meta>
robots
tag value of index
, which
indicates to crawlers that the pages' content itself should be
indexed. If false, the value will be noindex
. The default
is false, since directory contents are mostly filenames, which
would clutter up the crawler's index. Added in version
5.01.1226709000 20081114.ignoretextselectors
(list)
A list of CSS selectors to match elements (i.e. through and
including balanced close tags, if defined) whose formatted text
should be ignored (e.g. in <urlinfo text>). Only text
outside of ignored elements (and inside/part of
keeptextselectors
elements if given,
here) is retained.
A limited subset of CSS selector syntax is supported. Each item in the list must be a selector as defined by the following pseudo grammar. "!" indicates the preceding parenthetical group must produce at least one of its components. An optional item/group is suffixed with "?"; "*" indicates zero or more occurences of the item/group may appear; "+" indicates one or more. Fixed-font indicates literal text, including e.g. "[]" and quotes. Non-fixed-font pipe "|" separates alternatives.
|=
| ^=
| $=
| *=
| =
Examples:
#myId | Elements with id attribute equal to myId |
div.myClass | div elements with class attribute containing myClass |
div.myClass p | p elements that are descendants of myClass-class div elements |
.A, .B | Elements with class A or B |
.myClass > span | span elements that are children of myClass-class elements |
div[myAttr=myVal] | div elements with an attribute myAttr whose value is myVal |
Whitespace is permitted around (before/after) the selector; around (and as) a combinator; around a comma operator; and between the parts of an attribute-selector inside the square brackets. Comments (delimited by /* */, newlines permitted within) may appear between/around any parts in the grammar. Matches are case-insensitive, except for attribute-selector values, which match case-sensitively (unless the i attr-modifier is given). Backslash escapes are not suppored. A tag must be an HTML 5 tag. Setting added in version 8.01.1664337014 20220927. Returns nonzero on success, 0 on error.
input
(boolean)
If true (default), the VALUE
of <INPUT TYPE=text>
tags
is included in the formatted text obtained with <urlinfo text>.
If the input
setting is false, this text is
deleted. (This setting is the same as ignoreinput
but
negated.) Added in version 3.01.985900000 20010329.keeptextselectors
(list)
A list of CSS selectors to match elements (i.e. through and
including balanced close tags, if defined) whose formatted text
should be kept (e.g. in <urlinfo text>). Only text
inside/part of kept elements (and outside
ignoretextselectors
elements if given,
here) is retained. If no
keeptextselectors
are given (the default), the entire
document's text is considered kept.
A limited subset of CSS selector syntax is supported; see
ignoretextselectors
(here)
for details. Setting added in version 8.01.1664337014 20220927.
Returns nonzero on success, 0 on error. See also
strictkeepselectors
(here).
linelen
(integer)
Sets the formatted-text line length to word-wrap at. The default
is 75. Note that some lines may be longer that this, e.g. if
word-wrap is disabled due to a <PRE>
or similar tag.
A value of 0 will set the default (i.e. 75). A value of -1 means
infinite (no word wrap). Added in version 5.01.1119969728 20050628.
Returns previous setting.minclrdiff
(integer)
The minimum foreground/background color difference that formatted
text must have. If the color difference is less for a given
section of text, the area will be blank instead.
Sometimes extra padded keyword information - intended for web
robots but not human users - is hidden in white-on-white text.
This text is placed to artificially raise a page's visibility in a
search engine. However, since it often contains verbose or even
completely off-topic keywords, such hidden text can be misleading.
By setting minclrdiff
this user-hidden text can be stripped
from a Vortex-fetched page as well, and the resulting page ranked
on its user-visible content only.
The color difference is defined as: abs(R1 - R2) + abs(G1 -
G2) + abs(B1 - B2), where R1
, etc. are the RGB values for
two colors. The default is 0, which implies that all text is included.
nestcomment
(boolean)
Turn on or off nesting of HTML comments. With nesting on, comments may
be nested. The default is off. Added in version 3.01.986950000 20010410.
Aka nestedcomment
/nestcomments
/nestedcomments
.select
(boolean)
If true (default), the text within <SELECT>
blocks
is included in the formatted text obtainable with <urlinfo text>.
If the select
setting is false, this text is
deleted. (This setting is the same as ignoreselect
but
negated.) Added in version 3.01.985900000 20010329.showwidgets [add|del|set] [radio|select|...|all] [...]
Which <form>
input widgets to display in formatted text,
with square brackets or parentheses. Selected or checked widgets
are further indicated with an asterisk. Displaying widgets can be
useful to visualize where they are in relation to text, and which
are selected/checked. The widgets to show are specified in one or
more list arguments; they may be one or more of button
,
checkbox
, file
, hidden
, image
,
password
, radio
, reset
, select
,
select-one
, select-multiple
, submit
,
text
, textarea
, or all
for all widgets. If
the first token in the list is add
, the widgets are added
to the display list; if del
, removed; if set
(the
default), the list is cleared and set to the specified list. The
single widget default
may also be specified (with
set
) to restore the default setting, which is to show no
widgets. Added in version 5.01.1262085000 20091229.strictcomment
(boolean)
Turn on or off strict HTML comment parsing. With strict comments on,
comments must start with "<!--
", not just "<!
";
The default is on. Added in version 3.01.986950000 20010410.
Aka strictcomments
.strictkeepselectors
(boolean)
Whether keeptextselectors
(here) and keeprefsselectors
(here) matching should be strict. If
true, only text/refs matched by such selectors will be kept;
i.e. if they match nothing, or there are no such selectors,
nothing will be kept. If false (the default), all text/refs are
kept in such instances. For example, if keeptextselectors
were set to keep <article>s within documents (to trim cruft
outside of them), documents without any <article>s at all
would still be kept in their entirety, if
strictkeepselectors
is false. Added in version
8.01.1664337014 20220927. Returns previous flag value.
(Ignore-type selectors are effectively always strict.)strike
(boolean)
If true (default), the text within <STRIKE>
blocks is
included in the formatted text obtained with <urlinfo text>.
If the strike
setting is false, this text is
deleted. (This setting is the same as ignorestrike
but
negated.)textarea
(boolean)
If true (default), the text within <TEXTAREA>
blocks
is included in the formatted text obtained with <urlinfo text>.
If the textarea
setting is false, this text is
deleted. (This setting is the same as ignoretextarea
but
negated.) Added in version 3.01.985900000 20010329.xmltags
(boolean)
Turns on or off interpretation of XML tags as tags. If on, tags
that start with <?
will be interpreted as an unknown HTML
tag, i.e. suppressed from the formatted text. If off, such tags
will be taken as text and will appear in the formatted text
output. Added (and defaults to on) in version 5.01.1105303759 20050109.
Returns previous setting.utf8badencasiso88591
(boolean)
If true (default), invalid bytes in UTF-8 source documents will
be interpreted as ISO-8859-1 characters (and converted to
charsettxt
). If false, such bytes are replaced with
question marks ("?
") as with other failed conversions.
ISO-8859-1 is (erroneously) placed in UTF-8 text often enough that
this assumption can generally be made. Note that such conversions
still cause an error message and non-zero <urlinfo errnum>,
however, unless utf8badencasiso88591err
is set to false; this
alerts the user to the erroneous document. Added in version
5.01.1244765000 20090611.
utf8badencasiso88591err
(boolean)
If true (default), the interpretation of invalid UTF-8 bytes as
ISO-8859-1 (when utf8badencasiso88591
is true) still causes
an error to be reported (if charsetmsgs
true) and returned
in errnum
. If false, no error message is generated nor
error returned. Note that if utf8badencasiso88591
is
false, utf8badencasiso88591err
is ignored, as invalid UTF-8
bytes are then treated as any other failed conversion (mapped to
question mark). Added in version 5.01.1244765000 20090611.