Formatted Text

The following urlcp settings control how HTML documents are formatted (e.g. the return value of <urlinfo text> etc.):

  • 8bithtml (boolean) If true (default), 8-bit HTML characters are left alone when formatting HTML text. If false, 8-bit characters are replaced with the closest 7-bit character(s).

  • allowinputfiledefault (boolean) If true, <input type="file"> default values (i.e. those assigned in the original HTML, as opposed to those set by <urlcp domvalue>) will be allowed. If false (the default), such default values will be suppressed (i.e. empty, as if unset). This is for security, to help prevent malicious HTML from surreptitiously "stealing" local files by pre-setting file-upload dialogs. Added in version 6.00.1335222312 20120423. Returns previous value (1 or 0).

  • allowpunct (boolean) Sets whether to allow punctuation in tag/attribute names when parsing HTML. Added in version 4.0.1001550000 20010926. Default is on, which aids in parsing of XML-like attributes.

  • alttxt (boolean) If true (default), the text from ALT attributes in IMG and AREA tags is included in the formatted text. If false, this text is ignored. This is useful when the ALT text is "gif" or "image" or something equally inane.

  • charsetconfigfromfile (string) Load charset configuration from the given file. The file format is a set of charset names, each followed by zero or more space-separated aliases:

    Charset: ISO-8859-1
    Aliases: 8859-1 CP819 csISOLatin1 IBM819
    Aliases: ISO_8859-1:1987 iso-ir-100 l1 latin1
    ...

    Charsets encountered during fetch processing that match names in Aliases are canonicalized to their official Charset name. The default charset config file is set in texis.ini by the [Texis] Charset Config setting (here). If that is unspecified, the file conf/charsets.conf in the install dir is used. Returns 2 on success, 1 on partial success (file found but has errors), 0 on failure. Added in version 6.

  • charsetconfigfromtext (string) Parses the given string buffer as a charset configuration, in the same format as a charsetconfigfromfile file. Returns 2 on success, 1 on partial success (file found but has errors), 0 on failure. Added in version 6.

  • charsetconverter (string) The command and arguments to execute to convert character sets not known by the internal charset converter. The default (or if set empty) is the value set in texis.ini by [Texis] Charset Converter (here). If that is unspecified, the value "%INSTALLDIR%/etc/iconv" -f %CHARSETFROM% -t %CHARSETTO% -c is used. The variables %INSTALLDIR%, %CHARSETFROM% and %CHARSETTO% will be replaced with the Texis installation directory, source charset name and target charset name, respectively. Double quotes should be placed around single arguments that may contain spaces (e.g. the path to iconv) and will be removed in Unix versions. If the option %ALL% is given at the start, all charset conversions will be handled by this converter, even those that the internal converter knows. If the option %NONE% is given at the start (nothing else needed), no charset conversions will be handled by the converter; i.e. only the internal converter will be used. Returns previous setting. Added in version 5.00.1089408135 20040709. %NONE% added in version 7.02.1415897000 20141113.

  • charsetpartialconvok (boolean) Whether to accept timeout/non-zero exit of external charset translator program, if at least some output was generated. Sometimes a few bad characters on a page can cause the translator to generate valid output for the rest of the page, but exit non-zero; if this setting is on, such partial output will be accepted. Added (and defaults to on) in version 5.01.1098470049 20041022. Returns previous setting.

  • charsetsrc (string)

    Sets the character set to assume is the source for all pages fetched. Note that this forces the character set, i.e. all pages are interpreted as this character set even if labelled or detected differently. This setting should only be used as a last resort to force the character set for a mis-labelled or undetectable page; normally only charsetsrcdefault need be set. Note that this does not affect the charset for the output formatted text; see charsettxt. Currently recognized charsets are ISO-8859-1, UTF-8, UTF-16, UTF-16BE and UTF-16LE. Give empty string to set default, which is none/unknown - do not force charset, instead check next available source (i.e. charset set explicitly in page). Added in version 5.

  • charsetsrcdefault (string)

    Sets the character set to assume for the source of pages fetched, when it is not forced (with charsetsrc), nor labelled nor detectable; i.e. a last-resort fallback. Note that this does not affect the charset for the output formatted text; see charsettxt. This setting is useful if most pages are correctly labelled, but a few are not labelled and Vortex is not correctly recognizing them. Give empty string to set default, which is none/unknown. Added in version 5.

  • charsettxt or charsettext (string) Sets the character set to return formatted text in (i.e. <urlinfo text>). The default in version 5 is UTF-8. If the charset of a given source page is different, its formatted text will be translated to this charset. The charsettext value may be set to "source" or "src" to indicate that the source page charset should be used instead. It may be set to "" (empty string) to reset to the default value.

  • del (boolean) If true (default), the text within <DEL> blocks is included in the formatted text obtained with <urlinfo text>. If the del setting is false, this text is deleted. (This setting is the same as ignoredel but negated.) Added in version 3.01.962850000 20000705.

  • filedirrobotsfollow (boolean) If true (default), file:// directory URLs' HTML will contain a <meta> robots tag value of follow, which indicates to crawlers that the pages' links should be followed. If false, the value will be nofollow. Added in version 5.01.1226709000 20081114.

  • filedirrobotsindex (boolean) If true, file:// directory URLs' HTML will contain a <meta> robots tag value of index, which indicates to crawlers that the pages' content itself should be indexed. If false, the value will be noindex. The default is false, since directory contents are mostly filenames, which would clutter up the crawler's index. Added in version 5.01.1226709000 20081114.

  • formatxmlashtml (boolean) If true (default), XML documents are formatted and parsed as HTML (XSL stylesheets are not currently supported by the internal fetch formatter). If false, XML documents are left unparsed. Parsing XML as HTML will tend to return just the content of tags for formatted text, whereas leaving XML unparsed will return the entire raw document for formatted text. Added in version 5.01.1195086345 20071114.

  • formtxt or formtext (boolean) Controls the select, input and textarea settings together. (This setting is the same as ignoreformtxt or ignoreformtext but negated.) Added in version 3.01.985900000 20010329.

  • ftpdirrobotsfollow (boolean) If true (default), ftp:// directory URLs' HTML will contain a <meta> robots tag value of follow, which indicates to crawlers that the pages' links should be followed. If false, the value will be nofollow. Added in version 5.01.1226709000 20081114.

  • ftpdirrobotsindex (boolean) If true, ftp:// directory URLs' HTML will contain a <meta> robots tag value of index, which indicates to crawlers that the pages' content itself should be indexed. If false, the value will be noindex. The default is false, since directory contents are mostly filenames, which would clutter up the crawler's index. Added in version 5.01.1226709000 20081114.

  • ignoretextselectors (list) A list of CSS selectors to match elements (i.e. through and including balanced close tags, if defined) whose formatted text should be ignored (e.g. in <urlinfo text>). Only text outside of ignored elements (and inside/part of keeptextselectors elements if given, here) is retained.

    A limited subset of CSS selector syntax is supported. Each item in the list must be a selector as defined by the following pseudo grammar. "!" indicates the preceding parenthetical group must produce at least one of its components. An optional item/group is suffixed with "?"; "*" indicates zero or more occurences of the item/group may appear; "+" indicates one or more. Fixed-font indicates literal text, including e.g. "[]" and quotes. Non-fixed-font pipe "|" separates alternatives.

    • selector = complex-selector-list

    • complex-selector-list = complex-selector ( , complex-selector )*

    • complex-selector = compound-selector ( combinator compound-selector )*

    • compound-selector = ( type-selector? subclass-selector* )!

    • combinator = whitespace | > | + | ˜

    • type-selector = tag | *

    • subclass-selector = ( # id ) | ( . class ) | attribute-selector

    • attribute-selector = ( [ attr ] ) | ( [ attr attr-matcher ( value | string-token ) attr-modifier? ] )

    • attr-matcher = ˜= | |= | ^= | $= | *= | =

    • attr-modifier = i | s

    • string-token = "value" | 'value'

    • whitespace = ( space | tab | CR | LF | FF )+

    Examples:

    #myId Elements with id attribute equal to myId
    div.myClass div elements with class attribute containing myClass
    div.myClass p p elements that are descendants of myClass-class div elements
    .A, .B Elements with class A or B
    .myClass > span span elements that are children of myClass-class elements
    div[myAttr=myVal] div elements with an attribute myAttr whose value is myVal

    Whitespace is permitted around (before/after) the selector; around (and as) a combinator; around a comma operator; and between the parts of an attribute-selector inside the square brackets. Comments (delimited by /* */, newlines permitted within) may appear between/around any parts in the grammar. Matches are case-insensitive, except for attribute-selector values, which match case-sensitively (unless the i attr-modifier is given). Backslash escapes are not suppored. A tag must be an HTML 5 tag. Setting added in version 8.01.1664337014 20220927. Returns nonzero on success, 0 on error.

  • input (boolean) If true (default), the VALUE of <INPUT TYPE=text> tags is included in the formatted text obtained with <urlinfo text>. If the input setting is false, this text is deleted. (This setting is the same as ignoreinput but negated.) Added in version 3.01.985900000 20010329.

  • keeptextselectors (list) A list of CSS selectors to match elements (i.e. through and including balanced close tags, if defined) whose formatted text should be kept (e.g. in <urlinfo text>). Only text inside/part of kept elements (and outside ignoretextselectors elements if given, here) is retained. If no keeptextselectors are given (the default), the entire document's text is considered kept.

    A limited subset of CSS selector syntax is supported; see ignoretextselectors (here) for details. Setting added in version 8.01.1664337014 20220927. Returns nonzero on success, 0 on error. See also strictkeepselectors (here).

  • linelen (integer) Sets the formatted-text line length to word-wrap at. The default is 75. Note that some lines may be longer that this, e.g. if word-wrap is disabled due to a <PRE> or similar tag. A value of 0 will set the default (i.e. 75). A value of -1 means infinite (no word wrap). Added in version 5.01.1119969728 20050628. Returns previous setting.

  • minclrdiff (integer) The minimum foreground/background color difference that formatted text must have. If the color difference is less for a given section of text, the area will be blank instead.

    Sometimes extra padded keyword information - intended for web robots but not human users - is hidden in white-on-white text. This text is placed to artificially raise a page's visibility in a search engine. However, since it often contains verbose or even completely off-topic keywords, such hidden text can be misleading. By setting minclrdiff this user-hidden text can be stripped from a Vortex-fetched page as well, and the resulting page ranked on its user-visible content only.

    The color difference is defined as: abs(R1 - R2) + abs(G1 - G2) + abs(B1 - B2), where R1, etc. are the RGB values for two colors. The default is 0, which implies that all text is included.

  • nestcomment (boolean) Turn on or off nesting of HTML comments. With nesting on, comments may be nested. The default is off. Added in version 3.01.986950000 20010410. Aka nestedcomment/nestcomments/nestedcomments.

  • select (boolean) If true (default), the text within <SELECT> blocks is included in the formatted text obtainable with <urlinfo text>. If the select setting is false, this text is deleted. (This setting is the same as ignoreselect but negated.) Added in version 3.01.985900000 20010329.

  • showwidgets [add|del|set] [radio|select|...|all] [...] Which <form> input widgets to display in formatted text, with square brackets or parentheses. Selected or checked widgets are further indicated with an asterisk. Displaying widgets can be useful to visualize where they are in relation to text, and which are selected/checked. The widgets to show are specified in one or more list arguments; they may be one or more of button, checkbox, file, hidden, image, password, radio, reset, select, select-one, select-multiple, submit, text, textarea, or all for all widgets. If the first token in the list is add, the widgets are added to the display list; if del, removed; if set (the default), the list is cleared and set to the specified list. The single widget default may also be specified (with set) to restore the default setting, which is to show no widgets. Added in version 5.01.1262085000 20091229.

  • strictcomment (boolean) Turn on or off strict HTML comment parsing. With strict comments on, comments must start with "<!--", not just "<!"; The default is on. Added in version 3.01.986950000 20010410. Aka strictcomments.

  • strictkeepselectors (boolean) Whether keeptextselectors (here) and keeprefsselectors (here) matching should be strict. If true, only text/refs matched by such selectors will be kept; i.e. if they match nothing, or there are no such selectors, nothing will be kept. If false (the default), all text/refs are kept in such instances. For example, if keeptextselectors were set to keep <article>s within documents (to trim cruft outside of them), documents without any <article>s at all would still be kept in their entirety, if strictkeepselectors is false. Added in version 8.01.1664337014 20220927. Returns previous flag value. (Ignore-type selectors are effectively always strict.)

  • strike (boolean) If true (default), the text within <STRIKE> blocks is included in the formatted text obtained with <urlinfo text>. If the strike setting is false, this text is deleted. (This setting is the same as ignorestrike but negated.)

  • textarea (boolean) If true (default), the text within <TEXTAREA> blocks is included in the formatted text obtained with <urlinfo text>. If the textarea setting is false, this text is deleted. (This setting is the same as ignoretextarea but negated.) Added in version 3.01.985900000 20010329.

  • xmltags (boolean) Turns on or off interpretation of XML tags as tags. If on, tags that start with <? will be interpreted as an unknown HTML tag, i.e. suppressed from the formatted text. If off, such tags will be taken as text and will appear in the formatted text output. Added (and defaults to on) in version 5.01.1105303759 20050109. Returns previous setting.

  • utf8badencasiso88591 (boolean)

    If true (default), invalid bytes in UTF-8 source documents will be interpreted as ISO-8859-1 characters (and converted to charsettxt). If false, such bytes are replaced with question marks ("?") as with other failed conversions. ISO-8859-1 is (erroneously) placed in UTF-8 text often enough that this assumption can generally be made. Note that such conversions still cause an error message and non-zero <urlinfo errnum>, however, unless utf8badencasiso88591err is set to false; this alerts the user to the erroneous document. Added in version 5.01.1244765000 20090611.

  • utf8badencasiso88591err (boolean) If true (default), the interpretation of invalid UTF-8 bytes as ISO-8859-1 (when utf8badencasiso88591 is true) still causes an error to be reported (if charsetmsgs true) and returned in errnum. If false, no error message is generated nor error returned. Note that if utf8badencasiso88591 is false, utf8badencasiso88591err is ignored, as invalid UTF-8 bytes are then treated as any other failed conversion (mapped to question mark). Added in version 5.01.1244765000 20090611.

Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.