Query Processing

These <apicp> settings affect how a query is processed, e.g. what documents it will match.

  • defsuffrm aka defsufrm (boolean, on by default) Whether to remove a trailing vowel, or one of a trailing double consonant pair, after normal suffix processing, and if the word is still minwordlen or greater. This only has effect if suffix processing is enabled (suffixproc on and the original word is at least minwordlen long). Added in version 3.0.941600000 19991102.

  • edexp (string, empty by default) The default end delimiter expression.

  • eqprefix (string) The name of the equivalence file. Default is builtin, which uses the built-in equivalence list.

  • exactphrase (tri-state, off by default, on by default in tsql) Whether to exactly resolve the noise words in phrases. If on, a phrase such as "state of the art" will only match those exact words; however this may require post-processing to resolve the noise words "of the" (potentially slower). If off, any word is permitted in place of the noise words, and no post-processing is done: faster but potentially less accurate. In version 5.01.1178072161 20070501 and later, may be set to ignorewordposition: same as off, but non-noise words are permitted in any order or position; essentially emulates behavior of a non-inverted Metamorph index with no post-processing, but on a Metamorph inverted index too. Note: In tsql version 5 and earlier the default was on.

  • inced (boolean, on by default) Whether to include the end delimiters in hits. Ignored for w/N (within N chars or words) delimiters.

  • incsd (boolean, off by default) Whether to include the start delimiters in hits. Ignored for w/N (within N chars or words) delimiters.

  • intersects (integer, -1 by default) The default number of intersections (if not given in a query).

  • keepeqvs (boolean, off by default) Whether to use equivalences for words/phrases found in the equivalence file(s) or not.

  • keepnoise (boolean, off by default) Whether to preserve noise words in the query during processing.

  • minwordlen (integer, 255 by default) The minimum word length for prefix/suffix processing to occur. Note that this is different from qminwordlen, which is the minimum word length allowed.

  • noise (list) The noise word list used during query processing. The default noise list is:

    a between got me she upon
    about but gotten mine should us
    after by had more so very
    again came has most some was
    ago can have much somebody we
    all cannot having my someone went
    almost come he myself something were
    also could her never stand what
    always did here no such whatever
    am do him none sure what's
    an does his not take when
    and doing how now than where
    another done i of that whether
    any down if off the which
    anybody each in on their while
    anyhow else into one them who
    anyone even is onto then whoever
    anything ever isn't or there whom
    anyway every it our these whose
    are everyone just ourselves they why
    as everything last out this will
    at for least over those with
    away from left per through within
    back front less put till without
    be get let putting to won't
    became getting like same too would
    because go make saw two wouldn't
    been goes many see unless yet
    before going may seen until you
    being gone maybe shall up your

  • olddelim (boolean, off by default) Whether to emulate "old" delimiter behavior. If turned on, it is possible for a hit to occur outside dissimilar start and end delimiters, such as in this example text:

    start-delim ... end-delim ... hit ... start-delim ... end-delim

    Here the hit is "within" the outermost start and end delimiters, but it's not within the nearest delimiters. With olddelim off (the default), this hit now does not match: it would have to occur within the nearest delimiters, which would have to be in the correct order. (Added in version 3.0.950300000 20000211. Previous versions behave as if olddelim were on.)

  • phrasewordproc (string) Which words of a phrase to do suffix/wildcard processing on. The possible values are mono to treat the phrase as a monolithic word (i.e. only last word processed, but entire phrase counts towards minwordlen); none for no suffix/wildcard processing on phrases; or last to process just the last word (default). Note that a phrase is multi-word, i.e. a single word in double-quotes is not considered a phrase, and thus phrasewordproc does not apply. Added in version 4.03.1082000000 20040414. Mode none supported in version 5.01.1127760000 20050926.

  • prefix (list) The prefix list used for prefix processing (if enabled) during search. The default prefix list is:

    ante anti arch auto be bi counter de dis em en ex extra fore hyper in inter mis non post pre pro re semi sub super ultra un

  • prefixproc (boolean, off by default) Whether to do prefix processing.

  • rebuild (boolean, on by default) Whether to do word rebuilding.

  • reqsdelim, reqedelim (boolean, on by default) Whether to require the start (reqsdelim) or end (reqedelim) delimiter to actually be present in a hit. If these are turned off, then the given delimiter need not be found for a hit to match; it's as if the delimiter were "found" at the start or end of the buffer if not present. (Added in version 3.0.950300000 20000211. Previous versions behave as if these settings were off.)

  • sdexp (string, empty by default) The default start delimiter expression.

  • see (boolean, off by default) Whether to look up "see also" references during equivalence lookup.

  • stringcomparemode (string) Mode and flags for string compares, e.g. equals, less-than etc. It also controls the default mode for most string functions, e.g. <strfold>, <xtree> and <sort>, and the non-case-style flags/mode for the functions lower, upper and initcap. Its value is the same format as the textsearchmode setting, but the default is "unicodemulti, respectcase" - i.e. characters must be identical to match, though ISO-8859-1 vs. UTF-8 encoding may be ignored. Added in version 6. The version 5 and earlier behavior was effectively "ctype, respectcase, iso-8859-1". In version 7 (or compatibilityversion 7) and later, strlst comparisons also use stringcomparemode.

    A regular (B-tree) index will always use the stringcomparemode value that was set at its creation, not the current value. However, when multiple regular indexes exist on the same fields, at search time the Texis optmizer will attempt to use the index whose (creation-time) stringcomparemode is closest to the current value. This allows some dynamic flexibility in supporting queries with different stringcomparemode values (e.g. case-sensitive vs. insensitive). Caveat: A Texis version 5 or earlier should not access or modify a B-tree index created by a version 6 or later Texis, unless stringcomparemode was set to "ctype, respectcase, iso-8859-1" at creation, or index corruption may result, especially if there are hi-bit/Unicode/UTF-8 characters.

  • suffix (list) The suffix list used for suffix processing (if enabled) during search. The default suffix list is:

    ' (single quote) able age aged ager ages al ally ance anced ancer ances ant ary at ate ated ater atery ates atic ed en ence enced encer ences end ent er ery es ess est ful ial ible ibler ic ical ice iced icer ices ics ide ided ider ides ier ily ing ion ious ise ised ises ish ism ist ity ive ived ives ize ized izer izes less ly ment ncy ness nt ory ous re red res ry s ship sion th tic tion ty ual ul ward

  • suffixeq (list) The suffix list used for suffix processing during equivalence lookup. The default suffixeq list is:

    ' (single quote) ies s

  • suffixproc (boolean, on by default) Whether to do suffix processing.

  • textsearchmode (string) Mode and flags for text searches. This controls case-sensitivity and other character-folding aspects of Metamorph text searches. The value consists of a comma-separated list of values: a case-folding style, zero or more optional flags, and a case-folding mode.

    The textsearchmode setting may be altered - instead of cleared and set - by using "+" or "-" in front of the given values to denote adding or removing just those values, rather than clearing the whole setting first. This makes it easier to alter just the desired parts, without having to specify the remainder of the setting. E.g. "+respectcase, ignorewidth, -expandligatures" sets the case style to case-sensitive, turns on ignorewidth and turns off ligature expansion, without changing other flags such as ignorediacritics. (Note that negation ("-") can only be used with values that are "on/off", i.e. the flags; case style and case mode cannot be negated.) "+" and "-" remain in effect for following values, until another "+", "-" or "=" (clear the setting first) is given.

    The case-folding style determines what case to fold to; it is exactly one of:

    • respectcase aka preservecase aka casesensitive Do not change case at all, for case-sensitive searches.

    • ignorecase aka igncase aka caseinsensitive Fold case for caseless (case-insensitive) matching; this is the default style for textsearchmode. This typically (but not always) means characters are folded to their lowercase equivalents.

    • uppercase Fold to uppercase. Note: This style is for functions that actually return a string, e.g. <strfold>; it should not be used in comparison situations such as indexes and searches as its comparison behavior is undefined. See the stringcomparemode setting, here.

    • lowercase Fold to lower-case. Note: This style is for functions that actually return a string, e.g. <strfold>; it should not be used in comparison situations such as indexes and searches as its comparison behavior is undefined. See the stringcomparemode setting, here.

    • titlecase Fold to title-case. Titlecase means the first character of a word is uppercased, while the rest of the word is lowercased. Note: This style is for functions that actually return a string, e.g. <strfold>; it should not be used in comparison situations such as indexes and searches as its comparison behavior is undefined. See the stringcomparemode setting, here.

    Any combination of zero or more of the following flags may be given in addition to a case style:

    • iso-8859-1 aka iso88591 Interpret text as ISO-8859-1 encoded. This should only be used if all text is known to be in this character set. Only codepoints U+0001 through U+00FF can be supported. Any UTF-8 text will be misinterpreted.

      If this flag is disabled (the default), text is interpreted as UTF-8, and invalid bytes (if any) are interpreted as ISO-8859-1. This supports all UTF-8 characters, as well as most typical ISO-8859-1 data, if any happens to be accidentally mixed in.

      Typically, this flag is left disabled, and text is stored in UTF-8, since it supports a broader range of characters. Any other character set besides UTF-8 or ISO-8859-1 is not supported, and should be mapped to UTF-8.

    • utf-8 aka utf8 Alias for negating iso-8859-1, ie, specifying this disables the iso-8859-1 flag.

    • expanddiacritics aka expdiacritics Expand certain phonological diacritics: umlauts over "a", "o", "u" expand to the vowel plus "e" (for German, e.g. "für" matches "fuer"); circumflexes over "e" and "o" expand to the vowel plus "s" (for French, e.g. "hôtel" matches "hostel"). The expanded "e" or "s" is optional-match - e.g. "f"ur" also matches "fur" - but only against a non-optional char; i.e. "hôtel" does not match "hötel" (the "e" and "s" collide), and "für" does not match "füer" (both optional "e"s must match each other). Also, neither the vowel nor the "e"/"s" will match an ignorediacritics-stripped character; this prevents "für" from matching "fu'er".

    • ignorediacritics aka igndiacritics Ignore diacritic marks - Unicode non-starter or modifier symbols resulting from NFD decomposition - e.g. diaeresis, umlaut, circumflex, grave, acute, tilde etc.

    • expandligatures aka expligatures Expand ligatures, e.g. "œ" (U+0153) will match "oe". Note that even with this flag off, certain ligatures may still be expanded if necessary for case-folding under ignorecase with case mode unicodemulti; see below.

    • ignorewidth aka ignwidth Ignore half- and full-width differences, e.g. for katakana and ASCII.

    Due to interactions between flags, they are applied in the order specified above, followed by case folding according to the case style (upper/lower etc.). E.g. expanddiacritics is applied before ignorediacritics, because otherwise the latter would strip the characters that the former expands.

    A case-folding mode may also be given in addition to the above; this determines how the case-folding style (e.g. upper/lower/title) is actually applied. It is one of the following:

    • unicodemulti Use the builtin Unicode 5.1.0 1-to-N-character folding tables. All locale-independent Unicode characters with the appropriate case equivalent are folded. A single character may fold to up to 3 characters, if needed; e.g. the German es-zett character (U+00DF) will match "ss" and vice-versa under ignorecase. Note that additional ligature expansions may happen if expandligatures is set.

    • unicodemono Use the builtin Unicode 5.1.0 1-to-1-character folding tables. All locale-independent Unicode characters with the appropriate case equivalent are folded. Note that even though this mode is 1-to-1-character, it is not necessarily 1-to-1-byte, i.e. a UTF-8 string may still change its byte length when folded, even though the Unicode character count will remain the same.

    • ctype Use the C ctype.h functions. Case folding will be OS- and locale-dependent; a locale should be set with the SQL locale property. Only codepoints U+0001 through U+00FF can be folded; e.g. most Western European characters are folded, but Cyrillic, Greek etc. are not. Note that while this mode is 1-to-1-character, it is not necessarily 1-to-1-byte, unless the iso-8859-1 flag is also in effect. This mode was part of the default in version 5 and earlier.

    The default case-folding mode is unicodemulti; see below for the version 5 and earlier default, and important caveats.

    In addition to the above styles, flags and modes, several aliases may be used, and mixed with flags. The aliases have the form:

    [stringcomparemode|textsearchmode][default|builtin]
    where stringcomparemode or textsearchmode refers to that setting's value (if not given: the setting being modified). default refers to the default value (modifiable with texis.ini); builtin refers to the builtin factory default; no suffix refers to the current setting value. E.g. "stringcomparemodedefault,+ignorecase" would obtain the default stringcomparemode setting (from texis.ini if available), but set the case style to ignorecase.

    A Metamorph index always uses the textsearchmode value that was set at its initial creation, not the current value. However, when multiple Metamorph indexes exist on the same fields, at search time the Texis optimizer will attempt to use the index whose (creation-time) textsearchmode is closest to the current value.

    The textsearchmode setting was added in Texis version 6; its default is "unicodemulti, ignorecase, ignorewidth, ignorediacritics, expandligatures" (note that UTF-8 text is expected, since iso-8859-1 is not specified in the default). In version 5 and earlier the default was effectively "ctype, ignorecase, iso-8859-1". Caveat: A Texis version 5 or earlier should not access or modify a Metamorph index created by a version 6 or later Texis, unless textsearchmode was set to "ctype, ignorecase, iso-8859-1" at creation, or index corruption may result, especially if there are hi-bit/Unicode/UTF-8 characters.

  • ueqprefix (string) The name of the user equivalence file. Default is empty.

  • withinmode (string) A space- or comma-separated unit and optional type for the "within-N" operator (e.g. w/5). The unit is one of:

    • char for within-N characters

    • word for within-N words
    The optional type determines what distance the operator measures. It is one of the following:

    • radius (the default if no type specified when set) indicates all sets must be within a radius N of an "anchor" set, i.e. there is a set in the match such that all other sets are within N units right of its right edge or N units left of its left edge.

    • span indicates all sets must be within an N-unit span
    Added in version 4.03.1081200000 20040405. The optional type was added in version 5.01.1258712000 20091120; previously the only type was implicitly radius. The default setting for version 5 and earlier is char (i.e. char radius); in version 6 and later the default is word span.

  • withinproc (boolean, on by default) Whether to process the w/ operator in queries.


DIAGNOSTICS
apicp returns nothing.


EXAMPLE

<apicp "alpostproc" "on">


CAVEATS
The apicp function was added Sep. 13 1996. Various settings were added since then and are unknown to previous versions.

Any apicp calls should take place after USER/PASS statements, but before SQL and fmt calls.

The ability to pass multiple $value arguments for string-list settings was added in version 3.0.996300000 20010728.


SEE ALSO
apiinfo, USER PASS, Metamorph hit markup (here)

The Metamorph Linguistics chapter in the Texis manual


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.