These <apicp>
settings affect how a query is processed,
e.g. what documents it will match.
defsuffrm
aka defsufrm
(boolean, on by default)
Whether to remove a trailing vowel, or one of a trailing double
consonant pair, after normal suffix processing, and if the word is
still minwordlen
or greater. This only has effect if
suffix processing is enabled (suffixproc
on and the
original word is at least minwordlen
long). Added in
version 3.0.941600000 19991102.edexp
(string, empty by default)
The default end delimiter expression.eqprefix
(string)
The name of the equivalence file. Default is builtin
,
which uses the built-in equivalence list.exactphrase
(tri-state, off by default,
on by default in tsql
)
Whether to exactly resolve the noise words in phrases. If on, a
phrase such as "state of the art
" will only match those
exact words; however this may require post-processing to resolve
the noise words "of the" (potentially slower). If off, any word
is permitted in place of the noise words, and no post-processing
is done: faster but potentially less accurate. In version
5.01.1178072161 20070501 and later, may be set to
ignorewordposition
: same as off, but non-noise words are
permitted in any order or position; essentially emulates behavior
of a non-inverted Metamorph index with no post-processing, but on
a Metamorph inverted index too.
Note: In tsql
version 5 and earlier the default was on.inced
(boolean, on by default)
Whether to include the end delimiters in hits. Ignored for
w/N
(within N chars or words) delimiters.incsd
(boolean, off by default)
Whether to include the start delimiters in hits. Ignored for
w/N
(within N chars or words) delimiters.intersects
(integer, -1 by default)
The default number of intersections (if not given in a query).keepeqvs
(boolean, off by default)
Whether to use equivalences for words/phrases found in the
equivalence file(s) or not.keepnoise
(boolean, off by default)
Whether to preserve noise words in the query during processing.minwordlen
(integer, 255 by default)
The minimum word length for prefix/suffix processing to occur. Note
that this is different from qminwordlen
, which is the
minimum word length allowed.noise
(list)
The noise word list used during query processing. The default noise
list is:
a | between | got | me | she | upon |
about | but | gotten | mine | should | us |
after | by | had | more | so | very |
again | came | has | most | some | was |
ago | can | have | much | somebody | we |
all | cannot | having | my | someone | went |
almost | come | he | myself | something | were |
also | could | her | never | stand | what |
always | did | here | no | such | whatever |
am | do | him | none | sure | what's |
an | does | his | not | take | when |
and | doing | how | now | than | where |
another | done | i | of | that | whether |
any | down | if | off | the | which |
anybody | each | in | on | their | while |
anyhow | else | into | one | them | who |
anyone | even | is | onto | then | whoever |
anything | ever | isn't | or | there | whom |
anyway | every | it | our | these | whose |
are | everyone | just | ourselves | they | why |
as | everything | last | out | this | will |
at | for | least | over | those | with |
away | from | left | per | through | within |
back | front | less | put | till | without |
be | get | let | putting | to | won't |
became | getting | like | same | too | would |
because | go | make | saw | two | wouldn't |
been | goes | many | see | unless | yet |
before | going | may | seen | until | you |
being | gone | maybe | shall | up | your |
olddelim
(boolean, off by default)
Whether to emulate "old" delimiter behavior. If turned on, it
is possible for a hit to occur outside dissimilar start and end
delimiters, such as in this example text:
start-delim ... end-delim ... hit ... start-delim ... end-delim
Here the hit
is "within" the outermost start and end
delimiters, but it's not within the nearest delimiters.
With olddelim
off (the default), this hit now does not
match: it would have to occur within the nearest delimiters, which
would have to be in the correct order. (Added in version
3.0.950300000 20000211. Previous versions behave as if
olddelim
were on.)
phrasewordproc
(string)
Which words of a phrase to do suffix/wildcard processing on. The
possible values are mono
to treat the phrase as a
monolithic word (i.e. only last word processed, but entire phrase
counts towards minwordlen
); none
for no
suffix/wildcard processing on phrases; or last
to process just
the last word (default).
Note that a phrase is multi-word, i.e. a single word in double-quotes
is not considered a phrase, and thus phrasewordproc
does not apply.
Added in version 4.03.1082000000 20040414. Mode none
supported in version 5.01.1127760000 20050926.prefix
(list)
The prefix list used for prefix processing (if enabled) during search.
The default prefix list is:
ante anti arch auto be bi counter de dis em en ex extra fore hyper in inter mis non post pre pro re semi sub super ultra un
prefixproc
(boolean, off by default)
Whether to do prefix processing.rebuild
(boolean, on by default)
Whether to do word rebuilding.reqsdelim
, reqedelim
(boolean, on by default)
Whether to require the start (reqsdelim
) or end
(reqedelim
) delimiter to actually be present in a hit. If
these are turned off, then the given delimiter need not be found
for a hit to match; it's as if the delimiter were "found" at the
start or end of the buffer if not present. (Added in version
3.0.950300000 20000211. Previous versions behave as if these
settings were off.)sdexp
(string, empty by default)
The default start delimiter expression.see
(boolean, off by default)
Whether to look up "see also" references during
equivalence lookup.stringcomparemode
(string)
Mode and flags for string compares, e.g. equals, less-than etc.
It also controls the default mode for most string functions,
e.g. <strfold>
, <xtree>
and <sort>
, and the
non-case-style flags/mode for the functions lower
,
upper
and initcap
. Its value is the same format as
the textsearchmode
setting, but the default is "unicodemulti, respectcase" - i.e. characters must be
identical to match, though ISO-8859-1 vs. UTF-8 encoding may be
ignored. Added in version 6. The version 5 and earlier behavior
was effectively "ctype, respectcase, iso-8859-1". In
version 7 (or compatibilityversion
7) and later,
strlst
comparisons also use stringcomparemode
.
A regular (B-tree) index will always use the
stringcomparemode
value that was set at its creation, not
the current value. However, when multiple regular indexes exist
on the same fields, at search time the Texis optmizer will attempt
to use the index whose (creation-time) stringcomparemode
is
closest to the current value. This allows some dynamic
flexibility in supporting queries with different
stringcomparemode
values (e.g. case-sensitive
vs. insensitive). Caveat: A Texis version 5 or earlier
should not access or modify a B-tree index created by a
version 6 or later Texis, unless stringcomparemode
was set
to "ctype, respectcase, iso-8859-1" at creation, or index
corruption may result, especially if there are
hi-bit/Unicode/UTF-8 characters.
suffix
(list)
The suffix list used for suffix processing (if enabled) during search.
The default suffix list is:
' (single quote) able age aged ager ages al ally ance anced ancer ances ant ary at ate ated ater atery ates atic ed en ence enced encer ences end ent er ery es ess est ful ial ible ibler ic ical ice iced icer ices ics ide ided ider ides ier ily ing ion ious ise ised ises ish ism ist ity ive ived ives ize ized izer izes less ly ment ncy ness nt ory ous re red res ry s ship sion th tic tion ty ual ul ward
suffixeq
(list)
The suffix list used for suffix processing during
equivalence lookup. The default suffixeq
list is:
' (single quote) ies s
suffixproc
(boolean, on by default)
Whether to do suffix processing.textsearchmode
(string)
Mode and flags for text searches. This controls case-sensitivity
and other character-folding aspects of Metamorph text
searches. The value consists of a
comma-separated list of values: a case-folding style, zero or more
optional flags, and a case-folding mode.
The textsearchmode
setting may be altered - instead of
cleared and set - by using "+
" or "-
" in front
of the given values to denote adding or removing just those
values, rather than clearing the whole setting first. This makes
it easier to alter just the desired parts, without having to
specify the remainder of the setting. E.g. "+respectcase,
ignorewidth, -expandligatures" sets the case style to
case-sensitive, turns on ignorewidth
and turns off ligature
expansion, without changing other flags such as
ignorediacritics
. (Note that negation ("-
") can
only be used with values that are "on/off", i.e. the flags; case
style and case mode cannot be negated.) "+
" and
"-
" remain in effect for following values, until another
"+
", "-
" or "=
" (clear the setting
first) is given.
The case-folding style determines what case to fold to; it is exactly one of:
respectcase
aka preservecase
aka
casesensitive
Do not change case at all, for case-sensitive searches.ignorecase
aka igncase
aka caseinsensitive
Fold case for caseless (case-insensitive) matching; this is
the default style for textsearchmode. This typically
(but not always) means characters are folded to their
lowercase equivalents.uppercase
Fold to uppercase. Note: This style is for functions
that actually return a string, e.g. <strfold>
; it should
not be used in comparison situations such as indexes and
searches as its comparison behavior is undefined. See the
stringcomparemode
setting,
here.lowercase
Fold to lower-case. Note: This style is for functions
that actually return a string, e.g. <strfold>
; it should
not be used in comparison situations such as indexes and
searches as its comparison behavior is undefined. See the
stringcomparemode
setting,
here.titlecase
Fold to title-case. Titlecase means the first character of a
word is uppercased, while the rest of the word is lowercased.
Note: This style is for functions that actually return
a string, e.g. <strfold>
; it should not be used in
comparison situations such as indexes and searches as its
comparison behavior is undefined. See the
stringcomparemode
setting,
here.
Any combination of zero or more of the following flags may be given in addition to a case style:
iso-8859-1
aka iso88591
Interpret text as ISO-8859-1 encoded. This should only be
used if all text is known to be in this character set. Only
codepoints U+0001 through U+00FF can be supported. Any UTF-8
text will be misinterpreted.
If this flag is disabled (the default), text is interpreted as UTF-8, and invalid bytes (if any) are interpreted as ISO-8859-1. This supports all UTF-8 characters, as well as most typical ISO-8859-1 data, if any happens to be accidentally mixed in.
Typically, this flag is left disabled, and text is stored in UTF-8, since it supports a broader range of characters. Any other character set besides UTF-8 or ISO-8859-1 is not supported, and should be mapped to UTF-8.
utf-8
aka utf8
Alias for negating iso-8859-1
, ie, specifying this
disables the iso-8859-1
flag.expanddiacritics
aka expdiacritics
Expand certain phonological diacritics: umlauts over "a
", "o
",
"u
" expand to the vowel plus "e
" (for
German, e.g. "für" matches "fuer
");
circumflexes over "e
" and "o
" expand to the
vowel plus "s
" (for French, e.g. "hôtel"
matches "hostel
"). The expanded "e
" or
"s
" is optional-match - e.g. "f"ur" also
matches "fur
" - but only against a non-optional
char; i.e. "hôtel" does not match "hötel"
(the "e" and "s" collide), and "für" does
not match "füer" (both optional "e"s must match
each other). Also, neither the vowel nor the
"e
"/"s
" will match an
ignorediacritics
-stripped character; this prevents
"für" from matching "fu'er".ignorediacritics
aka igndiacritics
Ignore diacritic marks - Unicode non-starter or modifier
symbols resulting from NFD decomposition - e.g. diaeresis,
umlaut, circumflex, grave, acute, tilde etc.expandligatures
aka expligatures
Expand ligatures, e.g. "œ" (U+0153) will match
"oe". Note that even with this flag off, certain ligatures
may still be expanded if necessary for case-folding under
ignorecase
with case mode unicodemulti
; see
below.ignorewidth
aka ignwidth
Ignore half- and full-width differences, e.g. for katakana and
ASCII.
Due to interactions between flags, they are applied in the order
specified above, followed by case folding according to the case
style (upper/lower etc.). E.g. expanddiacritics
is applied
before ignorediacritics
, because otherwise the latter would
strip the characters that the former expands.
A case-folding mode may also be given in addition to the above; this determines how the case-folding style (e.g. upper/lower/title) is actually applied. It is one of the following:
unicodemulti
Use the builtin Unicode 5.1.0 1-to-N-character folding
tables. All locale-independent Unicode characters with the
appropriate case equivalent are folded. A single character
may fold to up to 3 characters, if needed; e.g. the German
es-zett character (U+00DF) will match "ss
" and
vice-versa under ignorecase
. Note that additional
ligature expansions may happen if expandligatures
is
set.unicodemono
Use the builtin Unicode 5.1.0 1-to-1-character folding tables.
All locale-independent Unicode characters with the appropriate
case equivalent are folded. Note that even though this mode
is 1-to-1-character, it is not necessarily 1-to-1-byte,
i.e. a UTF-8 string may still change its byte length when
folded, even though the Unicode character count will remain
the same.ctype
Use the C ctype.h
functions. Case folding will be OS-
and locale-dependent; a locale should be set with the SQL
locale
property. Only codepoints U+0001 through U+00FF
can be folded; e.g. most Western European characters are
folded, but Cyrillic, Greek etc. are not. Note that while
this mode is 1-to-1-character, it is not necessarily
1-to-1-byte, unless the iso-8859-1
flag is also
in effect. This mode was part of the default in version 5
and earlier.
The default case-folding mode is unicodemulti
; see below
for the version 5 and earlier default, and important caveats.
In addition to the above styles, flags and modes, several aliases
may be used, and mixed with flags. The aliases have the form:
[stringcomparemode|textsearchmode][default|builtin]
where stringcomparemode
or textsearchmode
refers to
that setting's value (if not given: the setting being modified).
default
refers to the default value (modifiable with
texis.ini
); builtin
refers to the builtin
factory default; no suffix refers to the current setting value.
E.g. "stringcomparemodedefault,+ignorecase
" would obtain
the default stringcomparemode
setting (from
texis.ini
if available), but set the case style to
ignorecase
.
A Metamorph index always uses the textsearchmode
value that
was set at its initial creation, not the current value. However,
when multiple Metamorph indexes exist on the same fields, at
search time the Texis optimizer will attempt to use the index
whose (creation-time) textsearchmode
is closest to the
current value.
The textsearchmode
setting was added in Texis version 6; its
default is "unicodemulti, ignorecase, ignorewidth,
ignorediacritics, expandligatures" (note that UTF-8 text is
expected, since iso-8859-1
is not specified in the
default). In version 5 and earlier the default was effectively
"ctype, ignorecase, iso-8859-1". Caveat: A Texis
version 5 or earlier should not access or modify a Metamorph
index created by a version 6 or later Texis, unless textsearchmode
was set to "ctype, ignorecase, iso-8859-1" at creation,
or index corruption may result, especially if there are
hi-bit/Unicode/UTF-8 characters.
ueqprefix
(string)
The name of the user equivalence file. Default is empty.withinmode
(string)
A space- or comma-separated unit and optional type for the
"within-N" operator (e.g. w/5
). The unit is one of:
char
for within-N charactersword
for within-N words
radius
(the default if no type specified when set)
indicates all sets must be within a radius N of an
"anchor" set, i.e. there is a set in the match such that all
other sets are within N units right of its right edge or N
units left of its left edge.span
indicates all sets must be within an N-unit
span
radius
. The default setting for
version 5 and earlier is char
(i.e. char radius);
in version 6 and later the default is word span.withinproc
(boolean, on by default)
Whether to process the w/
operator in queries.
DIAGNOSTICSapicp
returns nothing.
EXAMPLE<apicp "alpostproc" "on">
CAVEATS
The apicp
function was added Sep. 13 1996. Various settings
were added since then and are unknown to previous versions.
Any apicp
calls should take place after
USER
/PASS
statements, but before SQL
and fmt
calls.
The ability to pass multiple $value
arguments for string-list
settings was added in version 3.0.996300000 20010728.
SEE ALSOapiinfo
, USER
PASS
, Metamorph hit markup
(here)
The Metamorph Linguistics chapter in the Texis manual