This routine is done by the Engine as a preliminary step before
actually executing any search, using the content of the Prefix and
Suffix Lists. This routine is used to get words from the Equivalence
File, and only does the suffix stripping part, using the content of
the Equiv-Suffix List.
- When it is time to execute a search, the suffix and prefix lists
are each sorted by descending size and ascending alphabetical order.
The reason for and importance of descending order is so that suffixes
and prefixes can be stripped largest to smallest. There is no
particular reason for alphabetical order except to provide a
predictable ordering sequence.
- Get the word to be checked from entered query; e.g.,
"
antidisestablishmentarianism
". - Check the word's length to see if it is greater than or equal to
the length set in minimum word length. The default in Texis is 255. A
setting of 5 normally produces the best results with the default suffix
list. If so, carry
on. If not, there is no need to morpheme strip the word; it would
just get searched for as is; e.g., "
red
". - Check the word found against the list of suffixes to see if
there is a match, starting from largest suffix on the list.
- If a match is found strip it from the word. Note: This is why
ordering by size is so important: because you want to remove suffixes
(or prefixes) by the largest first, so as not to miss multiple
suffixes, where one suffix may be a subset of another.
- Continue checking against the list for the next match. Follow
steps 4-5 until no more matches found. In the case of our example
above, based on the default suffix list, we would be left with the
following morpheme before prefix processing:
"
antidisestablishmentarian
". Note the following things:
- The suffix "
ism
" was on the list and was stripped. - Neither "
an
", "ian
", nor "arian
" was
on the suffix list, so it was not stripped. - The suffix "
ment
" is on the suffix list, but it was not
left at the end of the word at any point, and therefore was not
removed. - If "
arian
" and "ian
" were both entered on the
suffix list, "arian
" would be removed first, so as not to
remove "ian
" and be left with "ar
" at the end of the
word which would not be strippable.
- If suffix checking (only), remove any trailing vowels, or 1 of
any double trailing consonants. This handles things like
"
strive
", which would be correctly stripped down to
"striv
" so that it won't miss matches for
"striving
", etc. (trailing vowel). And things like
"travelling
" would be stripped to "travell
"; you
have to strip the second `l
' so that you wouldn't miss the word
"travel
" (trailing double consonant). Note: this is only
done for suffix checking, not prefix checking. - Now repeat Steps 4-6 for prefix stripping against the prefix
list. In our example, "
antidisestablishmentarian
" would get
stripped down to "establishmentarian
". This is what you have
left and is what goes to the pattern matcher. - When something is found, the pattern matcher builds it back up
again to make sure it is truly a match to what you were looking for.
This prevents things like taking "
pressure
" when you were
really looking for "president
", "restive
" when you
were really looking for "restaurant
", and other such
oddities.
Copyright © Thunderstone Software Last updated: Apr 15 2024