Morpheme Stripping Routine

This routine is done by the Engine as a preliminary step before actually executing any search, using the content of the Prefix and Suffix Lists. This routine is used to get words from the Equivalence File, and only does the suffix stripping part, using the content of the Equiv-Suffix List.

  1. When it is time to execute a search, the suffix and prefix lists are each sorted by descending size and ascending alphabetical order. The reason for and importance of descending order is so that suffixes and prefixes can be stripped largest to smallest. There is no particular reason for alphabetical order except to provide a predictable ordering sequence.

  2. Get the word to be checked from entered query; e.g., "antidisestablishmentarianism".

  3. Check the word's length to see if it is greater than or equal to the length set in minimum word length. The default in Texis is 255. A setting of 5 normally produces the best results with the default suffix list. If so, carry on. If not, there is no need to morpheme strip the word; it would just get searched for as is; e.g., "red".

  4. Check the word found against the list of suffixes to see if there is a match, starting from largest suffix on the list.

  5. If a match is found strip it from the word. Note: This is why ordering by size is so important: because you want to remove suffixes (or prefixes) by the largest first, so as not to miss multiple suffixes, where one suffix may be a subset of another.

  6. Continue checking against the list for the next match. Follow steps 4-5 until no more matches found. In the case of our example above, based on the default suffix list, we would be left with the following morpheme before prefix processing: "antidisestablishmentarian". Note the following things:

    • The suffix "ism" was on the list and was stripped.

    • Neither "an", "ian", nor "arian" was on the suffix list, so it was not stripped.

    • The suffix "ment" is on the suffix list, but it was not left at the end of the word at any point, and therefore was not removed.

    • If "arian" and "ian" were both entered on the suffix list, "arian" would be removed first, so as not to remove "ian" and be left with "ar" at the end of the word which would not be strippable.

  7. If suffix checking (only), remove any trailing vowels, or 1 of any double trailing consonants. This handles things like "strive", which would be correctly stripped down to "striv" so that it won't miss matches for "striving", etc. (trailing vowel). And things like "travelling" would be stripped to "travell"; you have to strip the second `l' so that you wouldn't miss the word "travel" (trailing double consonant). Note: this is only done for suffix checking, not prefix checking.

  8. Now repeat Steps 4-6 for prefix stripping against the prefix list. In our example, "antidisestablishmentarian" would get stripped down to "establishmentarian". This is what you have left and is what goes to the pattern matcher.

  9. When something is found, the pattern matcher builds it back up again to make sure it is truly a match to what you were looking for. This prevents things like taking "pressure" when you were really looking for "president", "restive" when you were really looking for "restaurant", and other such oddities.

Copyright © Thunderstone Software     Last updated: Dec 10 2018
Copyright © 2019 Thunderstone Software LLC. All rights reserved.