4.7 Index Issues

When creating a metamorph index, of any of the three types, there are a number of factors that should be taken into account. The index should be able to find the things you are searching for. By default the index considers a word to be a string of between 2 and 99 alphanumeric characters, and does not index noise words. Texis respects locale settings and asks the system which characters are alphanumeric. You may wish to add or change expressions, or remove the noise list depending on your application.

For example if you have a resume or software products database you may want to be able to search for C++ or OS/2. You might add an expression with: set addexp='[\alnum+/]{2,99}'which would include + and / in the index. To remove the default expression and enter all your own you should say: set delexp=0;. You might do that if you are going to be indexing single character words.

Another reason to switch index expressions is to handle 8-bit character sets if the locale is not set correctly. The most common expression to use is: [\alnum\x80-\xff]{2,99}which just grabs all the high-bit characters. Metamorph also has a couple of character sets that define what constitutes a word and language character. These are settable as well. Wordc defines what constitutes a word. This is used to ensure you aren't getting substring matches. Langc is similar, and is used when looking at the query to determine whether it is language, or not, so suffix processing should be applied, and wordc used to qualify the hit. Wordc by default is [\alpha\'], and langc also includes hyphen [\alpha\'\-].

If you have a CD database you might want to disable the noise list: set keepnoise=1;so that you can search for and find things like "Who are you" by "The Who". You may also want to modify or disable the noise list if you have state abbreviations that will be searched on. IN, ME, and OR are all noise words.

Back: Index types Next: Indexing
Copyright © 2024 Thunderstone Software LLC. All rights reserved.