Unicode textsearchmode, stringcomparemode - Caveat

Texis Version 6 has improved Unicode (international/foreign/hi-bit/UTF-8) character support. Two new settings were introduced: textsearchmode (here) and stringcomparemode (here). Both have the same set of possible values, and offer more flexibility in how text searches and string comparisons (respectively) are handled. Some features:

  • Full Unicode case-insensitivity By default, text searches (e.g. the LIKE operator) are case-insensitive in version 6 for the entire Unicode 5.1 locale-independent character set, not just the given operating system's locale (which may be inconsistent and does not support characters beyond U+00FF).

  • UTF-8 support UTF-8 is the expected character set, though ISO-8859-1 is still accepted. (Other character sets are converted automatically.)

  • Full-width ASCII Full-width ASCII characters (used in CJK contexts) match their normal/half-width ASCII counterparts.

  • Diacritics ignored Diacritic marks - umlauts, accents, etc. - are ignored, so that e.g. "für" matches "fur".

  • Ligatures expanded Ligatures are expanded to match their expanded counterparts, e.g. "œ" (U+0153) will match "oe".

All of these behaviors can be controlled with the (new in version 6) textsearchmode and stringcomparemode apicp settings (see the Vortex manual for details).

Caveat: A version 5 or earlier Texis should not access or modify a regular (B-tree) or Metamorph index originally created by a version 6 or later Texis, unless stringcomparemode was set to ctype, respectcase, iso-8859-1 (regular indices) or textsearchmode was set to ctype, ignorecase, iso-8859-1 (Metamorph indices) at creation. If hi-bit/UTF-8/Unicode characters exist in the data, index corruption may result from Texis 5 modifications.

The stringcomparemode setting also affects the functions <xtree>, <strstr>, <strstri>, <substr>, <strcmp>, <strcmpi>, <strncmp>, <strnicmp>, <strlen>, <strrev>, <upper>, <lower>, <sort>, <uniq>, upper(), lower(), initcap(), text2mm() and length(). The length()/<strlen> functions count charset characters (e.g. UTF-8 characters) not bytes.

Version 5 and earlier behavior can be restored by default by setting the texis.ini setting [Apicp] Text Search Mode to ctype, ignorecase, iso-8859-1, and [Apicp] String Compare Mode to ctype, respectcase, iso-8859-1.

