REX Caveats and Commentary

REX is a highly optimized pattern recognition tool that has been modeled after the grep and lex Unix family of Unix tools. Wherever possible REX's syntax has been held consistent with these tools, but there are several major departures that may bite those who are used to using the grep family.

REX uses a combination of techniques that allow it to operate at a much faster rate than similar expression matching tools. Unlike grep, Rex is both deterministic and non-directional. This may cause some initial problems with users familiar with grep's way of thinking.

REX always applies repetition operators to the longest preceding expression. It does this so that it can maximize the benefits of using its rapid state skipping pattern matcher.

If you were to give grep the expression: "ab*de+"

It would interpret it as: an "a" then 0 or more "b"s then a "d" then 1 or more "e"s.

REX will interpret this as: 0 or more occurrences of "ab" followed by 1 or more occurrences of "de".

The second technique that provides REX with a speed advantage is ability to locate patterns both forwards and backwards indiscriminately.

Given the expression: "abc*def", the pattern matcher is looking for "Zero to N occurrences of `abc' followed by a `def"'.

The following text examples would be matched by this expression:

abcabcabcabcdef
     def
     abcdef

But consider these patterns if they were embedded within a body of text:

My country 'tis of abcabcabcabcdef sweet land of def, abcdef.

A normal pattern matching scheme would begin looking for "abc*" . Since "abc*" is matched by every position within the text, the normal pattern matcher would plod along checking for "abc*" and then whether it's there or not it would try to match "def". REX examines the expression in search of the the most efficient fixed length subpattern and uses it as the root of search rather than the first subexpression. So, in the example above, REX would not begin searching for "abc*" until it has located a "def".

There are many other techniques used in REX to improve the rate at which it searches for patterns, but these should have no effect on the way in which you specify an expression.

The three rules that will cause the most problems to experienced grep users are:

  1. Repetition operators are always applied to the longest preceding fixed length expression.

  2. There must be at least one subexpression that has one or more repetitions.

  3. No matched subexpression will be located as part of another.

Rule 1 Example
: "abc=def*" means one "abc" followed by 0 or more "def"s.

Rule 2 Example
: "abc*def*" cannot be located because it matches every position within the text.

Rule 3 Example
: "a+ab" is idiosyncratic because "a+" is a subpart of "ab".


Copyright © Thunderstone Software     Last updated: Nov 8 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.