Rex Regular EXpression Pattern Matcher
Q: What does it do?
A: It finds things that nothing else can!
Like:
- Part Numbers
- Social Security Numbers
- Special Codes that Aren't Words
- Headlines
- Proper Names
- Time Stamps
People stick many things in text that you cannot locate with the simple
wildcard matching that most other packages are limited to.
The Rex matcher is one of those features that is not called upon very often,
but when it's needed it will save a tremendous amount of time and aggravation!
Using a text retrieval package without this ability is like driving a car that
doesn't have a spare tire; eventually you'll be stranded.
Below is the rather lengthy description of Rex's syntax.
Fortunately you don't need to learn it because there are macro's and
easy-to-use expression builders within Metamorph to help guide you through.
REX Expression Syntax
Expressions are composed of characters and operators. Operators
are characters with special meaning to Rex. The following
characters have special meaning: "\=+*?{},[]^$.-!" and must
be escaped with a '\' if they are meant to be taken literally.
The string ">>" is also special and if it is to be matched,
it should be written "\>>".
-
A '\' followed by an 'R' or an 'I' mean to begin respecting
or ignoring alphabetic case distinction. (Ignoring case is the
default.) These switches DO NOT apply inside range brackets.
-
A '\' followed by an 'L' indicates that the characters following
are to be taken literally up to the next '\L'. The purpose of
this operation is to remove the special meanings from characters.
-
A sub-expression following '\F' (followed by) or '\P' (preceded by)
can be used to root the rest of an expression to which it is tied.
It means to look for the rest of the expression "as long as followed
by ..." or " as long as preceded by ..." the sub-expression
following the \F or \P, but the designated sub-expression will be
considered excluded from the located expression itself.
-
A '\' followed by one of the following 'C' language character
classes matches that character class: alpha, upper, lower,
digit, xdigit, alnum, space, punct, print, graph,
cntrl, ascii.
-
A '\' followed by one of the following special characters
will assume the following meaning: n=newline, t=tab,
v=vertical tab, b=backspace, r=carriage return,
f=form feed, 0= the null character.
-
A '\' followed by Xn or Xnn where n is a hexadecimal digit
will match that character.
-
A '\' followed by any single character (not one of the above)
matches that character. Escaping
a character that is not a special escape is not recommended, as the
expression could change meaning if the character becomes an escape
in a future release.
-
The character '^' placed anywhere in an expression (except after a
'[') matches the beginning of a line. (same as: \x0A in Unix or
\x0D\x0A in DOS)
-
The character '$' placed anywhere in an expression
matches the end of a line. (\x0A in UNIX, \x0D\x0A in DOS)
-
The character '.' matches any character.
-
A single character not having special meaning matches that
character.
-
A string enclosed in brackets [] matches any single char-
acter from the string. Ranges of ASCII character codes
may be abbreviated as in [a-z] or [0-9]. A '^' occurring as
the first character of the string will invert the meaning
of the range. A literal '-' must be preceded by an '\'.
The case of alphabetic characters is always respected within
brackets.
-
The '>>' operator in the first position of a fixed expression
will force REX to use that expression as the "root" expression
off which the other fixed expressions are matched. This operator
overrides one of the optimizers in REX. This operator can
be quite handy if you are trying to match an expression
with a '!' operator or if you are matching an item that
is surrounded by other items. For example: "x+>>y+z+"
would force REX to find the "y's' first then go backwards
and forwards for the leading "x's" and trailing "z's".
-
The '!' character in the first position of an expression means
that it is NOT to match the following fixed expression.
For example: "start=!finish+" would match the word "start"
and anything past it up to (but not including the word "finish".
Usually operations involving the NOT operator involve knowing
what direction the pattern is being matched in. In these cases
the '>>' operator comes in handy. If the '>>' operator is used,
it comes before the '!'. For example: ">>start=!finish+finish"
would match anything that began with "start" and ended with
"finish". THE NOT OPERATOR CANNOT BE USED BY ITSELF in an
expression, or as the root expression in a compound expression.
Repetition Operators
A regular expression may be followed by a repetition operator in
order to indicate the number of times it may be repeated.
-
An expression followed by the operator "{X,Y}" indicates that
from X to Y occurrences of the expression are to be located. This
notation may take on several forms: "{X}" means X occurrences of
the expression, "{X,}" means from X to N occurrences of the
expression, and "{,Y}" means from 0 (no occurrences) to Y
occurrences of the expression.
-
The '?' operator is a synonym for the operation "{0,1}".
Read as: "Zero or one occurrence."
-
The '*' operator is a synonym for the operation "{0,}".
Read as: "Zero or more occurrences."
-
The '+' operator is a synonym for the operation "{1,}".
Read as: "One or more occurrences."
-
The '=' operator is a synonym for the operation "{1}".
Read as: "One occurrence."
Caveats and Commentary
REX is a highly optimized pattern recognition tool that has been modeled
after the UNIX family of tools: GREP, EGREP, FGREP, and LEX. Wherever
possible REX's syntax has been held consistent with these tools, but
there are several major departures that may bite those who are used to
using the GREP family.
REX uses a combination of techniques that allow it to surpass the speed of
anything similar to it by a very wide margin.
The technique that provides the largest advantage is called
"state-anticipation or state-skipping" which works as follows:
if we were looking for the pattern:
ABCDE
in the text:
AAAAABCDEAAAAAAA
a normal pattern matcher would do the following:
ABCDE
ABCDE
ABCDE
ABCDE
ABCDE
AAAAABCDEAAAAAAA
The state-anticipation scheme would do the following:
ABCDE
ABCDE
AAAAABCDEAAAAAAA
The normal algorithm moves one character at time through the text,
comparing the leading character of the pattern to the current text
character of text, and if they match, it compares the leading pattern
character +1 to the current text character +1 , and so on...
The state anticipation pattern matcher is aware of the length of the
pattern to be matched, and compares the last character of the pattern to
the corresponding text character. If the two are not equal, it moves
over by an amount that would allow it to match the next potential hit.
If one were to count the number of comparison cycles for each pattern
matching scheme using the example above, the normal pattern matcher would
have to perform 13 compare operations before locating the first occurrence
VS. 6 compare operations for the state-anticipation pattern matcher.
One concept to grasp here is that: "The longer the pattern to be found,
the faster the state-anticipation pattern matcher will be." While a
normal pattern matcher will slow down as the pattern gets longer.
Herein lies the first major syntax departure: REX always applies
repetition operators to the longest preceding expression. It does
this so that it can maximize the benefits of using the state-skipping
pattern matcher.
If you were to give GREP the expression : ab*de+
It would interpret it as:
an "a" then 0 or more "b"'s then a "d" then 1 or more "e"'s.
REX will interpret this as:
0 or more occurrences of "ab" followed by 1 or more occurrences of "de".
The second technique that provides REX with a speed advantage is ability
to locate patterns both forwards and backwards indiscriminately.
Given the expression: "abc*def", the pattern matcher is looking for
"Zero to N occurrences of 'abc' followed by a 'def'".
The following text examples would be matched by this expression:
abcabcabcabcdef
def
abcdef
But consider these patterns if they were embedded within a body of text:
My country 'tis of abcabcabcabcdef sweet land of def, abcdef.
A normal pattern matching scheme would begin looking for 'abc*' . Since
'abc*' is matched by every position within the text, the normal pattern
matcher would plod along checking for 'abc*' and then whether it's there
or not it would try to match "def". REX examines the expression
in search of the the most efficient fixed length sub-pattern and uses it
as the root of search rather than the first sub-expression. So, in the
example above, REX would not begin searching for "abc*" until it has located
a "def".
There are many other techniques used in REX to improve the rate at which
it searches for patterns, but these should have no effect on the way in
which you specify an expression.
The three rules that will cause the most problems to experienced GREP users
are:
-
1: Repetition operators are always applied to the longest expression.
-
2: There must be at least one sub-expression that has one or more repetitions.
-
3: No matched sub-expression will be located as part of another.
Rule 1 example:
abc=def* Means one "abc" followed by 0 or more "def"'s .
Rule 2 example:
abc*def* CAN NOT be located because it matches every position within the text.
Rule 3 example:
a+ab Is idiosyncratic because "a+" is a subpart of "ab".
back up
Copyright © 1996 Thunderstone Software