Low level Pattern Matching functions

SYNOPSIS
REX - Deterministic Regular Expression Matcher

FFS   *openrex(byte *s);
FFS   *closerex(FFS *fs);
FFS   *mknegexp(FFS *fs);
byte  *getrex(FFS *fs,byte *buf,byte *end,int operation);
int    rexsize(FFS *ex);

XPM - Approximate Pattern Matcher

XPMS *openxpm(byte  *s,int threshold); /* 0 < threshhold < 101 */
XPMS *closexpm(XPMS *xs);
byte *getxpm(XPMS *xs,byte *buf,byte *end,int operation);

PPM - Parallel String Pattern Matcher

PPMS *closeppm(PPMS *ps);
PPMS *openppm(byte **sl);
byte *getppm(PPMS *ps,byte *buf,byte *end,int operation);

SPM - Single String Pattern Matcher

SPMS *openspm(char *s);
SPMS *closespm(SPMS *fs);
byte *getspm(SPMS *fs,byte *buf,byte *end,int operation);
int   spmhitsz(SPMS *fs);

NPM - Numeric Pattern Matcher

NPMS *opennpm(char *s);
NPMS *closenpm(NPMS *);
byte *getnpm(NPMS *,byte *,byte *,int);


DESCRIPTION
These pattern matchers represent the core search algorithms that are present in the Metamorph program and API. Usage and syntax details are not presented here as they are described in great detail in the Metamorph User Manual. The programmer is encouraged to understand their purpose and usage before implementing their own software using these functions.

All of the pattern matchers behave in a similar fashion from a programming perspective. The open___() call initializes the matcher and returns a "handle" to it, and the close___() call free()s the memory associated with that object's "handle". If the open___() call fails for any reason it will return a cast pointer to NULL.

The arguments passed to the open___() call are as follows:

xxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Call             >Parameters Type(s)    > Example                         
openrex() >Rex expression        > "\\x0d\\x0a+"            
openxpm() >any string, threshold > "abc widgets",80         
openppm() >a list of strings     > {"abc","def","ghi",""};  
opennpm() >a numeric expression  > ">1000<=million"         
openspm() >a string expression   > "abc*def"

In all cases, the open___() call is computationally intensive, as each algorithm makes extensive use of dynamic programming techniques. It is generally considered that the pattern matcher will be processing a great deal of information between it's creation and destruction so that the creation overhead is justified by the dramatic increase in search rates.

If the pattern matchers are to be used in conjunction with one another then the programmer should optimize usage by dispatching the pattern matchers in order of relative speed. The following table enumerates the relative search rates:

xxxxxxxxxxxxxx=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Pattern Types >Governing factors                                  
SPM Fastest   >longer strings are faster                          
REX .         >longer expressions are usually faster              
PPM .         >shorter lists with the longest minimum strlen()    
XPM .         >shorter strings are faster                         
NPM Slowest   >nothing makes a difference

The get___() call is responsible for the location of hits within a buffer of information. All of the pattern matchers share a common parameter set for this operation:

(object pointer, byte *buf , byte *end, int operation)

object pointer
The structure pointer that was returned by the open___() call.

byte *buf
A pointer to the first character of the buffer to be searched .

byte *end
A pointer to one character past the last character in the buffer to be searched. ( Usually obtained by the expression buf+bufsize).

int operation
This will be one of the four possible operation codes: SEARCHNEWBUF,CONTINUESEARCH, BSEARCHNEWBUF, or BCONTINUESEARCH.

The SEARCHNEWBUF to locate the first occurrence of a hit within the delineated buffer, and the CONTINUESEARCH operation code indicates that the pattern matcher should locate the next (or succesive) occurrence. A pointer to the matched pattern will be returned or, a (byte *)NULL if no match was located within the buffer for the given call.

The operation codes BSEARCHNEWBUF and BCONTINUESEARCH are only understood by the REX pattern matcher, and are used to search backwards from the end pointer towards the buf pointer. A non NULL return value will point to the beginning of the matched pattern.

Some information about the matched hits for each of the pattern matchers may be obtained by looking into that pattern matcher's structure. The following structure members are the only valid ones for an API programmer to use:

NPMS /* Numeric Pattern Matcher */
{
 int hitsz;  /* the length of the matched quantity */
 double hy;  /* the maximum numeric value of the matched string */
 double hx;  /* the minimum numeric value of the matched string */
};

PPMS
{
 byte **slist;  /* the strings being sought */
 int sn;        /* the index of the located string within slist */
};

XPMS
{
 word thresh;           /* threshold above which a hit can occur */
 word maxthresh;        /* the max possible value a match may have */
 word thishit;          /* the value of this match */
 word maxhit;           /* max value located so far */
 byte maxstr[DYNABYTE]; /* the string with highest match value */
};

SPMS
{
 /* no usable members */
};

FFS  /* aka REX */
{
 /* no usable members */
};

Hit length information for REX and SPM is available through calls to rexsize() and spmhitsz() respectively. Each of these functions return the length of the last hit located by a call to get___(). The reason there are not similar calls available for the othe pattern matchers is because their length is obtainable via the structure.


EXAMPLE

#include "api3.h"

       /* this code breaks some rules in the interest of brevity */

main(argc,argv)
int    argc;
char **argv;
{
 void *pm=(void *)NULL;
 int   i;
 void (*close)();                    /* pointer to the close function */
 void (*get)();                        /* pointer to the get function */
 char ln[80];

 switch(*argv[1])      /* determine search type via leading character */
    {
     case '/' : pm=(void *)openrex(argv[1]+1);
                get=getrex;
                close=closerex;
                break;
     case '#' : pm=(void *)opennpm(argv[1]+1);
                get=getnpm;
                close=closenpm;
                break;
     case '%' : pm=(void *)openxpm(argv[1]+1,80);
                get=getxpm;
                close=closexpm;
                break;
    }
 if(pm==(void *)NULL)                    /* check to see if it opened */
    exit(1);

 while(gets(ln)!=NULL)
    {                                /* see if there hit on this line */
     if((*get)(pm,(byte *)ln,(byte *)(ln+strlen(ln)),SEARCHNEWBUF)
                    !=(byte *)NULL)
         puts(ln);
    }
 (*close)(pm);
 exit(0);
}


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.