SYNOPSIS<rmcommon $data $template [$maxrm]>
DESCRIPTION
The rmcommon
function removes the prefix and suffix text from
each $data
value that is shared with the corresponding
$template
value. Up to $maxrm
characters are removed,
rounded down to the nearest word boundary; the default is the maximum
amount of common text. This function is useful in stripping common
header and footer text from web pages before indexing.
DIAGNOSTICSrmcommon
returns $data
with its common prefix/suffix
text removed.
EXAMPLE<$template = "Acme Industries, Inc. [Data] Home Next Previous">
<rmcommon $data $template>
<SQL NOVARS "insert into webpages
values(counter, $Url, $ret)">
</SQL>
In the above example, $template
is set to a template
representative of a typical (formatted) page from a web site, i.e. an
actual fetched page. Like all pages from this site, it contains the
same title prefix and navigation-bar suffix that we want to strip
before indexing, to prevent useless hits on "Acme
" for
example. By using this template with <rmcommon>
against every
fetched page $data
, the prefix/suffix is stripped before
insertion into the database. Thus, if $data
was initially
"Acme Industries, Inc. Widgets and Gadgets Home Next
Previous", after the <rmcommon>
call it would be inserted as
"Widgets and Gadgets".
CAVEATS
The rmcommon
function was added in version 3.01.984600000 20010314.