rmcommon - remove common prefix/suffix from text


<rmcommon $data $template [$maxrm]>

The rmcommon function removes the prefix and suffix text from each $data value that is shared with the corresponding $template value. Up to $maxrm characters are removed, rounded down to the nearest word boundary; the default is the maximum amount of common text. This function is useful in stripping common header and footer text from web pages before indexing.

rmcommon returns $data with its common prefix/suffix text removed.


<$template = "Acme Industries, Inc. [Data] Home Next Previous">
<rmcommon $data $template>
<SQL NOVARS "insert into webpages
             values(counter, $Url, $ret)">

In the above example, $template is set to a template representative of a typical (formatted) page from a web site, i.e. an actual fetched page. Like all pages from this site, it contains the same title prefix and navigation-bar suffix that we want to strip before indexing, to prevent useless hits on "Acme" for example. By using this template with <rmcommon> against every fetched page $data, the prefix/suffix is stripped before insertion into the database. Thus, if $data was initially "Acme Industries, Inc. Widgets and Gadgets Home Next Previous", after the <rmcommon> call it would be inserted as "Widgets and Gadgets".

The rmcommon function was added in version 3.01.984600000 20010314.

Copyright © Thunderstone Software     Last updated: Oct 24 2023
Copyright © 2024 Thunderstone Software LLC. All rights reserved.