14.5 Parsing Search Results

Although our last example allowed us to search another site and show its results with ours, one drawback is that it shows all results from that site, including banner ads, tables, etc. that clutter up our output.

We'd like to obtain just the nugget data from another site, and then format it in our style. The following script uses <TIMPORT> and REX expressions to parse the results from a popular search engine:

(Run this example.    Download the source.)


  <SCRIPT LANGUAGE=vortex>
  
  <A NAME=search>
    <urlcp timeout 10>
    Searching with
    <fmt '<A HREF="http://infoseek.go.com/Titles/?qt=%U">this url</A>'
       $query>.
    Results:<P>
    <submit METHOD=get URL=http://infoseek.go.com/Titles/ qt=$query>
    <$removeme = "\x0d" "<b>" ">><b =[^>]*>" "</b>">
    <sandr $removeme "" $ret>
    <$schema= "
      multiple
      recexpr >>\R<a href\=\x22http://=[^>]+>=!</a>+</a><br>=!<br>*<br>=
      #                            1    2 3     4        5     6    7
      #       Name            Type            Tag
      field   Link            varchar(40)     1-3
      field   Title           varchar(80)     4
      field   Abstract        varchar(180)    6
    ">
    <TIMPORT ROW $schema $ret>
      <send "$Link">$Title<send "</A>"><BR>
      $Abstract
      <P>
    </TIMPORT>
  </A>
  
  <A NAME=main>
    <FORM METHOD=post ACTION=$url/search.html>
      Search the web for: <INPUT NAME=query SIZE=20>
      <INPUT TYPE=submit>
    </FORM>
  </A>
  
  </SCRIPT>

Our <main> function submits a query to our <search> function, where we pass that query to the remote server with <submit> .

When the remote search returns, we first clean up the result HTML a bit, removing some highlighting tags with <sandr> .

Then we use <TIMPORT> to import the HTML, looking for the links, document titles, and abstracts of each search result. Our record expression recexpr is a REX expression to match each result in its entirety.

For each of these records, we obtain the Link , Title and Abstract from subfields of the expression. <TIMPORT> returns each result variable, looping over them.

Inside the <TIMPORT> loop, we simply print out the result variables, in our own format.

We can run this example to see the results (next page):

Back: Accessing Forms - Continued Next: Parsing Search Results - Continued
Copyright © 2024 Thunderstone Software LLC. All rights reserved.