14.6 Searching Multiple Sites Concurrently


14.6 Searching Multiple Sites Concurrently

Our previous example showed how to parse a single site's search results and present them in our own format. This gave us the basics of a proxy or portal site.

We can expand this to search multiple sites simultaneously with <fetch nbsp;PARALLEL> . The following example is a condensed version of our Meta-search demo:

(Run this example. Download the source.)


  <SCRIPT LANGUAGE=vortex>
  
  <A NAME=vars>
    <$rawurls =
     "http://infoseek.go.com/Titles/?qt=xyzzy"
     "http://www.excite.com/search.gw?s=xyzzy&c=web&start=0&showSummary=true"
     "http://search.thunderstone.com/texis/websearch/?q=xyzzy&max=10&w3meta=1"
    >
    <$names = 
     "Infoseek"
     "Excite"
     "Thunderstone"
    >
    <$schemas =
     "#Infoseek
     multiple
     recexpr >>\R<a href\=\x22http://=[^>]+>=!</a>+</a><br>=!<br>*<br>=
     #                            1    2 3     4        5     6    7
     #       Name            Type            Tag
     field   Link            varchar(40)     1-3
     field   Title           varchar(80)     4
     field   Abstract        varchar(180)    6
     "
     "#Excite
     multiple
     recexpr >>\% </SMALL>\x0a=<a href=[^>]+>=[^<]+</a>&nbsp;\x0a\- =[^<]+
     #                        1       2    3 4    5                 6    7
     #       Name            Type            Tag
     field   Link            varchar(40)     2-4
     field   Title           varchar(80)     5
     field   Abstract        varchar(180)    7
     "
     "#Thunderstone
     multiple
     recexpr >><dt>=[^<\x0a]+<a =[^>]+>=[^<\x0a]+</a><dd>=[^<\x0a]+<tt>=[^<\x0a]+</tt><br><i>=[^<\x0a]+</i><br><small><a =[^>]+>=[^<\x0a]+</a>=[^<\x0a]+</small><p>\x0a
     #             1        2   3    4 5        6        7        8    9        0           1        2                  3    4 5        6    7        8     9
     #       Name            Type            Tag
     field   Link            varchar(40)     3-5
     field   Title           varchar(80)     6
     field   Abstract        varchar(180)    8
     "
    >
    <$removeme = "\x0d" "<b>" ">><b =[^>]*>" "</b>">
  </A>
  
  <A NAME=search>
    <vars>
    <strfmt "%U" $query>                  <!-- URL-escape query -->
    <sandr "[\?\#\{\}\+\\]" "\\\1" $ret>  <!-- escape sandr replace chars -->
    <sandr "xyzzy" $ret $rawurls>         <!-- put the query in the URLs -->
    <$fetchurls = $ret>                   <!-- URLs plus user's query -->
    <urlcp timeout 20>
    <fetch PARALLEL URLS=$fetchurls $names $schemas>
      <sandr $removeme "" $ret>
      <$html = $ret>
      <hr>
      Results from $names:<P>
      <TIMPORT MAX=3 ROW $schemas $html>
        <send "$Link">$Title<send "</A>"><BR>
        $Abstract
        <P>
      </TIMPORT>
      <flush>
    </fetch>
  </A>
  
  <A NAME=main>
    <FORM METHOD=post ACTION=$url/search.html>
      Search multiple engines for: <INPUT NAME=query SIZE=20>
      <INPUT TYPE=submit>
    </FORM>
  </A>
  
  </SCRIPT>

This script is an expanded version of our previous example. The first function <vars> sets up several parallel variables with site-dependent information. For each site, we need to know:

Its raw search URL ($rawurls )
Its "vanilla" site name ($names )
A <TIMPORT> schema to parse its data ($schemas )

The <search> function is modified from our previous example. Since we have several sites to search with different form variables for each, we can't use <submit> . Instead, we use <strfmt> and <sandr> to insert the user's query into our raw search URLs' query strings, giving us $fetchurls .

Now we're ready for the actual search. We call with the PARALLEL flag to tell it we're fetching multiple URLs simultaneously. <fetch> will start retrieving every URL in $fetchurls at once.

As soon as the first site finishes, its data is returned in $ret and the <fetch nbsp;PARALLEL> block is entered. We then parse the data as in our previous script, using <TIMPORT> with the appropriate $schemas value.

When the next site returns, <fetch nbsp;PARALLEL> will loop again, and we parse that site. We proceed for each site that returns. (The PARALLEL flag makes <fetch> a looping statement that takes a matching </fetch> .)

Fastest first

It is important to note here that with the PARALLEL syntax, the sites are not simply searched one at a time in the order they appear in $fetchurls . That would waste time. Instead, PARALLEL searches all sites at the same time. So the fastest site is returned first, followed by the next fastest, etc. If one site is down, it won't slow down our access to the others. Rather than taking the sum of all search times to run, our script only takes as much time as the single slowest site. Neat huh?

Parallel variables

The other trick with PARALLEL is we don't know what order the sites will return in - the fastest site may vary from query to query. So how do we know to apply the correct schema inside the loop? That's where the extra vars to <fetch nbsp;PARALLEL> come in. Any variables given after the URLS argument will be looped over in the <fetch> block, but in the same order as the returned sites. So whatever site returns first, we know that $schemas is set to the corresponding schema, and $names is set to the corresponding name.

We ran this script and got the following results (next page):

Back: Parsing Search Results - Continued

Next: Searching Multiple Sites Concurrently - Continued