14.6 Searching Multiple Sites Concurrently | |
Our previous example showed how to parse a single site's search results and present them in our own format. This gave us the basics of a proxy or portal site.
We can expand this to search multiple sites simultaneously with <fetch nbsp;PARALLEL> . The following example is a condensed version of our Meta-search demo:
(Run this example. Download the source.)
<SCRIPT LANGUAGE=vortex> <A NAME=vars> <$rawurls = "http://infoseek.go.com/Titles/?qt=xyzzy" "http://www.excite.com/search.gw?s=xyzzy&c=web&start=0&showSummary=true" "http://search.thunderstone.com/texis/websearch/?q=xyzzy&max=10&w3meta=1" > <$names = "Infoseek" "Excite" "Thunderstone" > <$schemas = "#Infoseek multiple recexpr >>\R<a href\=\x22http://=[^>]+>=!</a>+</a><br>=!<br>*<br>= # 1 2 3 4 5 6 7 # Name Type Tag field Link varchar(40) 1-3 field Title varchar(80) 4 field Abstract varchar(180) 6 " "#Excite multiple recexpr >>\% </SMALL>\x0a=<a href=[^>]+>=[^<]+</a> \x0a\- =[^<]+ # 1 2 3 4 5 6 7 # Name Type Tag field Link varchar(40) 2-4 field Title varchar(80) 5 field Abstract varchar(180) 7 " "#Thunderstone multiple recexpr >><dt>=[^<\x0a]+<a =[^>]+>=[^<\x0a]+</a><dd>=[^<\x0a]+<tt>=[^<\x0a]+</tt><br><i>=[^<\x0a]+</i><br><small><a =[^>]+>=[^<\x0a]+</a>=[^<\x0a]+</small><p>\x0a # 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 # Name Type Tag field Link varchar(40) 3-5 field Title varchar(80) 6 field Abstract varchar(180) 8 " > <$removeme = "\x0d" "<b>" ">><b =[^>]*>" "</b>"> </A> <A NAME=search> <vars> <strfmt "%U" $query> <!-- URL-escape query --> <sandr "[\?\#\{\}\+\\]" "\\\1" $ret> <!-- escape sandr replace chars --> <sandr "xyzzy" $ret $rawurls> <!-- put the query in the URLs --> <$fetchurls = $ret> <!-- URLs plus user's query --> <urlcp timeout 20> <fetch PARALLEL URLS=$fetchurls $names $schemas> <sandr $removeme "" $ret> <$html = $ret> <hr> Results from $names:<P> <TIMPORT MAX=3 ROW $schemas $html> <send "$Link">$Title<send "</A>"><BR> $Abstract <P> </TIMPORT> <flush> </fetch> </A> <A NAME=main> <FORM METHOD=post ACTION=$url/search.html> Search multiple engines for: <INPUT NAME=query SIZE=20> <INPUT TYPE=submit> </FORM> </A> </SCRIPT> |
This script is an expanded version of our previous example. The first function <vars> sets up several parallel variables with site-dependent information. For each site, we need to know:
The <search> function is modified from our previous example. Since we have several sites to search with different form variables for each, we can't use <submit> . Instead, we use <strfmt> and <sandr> to insert the user's query into our raw search URLs' query strings, giving us $fetchurls .
Now we're ready for the actual search. We call As soon as the first site finishes, its data is returned in $ret
and the <fetch nbsp;PARALLEL>
block is entered. We
then parse the data as in our previous script, using <TIMPORT>
with the appropriate $schemas
value.
When the next site returns, <fetch nbsp;PARALLEL>
will loop
again, and we parse that site. We proceed for each site that
returns. (The PARALLEL
flag makes <fetch>
a looping
statement that takes a matching </fetch>
.)
It is important to note here that with the PARALLEL
syntax, the sites are not simply searched one at a time in
the order they appear in $fetchurls
. That would waste
time. Instead, PARALLEL
searches all sites at the same time.
So the fastest site is returned first, followed by the next
fastest, etc. If one site is down, it won't slow down our access to
the others. Rather than taking the sum of all search times to run,
our script only takes as much time as the single slowest site.
Neat huh?
The other trick with PARALLEL
is we don't know what order
the sites will return in - the fastest site may vary from query to
query. So how do we know to apply the correct schema inside the
loop? That's where the extra vars to <fetch nbsp;PARALLEL>
come
in. Any variables given after the URLS
argument will be
looped over in the <fetch>
block, but in the same order as
the returned sites. So whatever site returns first, we know that $schemas
is set to the corresponding schema, and $names
is set to the corresponding name.
We ran this script and got the following results (next page):
Fastest first
Parallel variables
Back: Parsing Search Results - Continued
Next: Searching Multiple Sites Concurrently - Continued