fetch - fetch URLs

 

SYNOPSIS

<fetch [METHOD=$methods] [RENAMETO=$path] $theurl [$rawdoc]>
or
<fetch PARALLEL[=n] [METHOD=$methods] [RENAMETO=$path] [URLS=]$urls
       [$loopvar ...]>
  ...
</fetch>


DESCRIPTION
The fetch function retrieves URLs, using the HTTP GET method (or its equivalent for other protocols), and returns the (unformatted) documents in $ret. The METHOD option, added in version 3.01.971300000 20001011, specifies a parallel list of alternate method(s); each method may be one of OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, MKDIR, or RENAME. Not all methods are supported by all protocols. Some methods are mapped to an equivalent method when used with non-HTTP protocols. If the RENAME method is set (valid only for ftp URLs), the RENAMETO option must also be set to the target file path.

With the first syntax, a single URL (the first value of $theurl) is retrieved. The urlinfo function (here) can then be called to obtain more information about the page just fetched. If a second argument ($rawdoc) is given, it is used as the raw document source, instead of actually fetching the URL. This provides a way to obtain the text, links, etc. of arbitrary script-generated HTML that isn't present on an actual web page. (To just HTML-decode an arbitrary string without tag removal, word-wrapping or link parsing, see the %H code to strfmt with the ! flag, here.)

With the second (loop) syntax, all of the URLs in $urls are fetched in parallel, that is, simultaneously. Once the first (i.e. quickest) URL is completed, the fetch loop is entered, with $ret set to the raw document source. As subsequent URLs are completed, the loop is iterated again; once for each member of $urls. Inside the loop, the urlinfo function can be used to retrieve further information about the URL just completed.

It is important to note with the loop syntax that URLs are returned fastest-first, which might not be the order they are present in $urls. For example, suppose two URLs are being fetched where the first URL takes 10 seconds to download and the other 3 seconds. With the parallel loop syntax, the second will probably be returned first, after 3 seconds; then 7 seconds later the first will be completed. A URL that refers to an unresponsive web server will not hold up other URLs; it is merely returned last, when it times out.

As an aid in tracking which URL was returned in each iteration, the $urls variable and any subsequent $loopvar variables are looped over in the fetch, but in the same order as returned URLs. Thus $urls is set to the URL just retrieved inside the loop.

The special variables $loop and $next are set and incremented inside the loop as well: $loop starts at 0, $next at 1.

If an argument to PARALLEL is given, only that many URLs will be fetched simultaneously; the remaining ones are started only as the first ones complete. The default (no argument to PARALLEL) is to start fetching all URLs initially (in version 4 and earlier) or only 10 at a time (version 5 and later).


DIAGNOSTICS
fetch returns the raw document just fetched (after content/transfer encodings decoded), or the value of $rawdoc.


EXAMPLE
This example uses the loop syntax to search multiple search engines simultaneously. First, the user's query is placed into each URL with sandr. Then the resulting URLs are fetched; because of the PARALLEL flag, the fastest engine will return first. Each page is then post-processed to remove HTML outside the <BODY>/</BODY> tags - since there will be multiple pages concatenated together - and displayed following a <BASE> tag so that the user's browser knows the proper path for the links:

<urlcp timeout 10>
<$rooturls =
  "http://www.searchsite1.com/cgi-bin/search?q=@@@"
  "http://www.searchsite2.com/findit?query=@@@"
  "http://www.searchsite3.com/cgi-bin/whereis?q=@@@&cmd=search"
>
<strfmt "%U" $query>              <!-- URL-escape query -->
<sandr "[\?\#\{\}\+\\]" "\\\1" $ret>  <!-- make sandr-safer -->
<sandr "@@@" $ret $rooturls>      <!-- and insert into URLs -->
<$urls = $ret>
<BODY BGCOLOR=white>
<fetch PARALLEL $urls>
  <sandr ".*>><body=[^>]+>=" "" $ret>  <!-- strip non-BODY -->
  <sandr "</body>=.*"        "" $ret>
  <$html = $ret>
  <urlinfo actualurl>
  <BASE HREF="$ret">              <!-- tell browser the base -->
  <send $html>                    <!-- print site results -->
</fetch>
</BODY>


CAVEATS
The PARALLEL syntax to fetch was added in version 2.1.902500000 19980807. Support for FTP was added in June 1998, Gopher in version 2.6.938200000 19990924, HTTPS and javascript: June 17 2002, and file:// URLs in version 4.02.1048785541 20030327. Protected file:// URLs (requiring user/pass) supported for Windows in version 5.01.1123012366 20050802. RENAMETO was added in version 5.01.1173117000 20070305.

All URLs are returned, even those that cause an error (empty string returned). The urlinfo function can then be used to obtain the error code.

In versions prior to 3.0.949000000 20000127, or if <urlcp dnsmode sys> is set, domain name resolution cannot be parallelized due to C lib constraints. Thus, an unresponsive name server (as opposed to a web server) may hold up other URLs, or even exceed the <urlcp timeout> setting. In some versions, parallel FTP retrieval is not supported.

Note that $loop and $next are merely incremented inside the loop: they do not necessarily correspond to the array index of the currently returned URL.

As little work as possible should occur inside a fetch loop, as any time-consuming commands could cause in-progress fetches to time out.


SEE ALSO
submit, urlinfo, urlcp


Copyright © Thunderstone Software     Last updated: Sep 25 2019
Copyright © 2019 Thunderstone Software LLC. All rights reserved.