fetch - fetch URLs



<fetch [METHOD=$methods] [RENAMETO=$path] $urls [$rawdoc]>
<fetch PARALLEL[=n] [METHOD=$methods] [RENAMETO=$path] [URL[S]=]$urls
       [$loopvar ...]>

The <fetch> function retrieves URLs, using the HTTP GET method (or its equivalent for other protocols), and returns the (unformatted) documents in ret. The METHOD option, added in version 3.01.971300000 20001011, specifies a parallel list of alternate method(s); each method may be one of OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, MKDIR, or RENAME. Not all methods are supported by all protocols. Some methods are mapped to an equivalent method when used with non-HTTP protocols. If the RENAME method is set (valid only for FTP URLs), the RENAMETO option must also be set to the target file path.

With the non-looping syntax, a single URL (the first value of urls) is retrieved. If the rawdoc argument is given, it is used as the source of the returned raw document, instead of actually fetching the URL. This provides a way to obtain the text, links, etc. of arbitrary script-generated HTML that isn't present on an actual web page. (To just HTML-decode an arbitrary string without tag removal, word-wrapping or link parsing, see the %H code to <strfmt> with the ! flag, here.) Whether actually fetched or provided by rawdoc, ret is set to the returned raw document when the statement finishes. The <urlinfo> function (here) can then be called to obtain more information about the document.

With the looping syntax, all of the URLs in urls are fetched in parallel, that is, simultaneously. Once the first (i.e. quickest) URL is completed, the <fetch> loop is entered, with ret set to the returned raw document source. As subsequent URLs are completed, the loop is iterated again; once for each member of urls. Inside the loop, the <urlinfo> function can be used to retrieve further information about the URL just completed.

It is important to note with the loop syntax that URLs are returned fastest-first, which might not be the order they are present in urls. For example, suppose two URLs are being fetched where the first URL takes 10 seconds to download and the other 3 seconds. With the parallel loop syntax, the second will probably be returned first, after 3 seconds; then 7 seconds later the first will be completed. A URL that refers to an unresponsive web server will not hold up other URLs; it is merely returned last, when it times out.

As an aid in tracking which URL was returned in each iteration, the urls argument (if a variable) and any loopvar variables (after all options and urls) are looped over in the <fetch>, but in the same order as returned URLs. Thus urls is set to the URL just retrieved inside the loop.

The special variables loop and next are set and incremented inside the loop as well: loop starts at 0, next at 1.

If an argument to PARALLEL is given, only that many URLs will be fetched simultaneously; the remaining ones are started only as the first ones complete. The default (no argument to PARALLEL) is to start fetching all URLs initially (in version 4 and earlier) or only 10 at a time (version 5 and later).

<fetch> returns the raw document just fetched (after content/transfer encodings decoded), or the value of rawdoc.

This example uses the loop syntax to search multiple search engines simultaneously. First, the user's query is placed into each URL with <sandr>. Then the resulting URLs are fetched; because of the PARALLEL flag, the fastest engine will return first. Each page is then post-processed to remove HTML outside the <BODY>/</BODY> tags - since there will be multiple pages concatenated together - and displayed following a <BASE> tag so that the user's browser knows the proper path for the links:

<urlcp timeout 10>
<$rooturls =
<strfmt "%U" $query>              <!-- URL-escape query -->
<sandr "[\?#\{\}\+\\]" "\\\1" $ret>  <!-- make sandr-safer -->
<sandr "@@@" $ret $rooturls>      <!-- and insert into URLs -->
<$urls = $ret>
<fetch PARALLEL urls=$urls>
  <sandr ".*>><body=[^>]+>=" "" $ret>  <!-- strip non-BODY -->
  <sandr "</body>=.*"        "" $ret>
  <$html = $ret>
  <urlinfo actualurl>
  <BASE HREF="$ret">              <!-- tell browser the base -->
  <send $html>                    <!-- print site results -->

The PARALLEL syntax to <fetch> was added in version 2.1.902500000 19980807. Support for FTP was added in June 1998, Gopher in version 2.6.938200000 19990924, HTTPS and javascript: June 17 2002, and file:// URLs in version 4.02.1048785541 20030327. Protected file:// URLs (requiring user/pass) supported for Windows in version 5.01.1123012366 20050802. RENAMETO was added in version 5.01.1173117000 20070305.

All URLs are returned, even those that cause an error (empty string returned). The <urlinfo> function can then be used to obtain the error code.

In versions prior to 3.0.949000000 20000127, or if <urlcp dnsmode sys> is set, domain name resolution cannot be parallelized due to C lib constraints. Thus, an unresponsive name server (as opposed to a web server) may hold up other URLs, or even exceed the <urlcp timeout> setting. In some versions, parallel FTP retrieval is not supported.

Note that loop and next are merely incremented inside the loop: they do not necessarily correspond to the array index of the currently returned URL.

As little work as possible should occur inside a <fetch> loop, as any time-consuming commands could cause in-progress fetches to time out.

submit, urlinfo, urlcp

Copyright © Thunderstone Software     Last updated: Aug 4 2020
Copyright © 2021 Thunderstone Software LLC. All rights reserved.