fetch - fetch URLs

SYNOPSIS

<fetch [PARALLEL[=n]] [options] URL[S]=$urls [$loopvar ...] [/]>
[  ...
</fetch>]

or in version 7 and earlier syntax (deprecated; see syntaxversion pragma, here) either:

<fetch [options] $urls [$downloaddoc]>
or
<fetch PARALLEL[=n] [options] [URL[S]=]$urls [$loopvar ...]>
  ...
</fetch>


DESCRIPTION
The <fetch> function retrieves URLs, using the HTTP GET method (or its equivalent for other protocols), and returns the (unformatted) documents in $ret. Options:

  • METHOD

    Specifies a parallel (to $urls) list of alternate method(s); each method may be one of OPTIONS, GET, HEAD, POST, PUT, DELETE, TRACE, MKDIR, or RENAME. Not all methods are supported by all protocols. Some methods are mapped to an equivalent method when used with non-HTTP protocols. If the RENAME method is set (valid only for FTP URLs), the RENAMETO option must also be set to the target file path. Added in version 3.01.971300000 20001011.

  • RENAMETO=$path

    Gives the file path to rename the URL to. Only valid with METHOD=RENAME and FTP URLs. Added in version 5.01.1173117000 20070305.

  • STATUSLINE=$status

    If non-empty, parse $status as the HTTP status line for the response, e.g. "HTTP/1.1 404 Not Found". Only valid with $downloaddoc. Added in version 7.07.1601332714 20200928.

  • HEADERS=$hdrs

    Parse list $hdrs as the HTTP response headers. Each item is a single header, e.g. "Content-Type: application/pdf". Only valid with $downloaddoc. Added in version 7.07.1601332714 20200928.

  • ERRTOKEN=$errtoken

    If non-empty, set $errtoken as the error token for <urlinfo errtoken> (and thus indirectly the <urlinfo errnum> and <urlinfo errmsg> return values as well). Must be a valid token returned by <urlinfo errtoken>, or number returned by <urlinfo errnum>.

    Errors like DocUnauth are usually automatically derived and set by <fetch> from the user data (e.g. a STATUSLINE of "HTTP/1.1 401 Access Denied"), but errors like ConnTimeout would not be settable with user data (e.g. there is no HTTP code for timeout). In such cases, the ERRTOKEN option can be used to set a <fetch>-standard error code for later <urlinfo errtoken> processing.

    Only valid with $downloaddoc. Overrides any error token induced by status line, headers etc. (e.g. DocUnauth). Added in version 7.07.1605575000 20201116.

  • DOWNLOADDOC=$downloaddoc

    Instead of fetching the given URL, use $downloaddoc as the downloaded (over-the-wire) content. Only valid in non-looping/non-parallel <fetch>. The STATUSLINE, HEADERS, and/or ERRTOKEN options are only valid when this option is specified. When given, only $downloaddoc will be processed: no later redirects etc. will be fetched, if indicated via processing of HEADERS etc. Some processing and/or information normally available with over-the-wire fetches may not occur nor be available, e.g. totaltime etc. Note that in version 7 syntax, DOWNLOADDOC is not a named option: $downloaddoc must be given as an optional argument immediately after $url, and only for the non-looping syntax.

With the non-looping syntax, a single URL (the first value of $urls) is retrieved. Whether content is actually fetched or provided by $downloaddoc, $ret is set to the returned raw document when the statement finishes. The <urlinfo> function (here) can then be called to obtain more information about the document.

With the looping syntax, all of the URLs in $urls are fetched in parallel, that is, simultaneously. Once the first (i.e. quickest) URL is completed, the <fetch> loop is entered, with $ret set to the returned raw document source. As subsequent URLs are completed, the loop is iterated again; once for each member of $urls. Inside the loop, the <urlinfo> function can be used to retrieve further information about the URL just completed.

It is important to note with the loop syntax that URLs are returned fastest-first, which might not be the order they are present in $urls. For example, suppose two URLs are being fetched where the first URL takes 10 seconds to download and the other 3 seconds. With the parallel loop syntax, the second will probably be returned first, after 3 seconds; then 7 seconds later the first will be completed. A URL that refers to an unresponsive web server will not hold up other URLs; it is merely returned last, when it times out.

As an aid in tracking which URL was returned in each iteration, the $urls argument (if a variable) and any $loopvar variables (after all options and $urls) are looped over in the <fetch>, but in the same order as returned URLs. Thus $urls is set to the URL just retrieved inside the loop.

The special variables $loop and $next are set and incremented inside the loop as well: $loop starts at 0, $next at 1.

If an argument to PARALLEL is given, only that many URLs will be fetched simultaneously; the remaining ones are started only as the first ones complete. The default (no argument to PARALLEL) is to start fetching all URLs initially (in version 4 and earlier) or only 10 at a time (version 5 and later).

Note that the syntaxversion pragma (here) affects what syntaxes are accepted. In version 8 and later syntax, the <fetch> statement, like most looping statements, is non-looping only if self-closed (looping otherwise); all options (including URLs but excluding loop variables) must be labelled; and all options are accepted (and in any order) for both looping and non-looping versions (though PARALLEL is ignored for non-looping, as it is a single fetch). For other differences in version 7 legacy syntax, see the syntaxversion pragma documentation (here).


DIAGNOSTICS
<fetch> returns the raw document just fetched (or provided by $downloaddoc), after content/transfer encodings decoded.


EXAMPLE
This example uses the loop syntax to search multiple search engines simultaneously. First, the user's query is placed into each URL with <sandr>. Then the resulting URLs are fetched; because of the PARALLEL flag, the fastest engine will return first. Each page is then post-processed to remove HTML outside the <BODY>/</BODY> tags - since there will be multiple pages concatenated together - and displayed following a <BASE> tag so that the user's browser knows the proper path for the links:

<urlcp timeout 10>
<$rooturls =
  "http://searchsite1.example.com/cgi-bin/search?q=@@@"
  "http://searchsite2.example.com/findit?query=@@@"
  "http://searchsite3.example.com/cgi-bin/whereis?q=@@@&cmd=search"
>
<strfmt "%U" $query>              <!-- URL-escape query -->
<sandr "[\?#\{\}\+\\]" "\\\1" $ret>  <!-- make sandr-safer -->
<sandr "@@@" $ret $rooturls>      <!-- and insert into URLs -->
<$urls = $ret>
<BODY BGCOLOR=white>
<fetch PARALLEL urls=$urls>
  <sandr ".*>><body=[^>]+>=" "" $ret>  <!-- strip non-BODY -->
  <sandr "</body>=.*"        "" $ret>
  <$html = $ret>
  <urlinfo actualurl>
  <BASE HREF="$ret">              <!-- tell browser the base -->
  <send $html>                    <!-- print site results -->
</fetch>
</BODY>


CAVEATS
The PARALLEL syntax to <fetch> was added in version 2.1.902500000 19980807. Support for FTP was added in June 1998, Gopher in version 2.6.938200000 19990924, HTTPS and javascript: June 17 2002, and file:// URLs in version 4.02.1048785541 20030327. Protected file:// URLs (requiring user/pass) supported for Windows in version 5.01.1123012366 20050802.

All URLs are returned, even those that cause an error (empty string returned). The <urlinfo> function can then be used to obtain the error code.

In versions prior to 3.0.949000000 20000127, or if <urlcp dnsmode sys> is set, domain name resolution cannot be parallelized due to C lib constraints. Thus, an unresponsive name server (as opposed to a web server) may hold up other URLs, or even exceed the <urlcp timeout> setting. In some versions, parallel FTP retrieval is not supported.

Note that $loop and $next are merely incremented inside the loop: they do not necessarily correspond to the array index of the currently returned URL.

As little work as possible should occur inside a <fetch> loop, as any time-consuming commands could cause in-progress fetches to time out.

The syntaxversion pragma (here) affects the syntax of this statement.


SEE ALSO
submit, urlinfo, urlcp


Copyright © Thunderstone Software     Last updated: Apr 15 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.