SYNOPSIS<fetch [PARALLEL[=n]] [options] URL[S]=$urls [$loopvar ...] [/]>
[ ...
</fetch>]
or in version 7 and earlier syntax (deprecated; see
syntaxversion
pragma,
here) either:
<fetch [options] $urls [$downloaddoc]>
or
<fetch PARALLEL[=n] [options] [URL[S]=]$urls [$loopvar ...]>
...
</fetch>
DESCRIPTION
The <fetch>
function retrieves URLs, using the HTTP GET
method (or its equivalent for other protocols), and returns the
(unformatted) documents in $ret. Options:
METHOD
Specifies a parallel (to $urls
) list of alternate
method(s); each method may be one of OPTIONS, GET,
HEAD, POST, PUT, DELETE, TRACE, MKDIR, or RENAME. Not all methods are supported by all
protocols. Some methods are mapped to an equivalent method when
used with non-HTTP protocols. If the RENAME method is set
(valid only for FTP URLs), the RENAMETO
option must also be
set to the target file path. Added in version 3.01.971300000
20001011.
RENAMETO=$path
Gives the file path to rename the URL to. Only valid with METHOD=RENAME and FTP URLs. Added in version 5.01.1173117000 20070305.
STATUSLINE=$status
If non-empty, parse $status as the HTTP status line for the
response, e.g. "HTTP/1.1 404 Not Found". Only valid with
$downloaddoc
. Added in version 7.07.1601332714 20200928.
HEADERS=$hdrs
Parse list $hdrs as the HTTP response headers. Each item is
a single header, e.g. "Content-Type: application/pdf".
Only valid with $downloaddoc
. Added in version
7.07.1601332714 20200928.
ERRTOKEN=$errtoken
If non-empty, set $errtoken as the error token for
<urlinfo errtoken>
(and thus indirectly the
<urlinfo errnum>
and <urlinfo errmsg>
return values
as well). Must be a valid token returned by
<urlinfo errtoken>
, or number returned by
<urlinfo errnum>
.
Errors like DocUnauth
are usually automatically derived and
set by <fetch>
from the user data (e.g. a STATUSLINE
of "HTTP/1.1 401 Access Denied"), but errors like
ConnTimeout
would not be settable with user data
(e.g. there is no HTTP code for timeout). In such cases, the
ERRTOKEN
option can be used to set a
<fetch>
-standard error code for later
<urlinfo errtoken>
processing.
Only valid with $downloaddoc
. Overrides any error token
induced by status line, headers etc. (e.g. DocUnauth
).
Added in version 7.07.1605575000 20201116.
DOWNLOADDOC=$downloaddoc
Instead of fetching the given URL, use $downloaddoc as the
downloaded (over-the-wire) content. Only valid in
non-looping/non-parallel <fetch>
. The STATUSLINE
,
HEADERS
, and/or ERRTOKEN
options are only valid when
this option is specified. When given, only $downloaddoc
will be processed: no later redirects etc. will be fetched, if
indicated via processing of HEADERS
etc. Some processing
and/or information normally available with over-the-wire fetches
may not occur nor be available, e.g. totaltime etc. Note
that in version 7 syntax, DOWNLOADDOC
is not a named
option: $downloaddoc must be given as an optional argument
immediately after $url, and only for the non-looping
syntax.
With the non-looping syntax, a single URL (the first value of $urls) is retrieved.
Whether content is actually fetched or provided by $downloaddoc, $ret is set to the returned raw document when
the statement finishes. The <urlinfo>
function
(here) can then be called to obtain more information
about the document.
With the looping syntax, all of the URLs in $urls are fetched
in parallel, that is, simultaneously. Once the first
(i.e. quickest) URL is completed, the <fetch>
loop is entered,
with $ret set to the returned raw document source. As
subsequent URLs are completed, the loop is iterated again; once for
each member of $urls. Inside the loop, the <urlinfo>
function can be used to retrieve further information about the URL
just completed.
It is important to note with the loop syntax that URLs are returned fastest-first, which might not be the order they are present in $urls. For example, suppose two URLs are being fetched where the first URL takes 10 seconds to download and the other 3 seconds. With the parallel loop syntax, the second will probably be returned first, after 3 seconds; then 7 seconds later the first will be completed. A URL that refers to an unresponsive web server will not hold up other URLs; it is merely returned last, when it times out.
As an aid in tracking which URL was returned in each iteration, the
$urls argument (if a variable) and any $loopvar
variables (after all options and $urls) are looped over in the
<fetch>
, but in the same order as returned URLs. Thus $urls is set to the URL just retrieved inside the loop.
The special variables $loop and $next are set and incremented inside the loop as well: $loop starts at 0, $next at 1.
If an argument to PARALLEL
is given, only that many URLs
will be fetched simultaneously; the remaining ones are started only as
the first ones complete. The default (no argument to PARALLEL
)
is to start fetching all URLs initially (in version 4 and earlier)
or only 10 at a time (version 5 and later).
Note that the syntaxversion
pragma
(here) affects what syntaxes
are accepted. In version 8 and later syntax, the <fetch>
statement, like most looping statements, is non-looping only if
self-closed (looping otherwise); all options (including URLs but
excluding loop variables) must be labelled; and all options are
accepted (and in any order) for both looping and non-looping versions
(though PARALLEL
is ignored for non-looping, as it is a single
fetch). For other differences in version 7 legacy syntax, see the
syntaxversion
pragma documentation
(here).
DIAGNOSTICS<fetch>
returns the raw document just fetched (or provided by
$downloaddoc), after content/transfer encodings decoded.
EXAMPLE
This example uses the loop syntax to search multiple search engines
simultaneously. First, the user's query is placed into each URL with
<sandr>
. Then the resulting URLs are fetched; because of the
PARALLEL
flag, the fastest engine will return first. Each page
is then post-processed to remove HTML outside the
<BODY>/</BODY> tags - since there will be multiple
pages concatenated together - and displayed following a <BASE>
tag so that the user's browser knows the proper path for the links:
<urlcp timeout 10>
<$rooturls =
"http://searchsite1.example.com/cgi-bin/search?q=@@@"
"http://searchsite2.example.com/findit?query=@@@"
"http://searchsite3.example.com/cgi-bin/whereis?q=@@@&cmd=search"
>
<strfmt "%U" $query> <!-- URL-escape query -->
<sandr "[\?#\{\}\+\\]" "\\\1" $ret> <!-- make sandr-safer -->
<sandr "@@@" $ret $rooturls> <!-- and insert into URLs -->
<$urls = $ret>
<BODY BGCOLOR=white>
<fetch PARALLEL urls=$urls>
<sandr ".*>><body=[^>]+>=" "" $ret> <!-- strip non-BODY -->
<sandr "</body>=.*" "" $ret>
<$html = $ret>
<urlinfo actualurl>
<BASE HREF="$ret"> <!-- tell browser the base -->
<send $html> <!-- print site results -->
</fetch>
</BODY>
CAVEATS
The PARALLEL
syntax to <fetch>
was added in version
2.1.902500000 19980807. Support for FTP was added in June 1998,
Gopher in version 2.6.938200000 19990924, HTTPS and javascript:
June 17 2002, and file:// URLs in version 4.02.1048785541
20030327. Protected file:// URLs (requiring user/pass)
supported for Windows in version 5.01.1123012366 20050802.
All URLs are returned, even those that cause an error (empty string
returned). The <urlinfo>
function can then be used to obtain the
error code.
In versions prior to 3.0.949000000 20000127, or if
<urlcp dnsmode sys>
is set, domain name resolution cannot be
parallelized due to C lib constraints. Thus, an unresponsive name
server (as opposed to a web server) may hold up other URLs, or even
exceed the <urlcp timeout>
setting. In some versions, parallel
FTP retrieval is not supported.
Note that $loop and $next are merely incremented inside the loop: they do not necessarily correspond to the array index of the currently returned URL.
As little work as possible should occur inside a <fetch>
loop,
as any time-consuming commands could cause in-progress fetches to
time out.
The syntaxversion
pragma
(here) affects the syntax of
this statement.
SEE ALSOsubmit
, urlinfo
, urlcp