urlinfo - get detailed page info

 

SYNOPSIS

<urlinfo $name [$which]>


DESCRIPTION
The urlinfo function returns information about the last page retrieved with fetch or submit. Inside a fetch loop (e.g. with the PARALLEL flag), this is the page just returned for the current loop. The $name argument describes what to return; some values take a second $which argument (as noted below). Possible values for $name and what they return are:

     

  • actualurl (string) The last URL retrieved. It may differ from the argument to fetch or submit, e.g. if redirects were followed. See also intermediateurls (here).

  • allrefs (list)  

    The list of all links, images, frames, iframes and string links from the document; essentially a concantenation of the links, images, frames, iframes and strlinks values, but with duplicate entries from the same tag/attribute tuple removed. Added in version 7.06.1463100000 20160512. Note that the urlcp settings getframes, getiframes and/or getscripts may affect which URLs are returned.

  • authparams (list) The list of names of parsed authentication parameters sent by the server. The value for a particular parameter name can be obtained with authparam $param. Added in version 5.1. Note that authentication parameters may not be available even if authentication is used, if the server does not send them. For example, the second and later requests on a connection may not need parameters, if credentials are sent with the initial request and thus the server does not need to challenge in the response.

  • authparam $param (string) The authentication parameter $param from the server. $param may be realm for the Basic authentication realm, target for the NTLM target (i.e. domain), or serverchallenge for the NTLM server challenge nonce. Added in version 5.1. Authentication parameter names are case-insensitive.

  • authscheme (string) The authentication scheme used. Returns one of the scheme tokens used by <urlcp authschemes> (here). Added in version 5.01.1239140000 20090407.

  • authschemes (list)

    The list of authentication schemes currently allowed via <urlcp authschemes> (here). Added in version 7.04.

  • authschemehighest (string) The highest (most secure) authentication scheme used during the entire transaction, i.e. across redirects (if any). E.g. if a Basic authentication protected page was fetched, which then redirected to an anonymous-access page, authschemehighest would return Basic, even though authscheme would return anonymous (from the last page). Returns one of the scheme tokens used by <urlcp authschemes> (here). Added in version 5.01.1239140000 20090407.

  • charsetconfigtotext (string) The current charset configuration, in the format used by <urlcp charsetconfigfromfile> (here). Added in version 6.

  • charsetdetected (string) The charset of the source page, as detected by scanning the document, without parsing explicit charset labels. Added in version 5.

  • charsetexplicit (string) The charset of the source page, as explicitly set in a header or <META HTTP-EQUIV> label. Returns Unknown if unknown or not set. Added in version 5.

  • charsetsrc or charsetsource (string) The charset of the source page, as interpreted by the parser. This is taken from the first available source, in descending priority: the charset as set by <urlcp charsetsrc>; the charset explicitly set in the page (header or meta); the charset detected by scanning the document; or the <urlcp charsetsrcdefault> charset. Added in version 5.

  • charsettxt or charsettext (string) The charset of the formatted text (as returned by <urlinfo text>. Added in version 5.

  • contenttype (string) The MIME content type of the page (without any parameters). This may have been derived from the Content-Type header, a <META HTTP-EQUIV> tag, or the URL extension, depending on what is available. In version 7.06.1477065000 20161021 and later, the value is returned lower-case for easier comparison, since media types are case-insensitive.

  • contenttypeparams (list)

    The names of parameters in the MIME content type, if any. In version 7.06.1477065000 20161021 and later, the names are returned lower-case for easier comparison, since media types parameter names are case-insensitive.

  • contenttypeparam (list, 2 args) The value(s) of the content type parameter(s) named $which. Multiple values may be given in $which. Parameter names are case-insensitive.

  • contenttypesrc (string) Returns the source of contenttype and related data, i.e. how it was determined. One of "generated", "header", "doctype", "metaheader", "urlpath", "contentscan" or "unknown". Added in version 5.01.1116341784 20050517. Aka contenttypesource.

  • cookiejar [all] [netscape4x] The contents of the "cookie jar" (Vortex's internal cache of cookies received or set). Returned as a Netscape-cookie-file format text buffer. By default, only persistent (non-session) cookies are returned, i.e. the ones to be preserved across browser invocations. If the argument all is given, all cookies, including session cookies, are returned. Added in version 4.01.1022000000 20020521.

    In version 5.01.1244880000 20090613 and later, a new fifth column was inserted in the output, containing the IsHttpOnly boolean value. To obtain the Netscape-4.x-compatible format of prior versions, set the netscape4x flag. <urlcp cookiejar> will accept input in either format.

  • domvalue $dom Gets the value of the DOM item indicated by $dom. Note that this is not the JavaScript DOM, but the near-parallel page DOM. This can be used to get the submit URL and content for a form on the page just fetched, e.g. document.forms.myForm.submitUrl and document.forms.myForm.submitContent, after optionally setting form input values via <urlcp domvalue>. Added in version 5.

  • downloaddoc (string or varbyte) The network-transferred downloaded document body. This is the same as rawdoc if the document had no content/transfer encodings. If it did have encodings, this is the chunked/compressed/etc. document, before decompression into rawdoc. The downloaded document is normally discarded if different from rawdoc, to save memory; thus it may be empty for documents with encodings. Set <urlcp savedownloaddoc on> (normally off) to preserve the downloaded document (at potential cost in memory). Added in version 5.01.1249203000 20090802. See also rawdoc, which is usually more useful.

  • encodings (list) The list of content/transfer encodings of the response document, in the order they were applied by the server. Known encodings (e.g. gzip) are canonicalized and lowercase. Note that known and enabled encodings are already decoded (in reverse order) in the <fetch> or <urlinfo rawdoc> returned document. Added in version 5.01.1249203000 20090802.

  • frames (list) The list of frame URLs in the document. If the urlcp setting getframes is true, the list is empty since the frames have been fetched and appended to the document.

  • iframes (list) The list of <IFRAME> URLs in the document. If the urlcp setting getiframes is true, the list is empty since the iframes have been fetched and inserted into the document.

  • headers (list) The names of HTTP headers received with the document.

  • header $hdrName (list)

    The full value(s) of the HTTP header(s) with single name $hdrName. Header names are case-insensitive.

  • headervalue $hdrName   The leading value (i.e. before the ";") of the header(s) with single name $hdrName, where the header is in semicolon-parameterized format, i.e.:
          value; param1=val1; param2="val 2"; ...
    Added in version 6.00.1287436000 20101018.

  • headerparams $hdrName

    The parameter name(s) from the semicolon-parameterized header(s) with single name $hdrName. Added in version 6.00.1287436000 20101018.

  • headerparam $hdrName $paramName

    The parameter value(s) of the parameter(s) with single name $paramName from the semicolon-parameterized header(s) with single name $hdrName. Added in version 6.00.1287436000 20101018.

  • errnum (integer) The Vortex fetch error code (not the HTTP or other protocol code), indicating a problem with the fetch. This can be non-zero even for a partially successful fetch, e.g. 15 if the page is too big. 0 indicates a completely successful fetch. See here for a list of errnum codes and what they mean.

  • errtoken (string) A string token representing the numeric errnum code, e.g. DocNotFound for error 24 (Document not found). This can be used in scripts as a more readable and self-documenting value than errnum integer values, and more constant than errmsg values (which may change in future releases). See here for a list of tokens and corresponding numbers and meanings. Added in version 5.01.1246963000 20090707.

  • errmsg (string) A human-readable string description of the errnum code. See here for a list of possible error messages and numbers.

  • httpcode (integer) The value of the protocol response code, if any (for HTTP or FTP). Note that this varies depending on the fetched URL protocol; the errnum value is more consistent. Typical HTTP codes and what they mean are listed below. Note that this is not an exhaustive list, as the protocol code is created and sent by the web server, not Vortex. Codes will also vary for other (non-HTTP) protocols, e.g. FTP:

    • 200 Ok (all 2NN codes)

    • 201 Created

    • 202 Accepted

    • 204 No Content

    • 300 Redirect (all 3NN codes)

    • 301 Moved permanently

    • 302 Moved temporarily

    • 303 See Other

    • 304 Not modified

    • 400 Bad client request (all 4NN)

    • 401 Unauthorized

    • 403 Forbidden

    • 404 Not found

    • 405 Method not allowed

    • 406 Method not acceptable

    • 407 Proxy access unauthorized

    • 408 Request timed out

    • 413 Request entity too large

    • 414 Request URI too large

    • 500 Internal server error (all 5NN)

    • 501 Not implemented by server

    • 502 Bad gateway

    • 503 Service unavailable

  • httpmsg (string) The protocol response string, if any (HTTP or FTP). Varies by protocol and server; check errmsg instead for more portable (platform-independent) messages.

  • images (list) The list of image URLs in the document, e.g. <IMG> tags, background images, etc.

     

  • intermediateurls (list)

    The list of intermediate URLs, if any, that were fetched before the final URL returned. This includes redirects, FTP/file dir/file retries, authorization retries, OPTIONS Upgrades, CONNECT tunnels, and proxy retries. Added in version 7.05.1450220000 20151215. See also actualurl (here).

     

  • links (list)

    The list of non-image link URLs in the document, e.g. <A HREF> tags, <FORM> tags, etc. Same as the return value of the obsolescent urllinks function. Note that frames will be listed as links if the urlcp setting getframes is false, iframes will be listed if getiframes is false, and script sources will be listed if getscripts is false. Note also that JavaScript string links (here) are not included in this list, as they are unreliable; but ordinary JavaScript links are included.

  • metaheaders (list) The names of <META HTTP-EQUIV> tags in the document.

  • metaheader $hdrName (list)

    The entire value(s) of the <META HTTP-EQUIV> tag(s) with single name $hdrName. Header names are case-insensitive.

  • metaheadervalue $hdrName

    The leading value (i.e. before the ";") of the meta header(s) with single name $hdrName, where the header is in semicolon-parameterized format (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • metaheaderparams $hdrName

    The parameter name(s) from the semicolon-parameter-format meta header(s) with single name $hdrName (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • metaheaderparam $hdrName $paramName

    The parameter value(s) of the parameter(s) with single name $paramName from the semicolon-parameter-format meta header(s) with single name $hdrName (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • metanames (list) The names of <META NAME> tags in the document.

  • metaname $hdrName (list)

    The entire value(s) of the <META NAME> tag(s) with single name $hdrName. Names are case-insensitive.

  • metanamevalue $hdrName

    The leading value (i.e. before the ";") of the meta name tag(s) with single name $hdrName, where the tag is in semicolon-parameter format (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • metanameparams $hdrName

    The parameter name(s) from the semicolon-parameter-format meta name tag(s) with single name $hdrName (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • metanameparam $hdrName $paramName

    The parameter value(s) of the parameter(s) with single name $paramName from the semicolon-parameter-format meta name tag(s) with single name $hdrName (see headervalue here for format example). Added in version 6.00.1287436000 20101018.

  • originalurl (string) The original URL retrieved (i.e. the one given to fetch or submit). It may differ from the actual last URL retrieved, e.g. if redirects were followed. Added in version 5.01.1205285000 20080311.

     

  • prngdpid [$path] (integer, 2 args) The process ID of the prngd daemon (entropy gatherer) running on Unix file pipe $path, 0 if none detected, -1 on error. If no $path (or an empty one) is given, all standard paths ("/var/run/egd-pool", "/dev/egd-pool", "/etc/egd-pool", "/etc/entropy") and the configured path ([Texis] Entropy Pipe value in texis.ini) are checked. The prngd daemon is used on certain Unix platforms (those without /dev/random) to provide entropy to seed the random number generator for the SSL/HTTPS plugin. The prngdpid value provides a way to check if the daemon is running. Note that not all platforms require an entropy daemon. Added in version 4.01.1031761163 20020911. See also the entropypipe setting of urlcp (here).

  • putmsgs (list) The fetch-related putmsgs since the most recent <fetch> or <submit>. When called inside a <fetch parallel> loop, only the messages from the just-completed fetch are returned, making disambiguation much easier than with the standard <putmsg> function callback mechanism. If <urlcp putmsg save> is off (here), no messages will be saved or returned. The message buffer is cleared at the start of each <fetch> or <submit>. If parsing these messages, it may be helpful to turn off <urlcp putmsg pass>, so that the same messages need not be seen and parsed by the script-wide <putmsg> function callback. Added in version 6.

  • processedchunks (strings or varbyte values)

    The ordered list of HTML document chunks that were actually processed during HTML parsing. The concatenation of these is normally the same as rawdoc. However, the chunks may differ if rawdoc is not UTF-8, as the chunks always are. The chunks may also differ from rawdoc if JavaScript was run and modified the document; e.g. some of the chunks may be the output of document.write() statements, whereas rawdoc is always the static original document. The chunks may be zero-length/empty if no HTML processing was done, e.g. for an image. Added in version 6. The concantenation of processedchunks is available as processeddoc, which may be easier to use if individual chunks (e.g. static vs. dynamic content) are not needed.

  • processedchunksbufnums (list of integers) The ordered list of buffer numbers that the corresponding processedchunks values come from. During HTML and JavaScript processing, a document will end up with one or more buffers, the first of which (buffer 0) is the original static document source itself. JavaScript processing may create further buffers (e.g. the output of document.write()). A buffer may end up split into multiple chunks for HTML formatting if such JavaScript output occurs mid-buffer. For example, a document.write() in the middle of an HTML page may result in 3 chunks: the first part of buffer 0 (static doc), all of buffer 1 (generated by JavaScript), and the latter part of buffer 0 (rest of static doc). Added in version 6.

  • processeddoc (string or varbyte)  

    The concatentation of processedchunks. Added in version 7.06.1463504000 20160517.

  • rawdoc (string or varbyte)  

    The document source (after any content/transfer encodings are decoded). Same as the return value of the original fetch or submit. See also downloaddoc.

  • redirs (integer) The number of redirects encountered.

  • saslmechanisms (list)

    The list of enabled SASL mechanisms (under Negotiate authentication). See <urlcp saslmechanisms> (here) for more info. A putmsg is generated if SASL is not supported on the current platform.

  • saslmechanismsavailable (list)   The list of available SASL mechanisms. See <urlcp saslmechanisms> (here) for more info. A putmsg is generated if SASL is not supported on the current platform.

  • saslpluginpath (string)

    The colon-separated path to look for SASL plugins in. See <urlcp saslpluginpath> (here) for more info. A putmsg is generated if SASL is not supported on the current platform.

  • secure (list)   Which parts of the transaction were conducted securely (via SSL). One or more of the following values:

    • request - The final URL request to the server was secure.

    • response - The final response from the server was secure.

    • ancestors - All previous requests and responses that led to the final fetch (i.e. earlier redirects) were secure.

    • descendants - All requests and responses made to sub-objects on the final page (e.g. frames, scripts) were secure.

    • all - All requests and responses for the entire transaction - ancestors (if any), final page, and descendants (if any) - were secure.
    Added in version 5.01.1184803500 20070718. Note that the definition of "secure" for this option only applies to the first-hop network connection (Vortex); if a proxy is used, the transaction(s) from the proxy to the URLs may or may not be secure. See also the insecure option.

  • insecure (list) Which parts of the transaction were insecure, i.e. not conducted securely via SSL. One or more of the following values:

    • request - The final URL request to the server was insecure.

    • response - The final response from the server was insecure.

    • ancestors - One or more previous requests or responses that led to the final fetch (i.e. earlier redirects) were insecure.

    • descendants - One or more requests or responses made to sub-objects on the final page (e.g. frames, scripts) were insecure.

    • all - The request and response for the final page were insecure, one or more ancestors (if any) were insecure, and one or more descendants (if any) were insecure.
    Added in version 5.01.1184803500 20070718. Note that the definition of "insecure" for this option only applies to the first-hop network connection (Vortex); if a proxy is used, the transaction(s) from the proxy to the URLs may or may not be secure. See also the secure option.

  • sslciphers [$group] (string)

    Returns the list of SSL ciphers currently set with <urlcp sslciphers> (here), or empty string if none set (i.e. the OpenSSL default list is in effect). Added in version 7.03.1436205000 20150706.

    In version 7.07 and later, an optional cipher $group may be given, to return the cipher list for that protocol group. The group may be SSL (the default) for protocols TLSv1.2 and below, or TLSv1.3 for TLSv1.3 ciphers; the two lists are independent.

  • sslservercertificate (PEM string)

    Returns the SSL certificate obtained from the server, in PEM format, or empty if none (e.g. no HTTPS/SSL server contacted). If the server is an Apache or Texis Monitor web server, this certificate is typically from the server's SSLCertificateFile setting. The urlutil action sslcertificate (here) may be used to decode the certificate into a human-readable string format. Note that a server certificate may sometimes be obtainable from an HTTPS/SSL server even if the connection fails (e.g. due to verification problems). Added in version 6.00.1320460000 20111104.

  • sslclientcalist (list)

    Returns the list of CA (certificate authority) certificate names that the HTTPS/SSL server requested as acceptable issuers of the client's certificate. (If the server is an Apache or Texis Monitor web server, this list is typically from the server's SSLCADNRequestFile or SSLCACertificateFile setting.) This is a list of certificate issuers that the server indicates it will accept as signers of the client's (Vortex fetch lib's) certificate. In other words, the certificate set with <urlcp sslcertificatefile> (here) should have been signed by one of these issuers, or the server might reject the connection with a "Cannot complete SSL handshake: ... alert bad certificate" (or "... alert unknown ca") or similar error.

    If an HTTPS/SSL server was not contacted, or the server did not request a client (Vortex) certificate for verification, this list may be empty. Added in version 6.00.1320460000 20111104.

  • sslverifyservererrtoken (string)

    The string token that identifies the reason for the <urlcp sslverifyserver> error, i.e. the token for the reason part of the "Cannot verify certificate from host:port: reason at depth N" message. If no server-certificate verification was performed (e.g. sslverifyserver is off, or no SSL server was contacted), the token is empty or "unknown". If verification was performed successfully (no errors), "Ok" is returned.

    To continue to verify SSL server certificates - but ignore this particular sub-type of verification error - this error can be disabled by adding the token prepended with a "-" (minus sign) to the <urlcp sslverifyserver> (here) setting. Added in version 6.00.1320460000 20111104. The list of possible tokens is detailed in the SSL Client/Server Certificate Verification appendix, here. Note that disabling individual sslverifyserver errors should be done with caution, as it can weaken the security provided by those checks.

  • strlinks (list)   The list of JavaScript string links. These may be unreliable or require further processing, so they are not returned as part of the normal links list. See also <urlcp scriptstrlinks> (here). Added in version 5.00.1086804521 20040609.

  • sspipackages (list)

    The list of enabled SSPI packages enabled/offered under Negotiate authentication. See <urlcp sspipackages> (here) for more info. A putmsg is generated if SSPI is not supported on the current platform (e.g. non-Windows).

  • sspipackagesavailable (list)   The list of available SSPI packages. See <urlcp sspipackages> (here) for more info. A putmsg is generated if SSPI is not supported on the current platform (e.g. non-Windows).

  • strbaseurls (list) The list of JavaScript base URLs corresponding to strlinks. If <urlcp scriptstrlinksabs> is off, this enables the strlinks list to be made absolute, perhaps after some post-processing. Added in version 5.00.1086804521 20040609.

     

  • text (string) The formatted text of the document. Same as the return value of the obsolescent urltext function.

     

  • textformatter (string) A token describing what formatter was used to produce the <urlinfo text> value; one of the following:

    • unknown Formatter is unknown.

    • rawdoc No formatting: text is the raw document source.

    • text Plain-text document formatter.

    • gopher Gopher menu formatter.

    • html HTML document formatter.

    • rss RSS feed formatter.

    • frame Framed document formatter/aggregator.
    Added in version 5.01.1257475000 20091105. rss was added in version 7.02.1407881000 20140812.

  • title (string) The formatted title text of the document.

  • time or totaltime (double) The total time in seconds (including fraction) to retrieve the page. This includes DNS resolution plus content transfer time. Added in version 3.01.966019604 20000811.

  • dnstime (double) The time in seconds (including fraction) to resolve the hostname(s) via DNS. Added in version 3.01.966019604 20000811.

  • transfertime (double) The time in seconds (including fraction) to transfer content to/from the web server. This is a more accurate measure of web server throughput because it does not include the time to resolve the hostname(s). Added in version 3.01.966019604 20000811.

  The possible errnum, errtoken and errmsg values are:

errtoken errmsg
0 Ok Ok
1 ClientErr Unknown client error
2 ServerErr Server error
3 UnkResponseCode Unrecognized response code
4 UnkProtocolVersion Unrecognized protocol version
5 ConnTimeout Connection timeout
6 UnkHost Unknown host
7 CannotConn Cannot connect to host
8 NotConn Not connected
9 CannotCloseConn Cannot close connection
10 CannotWriteConn Cannot write to connection
11 CannotReadConn Cannot read from connection
12 CannotWriteFile Cannot write to file
13 OutOfMem Out of memory
14 PageTrunc Page not expected size, possibly truncated
15 MaxPageSizeExceeded Max page size exceeded, truncated
16 TooManyRedirs Too many redirects
17 OffsiteRef Off-site or unapproved redirect or frame
18 UnkProtocol Unknown/unimplemented access method
19 BadParam Bad parameter
20 UnkErr Unknown error
21 BadRedir Bad redirect
22 DocUnauth Document access unauthorized
23 DocForbidden Document access forbidden
24 DocNotFound Document not found
25 ServerNotImplemented Server did not recognize request (unimplemented)
26 ServiceUnavailable Service unavailable
27 UnkMethod Unknown request method
28 CannotReadFile Cannot read from file
29 CannotLoadLib Cannot load dynamic library
30 ScriptErr Script error
31 ScriptTimeout Script timeout
32 ScriptMemExceeded Script memory limit exceeded
33 DisallowedProtocol Disallowed protocol
34 SslErr SSL error
35 ProxyUnauth Proxy access unauthorized
36 EmbeddedSecurityChange Embedded object security change
37 DisallowedFilePrefix Disallowed file prefix
38 DisallowedFileType Disallowed file type
39 DisallowedNonlocalFileUrl Disallowed non-local file URL
40 CannotConvertCharset Cannot convert character set
41 DisallowedAuthScheme Disallowed authentication scheme
42 SecureTransNotPossible Secure transaction not possible
43 UnexpectedResponseCode Unexpected server response
44 DisallowedMethod Disallowed request method
45 ConnUpgradeToSslRequired Connection upgrade to SSL required
46 FetchNotPermittedByLicense Fetch not permitted by license
47 UnknownContentEncoding Unknown Content- or Transfer-Encoding
48 DisallowedContentEncoding Disallowed Content- or Transfer-Encoding
49 CannotDecodeContentEncoding Cannot decode Content- or Transfer-Encoding

errtoken errmsg
50 NotAcceptable Client-acceptable version not found
51 CannotVerifyServerCertificate Cannot verify server certificate
52 ConnectionNotReusable Connection not reusable
53 CannotTunnelProtocol Cannot tunnel protocol
54 PacError Proxy auto-config error


DIAGNOSTICS
urlinfo returns the requested value(s).


EXAMPLE

<fetch "http://www.somesite.com/mypage.html">
<urlinfo "metanames">
<$names = $ret>
Meta data:
<LOOP $names>
  <urlinfo "metaname" $names>
  $names = <LOOP $ret> "$ret" </LOOP>

</LOOP>


CAVEATS
The urlinfo function was added in version 2.1.884800000 19980114.

If submit is used with TOFILE, then content and content-derived items such as links are unavailable in urlinfo, because the content was not held in memory for processing.


SEE ALSO
fetch, submit, urlcp


Copyright © Thunderstone Software     Last updated: Dec 10 2018
Copyright © 2019 Thunderstone Software LLC. All rights reserved.