pdfxml - convert Metamorph hit to PDF markup information

`pdfxml` - convert Metamorph hit to PDF markup information

SYNOPSIS

<pdfxml $query $Body [options ...]>

DESCRIPTION
The pdfxml function enables Metamorph hit markup to be viewed in a PDF (Adobe Acrobat) file. It executes the Metamorph $query on a PDF document ($Body) and returns the information needed by a user's PDF browser plugin to mark up the resulting hits.

Note that $Body must be the exact text returned by the PDF plugin for Thunderstone's Webinator or Texis. If it is modified in any way, character and word counts may be off, causing incorrect highlighting information to be sent to the browser.

Options that may be specified are:

$color The color to show highlighted terms in. Must be a color specified in RGB of the form #RRGGBB (by Adobe specification).
words Flag that indicates that the hit markup should use word mode. Mutually-exclusive with characters mode. This is the default, and should be used for anytotx plugins that use the "Adobe Acrobat TK" library (anytotx --identify does not show a pdf: version; prior to version 4.02.1038324681 20021126). Added in version 4.00.999800000 20010906.
characters Flag that indicates that the hit markup should use character mode. Mutually-exclusive with words mode. This mode should be used with anytotx plugins that use the XPDF library (anytotx --identify shows a pdf: version; after version 4.02.1038324681 20021126). Added in version 4.00.999800000 20010906.
active Indicates that the browser's PDF viewer should jump to the first match upon displaying the document. Mutually-exclusive with passive. This is the default. Added in version 4.00.999800000 20010906.
passive Indicates that the browser's PDF viewer should not jump to the first match upon displaying the document. Mutually-exclusive with active. Added in version 4.00.999800000 20010906.
showhits Indicates that the matching terms should be included in the XML output as comments. This is primarily for debugging purposes and should generally not be used in a production environment. Added in version 4.00.999800000 20010906.
startpage $pg or startpg $pg Specifies the page number that the $Body document actually starts at. The default is 0 (the first page), i.e. $Body is the complete document. This option is used to keep the browser plugin in sync when partial-document $Body arguments (e.g. single pages) are used. For example, if $Body actually starts at the third page of the original document, use startpg 2. Added in version 5.00.1092761457 20040817.
charset $charset Specifies the character set of $Body. The default is ISO-8859-1. Multi-byte character sets such as UTF-8 can cause erroneous highlighting offsets if the character set is not specified with this option. Note that this is the character set of $Body, not necessarily that of the original PDF. Added in version 5.01.1104778576 20050103.

DIAGNOSTICS
The pdfxml function returns a list of strings to be sent to the Web browser's PDF viewer plugin.

EXAMPLE

<EXPORT $query>
<EXPORT $id>

<A NAME=xml>
  <SQL MAX=1 "select Body from html where id = $id">
   <pdfxml $query $Body "#00FF00">
   <LOOP $ret>
     $ret
   </LOOP>
 </SQL>
</A>

<A NAME=main>
  ...

  <SQL "select Url, Title, id from html
        where Title\Body like $query">
    <substr $Title 0 14>
    <IF $ret eq "PDF Document (">
      <CAPTURE>#xml=http://$HTTP_HOST$url/xml.txt</CAPTURE>
    <ELSE>
      <$ret = "">
    </IF>
    <A HREF="http://$Url$ret">$Title</A>
  </SQL>
</A>

In this example, the main function excerpt prints the URLs for documents matching $query, from a Webinator html table. For most documents this is http:// plus the $Url. For PDF documents however, an anchor is attached (#xml=...) that contains the URL to hit markup information. If the user's Web browser is configured with a PDF viewer, the viewer will fetch this hit information URL (which points to the xml function here) and use it to mark up the PDF document.

CAVEATS
The pdfxml function was added in version 2.1.864700000 19970527.

The PDF (anytotx) plugin for Thunderstone's Webinator must be used to generate the $Body value passed to pdfxml. Also, the web user's browser must have a configured PDF viewer to fetch and use the PDF markup information.

The XML generated by this function conforms to Adobe's "Highlight File Format" specified in Adobe Technical Note #5172. It does not necessarily conform to any XML "standard".