You may make many changes to Webinator's walk behavior by using
Walk Settings
from the administrative interface main menu.
But you are not limited to these features. You may change any and all
aspects of the walker's behavior by modifying the supplied
dowalk
script. (The webinatoradmin
script supplied with
version 4 and earlier releases has been combined into dowalk
for atomicity.)
For details on programming with Texis Web Script (Vortex), see the
manual at the Thunderstone web site, https://www.thunderstone.com/
.
The following describes some important points about the internals of the
dowalk
script that comes with Webinator. The dowalk
script is
fairly heavily commented to aid in finding your way around within it.
The dowalk
script actually consists of 2 Vortex script files
concatenated. The first part contains the walker/indexer and settings
reading code. The second part of the file provides the management
interface that is used from a web browser.
The dispatch
function is the primary external entry point for
performing a new walk. It load settings, sets up logging and databases,
then invokes other processes in parallel (according to maximum servers
setting). When all of the walking is complete it removes commonality
from pages (if that option is set), creates the indices needed for
searching the database, then makes the new database live and deletes
the old database.
The stop
function is an external entry point that is used to
signal (using <loguser>
) a walk that is in progress that it
should stop. The walkers check for this signal (using
<userstats>
) at various points and will quit when it is detected.
The reindex
function is an external entry point that is used
to drop and recreate the Metamorph index on the html table. This is
needed after changing the word definition expressions.
The remakeindex
function is an external entry point that is used
to drop and recreate all indices on the database. It it only for use if
one or more non-Metamorph indices get corrupted by disk errors or such.
The recat
function is an external entry point that is used to
recategorize the html table based on the current (presumably changed)
categories (here). This may take some time on
large walks.
The ifmodified
function is an external entry point that is used to
tell the dispatcher to run only if chkneedwalk indicates a walk is needed.
The usage
function is called when you invoke dowalk
incorrectly
and prints a terse summary or correct usage options.
The doplugin
function handles files that are not HTML or text,
such as PDF and MSWord. It determines the correct options for anytotx
based on the fetched page's MIME type or extension. It then calls the
dofilt
function which actually runs anytotx
to perform
the conversion to text and the extraction of meta information such as
Title. It will make up a title for the document if none is returned
by anytotx
.
The settings
function calls the defaults, readsettings,
and applysettings functions, in order. This function is called by
most entry points to get default and current settings for a given
profile before proceeding with any work.
The updatemmindex
function is called (sometime after having
called settings) to create or update the Metamorph index on the html table.
The maketables
function is called (sometime after having
called settings) to create all of the Webinator tables. This function does
nothing for Webinator-only licenses. For Webinator-only licenses the
tables are created automatically by Texis when the database is created.
The schema may not be changed.
The walk
function is the core which walks all desired URLs on a
single site. It always processes breadth first (i.e. it gets all URLs at
a given depth before proceeding to the next level down). Any desired
URLs that reside on a different site are placed into the database's
todo table for processing by the dispatcher.
The fetchset
function is used in various places to fetch one
or more URLs (using the maximum threads setting) simultaneously.
The manglepage
function is called before extracting text and hyperlinks
from an HTML page. It allows the page to be modified before processing.
This is where the ignore/keep tags are handled.
The getrobotstxt
function fetches the robots.txt file from a
given site and checks for any exclusions for Webinator. These
exclusions are later added to the list of URL rejection patterns.
The chkneedwalk
function is called
to check if a rewalk is required. It fetches the page
to see if the modification date has changed. Or, if the web server does
not provide a modification date it compares the content to what it was
previously. It sets an internal flag if a rewalk is needed.
The putmsg
function intercepts error messages to provide special
handling for some, and recording of most.
The go
function is an external entry point used by the dispatcher
when it starts up child processes to walk a specific site or set of URLs.
The singles
function is an external entry point that is used to
fetch all of the single page URL. It is called by the dispatcher as
the first parallel process. Therefore single pages will generally be
fetched earliest in a new walk.
The rmlocks
function is used to remove any stale locks and monitor
processes on a database and dismantle the locking structure. This is
done before physically removing a database from the system.
The geturl
function is a utility function that may be used to find
out what the walker will think about a given URL using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/geturl.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/geturl.txt >FILE.txt
The getrobots
function is a utility function that may be used to find
out what the walker will think about a given robots.txt using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/getrobots.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/getrobots.txt >FILE.txt