You may make many changes to Webinator's walk behavior by using
Walk Settings from the administrative interface main menu.
But you are not limited to these features. You may change any and all
aspects of the walker's behavior by modifying the supplied
dowalk script. (The webinatoradmin script supplied with
version 4 and earlier releases has been combined into dowalk
for atomicity.)
For details on programming with Texis Web Script (Vortex), see the
manual at the Thunderstone web site, https://www.thunderstone.com/.
The following describes some important points about the internals of the
dowalk script that comes with Webinator. The dowalk script is
fairly heavily commented to aid in finding your way around within it.
The dowalk script actually consists of 2 Vortex script files
concatenated. The first part contains the walker/indexer and settings
reading code. The second part of the file provides the management
interface that is used from a web browser.
The dispatch function is the primary external entry point for
performing a new walk. It load settings, sets up logging and databases,
then invokes other processes in parallel (according to maximum servers
setting). When all of the walking is complete it removes commonality
from pages (if that option is set), creates the indices needed for
searching the database, then makes the new database live and deletes
the old database.
The stop function is an external entry point that is used to
signal (using <loguser>) a walk that is in progress that it
should stop. The walkers check for this signal (using
<userstats>) at various points and will quit when it is detected.
The reindex function is an external entry point that is used
to drop and recreate the Metamorph index on the html table. This is
needed after changing the word definition expressions.
The remakeindex function is an external entry point that is used
to drop and recreate all indices on the database. It it only for use if
one or more non-Metamorph indices get corrupted by disk errors or such.
The recat function is an external entry point that is used to
recategorize the html table based on the current (presumably changed)
categories (here). This may take some time on
large walks.
The ifmodified function is an external entry point that is used to
tell the dispatcher to run only if chkneedwalk indicates a walk is needed.
The usage function is called when you invoke dowalk incorrectly
and prints a terse summary or correct usage options.
The doplugin function handles files that are not HTML or text,
such as PDF and MSWord. It determines the correct options for anytotx
based on the fetched page's MIME type or extension. It then calls the
dofilt function which actually runs anytotx to perform
the conversion to text and the extraction of meta information such as
Title. It will make up a title for the document if none is returned
by anytotx.
The settings function calls the defaults, readsettings,
and applysettings functions, in order. This function is called by
most entry points to get default and current settings for a given
profile before proceeding with any work.
The updatemmindex function is called (sometime after having
called settings) to create or update the Metamorph index on the html table.
The maketables function is called (sometime after having
called settings) to create all of the Webinator tables. This function does
nothing for Webinator-only licenses. For Webinator-only licenses the
tables are created automatically by Texis when the database is created.
The schema may not be changed.
The walk function is the core which walks all desired URLs on a
single site. It always processes breadth first (i.e. it gets all URLs at
a given depth before proceeding to the next level down). Any desired
URLs that reside on a different site are placed into the database's
todo table for processing by the dispatcher.
The fetchset function is used in various places to fetch one
or more URLs (using the maximum threads setting) simultaneously.
The manglepage function is called before extracting text and hyperlinks
from an HTML page. It allows the page to be modified before processing.
This is where the ignore/keep tags are handled.
The getrobotstxt function fetches the robots.txt file from a
given site and checks for any exclusions for Webinator. These
exclusions are later added to the list of URL rejection patterns.
The chkneedwalk function is called
to check if a rewalk is required. It fetches the page
to see if the modification date has changed. Or, if the web server does
not provide a modification date it compares the content to what it was
previously. It sets an internal flag if a rewalk is needed.
The putmsg function intercepts error messages to provide special
handling for some, and recording of most.
The go function is an external entry point used by the dispatcher
when it starts up child processes to walk a specific site or set of URLs.
The singles function is an external entry point that is used to
fetch all of the single page URL. It is called by the dispatcher as
the first parallel process. Therefore single pages will generally be
fetched earliest in a new walk.
The rmlocks function is used to remove any stale locks and monitor
processes on a database and dismantle the locking structure. This is
done before physically removing a database from the system.
The geturl function is a utility function that may be used to find
out what the walker will think about a given URL using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/geturl.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/geturl.txt >FILE.txt
The getrobots function is a utility function that may be used to find
out what the walker will think about a given robots.txt using the current walk
settings. It is invoked as follows:
texis profile=PROFILE top=THEURL dowalk/getrobots.txt
This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.
texis profile=PROFILE top=THEURL dowalk/getrobots.txt >FILE.txt