Customizing the Walker

 

You may make many changes to Webinator's walk behavior by using Walk Settings from the administrative interface main menu.

But you are not limited to these features. You may change any and all aspects of the walker's behavior by modifying the supplied dowalk script. (The webinatoradmin script supplied with version 4 and earlier releases has been combined into dowalk for atomicity.)

For details on programming with Texis Web Script (Vortex), see the manual at the Thunderstone web site, http://www.thunderstone.com/.

The following describes some important points about the internals of the dowalk script that comes with Webinator. The dowalk script is fairly heavily commented to aid in finding your way around within it.

The dowalk script actually consists of 2 Vortex script files concatenated. The first part contains the walker/indexer and settings reading code. The second part of the file provides the management interface that is used from a web browser.

The dispatch function is the primary external entry point for performing a new walk. It load settings, sets up logging and databases, then invokes other processes in parallel (according to maximum servers setting). When all of the walking is complete it removes commonality from pages (if that option is set), creates the indices needed for searching the database, then makes the new database live and deletes the old database.

The stop function is an external entry point that is used to signal (using <loguser>) a walk that is in progress that it should stop. The walkers check for this signal (using <userstats>) at various points and will quit when it is detected.

The reindex function is an external entry point that is used to drop and recreate the Metamorph index on the html table. This is needed after changing the word definition expressions.

The remakeindex function is an external entry point that is used to drop and recreate all indices on the database. It it only for use if one or more non-Metamorph indices get corrupted by disk errors or such.

The recat function is an external entry point that is used to recategorize the html table based on the current (presumably changed) categories (here). This may take some time on large walks.

The ifmodified function is an external entry point that is used to tell the dispatcher to run only if chkneedwalk indicates a walk is needed.

The usage function is called when you invoke dowalk incorrectly and prints a terse summary or correct usage options.

The doplugin function handles files that are not HTML or text, such as PDF and MSWord. It determines the correct options for anytotx based on the fetched page's MIME type or extension. It then calls the dofilt function which actually runs anytotx to perform the conversion to text and the extraction of meta information such as Title. It will make up a title for the document if none is returned by anytotx.

The settings function calls the defaults, readsettings, and applysettings functions, in order. This function is called by most entry points to get default and current settings for a given profile before proceeding with any work.

The updatemmindex function is called (sometime after having called settings) to create or update the Metamorph index on the html table.

The maketables function is called (sometime after having called settings) to create all of the Webinator tables. This function does nothing for Webinator-only licenses. For Webinator-only licenses the tables are created automatically by Texis when the database is created. The schema may not be changed.

The walk function is the core which walks all desired URLs on a single site. It always processes breadth first (i.e. it gets all URLs at a given depth before proceeding to the next level down). Any desired URLs that reside on a different site are placed into the database's todo table for processing by the dispatcher.

The fetchset function is used in various places to fetch one or more URLs (using the maximum threads setting) simultaneously.

The manglepage function is called before extracting text and hyperlinks from an HTML page. It allows the page to be modified before processing. This is where the ignore/keep tags are handled.

The getrobotstxt function fetches the robots.txt file from a given site and checks for any exclusions for Webinator. These exclusions are later added to the list of URL rejection patterns.

The chkneedwalk function is called to check if a rewalk is required. It fetches the page to see if the modification date has changed. Or, if the web server does not provide a modification date it compares the content to what it was previously. It sets an internal flag if a rewalk is needed.

The putmsg function intercepts error messages to provide special handling for some, and recording of most.

The go function is an external entry point used by the dispatcher when it starts up child processes to walk a specific site or set of URLs.

The singles function is an external entry point that is used to fetch all of the single page URL. It is called by the dispatcher as the first parallel process. Therefore single pages will generally be fetched earliest in a new walk.

The rmlocks function is used to remove any stale locks and monitor processes on a database and dismantle the locking structure. This is done before physically removing a database from the system.

The geturl function is a utility function that may be used to find out what the walker will think about a given URL using the current walk settings. It is invoked as follows:

texis profile=PROFILE top=THEURL dowalk/geturl.txt

This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.

texis profile=PROFILE top=THEURL dowalk/geturl.txt >FILE.txt

The getrobots function is a utility function that may be used to find out what the walker will think about a given robots.txt using the current walk settings. It is invoked as follows:

texis profile=PROFILE top=THEURL dowalk/getrobots.txt

This can generate a lot of output for a page of any size so you may want redirect it to a file that you can examine with your favorite viewer/editor.

texis profile=PROFILE top=THEURL dowalk/getrobots.txt >FILE.txt


Copyright © Thunderstone Software     Last updated: Mar 7 2019
Copyright © 2019 Thunderstone Software LLC. All rights reserved.