Database and File Usage

The Search Appliance maintains a database that contains text from HTML pages, links to other pages, and a list of categories.

When the Search Appliance walker runs it creates a new database, under your specified data directory, to hold the new walk. It then dispatches a separate process for each web site it needs to visit and another to handle all of the "Single Pages". Each of these retrieves all of the pages in its base list and stores the text of the HTML page to the html table and the hyperlinks to the refs table. All of the desirable URLs from the page that have not been seen before are placed into an internal "todo" list. After all of the base URLs are processed the process repeats with the internal todo list. When there's nothing left in the todo list processing is complete.

Once all of the walking is complete the indices needed for searching are created on the data. Then the new database is flagged as the "live" one and the old database is deleted. Therefore your disk must have sufficient space for 2 complete databases plus temporary space used during the indexing step.

The databases are called db1 and db2. The Search Appliance alternates between using these two names.

Note that the above applies to a walk type of New. During a walk type of Refresh only one database, the "live" one, is used.

The Search Appliance also maintains a file containing the detailed report for each walk. This file has the same name as the database with .long appended to the end. Also, a single file called summary is maintained with short summary information about the state of the database.

Given a data directory named .../default there may also be the following:

.../default/db1
an actual walk database
.../default/db2
an actual walk database
.../default/db1.long
detailed walk report. Displayed when viewing Walk Status
.../default/db2.long
detailed walk report. Displayed when viewing Walk Status
.../default/summary
summary walk report. Displayed as Walk summary when viewing Walk Settings

Each setting has a record in the options table of the default database. See section 5.6 (here) for the list of fields in the table. At each complete rewalk the current options settings are copied into an options table in the walk database. These options are not changed as settings are modified and are not otherwise used unless a search is performed setting the database with db instead of setting the profile with pr.


Copyright © Thunderstone Software     Last updated: Nov 8 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.