THUNDERSTONE NEWS

Summer 2004 - Archive

CONTENTS


Texis Version 5.0

Texis version 5.0 contains many improvements. Most of the new features are in the crawler utilities ("dowalk" and "search" scripts) that play a core role in both Webinator and the Search Appliance products. The scripts also are provided for optional use or modification with full Texis distributions. Separately, Texis and Vortex have a variety of optimizations and extensions.

All products have these enhancements:

  • Adaptive Indexing. Much faster refresh crawls, with reduced server load.
  • Spelling checker. Suggests alternate searches, automatically customized to your site.
  • Pause/resume walks. Stop an in-progress walk, resume it later.
  • Unicode/UTF-8 support. Also international characters.
  • Resource limits. Pauses walk if memory usage etc. exceeds settable limits.
  • Multiple user/passwords. Different base URLs with different logins, for multiple users.
  • Exclude-by field. Flexible exclusion based on text, meta-data, keywords etc.

Adaptive Indexing is especially significant. It keeps site and enterprise indexes fresher than ever before. The crawler revisits each page or document according to a separate schedule -- the more often the page changes, the more frequently it visits. The software estimates how often each page or document needs to be re-crawled based on its change history. Pages may be reindexed as often as every minute!

Note that the spelling-checker lexicon is not based on a dictionary. Rather, it is made up of the words contained in each database, i.e., in the index of a document or web page collection (called a profile in the crawler admin module). If you search for a word not in the index, the spell checker suggests the closest matches existing in the index. For this reason the spelling checker works automatically with site-specific terminology, or even in languages other than English!

Texis maintenance customers will receive version 5 automatically according to the semi-annnual update schedule, but can request a release sooner if needed. Webinator upgrade pricing is available at the Webinator product page. Search Appliance customers should use the Maintenance - Check for Updates feature periodically to be kept up to date.


Search Appliance Update Note

After updating the Search Appliance to version 5, any scheduled refresh crawls will need to be launched manually once to do a complete crawl, and will then revert back to the regularly scheduled refresh. Otherwise a schedule refresh may do a complete new crawl unexpectedly.

Note that if you have selected "Automatically Discover, Download, and Install Updates" on the Maintenance - Update Preferences page, your appliance should be updating itself to v5 on Monday, August 23. Please follow the procedure above on that day.


"Best Bets" Feature

Customers sometimes ask how to make certain records "always come to the top" in search results. Thunderstone's new Best Bets feature makes that easy to accomplish.

Best Bets is analogous to the "sponsored listings" at the top of search results on Yahoo and its competitors. You may be familiar with how sponsored listings work: the advertiser stores a list of phrases, each associated with a web page. When one of those phrases matches a user's query, the corresponding page link is displayed first, or set apart in a special area.

If you administer a Thunderstone search application for your own organization or web site, Best Bets allows you to become an "advertiser" on your own site. (And without paying advertising fees!)

This tends to come in handy for larger collections. Suppose you determine, for example, that users should see a certain document first whenever they search for the term "repair service". However, many documents in your collection may discuss repair service, and thus match such a query. Just go to the Best Bet settings page, and store that phrase along with the desired document link. Then you are assured that it will appear as the top search result. You also can set highlighting color, page placement, and various other characteristics to make it stand out.

Best Bets also is included in the Search Appliance and with full Texis distributions, and as an extra cost option with some Webinator versions.


Tech Corner: Grouped Search Results

If you're a Texis developer, how often do you use the SQL GROUP BY clause? It is a very useful feature in text searches, but it tends to be neglected!

If you have developed SQL applications in the past, you probably used GROUP BY in connection with numeric data. But GROUP BY can be very helpful in many cases even with no numeric values. The typical case we see involves tables containing a text field along with one or more category or "metadata" fields.

For example, a customer of ours that operates an industrial laboratory has a database of research reports. At first their Texis Web Script search application had the simple form of

SELECT Title, Author, Date FROM Reports
WHERE Text LIKE $query

However, when hundreds or even thousands of document titles are returned, users need a more refined query to narrow the search. Our customer observed that each researcher tends to create reports that are similar or are about related topics. That made it useful to first see a list of which researchers had published anything about that query. They accomplish it with this:

SELECT Author, COUNT(Author) FROM Reports 
WHERE Text LIKE $query GROUP BY Author

The results would be formatted to look something like this:

Author count(Author)
Anderson, William 12
Johnson, Mary 22
Smith, Robert 8

Where a query has hundreds of matching documents or more, this query tends to return a more manageable list. When the user clicks on a link, the application displays the records that that author published containing that search term:

SELECT Title, Date FROM Reports
WHERE Text LIKE $query AND Author = $Author

The user can examine those documents and if they are similar, go back see what the next author published.


Feedback, suggestions and questions are welcome to
Copyright © 2024 Thunderstone Software LLC. All rights reserved.