Categorizer

Thunderstone Texis Categorizer

Overview

Thunderstone's Texis categorizer (also known as a classifier) automatically attaches categories, subject codes, metadata, and the like, to documents or text records.

Categories add value to full-text search systems in several ways:

Sorting: Categories provide keys for sorting or grouping search results.
Menus: They provide a "controlled vocabulary" that users can select from, instead of - or in addition to - trial-and-error searching.
Browsing: They provide a finite set of hyperlinks that may be "navigated" as a means to browse thorough data in an organized fashion.

Customers generally begin with a "taxonomy," meaning predetermined set of categories. For each category, one must provide "training" data, consisting of example documents. However, the system is quite dynamic, in that authorized users may create new categories as needed for new types of content.

Each automatic category "recommendation" receives a statistical confidence score. Operation may be either manual, automatic, or mixed. In manual mode, an operator accepts or rejects each recommendation. In automatic mode, categories are applied without user intervention. In mixed operation, one designates a confidence score threshold, such that recommendations above the threshold are accepted automatically, but those below are held for human review.

Accuracy is a function of both the quantity and quality of the examples. Categorization results approved or corrected by an operator are fed back into the training base, helping the categorizer results become even more accurate over time.

Hierarchical category schemes are easily accommodated. The categorizer handles most European languages.

Architecture and Integration

The categorizer is an application of Thunderstone's Texis and Texis Web Script products. The Texis underpinnings provide a broad range of "hooks" for using the categorizer and tying it into other computer applications. The system may be controlled through a web interface by authorized individuals from any location. It may be interconnected to a wide range of other information sources or repositories.

Interconnection may be accomplished by standard data interchange mechanisms including FTP; HTTP with or without XML; ODBC; JDBC; or Perl DBI/DBD. A "web services" application development model is supported. A feature-rich C-callable API also is available.

The categorizer uses both the database and the search-engine features of Texis. The documents are managed in a Texis SQL table. The categories are added as updates to records in the table. The Thunderstone relevance ranking algorithms are used to determine "recommended" categories for each document.

From a database point of view, a category is a value in a column of a SQL table. A categorization scheme may have multiple columns, and a record may contain multiple values in a column. For example, a database of business news may contain a column "industry" (with possible values "energy"; "agriculture"; etc.) and a column "event" (with possible values "merger"; "ipo"; etc.). These are represented in HTML by the tagging convention <meta name=... content=... >.

Categories can be used in conjunction with the Texis thesaurus to enhance text searching. For example, if your data contain documents labeled with a "finance" category, you might designate "banking" as a synonym of finance. Then one could set up a text-search application such that a free-text query on "banking" would automatically return records in the finance category, even if they don't mention banking.

The categorizer is highly scalable. One typical server machine can classify tens of thousands of documents daily. It may be set up to perform categorization on new documents as soon as they are available (real-time operation).