THUNDERSTONE NEWS

Fall 2003 - Archive

CONTENTS


New Asian Language Partnership

When creating applications that deal with Asian languages such as Chinese or Japanese, there is more to consider than just being able to store and search Unicode. Other issues include optimizing the search syntax, the user interface, and even the marketing -- in other words, globalization of the entire application.

Thunderstone has partnered with a leading company providing localization and globalization services. The partner has valuable experience in integrating Texis. If you need to use Texis in Asian language applications, we encourage you to contact us to learn more.


AOL Selects Texis for Classifieds

Thunderstone is proud that America Online has selected Thunderstone's Texis software to power a new online classified advertising system. The Texis-based service already is used for storing and searching the Monster job listings on the AOL service. Other classified categories are being market tested.

Texis was selected for its efficient and versatile handling of complex search criteria. Each potential classified category has a different structure -- for example, "Real Estate" needs fields for ZIP code and number of bedrooms. Texis allows developers to easily apply either text-search or database-query logic, or both together, to any set of information. Texis's real-time updating and SQL-oriented application development tools also influenced the choice.


File Format Filter Enhancements

The File Format Plug-in (anytotx) handling of Microsoft Office documents has been updated in a variety of ways. For example, it now extracts document titles and other meta information, and it does a better job finding text in non-Word files (PowerPoint, Excel, etc.)

By the way, we often are asked whether our products can index this-or-that unusual file format. For example, customers recently asked about indexing engineering drawings and news photographs. There answer is very often yes, even though we may never have dealt with the particular format. That's because the anytotx program is designed to examine even file types it doesn't know, and select out any ASCII text it finds, skipping over the non-ASCII (binary) data. Comments, photo captions, etc. are commonly stored in ASCII within these types of files.

Even if the pertinent information is not ASCII, another enhancement may be of help. We have made it easier to add an external filter to anytotx. This can be a third-party program not necessarily designed for use with Texis or Webinator. For more information about how to accomplish this, please ask Thunderstone Tech Support.

The anytotx enhancements are part of the Webinator and Texis distributions beginning with version 4.3.8. Texis maintenance customers or those with Webinator paid versions 4.0+ may request an update that includes the new plug-in from Tech Support. Other customers may obtain the new plug-in by upgrading Webinator or joining Texis Maintenance. Search appliance customers should go to the "Maintenance" menu on the admin page and select "Check for updates."


Thunderstone Wins Award

We're happy to mention that Thunderstone's Webinator won the Cleveland Area Knowledge Industry Best Software Product award. By the way, our headquarters city of Cleveland, Ohio, USA, and the nearby area is becoming somewhat of high-tech hotbed, so we feel way ahead of the times in setting up shop here 22 years ago! As in other tech hot spots, first-rate research universities seed the region with talent. (If you're a programmer with computer science degree, seeking a position in Northeast Ohio, please contact us!)


Tech Corner: Date Sorting Anomalies in Web Site Searching

Although Thunderstone's Webinator search script offers users a "sort by date" option, its sorting is only as good as the dates it gets from each web site it indexes. We see many sites where the dates are spurious, resulting in incorrect search result ordering.

In the web walker script (dowalk), the date stored for a web page is that returned by the visited web server, as the HTTP header's Last-Modified value. The visited web server in turn usually gets that information from the date stamp of the html file, as reported by its file system.

But there are a variety of ways that file system date stamps can be misleading. For example, when you copy files to your web server by ftp, the operating system probably will assign the current date to each file. If your intention was to just copy an older file, you lose the original date stamp, and the page will be reported as new instead.

We have seen news sites where almost every page contained an identical Last-Modified date, although the content of those pages consisted of news articles spanning months or years, which had not changed since they were published. Over time, the files were moved from machine to machine, and the original document dates were lost.

Incorrect dates also commonly are seen on dynamically generated pages. Such pages usually are based on content extracted from a database. The database might hold correct dates, but the programmer for that site neglected to incorporate logic for using them, so a default value gets assigned instead.

If your web pages are dated correctly, pay attention to keeping them that way! If you need to move or copy files, there are procedures that will preserve the date, such as backup and restore.

If the web pages or documents you want to index already have incorrect dates, you have these options to fix them. If correct dates are available within the documents themselves (perhaps as a meta value), you can extract those and substitute them for the HTTP default value during a walk. That requires customizing the dowalk script. With full Texis, you also have the option to directly change the dates stored in the Webinator database, either manually or by a new script containing whatever logic you want to apply.


Feedback, suggestions and questions are welcome to
Copyright © 2024 Thunderstone Software LLC. All rights reserved.