The crawler provides many ways of controlling what you do and don't crawl. Note that URLs manually specified by you (Base URLs, URL URLs, Single Pages, etc.) are exempt from all inclusion/exclusion rules — they will always be used.
The shotgun approach — any URLs that contain any of the text listed in an exclusion line anywhere in the URL will not be included in the walk. It doesn't need to be a full path or filename, sub-matches are okay.
If you specify "archive" as an exclusion, then "http://www.example.com/archive/index.htm" will be excluded and "http://www.example.com/site/newsarchivefrom2004" will also be excluded.
Like Exclusion, except it has to be the same starting from the beginning. This gives a bit more control over what exactly matches.
If you specify "http://www.example.com/archive" as an exclusion prefix, then "http://www.example.com/archive/index.htm" will be excluded and "http://www.example.com/archivePages..htm" will be excluded, but "http://www.example.com/site/newsarchivefrom2004" will be allowed.
The opposite of Exclusion prefix — instead of rejecting URLs that DO match the prefix, it rejects URLs that DON'T match the expression. Both settings are used for weeding out URLs, it just swaps which are used and which aren't. Multiple Required prefixes can be specified, and URLs are allowed if they match at least one.
If you specify "http://www.example.com/archive" as an required prefix, then "http://www.example.com/archive/index.htm" will be used and "http://www.example.com/archivePages..htm" will be used, but "http://www.example.com/site/newsarchivefrom2004" will be excluded.
Similar ideas to Exclusion prefix & Required prefix, except you use our powerful REX pattern matcher to specify what should match instead of just a prefix. It's similar to regular expressions but much faster. Please see the the REX pages in the Vortex manual on our website (http://www.thunderstone.com/site/vortexman/rex_split.html) for more details on the exact syntax.
If you specify both requirements and exclusions, then URLs must satisfy both to be used — they must not match any Exclusion Prefix, AND they must match at least one Required Prefix (if specified).
There's an even more powerful way to exclude pages with Exclude by Field, but that's for another Tech Tips article. (Watch for it in next month's newsletter.)
If you have questions about how any of these operate, feel free contact Thunderstone Support.
Steve Kolowich, a reporter for THE CHRONICLE OF HIGHER EDUCATION, noted what he referred to as Thunderstone's determined efforts at "Out-Googling Google" in his article entitled In Search of a Better Search Engine (http://chronicle.com/free/v55/i24/24a01501.htm) for the February 20, 2009 issue of The Chronicle. He wrote, in part:
The Virginia Bioinformatics Institute at Virginia Tech, facing a thickening swamp of digital documents, opted for Thunderstone's search appliance, which starts at $13,000, about six months ago. The institute uses the device to index reams of unpublished data and notes stored on its intranet. James E. Stoll, who leads Internet projects at the institute, said the appliance allowed research collaborators and other authorized users to retrieve items from across the institute's network of repositories without exposing those documents to the public Web, as basic site-search software would require. Researchers "don't want to be scooped," Mr. Stoll said. "This is their livelihood."
Thunderstone's Fred Harmon (Channel Director and CSO) and Peter Thusat (Communication Director and CMO) will participate as exhibitors (Thunderstone Software Booth: 1045) during the AIIM International Exposition and Conference March 30 - April 2, 2009 at the Pennsylvania Convention Center in Philadelphia, PA.
Conference: March 30 - April 2, 2009
Exhibits: March 31 - April 2, 2009
REGISTER TODAY FOR YOUR FREE EXPO FLOOR PASS
and get access to all keynotes, general sessions,
Expo floor education and the ON DEMAND Expo!
To receive your free pass, use Registration Code: 615M
when you register at WWW.AIIMEXPO.COM
or call +1 888 824 3004.
Your FREE pass comes to you compliments of Thunderstone
Software. Please stop by and visit us at Booth 1045.
The Center for Russian & East European Studies, a sub-unit of the larger University Center for International Studies (UCIS) at the University of Pittsburgh, won a competition a number of years ago to create the Vladimir I. Toumanoff Virtual Library — a collection that includes searchable online documents from many top U.S. researchers and analysts who write about politics, history, sociology, economics and foreign policy related to the states of the former Soviet Union and Central and Eastern Europe. Thunderstone's Webinator indexing and retrieval software enabled the responsible Informatics team to accomplish this goal in an efficient and affordable manner.
The University Center for International Studies (UCIS) provides the organizational framework that supports the University of Pittsburgh's mission to integrate and reinforce all its strands of international scholarship in research, teaching and public service. UCIS includes — in addition to many other highly-acclaimed programs and component units — a Center for Russian & East European Studies, an Asian Studies Center, a Center for Latin American Studies, a European Studies Center, an International Business Center (jointly sponsored with the Katz School of Business) and a European Union Center of Excellence (funded by the European Union.)
As a thin layer on top of the whole UCIS structure, Central Administration handles all business-related core functions and technology issues. When individuals in any of the sub-units need advice or consulting related to I.T. Services, Knowledge Management, database planning, upgrading of their websites or anything else that would fall into technology-mediated information, they call upon Mark J. Weixel, Director of Informatics at UCIS.
Weixel recalled, "Back in I guess it was '98, I found out about Webinator from a friend of mine who was at Princeton at the time. We had a particular niche here in International Studies, and we wanted to create mini search engines for web content that was specific to certain world regions. We were hoping to create search engines like AltaVista, since Google wasn't even around then, that would allow people to do full-text searching of those websites. But, because we were vetting the list of sites, we thought we could increase the probability that searchers would come across something really relevant to the part of the world we were focusing on.
"We used Webinator to index and search collections of websites that were in and dealt with Russia and Eastern Europe.
"So, that was my original introduction to Webinator. We bought the entry-level product to begin with, and we currently have the Enterprise version. What I really like about it, still, is the fact that it's relatively easy to configure. It's much easier to configure that it was back when we bought the original product, when everything was run through command lines. I like the notion of relevance in terms of returned hits. It seems to make a lot more sense to me than, for example, Google page ranking — which places a much higher priority on popularity than it does on the actual content of the pages where text matches.
"Another thing that has been nice is the fact there is support for synonym matching within the server. And I think Vortex as a scripting language is very powerful. Even though I haven't used it to its fullest ability, it's proven to be quite flexible when we've needed to make modifications."
Download the 3-page UCIS case study PDF here.
Feedback, suggestions and questions are welcome. Send your email to