Page Exclusion, Robots.txt, and Meta-robots

On the first access to a site the file /robots.txt will be retrieved, if it exists. Settings there will be respected. Any encountered URL that is disallowed by robots.txt will be discarded. Meta robots is also respected for each page retrieved. See https://www.robotstxt.org for the robots.txt and meta robots standards.

If there are any HTML trees that you don't want indexed you may want to setup a robots.txt file, meta robots within the HTML pages, or use the various exclusion options to the Search Appliance. For example: if you had a "text only" version of your web server that duplicated the content of your normal server you would not want to index it. (On the other hand if most of your meaningful text is contained in graphics, Java, or JavaScript you may want to walk the text tree instead of the normal one, since graphics and Java are not searchable.)

Suppose your "text only" pages were all under a directory called /text. The simplest way to prevent traversal of that tree would be to use the exclusion or exclusion prefix.

The exclusion would look something like this:

/text/

The exclusion prefix would look something like this:

http://www.example.com/text/

That will prevent retrieval of any pages under the /text tree. This does not prevent other Web robots from retrieving the /text tree. To setup a permanent global exclusion list you need to create a file called robots.txt in your document root directory. The format of that file is as follows:

User-agent: *
Disallow: /text

Where "*" is the name of the robot to block. "*" means any robot not specifically named (all robots in this case since no others are named). Or you could specify the name of the robot. For the Search Appliance it would be ThunderstoneSA. You may specify several "Disallow"s for any given robot (see below). The "Disallow"s are simple path prefixes. They may not contain wildcards.

You may also specify different "Disallow" sets for different robots. Simply insert a blank line and add another "User-agent" line followed by its "Disallow" lines.

Here's a larger example:

User-agent: *
Disallow: /text
Disallow: /junk

User-agent: ThunderstoneSA
Disallow: /text
Disallow: /thunderstonesa

User-agent: Scooter
Disallow: /text
Disallow: /junk
Disallow: /big

The Scooter robot will be blocked from accessing any pages under the /text, /junk, and /big trees. The Search Appliance will be blocked from accessing any pages under /text and /thunderstonesa. All other robots will be blocked from accessing pages under /text and /junk.

Use of robots.txt is not enforced in any way. Robots may or may not use it. The Search Appliance will, by default, always look for it and use it if present. This may be disabled by turning off robots.txt under the Robots setting. When using robots.txt you may still use "Exclusions" for manual exclusion.

Meta robots provides another method of controlling robots such as the Search Appliance. Any HTML may contain a meta tag in the source of the form.

<meta name="robots" content="WHAT-TO-DO">

WHAT-TO-DO may contain any of the following keywords. Multiple keywords may be used by placing a comma(,) between them.

Keyword Meaning
INDEX Index the text of this page
NOINDEX Don't index the text of this page
FOLLOW Follow hyperlinks on this page
NOFOLLOW Don't follow hyperlinks on this page
ALL Synonym for INDEX,FOLLOW
NONE Synonym for NOINDEX,NOFOLLOW

Table 4.1: Meta-Robots Flags

Like robots.txt this is not enforced in any way. Robots may or may not use it. The Search Appliance always indexes and follows hyperlinks by default so it only looks for NOINDEX and/or NOFOLLOW and/or NONE.


Copyright © Thunderstone Software     Last updated: Nov 8 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.