On the first access to a site the file /robots.txt will be retrieved, if it exists. Settings there will be respected. Any encountered URL that is disallowed by robots.txt will be discarded. Meta robots is also respected for each page retrieved. See http://www.robotstxt.org/wc/exclusion.html for the robots.txt and meta robots standards.
Suppose your "text only" pages were all under a directory called /text. The simplest way to prevent traversal of that tree would be to use the exclusion or exclusion prefix.
The exclusion would look something like this:
The exclusion prefix would look something like this:
That will prevent retrieval of any pages under the /text tree. This does not prevent other Web robots from retrieving the /text tree. To setup a permanent global exclusion list you need to create a file called robots.txt in your document root directory. The format of that file is as follows:
Where "*" is the name of the robot to block. "*" means any robot not specifically named (all robots in this case since no others are named). Or you could specify the name of the robot. For Webinator it would be Webinator. You may specify several "Disallow"s for any given robot (see below). The "Disallow"s are simple path prefixes. They may not contain wildcards.
You may also specify different "Disallow" sets for different robots. Simply insert a blank line and add another "User-agent" line followed by its "Disallow" lines.
Here's a larger example:
Scooter robot will be blocked from accessing any pages under
/big trees. Webinator
will be blocked from accessing any pages under
All other robots will be blocked from accessing
robots.txt is not enforced in any way. Robots may or may
not use it. Webinator will, by default, always look for it and
use it if present. This may be disabled by turning off robots.txt
under the Robots setting. When using robots.txt you may still
use "Exclusions" for manual exclusion.
Meta robots provides another method of controlling robots such as Webinator. Any HTML may contain a meta tag in the source of the form.
<meta name="robots" content="WHAT-TO-DO">
WHAT-TO-DO may contain any of the following keywords. Multiple
keywords may be used by placing a comma(
,) between them.
|Index the text of this page|
|Don't index the text of this page|
|Follow hyperlinks on this page|
|Don't follow hyperlinks on this page|
| Synonym for |
| Synonym for |
robots.txt this is not enforced in any way. Robots may or
may not use it. Webinator always indexes and follows hyperlinks
by default so it only looks for