Avoiding Crawling Issues with Custom Error Pages

Custom error pages are important for user experience — they help visitors stay oriented when something goes wrong. But if they aren't set up correctly, they can cause problems for search engines and crawlers, including:

Following links into invalid or duplicate paths
Indexing content that shouldn't be indexed
Consuming crawl resources unnecessarily

As a provider of programmable search engine software, we often see these issues during real-world deployments. This post outlines best practices to help avoid these problems and improve crawler behavior.

Return the Correct HTTP Status Code

Every error page should return an appropriate HTTP status code:

404 Not Found for missing pages
410 Gone for pages that have been permanently removed
403 Forbidden for access-restricted pages

Avoid returning 200 OK for pages that represent an error. Doing so tells crawlers the content is valid, which can lead to indexing and link-following behavior that's not intended.

Use a Canonical Link if the Error Page is Reused

If your error page is a shared template (e.g., /404.html) and may be returned with a 200 OK, consider adding a canonical link tag in the page header:

<link rel="canonical" href="https://example.com/404.html">

This helps search engines understand that the content is not unique and avoids indexing many different URLs that all show the same error page.

Include Meta Robots Directives

Use the following tag in your error page's <head>:

<meta name="robots" content="noindex, nofollow">

This tells crawlers not to index the page or follow any links on it. It's a useful safeguard, especially if some error pages are returned with a 200 OK status due to technical or legacy reasons.

Be Careful with Relative Links

Navigation links on error pages, such as "Home" or "Contact," can cause unintended crawling behavior if written as relative paths like home.html or ../contact.

For example, if a crawler accesses: https://example.com/missing/path/ and the error page includes a link to contact.html, the crawler may request: https://example.com/missing/path/contact.html which likely also doesn't exist.

Recommendations:

Use absolute URLs (e.g., https://example.com/contact.html)
Or use root-relative paths (e.g., /contact.html)
Or, if in doubt, omit navigation links from error pages entirely

Use Consistent Titles for Optional Exclusion Rules

Although our crawler does not rely on page heuristics to detect error pages, you can set up manual rules using our “Exclude By Field” feature.

If your error page consistently uses a title like:

<title>404 - Page Not Found</title>

you can define a rule in the crawler to exclude any page with that title from indexing, and optionally from link-following.

Keep in mind:

This should be used only as a fallback, not as a primary method
Titles are not guaranteed to be unique and may result in false positives if reused across legitimate content

Monitor for Crawl Patterns

It's helpful to keep an eye on crawler behavior, especially during initial indexing. Indicators of error-related problems include:

High numbers of similar 404s
Deep, nonsensical URL paths
Crawling activity through pages that shouldn't be linked

Our software provides crawl logs and supports exclusion rules that can be fine-tuned to prevent this type of activity.

Conclusion

Custom error pages should support users without misleading crawlers. By following these best practices — correct status codes, proper link handling, and metadata — you can prevent many common issues and improve the quality of your indexed content.

If you use the Thunderstone search engine, our features like "Exclude By Field" and crawl logging make it easier to manage how error pages are handled.