Refresh All

Refresh All still starts with the Base URLs and explores from there. But instead of creating a new database and fully downloading and processing each URL, it leaves the already-indexed data in place, and check each of the URLs to see if the content has changed. New URLs are added to the database, and URLs that are no longer present on the server are removed from the database.

If a URL's content hasn't changed, the Search Appliance doesn't reprocess the file. If the server supports If-Modified-Since (or it's doing a file:// walk), the content won't even be transferred. This lets the walk be much more efficient.

  • When to use Refresh All walks - Refresh All walks are useful for keeping content up to date once you've established all your walk settings. You're guaranteed for the walk to see anything that's changed, without needing to fully reprocess every URL every time.

    However, Refresh All walks don't apply the walk settings every walk. A new Data from Field rule to customize the Title will not take effect if a URL's contents hasn't changed. If you change your settings to include more URLs (i.e. add extensions, remove exclusions, add domains, etc.), a Refresh All walk is not likely to find the newly allowed data, unless all of the URLs leading to this data have been modified. You should do a New walk once to process these changes.

    For some large collections, especially those whose servers don't support If-Modified-Since, checking every URL every walk may still be too intensive. For these, Refresh walks can be used (see below).

    If more than 30%-50% of your site changes between walks you may be better off using a New walk instead of Refresh All. Also, many dynamic content generators may not give accurate Last-Modified dates, which will cause every URL to be rewalked. In that case you should use New instead of Refresh All.


Copyright © Thunderstone Software     Last updated: Nov 8 2024

 

Copyright © 2024 Thunderstone Software LLC. All rights reserved.