Refresh All
still starts with the Base URLs and explores from there.
But instead of creating a new database and fully downloading and processing each
URL, it leaves the already-indexed data in place, and check each of the URLs to
see if the content has changed. New URLs are added to the database, and URLs
that are no longer present on the server are removed from the database.
If a URL's content hasn't changed, the Parametric Search Appliance doesn't reprocess the file. If
the server supports If-Modified-Since
(or it's doing a file://
walk), the content won't even be transferred. This lets the walk be much more
efficient.
Refresh All
walks - Refresh All
walks are
useful for keeping content up to date once you've established all your walk
settings. You're guaranteed for the walk to see anything that's changed,
without needing to fully reprocess every URL every time.
However, Refresh All
walks don't apply the walk settings every walk. A
new Data from Field rule to customize the Title will not take effect if
a URL's contents hasn't changed. If you change your settings to include more
URLs (i.e. add extensions, remove exclusions, add domains, etc.), a
Refresh All
walk is not likely to find the newly allowed data, unless
all of the URLs leading to this data have been modified. You should do a
New
walk once to process these changes.
For some large collections, especially those whose servers don't support
If-Modified-Since
, checking every URL every walk may still be too
intensive. For these, Refresh
walks can be used (see below).
If more than 30%-50% of your site changes between walks you may be
better off using a New
walk instead of Refresh All
.
Also, many dynamic content generators may not give accurate Last-Modified dates, which will cause every URL to be
rewalked. In that case you should use New
instead of
Refresh All
.