Duplicate Check Fields

Syntax: checkboxes to choose fields

These are the fields which will be checked for duplicate prevention (if Prevent Duplicates is enabled). The concatenation of these fields is hashed for each incoming document, and if the hash is the same as an existing document, the incoming document will be discarded as a duplicate.

By default, only Body is checked, as the body is the majority of search content for a document, and thus another document that has an identical body should be considered a duplicate even if it has a slightly different title or description.

However, sometimes errors in processing (e.g. anytotx) can cause the bodies of large numbers of documents to become empty and thus be considered duplicates of each other and removed. In this case it may be desirable to either turn off Prevent Duplicates or check more fields in Duplicate Check Fields.

Note: Changing Duplicate Check Fields after a walk has completed (i.e. before a later Refresh type walk) may cause new documents to not be removed as duplicates as expected, since the pre-existing documents' hashes are now for a different set of fields. This will not cause errors or corruption; it just might leave some newly-duplicate documents in the database.

Copyright © Thunderstone Software     Last updated: Mar 7 2019
Copyright © 2019 Thunderstone Software LLC. All rights reserved.