Entity XML Elements

The XML schema for entities is below. Optional elements are shown in square brackets; elements that may repeat are followed by ellipses. The syntax is largely compatible with the Google Search Appliance's Entity Recognition XML format, but see here for differences.

<?xml version="1.0"?>
<instances>
  <instance>
    <name>Counties</name>
    [<case_sensitive>N</case_sensitive>]
    [<apply_case>as_is</apply_case>]
    [<store_term_or_name>term</store_term_or_name>]
    [<store_regex_or_name>regex</store_regex_or_name>]
    [<pattern>(?:[[:upper:]]\w+\s+)+County</pattern>
     ...]
    [<term>Adams County</term>
     ...]
  </instance>
  ...
</instances>

The root element is <instances>, which contains one or more entities, each defined in an <instance> element. Each <instance> has the following children:

  • <name> (required) - The name of the entity. Like Parametric Fields, an entity name must be composed solely of 1 to 29 alphanumerics or underscores (with the first character alphabetic), and the name must not be a SQL keyword.

  • <case_sensitive> (optional) - Whether <term>s match case-sensitively or not; a Y or N value. The default if unspecified is N.

  • <apply_case> (optional) - How to transform the case of text matches, before storing the entity. One of the following values:

    • as_is - Leave text as-is; no transformation

    • lowercase - Lower-case the match

    • uppercase - Upper-case the match

    • titlecase - Title-case the match: capitalize the first letter of each word

    • titlecase_first_word - Title-case just the first word
    The default if unspecified is as_is. Note that only matches stored from document text are affected: <term> matches when <store_term_or_name> is name or term_tag, and <pattern> matches when <store_regex_or_name> is name, are not modified. This allows mixed-case <term> values - e.g. McDuff - to retain their custom-specified case when stored, while still canonicalizing the possibly-variant cases of <pattern> matches in text, when both are specified for the same entity.

  • <store_term_or_name> (optional) - What to store as the entity for <term> matches. One of the following values:

    • term - Store the text matched; this is the default if unspecified. Useful if knowing which <term> matched is significant; e.g. when looking for a list of cities, and search results will be Grouped By city.

    • name - Store the entity <name> value. Useful when just the existence of the entity matters, i.e. all the terms are synonymous. (E.g. an entity named Water with terms water, H2O and dihydrogen monoxide, and any occurrence should be stored as Water.)

    • term_tag - Store the <term> value. Useful if the specific term matters, and it should be saved with the same case as in the <term>, not the text. E.g. if a custom-case <term> like McDuff is set, it may match Mcduff, MCDUFF etc. in the text - the McDuff case variant is stored.

  • <store_regex_or_name> (optional) - What to store as the entity for <pattern> matches. One of the following values:

    • regex - Store the text matched; this is the default if unspecified.

    • name - Store the entity <name> value. Useful if just the existence of the entity matters; e.g. the <pattern>s are looking for credit-card or phone numbers, and the exact digits do not matter, just the fact that the document contains a credit-card or phone number.

    • regex_tagged_as_first_group - Store the text matched by the first parenthetical capture group of the <pattern>. For example, the pattern "Mr\. (\w+)" could be used with regex_tagged_as_first_group to store just the last name found, without the "Mr." title. Note that REX syntax uses the \P and \F operators to indicate what part of the expression to store, and does not support capture groups; thus regex_tagged_as_first_group is not valid for REX <pattern>s.

  • <pattern> (optional; zero or more occurrences) - A regular expression (regex) to match entities in document text. The default syntax is that of Google's RE2 library. REX syntax may also be used, by preceding the expression with \<rex\>. To store just part of the text matched, use a parenthetical capture group in the expression and set <store_regex_or_name> to regex_tagged_as_first_group; or use a REX expression with the \P and \F operators.

    Note: On some platforms, RE2 syntax is not supported, and REX syntax must be used. These platforms will give the error message "REX: RE2 not supported on this platform" when uploading an entity file containing RE2 <pattern>s. (Windows, Linux 2.6 and later versions except i686-unknown-linux2.6.17-64-32 are supported.) RE2 syntax is documented at https://github.com/google/re2/wiki/Syntax.

  • <term> (optional; zero or more occurrences) - A term to find as an entity in document text. The term is searched for exactly, as a phrase (no quotes needed). It is matched case-insensitively, unless <case_sensitive> is set to Y.

Note that more than one entity may be defined in a file, since the <instance> element defining an entity may occur repeatedly.


Copyright © Thunderstone Software     Last updated: Nov 8 2024
Copyright © 2024 Thunderstone Software LLC. All rights reserved.