The XML schema for entities is below. Optional elements are shown in square brackets; elements that may repeat are followed by ellipses. The syntax is largely compatible with the Google Search Appliance's Entity Recognition XML format, but see here for differences.
<?xml version="1.0"?>
<instances>
<instance>
<name>Counties</name>
[<case_sensitive>N</case_sensitive>]
[<apply_case>as_is</apply_case>]
[<store_term_or_name>term</store_term_or_name>]
[<store_regex_or_name>regex</store_regex_or_name>]
[<pattern>(?:[[:upper:]]\w+\s+)+County</pattern>
...]
[<term>Adams County</term>
...]
</instance>
...
</instances>
The root element is <instances>
, which contains one or more
entities, each defined in an <instance>
element. Each
<instance>
has the following children:
<name>
(required) - The name of the entity. Like
Parametric Fields, an entity name must be composed solely of 1 to
29 alphanumerics or underscores (with the first character
alphabetic), and the name must not be a SQL keyword.<case_sensitive>
(optional) - Whether <term>
s
match case-sensitively or not; a Y
or N
value.
The default if unspecified is N
.<apply_case>
(optional) - How to transform the case of
text matches, before storing the entity. One of the following
values:
as_is
- Leave text as-is; no transformationlowercase
- Lower-case the matchuppercase
- Upper-case the matchtitlecase
- Title-case the match: capitalize the
first letter of each wordtitlecase_first_word
- Title-case just the first word
as_is
. Note that only
matches stored from document text are affected:
<term>
matches when <store_term_or_name>
is
name
or term_tag
, and <pattern>
matches when
<store_regex_or_name>
is name
, are not modified.
This allows mixed-case <term>
values - e.g. McDuff
- to retain their custom-specified case when stored, while still
canonicalizing the possibly-variant cases of <pattern>
matches in text, when both are specified for the same entity.<store_term_or_name>
(optional) - What to store as the
entity for <term>
matches. One of the following values:
term
- Store the text matched; this is the default
if unspecified. Useful if knowing which <term>
matched
is significant; e.g. when looking for a list of cities, and
search results will be Grouped By city.name
- Store the entity <name>
value.
Useful when just the existence of the entity matters, i.e.
all the terms are synonymous. (E.g. an entity named
Water
with terms water
, H2O
and
dihydrogen monoxide
, and any occurrence should be
stored as Water
.)term_tag
- Store the <term>
value. Useful
if the specific term matters, and it should be saved with the
same case as in the <term>
, not the text. E.g. if a
custom-case <term>
like McDuff
is set, it may
match Mcduff
, MCDUFF
etc. in the text - the
McDuff
case variant is stored.
<store_regex_or_name>
(optional) - What to store as the
entity for <pattern>
matches. One of the following values:
regex
- Store the text matched; this is the default
if unspecified.name
- Store the entity <name>
value.
Useful if just the existence of the entity matters; e.g. the
<pattern>
s are looking for credit-card or phone
numbers, and the exact digits do not matter, just the fact
that the document contains a credit-card or phone number.regex_tagged_as_first_group
- Store the text
matched by the first parenthetical capture group of the
<pattern>
. For example, the pattern "Mr\. (\w+)
"
could be used with regex_tagged_as_first_group
to store
just the last name found, without the "Mr.
" title.
Note that REX syntax uses the \P
and \F
operators to indicate what part of the expression to store,
and does not support capture groups; thus
regex_tagged_as_first_group
is not valid for REX
<pattern>
s.
<pattern>
(optional; zero or more occurrences) - A
regular expression (regex) to match entities in document text.
The default syntax is that of Google's RE2 library. REX syntax
may also be used, by preceding the expression with \<rex\>
.
To store just part of the text matched, use a parenthetical
capture group in the expression and set
<store_regex_or_name>
to
regex_tagged_as_first_group
; or use a REX expression with
the \P
and \F
operators.
Note: On some platforms, RE2 syntax is not supported, and
REX syntax must be used. These platforms will give the
error message "REX: RE2 not supported on this platform"
when uploading an entity file containing RE2 <pattern>
s.
(Windows, Linux 2.6 and later versions except
i686-unknown-linux2.6.17-64-32 are supported.)
RE2 syntax is documented at
https://github.com/google/re2/wiki/Syntax.
<term>
(optional; zero or more occurrences) - A term to
find as an entity in document text. The term is searched for
exactly, as a phrase (no quotes needed). It is matched
case-insensitively, unless <case_sensitive>
is set to
Y
.
Note that more than one entity may be defined in a file, since the
<instance>
element defining an entity may occur repeatedly.