The Zebra internal query engine has been extended to specific needs
not covered by the bib-1
attribute set query
model. These extensions are non-standard
and non-portable: most functional extensions
are modeled over the bib-1
attribute set,
defining type 7 and higher values.
There are also the special
string
type index names for the
idxpath
attribute set.
Zebra defines a hardwired string
index name
called _ALLRECORDS
. It matches any record
contained in the database, if used in conjunction with
the relation attribute
AlwaysMatches (103)
.
The _ALLRECORDS
index name is used for total database
export. The search term is ignored, it may be empty.
Z> find @attr 1=_ALLRECORDS @attr 2=103 ""
Combination with other index types can be made. For example, to
find all records which are not indexed in
the Title
register, issue one of the two
equivalent queries:
Z> find @not @attr 1=_ALLRECORDS @attr 2=103 "" @attr 1=Title @attr 2=103 "" Z> find @not @attr 1=_ALLRECORDS @attr 2=103 "" @attr 1=4 @attr 2=103 ""
The special string index _ALLRECORDS
is
experimental, and the provided functionality and syntax may very
well change in future releases of Zebra.
Zebra extends the BIB-1 attribute types, and these extensions are
recognized regardless of attribute
set used in a search
operation query.
Table 5.9. Zebra Search Attribute Extensions
Name | Value | Operation | Zebra version |
---|---|---|---|
Embedded Sort | 7 | search | 1.1 |
Term Set | 8 | search | 1.1 |
Rank Weight | 9 | search | 1.1 |
Term Reference | 10 | search | 1.4 |
Local Approx Limit | 11 | search | 1.4 |
Global Approx Limit | 12 | search | 2.0.8 |
Maximum number of truncated terms (truncmax) | 13 | search | 2.0.10 |
Specifies whether un-indexed fields should be ignored. A zero value (default) throws a diagnostic when an un-indexed field is specified. A non-zero value makes it return 0 hits. | 14 | search | 2.0.16 |
The embedded sort is a way to specify sort within a query - thus removing the need to send a Sort Request separately. It is both faster and does not require clients to deal with the Sort Facility.
All ordering operations are based on a lexicographical ordering,
except when the
structure attribute numeric (109)
is used. In
this case, ordering is numerical. See
Section 2.4.3, “Structure Attributes (type 4)”.
The possible values after attribute type 7
are
1
ascending and
2
descending.
The attributes+term (APT) node is separate from the
rest and must be @or
'ed.
The term associated with APT is the sorting level in integers,
where 0
means primary sort,
1
means secondary sort, and so forth.
See also Section 9, “Relevance Ranking and Sorting of Result Sets”.
For example, searching for water, sort by title (ascending)
Z> find @or @attr 1=1016 water @attr 7=1 @attr 1=4 0
Or, searching for water, sort by title ascending, then date descending
Z> find @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1
Rank weight is a way to pass a value to a ranking algorithm - so that one APT has one value - while another as a different one. See also Section 9, “Relevance Ranking and Sorting of Result Sets”.
For example, searching for utah in title with weight 30 as well as any with weight 20:
Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah
Zebra supports the searchResult-1 facility. If the Term Reference Attribute (type 10) is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an APT part of a query.
Experimental. Do not use in production code.
Zebra computes - unless otherwise configured - the exact hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility in the binary encoded Z39.50 search response packages.
By setting an estimation limit size of the resultset of the APT leaves, Zebra stops processing the result set when the limit length is reached. Hit counts under this limit are still precise, but hit counts over it are estimated using the statistics gathered from the chopped result set.
Specifying a limit of 0
results in exact hit counts.
For example, we might be interested in exact hit count for a, but for b we allow hit count estimates for 1000 and higher.
Z> find @and a @attr 11=1000 b
The estimated hit count facility makes searches faster, as one only needs to process large hit lists partially. It is mostly used in huge databases, where you you want trade exactness of hit counts against speed of execution.
Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has been processed.
By default Zebra computes precise hit counts for a query as
a whole. Setting attribute 12 makes it perform approximative
hit counts instead. It has the same semantics as
estimatehits
for the Section 2, “The Zebra Configuration File”.
The attribute (12) can occur anywhere in the query tree. Unlike regular attributes it does not relate to the leaf (APT) - but to the whole query.
Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has been processed.
Zebra extends the Bib1 attribute types, and these extensions are recognized regardless of attribute set used in a scan operation query.
Table 5.10. Zebra Scan Attribute Extensions
Name | Type | Operation | Zebra version |
---|---|---|---|
Result Set Narrow | 8 | scan | 1.3 |
Approximative Limit | 12 | scan | 2.0.20 |
If attribute Result Set Narrow (type 8)
is given for scan, the value is the name of a
result set. Each hit count in scan is
@and
'ed with the result set given.
Consider for example the case of scanning all title fields around the scanterm mozart, then refining the scan by issuing a filtering query for amadeus to restrict the scan to the result set of the query:
Z> scan @attr 1=4 mozart ... * mozart (43) mozartforskningen (1) mozartiana (1) mozarts (16) ... Z> f @attr 1=4 amadeus ... Number of hits: 15, setno 2 ... Z> scan @attr 1=4 @attr 8=2 mozart ... * mozart (14) mozartforskningen (0) mozartiana (0) mozarts (1) ...
Zebra 2.0.2 and later is able to skip 0 hit counts. This, however, is known not to scale if the number of terms to skip is high. This most likely will happen if the result set is small (and result in many 0 hits).
The attribute-set idxpath
consists of a single
Use (type 1) attribute. All non-use attributes behave as normal.
This feature is enabled when defining the
xpath enable
option in the GRS-1 filter
*.abs
configuration files. If one wants to use
the special idxpath
numeric attribute set, the
main Zebra configuration file zebra.cfg
directive attset: idxpath.att
must be enabled.
The idxpath
is deprecated, may not be
supported in future Zebra versions, and should definitely
not be used in production code.
This attribute set allows one to search GRS-1 filter indexed records by XPATH like structured index names.
The idxpath
option defines hard-coded
index names, which might clash with your own index names.
Table 5.11. Zebra specific IDXPATH Use Attributes (type 1)
IDXPATH | Value | String Index | Notes |
---|---|---|---|
XPATH Begin | 1 | _XPATH_BEGIN | deprecated |
XPATH End | 2 | _XPATH_END | deprecated |
XPATH CData | 1016 | _XPATH_CDATA | deprecated |
XPATH Attribute Name | 3 | _XPATH_ATTR_NAME | deprecated |
XPATH Attribute CData | 1015 | _XPATH_ATTR_CDATA | deprecated |
See tab/idxpath.att
for more information.
Search for all documents starting with root element
/root
(either using the numeric or the string
use attributes):
Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/ Z> find @attr idxpath 1=1 @attr 4=3 root/ Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/
Search for all documents where specific nested XPATH
/c1/c2/../cn
exists. Notice the very
counter-intuitive reverse notation!
Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/ Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/
Search for CDATA string text in any element
Z> find @attrset idxpath @attr 1=1016 text Z> find @attr 1=_XPATH_CDATA text
Search for CDATA string anothertext in any attribute:
Z> find @attrset idxpath @attr 1=1015 anothertext Z> find @attr 1=_XPATH_ATTR_CDATA anothertext
Search for all documents with have an XML element node including an XML attribute named creator
Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator
Combining usual bib-1
attribute set searches
with idxpath
attribute set searches:
Z> find @and @attr idxpath 1=1 @attr 4=3 link/ @attr 1=4 mozart Z> find @and @attr 1=_XPATH_BEGIN @attr 4=3 link/ @attr 1=_XPATH_CDATA mozart
Scanning is supported on all idxpath
indexes, both specified as numeric use attributes, or as string
index names.
Z> scan @attrset idxpath @attr 1=1016 text Z> scan @attr 1=_XPATH_ATTR_CDATA anothertext Z> scan @attrset idxpath @attr 1=3 @attr 4=3 ''
The rules for PQF APT mapping are rather tricky to grasp in the first place. We deal first with the rules for deciding which internal register or string index to use, according to the use attribute or access point specified in the query. Thereafter we deal with the rules for determining the correct structure type of the named register.
Zebra understands four fundamental different types of access points, of which only the numeric use attribute type access points are defined by the Z39.50 standard. All other access point types are Zebra specific, and non-portable.
Table 5.12. Access point name mapping
Access Point | Type | Grammar | Notes |
---|---|---|---|
Use attribute | numeric | [1-9][1-9]* | directly mapped to string index name |
String index name | string | [a-zA-Z](\-?[a-zA-Z0-9])* | normalized name is used as internal string index name |
Zebra internal index name | zebra | _[a-zA-Z](_?[a-zA-Z0-9])* | hardwired internal string index name |
XPATH special index | XPath | /.* | special xpath search for GRS-1 indexed records |
Attribute set names
and
string index names
are normalizes
according to the following rules: all single
hyphens '-'
are stripped, and all upper case
letters are folded to lower case.
Numeric use attributes are mapped to the Zebra internal string index according to the attribute set definition in use. The default attribute set is BIB-1, and may be omitted in the PQF query.
According to normalization and numeric use attribute mapping, it follows that the following PQF queries are considered equivalent (assuming the default configuration has not been altered):
Z> find @attr 1=Body-of-text serenade Z> find @attr 1=bodyoftext serenade Z> find @attr 1=BodyOfText serenade Z> find @attr 1=bO-d-Y-of-tE-x-t serenade Z> find @attr 1=1010 serenade Z> find @attrset bib1 @attr 1=1010 serenade Z> find @attrset bib1 @attr 1=1010 serenade Z> find @attrset Bib1 @attr 1=1010 serenade Z> find @attrset b-I-b-1 @attr 1=1010 serenade
The numerical
use attributes (type 1)
are interpreted according to the
attribute sets which have been loaded in the
zebra.cfg
file, and are matched against specific
fields as specified in the .abs
file which
describes the profile of the records which have been loaded.
If no use attribute is provided, a default of
BIB-1 Use Any (1016) is assumed.
The predefined use attribute sets
can be reconfigured by tweaking the configuration files
tab/*.att
, and
new attribute sets can be defined by adding similar files in the
configuration path profilePath
of the server.
String indexes can be accessed directly,
independently which attribute set is in use. These are just
ignored. The above mentioned name normalization applies.
String index names are defined in the
used indexing filter configuration files, for example in the
GRS-1
*.abs
configuration files, or in the
alvis
filter XSLT indexing stylesheets.
Zebra internal indexes can be accessed directly,
according to the same rules as the user defined
string indexes. The only difference is that
Zebra internal index names are hardwired,
all uppercase and
must start with the character '_'
.
Finally, XPATH access points are only
available using the GRS-1 filter for indexing.
These access point names must start with the character
'/'
, they are not
normalized, but passed unaltered to the Zebra internal
XPATH engine. See Section 2.1.6, “Zebra's special access point of type 'XPath'
for GRS-1 filters”.
Internally Zebra has in its default configuration several different types of registers or indexes, whose tokenization and character normalization rules differ. This reflects the fact that searching fundamental different tokens like dates, numbers, bitfields and string based text needs different rule sets.
Table 5.13. Structure and completeness mapping to register types
Structure | Completeness | Register type | Notes |
---|---|---|---|
phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106) | Incomplete field (@attr 6=1) | Word ('w') | Traditional tokenized and character normalized word index |
phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106) | complete field' (@attr 6=3) | Phrase ('p') | Character normalized, but not tokenized index for phrase matches |
urx (@attr 4=104) | ignored | URX/URL ('u') | Special index for URL web addresses |
numeric (@attr 4=109) | ignored | Numeric ('n') | Special index for digital numbers |
key (@attr 4=3) | ignored | Null bitmap ('0') | Used for non-tokenized and non-normalized bit sequences |
year (@attr 4=4) | ignored | Year ('y') | Non-tokenized and non-normalized 4 digit numbers |
date (@attr 4=5) | ignored | Date ('d') | Non-tokenized and non-normalized ISO date strings |
ignored | ignored | Sort ('s') | Used with special sort attribute set (@attr 7=1, @attr 7=2) |
overruled | overruled | special | Internal record ID register, used whenever Relation Always Matches (@attr 2=103) is specified |
If a Structure attribute of
Phrase is used in conjunction with a
Completeness attribute of
Complete (Sub)field, the term is matched
against the contents of the phrase (long word) register, if one
exists for the given Use attribute.
A phrase register is created for those fields in the
GRS-1 *.abs
file that contains a
p
-specifier.
Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven ... bayreuther festspiele (1) * beethoven bibliography database (1) benny carter (1) ... Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography" ... Number of hits: 0, setno 5 ... Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database" ... Number of hits: 1, setno 6
If Structure=Phrase is
used in conjunction with Incomplete Field - the
default value for Completeness, the
search is directed against the normal word registers, but if the term
contains multiple words, the term will only match if all of the words
are found immediately adjacent, and in the given order.
The word search is performed on those fields that are indexed as
type w
in the GRS-1 *.abs
file.
Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven ... beefheart (1) * beethoven (18) beethovens (7) ... Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven ... Number of hits: 18, setno 1 ... Z> find @attr 1=Title @attr 4=1 @attr 6=1 "beethoven bibliography" ... Number of hits: 2, setno 2 ...
If the Structure attribute is
Word List,
Free-form Text, or
Document Text, the term is treated as a
natural-language, relevance-ranked query.
This search type uses the word register, i.e. those fields
that are indexed as type w
in the
GRS-1 *.abs
file.
If the Structure attribute is
Numeric String the term is treated as an integer.
The search is performed on those fields that are indexed
as type n
in the GRS-1
*.abs
file.
If the Structure attribute is
URX the term is treated as a URX (URL) entity.
The search is performed on those fields that are indexed as type
u
in the *.abs
file.
If the Structure attribute is Local Number the term is treated as native Zebra Record Identifier.
If the Relation attribute is Equals (default), the term is matched in a normal fashion (modulo truncation and processing of individual words, if required). If Relation is Less Than, Less Than or Equal, Greater than, or Greater than or Equal, the term is assumed to be numerical, and a standard regular expression is constructed to match the given expression. If Relation is Relevance, the standard natural-language query processor is invoked.
For the Truncation attribute, No Truncation is the default. Left Truncation is not supported. Process # in search term is supported, as is Regxp-1. Regxp-2 enables the fault-tolerant (fuzzy) search. As a default, a single error (deletion, insertion, replacement) is accepted when terms are matched against the register contents.
Each term in a query is interpreted as a regular expression if the truncation value is either Regxp-1 (@attr 5=102) or Regxp-2 (@attr 5=103). Both query types follow the same syntax with the operands:
Table 5.14. Regular Expression Operands
x | Matches the character x . |
. | Matches any character. |
[ .. ] | Matches the set of characters specified;
such as [abc] or [a-c] . |
The above operands can be combined with the following operators:
Table 5.15. Regular Expression Operators
x* | Matches x zero or more times.
Priority: high. |
x+ | Matches x one or more times.
Priority: high. |
x? | Matches x zero or once.
Priority: high. |
xy | Matches x , then y .
Priority: medium. |
x|y | Matches either x or y .
Priority: low. |
( ) | The order of evaluation may be changed by using parentheses. |
If the first character of the Regxp-2
query
is a plus character (+
) it marks the
beginning of a section with non-standard specifiers.
The next plus character marks the end of the section.
Currently Zebra only supports one specifier, the error tolerance,
which consists one digit.
Since the plus operator is normally a suffix operator the addition to the query syntax doesn't violate the syntax for standard regular expressions.
For example, a phrase search with regular expressions in the title-register is performed like this:
Z> find @attr 1=4 @attr 5=102 "informat.* retrieval"
Combinations with other attributes are possible. For example, a ranked search with a regular expression:
Z> find @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"