3. Extended Zebra RPN Features

3. Extended Zebra RPN Features
Prev	Chapter 5. Query Model	Next

The Zebra internal query engine has been extended to specific needs not covered by the bib-1 attribute set query model. These extensions are non-standard and non-portable: most functional extensions are modeled over the bib-1 attribute set, defining type 7 and higher values. There are also the special string type index names for the idxpath attribute set.

3.1. Zebra specific retrieval of all records

Zebra defines a hardwired string index name called _ALLRECORDS. It matches any record contained in the database, if used in conjunction with the relation attribute AlwaysMatches (103).

The _ALLRECORDS index name is used for total database export. The search term is ignored, it may be empty.

      Z> find @attr 1=_ALLRECORDS @attr 2=103 ""

Combination with other index types can be made. For example, to find all records which are not indexed in the Title register, issue one of the two equivalent queries:

      Z> find @not @attr 1=_ALLRECORDS @attr 2=103 "" @attr 1=Title @attr 2=103 ""
      Z> find @not @attr 1=_ALLRECORDS @attr 2=103 "" @attr 1=4 @attr 2=103 ""

Warning

The special string index _ALLRECORDS is experimental, and the provided functionality and syntax may very well change in future releases of Zebra.

3.2. Zebra specific Search Extensions to all Attribute Sets

Zebra extends the BIB-1 attribute types, and these extensions are recognized regardless of attribute set used in a search operation query.

Table 5.9. Zebra Search Attribute Extensions

Name	Value	Operation	Zebra version
Embedded Sort	7	search	1.1
Term Set	8	search	1.1
Rank Weight	9	search	1.1
Term Reference	10	search	1.4
Local Approx Limit	11	search	1.4
Global Approx Limit	12	search	2.0.8
Maximum number of truncated terms (truncmax)	13	search	2.0.10
Specifies whether un-indexed fields should be ignored. A zero value (default) throws a diagnostic when an un-indexed field is specified. A non-zero value makes it return 0 hits.	14	search	2.0.16

3.2.1. Zebra Extension Embedded Sort Attribute (type 7)

The embedded sort is a way to specify sort within a query - thus removing the need to send a Sort Request separately. It is both faster and does not require clients to deal with the Sort Facility.

All ordering operations are based on a lexicographical ordering, except when the structure attribute numeric (109) is used. In this case, ordering is numerical. See Section 2.4.3, “Structure Attributes (type 4)”.

The possible values after attribute type 7 are 1 ascending and 2 descending. The attributes+term (APT) node is separate from the rest and must be @or'ed. The term associated with APT is the sorting level in integers, where 0 means primary sort, 1 means secondary sort, and so forth. See also Section 9, “Relevance Ranking and Sorting of Result Sets”.

For example, searching for water, sort by title (ascending)

       Z> find @or @attr 1=1016 water @attr 7=1 @attr 1=4 0

Or, searching for water, sort by title ascending, then date descending

       Z> find @or @or @attr 1=1016 water @attr 7=1 @attr 1=4 0 @attr 7=2 @attr 1=30 1

3.2.2. Zebra Extension Rank Weight Attribute (type 9)

Rank weight is a way to pass a value to a ranking algorithm - so that one APT has one value - while another as a different one. See also Section 9, “Relevance Ranking and Sorting of Result Sets”.

For example, searching for utah in title with weight 30 as well as any with weight 20:

       Z> find @attr 2=102 @or @attr 9=30 @attr 1=4 utah @attr 9=20 utah

3.2.3. Zebra Extension Term Reference Attribute (type 10)

Zebra supports the searchResult-1 facility. If the Term Reference Attribute (type 10) is given, that specifies a subqueryId value returned as part of the search result. It is a way for a client to name an APT part of a query.

Warning

Experimental. Do not use in production code.

3.2.4. Local Approximative Limit Attribute (type 11)

Zebra computes - unless otherwise configured - the exact hit count for every APT (leaf) in the query tree. These hit counts are returned as part of the searchResult-1 facility in the binary encoded Z39.50 search response packages.

By setting an estimation limit size of the resultset of the APT leaves, Zebra stops processing the result set when the limit length is reached. Hit counts under this limit are still precise, but hit counts over it are estimated using the statistics gathered from the chopped result set.

Specifying a limit of 0 results in exact hit counts.

For example, we might be interested in exact hit count for a, but for b we allow hit count estimates for 1000 and higher.

       Z> find @and a @attr 11=1000 b

Note

The estimated hit count facility makes searches faster, as one only needs to process large hit lists partially. It is mostly used in huge databases, where you you want trade exactness of hit counts against speed of execution.

Warning

Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has been processed.

3.2.5. Global Approximative Limit Attribute (type 12)

By default Zebra computes precise hit counts for a query as a whole. Setting attribute 12 makes it perform approximative hit counts instead. It has the same semantics as estimatehits for the Section 2, “The Zebra Configuration File”.

The attribute (12) can occur anywhere in the query tree. Unlike regular attributes it does not relate to the leaf (APT) - but to the whole query.

Warning

Do not use approximative hit count limits in conjunction with relevance ranking, as re-sorting of the result set only works when the entire result set has been processed.

3.3. Zebra specific Scan Extensions to all Attribute Sets

Zebra extends the Bib1 attribute types, and these extensions are recognized regardless of attribute set used in a scan operation query.

Table 5.10. Zebra Scan Attribute Extensions

Name	Type	Operation	Zebra version
Result Set Narrow	8	scan	1.3
Approximative Limit	12	scan	2.0.20

3.3.1. Zebra Extension Result Set Narrow (type 8)

If attribute Result Set Narrow (type 8) is given for scan, the value is the name of a result set. Each hit count in scan is @and'ed with the result set given.

Consider for example the case of scanning all title fields around the scanterm mozart, then refining the scan by issuing a filtering query for amadeus to restrict the scan to the result set of the query:

       Z> scan @attr 1=4 mozart
       ...
       * mozart (43)
       mozartforskningen (1)
       mozartiana (1)
       mozarts (16)
       ...
       Z> f @attr 1=4 amadeus
       ...
       Number of hits: 15, setno 2
       ...
       Z> scan @attr 1=4 @attr 8=2 mozart
       ...
       * mozart (14)
       mozartforskningen (0)
       mozartiana (0)
       mozarts (1)
       ...

Zebra 2.0.2 and later is able to skip 0 hit counts. This, however, is known not to scale if the number of terms to skip is high. This most likely will happen if the result set is small (and result in many 0 hits).

3.3.2. Zebra Extension Approximative Limit (type 12)

The Zebra Extension Approximative Limit (type 12) is a way to enable approximate hit counts for scan hit counts, in the same way as for search hit counts.

3.4. Zebra special IDXPATH Attribute Set for GRS-1 indexing

The attribute-set idxpath consists of a single Use (type 1) attribute. All non-use attributes behave as normal.

This feature is enabled when defining the xpath enable option in the GRS-1 filter *.abs configuration files. If one wants to use the special idxpath numeric attribute set, the main Zebra configuration file zebra.cfg directive attset: idxpath.att must be enabled.

Warning

The idxpath is deprecated, may not be supported in future Zebra versions, and should definitely not be used in production code.

3.4.1. IDXPATH Use Attributes (type = 1)

This attribute set allows one to search GRS-1 filter indexed records by XPATH like structured index names.

Warning

The idxpath option defines hard-coded index names, which might clash with your own index names.

Table 5.11. Zebra specific IDXPATH Use Attributes (type 1)

IDXPATH	Value	String Index	Notes
XPATH Begin	1	_XPATH_BEGIN	deprecated
XPATH End	2	_XPATH_END	deprecated
XPATH CData	1016	_XPATH_CDATA	deprecated
XPATH Attribute Name	3	_XPATH_ATTR_NAME	deprecated
XPATH Attribute CData	1015	_XPATH_ATTR_CDATA	deprecated

See tab/idxpath.att for more information.

Search for all documents starting with root element /root (either using the numeric or the string use attributes):

       Z> find @attrset idxpath @attr 1=1 @attr 4=3 root/
       Z> find @attr idxpath 1=1 @attr 4=3 root/
       Z> find @attr 1=_XPATH_BEGIN @attr 4=3 root/

Search for all documents where specific nested XPATH /c1/c2/../cn exists. Notice the very counter-intuitive reverse notation!

       Z> find @attrset idxpath @attr 1=1 @attr 4=3 cn/cn-1/../c1/
       Z> find @attr 1=_XPATH_BEGIN @attr 4=3 cn/cn-1/../c1/

Search for CDATA string text in any element

       Z> find @attrset idxpath @attr 1=1016 text
       Z> find @attr 1=_XPATH_CDATA text

Search for CDATA string anothertext in any attribute:

       Z> find @attrset idxpath @attr 1=1015 anothertext
       Z> find @attr 1=_XPATH_ATTR_CDATA anothertext

Search for all documents with have an XML element node including an XML attribute named creator

       Z> find @attrset idxpath @attr 1=3 @attr 4=3 creator
       Z> find @attr 1=_XPATH_ATTR_NAME @attr 4=3 creator

Combining usual bib-1 attribute set searches with idxpath attribute set searches:

       Z> find @and @attr idxpath 1=1 @attr 4=3 link/ @attr 1=4 mozart
       Z> find @and @attr 1=_XPATH_BEGIN @attr 4=3 link/ @attr 1=_XPATH_CDATA mozart

Scanning is supported on all idxpath indexes, both specified as numeric use attributes, or as string index names.

       Z> scan  @attrset idxpath @attr 1=1016 text
       Z> scan  @attr 1=_XPATH_ATTR_CDATA anothertext
       Z> scan  @attrset idxpath @attr 1=3 @attr 4=3 ''

3.5. Mapping from PQF atomic APT queries to Zebra internal register indexes

The rules for PQF APT mapping are rather tricky to grasp in the first place. We deal first with the rules for deciding which internal register or string index to use, according to the use attribute or access point specified in the query. Thereafter we deal with the rules for determining the correct structure type of the named register.

3.5.1. Mapping of PQF APT access points

Zebra understands four fundamental different types of access points, of which only the numeric use attribute type access points are defined by the Z39.50 standard. All other access point types are Zebra specific, and non-portable.

Table 5.12. Access point name mapping

Access Point	Type	Grammar	Notes
Use attribute	numeric	[1-9][1-9]*	directly mapped to string index name
String index name	string	[a-zA-Z](\-?[a-zA-Z0-9])*	normalized name is used as internal string index name
Zebra internal index name	zebra	_[a-zA-Z](_?[a-zA-Z0-9])*	hardwired internal string index name
XPATH special index	XPath	/.*	special xpath search for GRS-1 indexed records

Attribute set names and string index names are normalizes according to the following rules: all single hyphens '-' are stripped, and all upper case letters are folded to lower case.

Numeric use attributes are mapped to the Zebra internal string index according to the attribute set definition in use. The default attribute set is BIB-1, and may be omitted in the PQF query.

According to normalization and numeric use attribute mapping, it follows that the following PQF queries are considered equivalent (assuming the default configuration has not been altered):

       Z> find  @attr 1=Body-of-text serenade
       Z> find  @attr 1=bodyoftext serenade
       Z> find  @attr 1=BodyOfText serenade
       Z> find  @attr 1=bO-d-Y-of-tE-x-t serenade
       Z> find  @attr 1=1010 serenade
       Z> find  @attrset bib1 @attr 1=1010 serenade
       Z> find  @attrset bib1 @attr 1=1010 serenade
       Z> find  @attrset Bib1 @attr 1=1010 serenade
       Z> find  @attrset b-I-b-1 @attr 1=1010 serenade

The numerical use attributes (type 1) are interpreted according to the attribute sets which have been loaded in the zebra.cfg file, and are matched against specific fields as specified in the .abs file which describes the profile of the records which have been loaded. If no use attribute is provided, a default of BIB-1 Use Any (1016) is assumed. The predefined use attribute sets can be reconfigured by tweaking the configuration files tab/*.att, and new attribute sets can be defined by adding similar files in the configuration path profilePath of the server.

String indexes can be accessed directly, independently which attribute set is in use. These are just ignored. The above mentioned name normalization applies. String index names are defined in the used indexing filter configuration files, for example in the GRS-1 *.abs configuration files, or in the alvis filter XSLT indexing stylesheets.

Zebra internal indexes can be accessed directly, according to the same rules as the user defined string indexes. The only difference is that Zebra internal index names are hardwired, all uppercase and must start with the character '_'.

Finally, XPATH access points are only available using the GRS-1 filter for indexing. These access point names must start with the character '/', they are not normalized, but passed unaltered to the Zebra internal XPATH engine. See Section 2.1.6, “Zebra's special access point of type 'XPath' for GRS-1 filters”.

3.5.2. Mapping of PQF APT structure and completeness to register type

Internally Zebra has in its default configuration several different types of registers or indexes, whose tokenization and character normalization rules differ. This reflects the fact that searching fundamental different tokens like dates, numbers, bitfields and string based text needs different rule sets.

Table 5.13. Structure and completeness mapping to register types

Structure	Completeness	Register type	Notes
phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106)	Incomplete field (@attr 6=1)	Word ('w')	Traditional tokenized and character normalized word index
phrase (@attr 4=1), word (@attr 4=2), word-list (@attr 4=6), free-form-text (@attr 4=105), or document-text (@attr 4=106)	complete field' (@attr 6=3)	Phrase ('p')	Character normalized, but not tokenized index for phrase matches
urx (@attr 4=104)	ignored	URX/URL ('u')	Special index for URL web addresses
numeric (@attr 4=109)	ignored	Numeric ('n')	Special index for digital numbers
key (@attr 4=3)	ignored	Null bitmap ('0')	Used for non-tokenized and non-normalized bit sequences
year (@attr 4=4)	ignored	Year ('y')	Non-tokenized and non-normalized 4 digit numbers
date (@attr 4=5)	ignored	Date ('d')	Non-tokenized and non-normalized ISO date strings
ignored	ignored	Sort ('s')	Used with special sort attribute set (@attr 7=1, @attr 7=2)
overruled	overruled	special	Internal record ID register, used whenever Relation Always Matches (@attr 2=103) is specified

If a Structure attribute of Phrase is used in conjunction with a Completeness attribute of Complete (Sub)field, the term is matched against the contents of the phrase (long word) register, if one exists for the given Use attribute. A phrase register is created for those fields in the GRS-1 *.abs file that contains a p-specifier.

       Z> scan @attr 1=Title @attr 4=1 @attr 6=3 beethoven
       ...
       bayreuther festspiele (1)
       * beethoven bibliography database (1)
       benny carter (1)
       ...
       Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography"
       ...
       Number of hits: 0, setno 5
       ...
       Z> find @attr 1=Title @attr 4=1 @attr 6=3 "beethoven bibliography database"
       ...
       Number of hits: 1, setno 6

If Structure=Phrase is used in conjunction with Incomplete Field - the default value for Completeness, the search is directed against the normal word registers, but if the term contains multiple words, the term will only match if all of the words are found immediately adjacent, and in the given order. The word search is performed on those fields that are indexed as type w in the GRS-1 *.abs file.

       Z> scan @attr 1=Title @attr 4=1 @attr 6=1 beethoven
       ...
       beefheart (1)
       * beethoven (18)
       beethovens (7)
       ...
       Z> find @attr 1=Title @attr 4=1 @attr 6=1 beethoven
       ...
       Number of hits: 18, setno 1
       ...
       Z> find @attr 1=Title @attr 4=1 @attr 6=1 "beethoven  bibliography"
       ...
       Number of hits: 2, setno 2
       ...

If the Structure attribute is Word List, Free-form Text, or Document Text, the term is treated as a natural-language, relevance-ranked query. This search type uses the word register, i.e. those fields that are indexed as type w in the GRS-1 *.abs file.

If the Structure attribute is Numeric String the term is treated as an integer. The search is performed on those fields that are indexed as type n in the GRS-1 *.abs file.

If the Structure attribute is URX the term is treated as a URX (URL) entity. The search is performed on those fields that are indexed as type u in the *.abs file.

If the Structure attribute is Local Number the term is treated as native Zebra Record Identifier.

If the Relation attribute is Equals (default), the term is matched in a normal fashion (modulo truncation and processing of individual words, if required). If Relation is Less Than, Less Than or Equal, Greater than, or Greater than or Equal, the term is assumed to be numerical, and a standard regular expression is constructed to match the given expression. If Relation is Relevance, the standard natural-language query processor is invoked.

For the Truncation attribute, No Truncation is the default. Left Truncation is not supported. Process # in search term is supported, as is Regxp-1. Regxp-2 enables the fault-tolerant (fuzzy) search. As a default, a single error (deletion, insertion, replacement) is accepted when terms are matched against the register contents.

3.6. Zebra Regular Expressions in Truncation Attribute (type = 5)

Each term in a query is interpreted as a regular expression if the truncation value is either Regxp-1 (@attr 5=102) or Regxp-2 (@attr 5=103). Both query types follow the same syntax with the operands:

Table 5.14. Regular Expression Operands

`x`	Matches the character `x`.
`.`	Matches any character.
`[ .. ]`	Matches the set of characters specified; such as `[abc]` or `[a-c]`.

The above operands can be combined with the following operators:

Table 5.15. Regular Expression Operators

`x*`	Matches `x` zero or more times. Priority: high.
`x+`	Matches `x` one or more times. Priority: high.
`x?`	Matches `x` zero or once. Priority: high.
`xy`	Matches `x`, then `y`. Priority: medium.
`x\|y`	Matches either `x` or `y`. Priority: low.
`( )`	The order of evaluation may be changed by using parentheses.

If the first character of the Regxp-2 query is a plus character (+) it marks the beginning of a section with non-standard specifiers. The next plus character marks the end of the section. Currently Zebra only supports one specifier, the error tolerance, which consists one digit.

Since the plus operator is normally a suffix operator the addition to the query syntax doesn't violate the syntax for standard regular expressions.

For example, a phrase search with regular expressions in the title-register is performed like this:

      Z> find @attr 1=4 @attr 5=102 "informat.* retrieval"

Combinations with other attributes are possible. For example, a ranked search with a regular expression:

      Z> find @attr 1=4 @attr 5=102 @attr 2=102 "informat.* retrieval"

Prev	Up	Next
2. RPN queries and semantics	Home	4. Server Side CQL to PQF Query Translation