2. DOM XML filter pipeline configuration

2. DOM XML filter pipeline configuration
Prev	Chapter 7. DOM XML Record Model and Filter Module	Next

The experimental, loadable DOM XML/XSLT filter module mod-dom.so is invoked by the zebra.cfg configuration statement

     recordtype.xml: dom.db/filter_dom_conf.xml

In this example the DOM XML filter is configured to work on all data files with suffix *.xml, where the configuration file is found in the path db/filter_dom_conf.xml.

The DOM XSLT filter configuration file must be valid XML. It might look like this:

     
     <?xml version="1.0" encoding="UTF8"?>
     <dom xmlns="http://indexdata.com/zebra-2.0">
     <input>
     <xmlreader level="1"/>
     <!-- <marc inputcharset="marc-8"/> -->
    </input>
     <extract>
     <xslt stylesheet="common2index.xsl"/>
    </extract>
     <store>
     <xslt stylesheet="common2store.xsl"/>
    </store>
     <retrieve name="dc">
     <xslt stylesheet="store2dc.xsl"/>
    </retrieve>
     <retrieve name="mods">
     <xslt stylesheet="store2mods.xsl"/>
    </retrieve>
    </dom>

The root XML element <dom> and all other DOM XML filter elements are residing in the namespace xmlns="http://indexdata.com/zebra-2.0".

All pipeline definition elements - i.e. the <input>, <extract>, <store>, and <retrieve> elements - are optional. Missing pipeline definitions are just interpreted do-nothing identity pipelines.

All pipeline definition elements may contain zero or more <xslt stylesheet="path/file.xsl"/> XSLT transformation instructions, which are performed sequentially from top to bottom. The paths in the stylesheet attributes are relative to zebras working directory, or absolute to the file system root.

2.1. Input pipeline

The <input> pipeline definition element may contain either one XML Reader definition <xmlreader level="1"/>, used to split an XML collection input stream into individual XML DOM documents at the prescribed element level, or one MARC binary parsing instruction <marc inputcharset="marc-8"/>, which defines a conversion to MARCXML format DOM trees. The allowed values of the inputcharset attribute depend on your local iconv™ set-up.

Both input parsers deliver individual DOM XML documents to the following chain of zero or more <xslt stylesheet="path/file.xsl"/> XSLT transformations. At the end of this pipeline, the documents are in the common format, used to feed both the <extract> and <store> pipelines.

2.2. Extract pipeline

The <extract> pipeline takes documents from any common DOM XML format to the Zebra specific indexing DOM XML format. It may consist of zero ore more <xslt stylesheet="path/file.xsl"/> XSLT transformations, and the outcome is handled to the Zebra core to drive the process of building the inverted indexes. See Section 2.5, “Canonical Indexing Format” for details.

2.3. Store pipeline

The <store> pipeline takes documents from any common DOM XML format to the Zebra specific storage DOM XML format. It may consist of zero ore more <xslt stylesheet="path/file.xsl"/> XSLT transformations, and the outcome is handled to the Zebra core for deposition into the internal storage system.

2.4. Retrieve pipeline

Finally, there may be one or more <retrieve> pipeline definitions, each of them again consisting of zero or more <xslt stylesheet="path/file.xsl"/> XSLT transformations. These are used for document presentation after search, and take the internal storage DOM XML to the requested output formats during record present requests.

The possible multiple <retrieve> pipeline definitions are distinguished by their unique name attributes, these are the literal schema or element set names used in SRW, SRU and Z39.50 protocol queries.

2.5. Canonical Indexing Format

DOM XML indexing comes in two flavors: pure processing-instruction governed plain XML documents, and - very similar to the Alvis filter indexing format - XML documents containing XML <record> and <index> instructions from the magic namespace xmlns:z="http://indexdata.com/zebra-2.0".

2.5.1. Processing-instruction governed indexing format

The output of the processing instruction driven indexing XSLT stylesheets must contain processing instructions named zebra-2.0. The output of the XSLT indexing transformation is then parsed using DOM methods, and the contained instructions are performed on the elements and their subtrees directly following the processing instructions.

For example, the output of the command

       xsltproc dom-index-pi.xsl marc-one.xml

might look like this:

       
       <?xml version="1.0" encoding="UTF-8"?>
       <?zebra-2.0 record id=11224466 rank=42?>
       <record>
       <?zebra-2.0 index control:0?>
       <control>11224466</control>
       <?zebra-2.0 index any:w title:w title:p title:s?>
       <title>How to program a computer</title>
      </record>

2.5.2. Magic element governed indexing format

The output of the indexing XSLT stylesheets must contain certain elements in the magic xmlns:z="http://indexdata.com/zebra-2.0" namespace. The output of the XSLT indexing transformation is then parsed using DOM methods, and the contained instructions are performed on the magic elements and their subtrees.

For example, the output of the command

       xsltproc dom-index-element.xsl marc-one.xml

might look like this:

       
       <?xml version="1.0" encoding="UTF-8"?>
       <z:record xmlns:z="http://indexdata.com/zebra-2.0"
       z:id="11224466" z:rank="42">
       <z:index name="control:0">11224466</z:index>
       <z:index name="any:w title:w title:p title:s">
       How to program a computer</z:index>
      </z:record>

2.5.3. Semantics of the indexing formats

Both indexing formats are defined with equal semantics and behavior in mind:

Zebra specific instructions are either processing instructions named zebra-2.0 or elements contained in the namespace xmlns:z="http://indexdata.com/zebra-2.0".
There must be exactly one record instruction, which sets the scope for the following, possibly nested index and group instructions.
The unique record instruction may have additional attributes id, rank and type. Attribute id is the value of the opaque ID and may be any string not containing the whitespace character ' '. The rank attribute value must be a non-negative integer. See Section 9, “Relevance Ranking and Sorting of Result Sets” . The type attribute specifies how the record is to be treated. The following values may be given for type:
insert
The record is inserted. If the record already exists, it is skipped (i.e. not replaced).
replace
The record is replaced. If the record does not already exist, it is skipped (i.e. not inserted).
delete
The record is deleted. If the record does not already exist, a warning issued and rest of records are skipped in from the input stream.
update
The record is inserted or replaced depending on whether the record exists or not. This is the default behavior but may be effectively changed by "outside" the scope of the DOM filter by zebraidx commands or extended services updates.
adelete
The record is deleted. If the record does not already exist, it is skipped (i.e. nothing is deleted).
Note
Requires version 2.0.54 or later.
Note that the value of type is only used to determine the action if and only if the Zebra indexer is running in "update" mode (i.e zebraidx update) or if the specialUpdate action of the Extended Service Update is used. For this reason a specialUpdate may end up deleting records!
Multiple and possible nested index instructions must contain at least one indexname:indextype pair, and may contain multiple such pairs separated by the whitespace character ' '. In each index pair, the name and the type of the index is separated by a colon character ':'.
Any index name consisting of ASCII letters, and following the standard Zebra rules will do, see Section 3.5.1, “Mapping of PQF APT access points”.
Index types are restricted to the values defined in the standard configuration file default.idx, see Section 2.3, “BIB-1 Attribute Set” and Chapter 10, Field Structure and Character Sets for details.
DOM input documents which are not resulting in both one unique valid record instruction and one or more valid index instructions can not be searched and found. Therefore, invalid document processing is aborted, and any content of the <extract> and <store> pipelines is discarded. A warning is issued in the logs.
The group can be used to group indexing material for proximity search. It can be used to search for material that should all occur within the same group. It takes an optional unit attribute which can be one of known Z39.50 proximity units: sentence (3), paragraph (4), section (5), chapter (6), document (7), element (8), subelement (9), elementType (10). If omitted, unit element is used.
For example, in order to search withing same group of unit type chapter, the corresponding Z39.50 proximity search would be: @prox 0 0 0 0 k 6 leftop rightop
Note
The group facility requires Zebra 2.1.0 or later

The examples work as follows: From the original XML file marc-one.xml (or from the XML record DOM of the same form coming from an <input> pipeline), the indexing pipeline <extract> produces an indexing XML record, which is defined by the record instruction Zebra uses the content of z:id="11224466" or id=11224466 as internal record ID, and - in case static ranking is set - the content of rank=42 or z:rank="42" as static rank.

In these examples, the following literal indexes are constructed:

       any:w
       control:0
       title:w
       title:p
       title:s

where the indexing type is defined after the literal ':' character. Any value from the standard configuration file default.idx will do. Finally, any text() node content recursively contained inside the <z:index> element, or any element following a index processing instruction, will be filtered through the appropriate char map for character normalization, and will be inserted in the named indexes.

Finally, this example configuration can be queried using PQF queries, either transported by Z39.50, (here using a yaz-client)

       
       Z> open localhost:9999
       Z> elem dc
       Z> form xml
       Z>
       Z> find @attr 1=control @attr 4=3 11224466
       Z> scan @attr 1=control @attr 4=3 ""
       Z>
       Z> find @attr 1=title program
       Z> scan @attr 1=title ""
       Z>
       Z> find @attr 1=title @attr 4=2 "How to program a computer"
       Z> scan @attr 1=title @attr 4=2 ""

or the proprietary extensions x-pquery and x-pScanClause to SRU, and SRW

       
       http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr 1=title program
       http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr 1=title ""

See the section called “The SRU Server” for more information on SRU/SRW configuration, and the section called “YAZ server virtual hosts” or the YAZ CQL section for the details or the YAZ frontend server.

Notice that there are no *.abs, *.est, *.map, or other GRS-1 filter configuration files involves in this process, and that the literal index names are used during search and retrieval.

In case that we want to support the usual bib-1 Z39.50 numeric access points, it is a good idea to choose string index names defined in the default configuration file tab/bib1.att, see Section 3.4, “The Attribute Set (.att) Files”