The experimental, loadable DOM XML/XSLT filter module
mod-dom.so
is invoked by the zebra.cfg
configuration statement
recordtype.xml: dom.db/filter_dom_conf.xml
In this example the DOM XML filter is configured to work
on all data files with suffix
*.xml
, where the configuration file is found in the
path db/filter_dom_conf.xml
.
The DOM XSLT filter configuration file must be valid XML. It might look like this:
<?xml version="1.0" encoding="UTF8"?> <dom xmlns="http://indexdata.com/zebra-2.0"> <input> <xmlreader level="1"/> <!-- <marc inputcharset="marc-8"/> --> </input> <extract> <xslt stylesheet="common2index.xsl"/> </extract> <store> <xslt stylesheet="common2store.xsl"/> </store> <retrieve name="dc"> <xslt stylesheet="store2dc.xsl"/> </retrieve> <retrieve name="mods"> <xslt stylesheet="store2mods.xsl"/> </retrieve> </dom>
The root XML element <dom>
and all other DOM
XML filter elements are residing in the namespace
xmlns="http://indexdata.com/zebra-2.0"
.
All pipeline definition elements - i.e. the
<input>
,
<extract>
,
<store>
, and
<retrieve>
elements - are optional.
Missing pipeline definitions are just interpreted
do-nothing identity pipelines.
All pipeline definition elements may contain zero or more
<xslt stylesheet="path/file.xsl"/>
XSLT transformation instructions, which are performed
sequentially from top to bottom.
The paths in the stylesheet
attributes
are relative to zebras working directory, or absolute to the file
system root.
The <input>
pipeline definition element
may contain either one XML Reader definition
<xmlreader level="1"/>
, used to split
an XML collection input stream into individual XML DOM
documents at the prescribed element level,
or one MARC binary
parsing instruction
<marc inputcharset="marc-8"/>
, which defines
a conversion to MARCXML format DOM trees. The allowed values
of the inputcharset
attribute depend on your
local iconv™ set-up.
Both input parsers deliver individual DOM XML documents to the
following chain of zero or more
<xslt stylesheet="path/file.xsl"/>
XSLT transformations. At the end of this pipeline, the documents
are in the common format, used to feed both the
<extract>
and
<store>
pipelines.
The <extract>
pipeline takes documents
from any common DOM XML format to the Zebra specific
indexing DOM XML format.
It may consist of zero ore more
<xslt stylesheet="path/file.xsl"/>
XSLT transformations, and the outcome is handled to the
Zebra core to drive the process of building the inverted
indexes. See
Section 2.5, “Canonical Indexing Format” for
details.
The <store>
pipeline takes documents
from any common DOM XML format to the Zebra specific
storage DOM XML format.
It may consist of zero ore more
<xslt stylesheet="path/file.xsl"/>
XSLT transformations, and the outcome is handled to the
Zebra core for deposition into the internal storage system.
Finally, there may be one or more
<retrieve>
pipeline definitions, each
of them again consisting of zero or more
<xslt stylesheet="path/file.xsl"/>
XSLT transformations. These are used for document
presentation after search, and take the internal storage DOM
XML to the requested output formats during record present
requests.
The possible multiple
<retrieve>
pipeline definitions
are distinguished by their unique name
attributes, these are the literal schema
or
element set
names used in
SRW,
SRU and
Z39.50 protocol queries.
DOM XML indexing comes in two flavors: pure
processing-instruction governed plain XML documents, and - very
similar to the Alvis filter indexing format - XML documents
containing XML <record>
and
<index>
instructions from the magic
namespace xmlns:z="http://indexdata.com/zebra-2.0"
.
The output of the processing instruction driven
indexing XSLT stylesheets must contain
processing instructions named
zebra-2.0
.
The output of the XSLT indexing transformation is then
parsed using DOM methods, and the contained instructions are
performed on the elements and their
subtrees directly following the processing instructions.
For example, the output of the command
xsltproc dom-index-pi.xsl marc-one.xml
might look like this:
<?xml version="1.0" encoding="UTF-8"?> <?zebra-2.0 record id=11224466 rank=42?> <record> <?zebra-2.0 index control:0?> <control>11224466</control> <?zebra-2.0 index any:w title:w title:p title:s?> <title>How to program a computer</title> </record>
The output of the indexing XSLT stylesheets must contain
certain elements in the magic
xmlns:z="http://indexdata.com/zebra-2.0"
namespace. The output of the XSLT indexing transformation is then
parsed using DOM methods, and the contained instructions are
performed on the magic elements and their
subtrees.
For example, the output of the command
xsltproc dom-index-element.xsl marc-one.xml
might look like this:
<?xml version="1.0" encoding="UTF-8"?> <z:record xmlns:z="http://indexdata.com/zebra-2.0" z:id="11224466" z:rank="42"> <z:index name="control:0">11224466</z:index> <z:index name="any:w title:w title:p title:s"> How to program a computer</z:index> </z:record>
Both indexing formats are defined with equal semantics and behavior in mind:
Zebra specific instructions are either
processing instructions named
zebra-2.0
or
elements contained in the namespace
xmlns:z="http://indexdata.com/zebra-2.0"
.
There must be exactly one record
instruction, which sets the scope for the following,
possibly nested index
and
group
instructions.
The unique record
instruction
may have additional attributes id
,
rank
and type
.
Attribute id
is the value of the opaque ID
and may be any string not containing the whitespace character
' '
.
The rank
attribute value must be a
non-negative integer. See
Section 9, “Relevance Ranking and Sorting of Result Sets” .
The type
attribute specifies how the record
is to be treated. The following values may be given for
type
:
insert
The record is inserted. If the record already exists, it is skipped (i.e. not replaced).
replace
The record is replaced. If the record does not already exist, it is skipped (i.e. not inserted).
delete
The record is deleted. If the record does not already exist, a warning issued and rest of records are skipped in from the input stream.
update
The record is inserted or replaced depending on whether the record exists or not. This is the default behavior but may be effectively changed by "outside" the scope of the DOM filter by zebraidx commands or extended services updates.
adelete
The record is deleted. If the record does not already exist, it is skipped (i.e. nothing is deleted).
Requires version 2.0.54 or later.
Note that the value of type
is only used to
determine the action if and only if the Zebra indexer is running
in "update" mode (i.e zebraidx update) or if the specialUpdate
action of the
Extended
Service Update is used.
For this reason a specialUpdate may end up deleting records!
Multiple and possible nested index
instructions must contain at least one
indexname:indextype
pair, and may contain multiple such pairs separated by the
whitespace character ' '
. In each index
pair, the name and the type of the index is separated by a
colon character ':'
.
Any index name consisting of ASCII letters, and following the standard Zebra rules will do, see Section 3.5.1, “Mapping of PQF APT access points”.
Index types are restricted to the values defined in
the standard configuration
file default.idx
, see
Section 2.3, “BIB-1 Attribute Set” and
Chapter 10, Field Structure and Character Sets
for details.
DOM input documents which are not resulting in both one
unique valid
record
instruction and one or more valid
index
instructions can not be searched and
found. Therefore,
invalid document processing is aborted, and any content of
the <extract>
and
<store>
pipelines is discarded.
A warning is issued in the logs.
The group
can be used to group
indexing material for proximity search. It can be used to
search for material that should all occur within the same
group. It takes an optional unit
attribute
which can be one of known Z39.50 proximity units:
sentence
(3),
paragraph
(4),
section
(5),
chapter
(6),
document
(7),
element
(8),
subelement
(9),
elementType
(10).
If omitted, unit element
is used.
For example, in order to search withing same group of unit type
chapter
, the
corresponding Z39.50 proximity search would be:
@prox 0 0 0 0 k 6 leftop rightop
The group facility requires Zebra 2.1.0 or later
The examples work as follows:
From the original XML file
marc-one.xml
(or from the XML record DOM of the
same form coming from an <input>
pipeline),
the indexing
pipeline <extract>
produces an indexing XML record, which is defined by
the record
instruction
Zebra uses the content of
z:id="11224466"
or
id=11224466
as internal
record ID, and - in case static ranking is set - the content of
rank=42
or
z:rank="42"
as static rank.
In these examples, the following literal indexes are constructed:
any:w control:0 title:w title:p title:s
where the indexing type is defined after the
literal ':'
character.
Any value from the standard configuration
file default.idx
will do.
Finally, any
text()
node content recursively contained
inside the <z:index>
element, or any
element following a index
processing instruction,
will be filtered through the
appropriate char map for character normalization, and will be
inserted in the named indexes.
Finally, this example configuration can be queried using PQF queries, either transported by Z39.50, (here using a yaz-client)
Z> open localhost:9999 Z> elem dc Z> form xml Z> Z> find @attr 1=control @attr 4=3 11224466 Z> scan @attr 1=control @attr 4=3 "" Z> Z> find @attr 1=title program Z> scan @attr 1=title "" Z> Z> find @attr 1=title @attr 4=2 "How to program a computer" Z> scan @attr 1=title @attr 4=2 ""
or the proprietary
extensions x-pquery
and
x-pScanClause
to
SRU, and SRW
http://localhost:9999/?version=1.1&operation=searchRetrieve&x-pquery=@attr 1=title program http://localhost:9999/?version=1.1&operation=scan&x-pScanClause=@attr 1=title ""
See the section called “The SRU Server” for more information on SRU/SRW configuration, and the section called “YAZ server virtual hosts” or the YAZ CQL section for the details or the YAZ frontend server.
Notice that there are no *.abs
,
*.est
, *.map
, or other GRS-1
filter configuration files involves in this process, and that the
literal index names are used during search and retrieval.
In case that we want to support the usual
bib-1
Z39.50 numeric access points, it is a
good idea to choose string index names defined in the default
configuration file tab/bib1.att
, see
Section 3.4, “The Attribute Set (.att) Files”