The Zebra system is designed to support a wide range of data management applications. The system can be configured to handle virtually any kind of structured data. Each record in the system is associated with a record schema which lends context to the data elements of the record. Any number of record schemas can coexist in the system. Although it may be wise to use only a single schema within one database, the system poses no such restrictions.
The Zebra indexer and information retrieval server consists of the following main applications: the zebraidx indexing maintenance utility, and the zebrasrv information query and retrieval server. Both are using some of the same main components, which are presented here.
The virtual Debian package idzebra-2.0
installs all the necessary packages to start
working with Zebra - including utility programs, development libraries,
documentation and modules.
The core Zebra module is the meat of the zebraidx indexing maintenance utility, and the zebrasrv information query and retrieval server binaries. Shortly, the core libraries are responsible for
of external filter modules, in case the application is not compiled statically. These filter modules define indexing, search and retrieval capabilities of the various input formats.
Zebra maintains Term Dictionaries and ISAM index entries in inverted index structures kept on disk. These are optimized for fast inset, update and delete, as well as good search performance.
by execution of search requests expressed in PQF/RPN data structures, which are handed over from the YAZ server frontend API. Search evaluation includes construction of hit lists according to boolean combinations of simpler searches. Fast performance is achieved by careful use of index structures, and by evaluation specific index hit lists in correct order.
components call resorting/re-ranking algorithms on the hit sets. These might also be pre-sorted not only using the assigned document ID's, but also using assigned static rank information.
returns - possibly ranked - result sets, hit numbers, and the like internal data to the YAZ server backend API for shipping to the client. Each individual filter module implements it's own specific presentation formats.
The Debian package libidzebra-2.0
contains all run-time libraries for Zebra, the
documentation in PDF and HTML is found in
idzebra-2.0-doc
, and
idzebra-2.0-common
includes common essential Zebra configuration files.
The zebraidx indexing maintenance utility loads external filter modules used for indexing data records of different type, and creates, updates and drops databases and indexes according to the rules defined in the filter modules.
The Debian package idzebra-2.0-utils
contains
the zebraidx utility.
This is the executable which runs the Z39.50/SRU/SRW server and glues together the core libraries and the filter modules to one great Information Retrieval server application.
The Debian package idzebra-2.0-utils
contains
the zebrasrv utility.
The YAZ server frontend is a full fledged stateful Z39.50 server taking client connections, and forwarding search and scan requests to the Zebra core indexer.
In addition to Z39.50 requests, the YAZ server frontend acts as HTTP server, honoring SRU SOAP requests, and SRU REST requests. Moreover, it can translate incoming CQL queries to PQF queries, if correctly configured.
YAZ
is an Open Source
toolkit that allows you to develop software using the
ANSI Z39.50/ISO23950 standard for information retrieval.
It is packaged in the Debian packages
yaz
and libyaz
.
The hard work of knowing what to index, how to do it, and which part of the records to send in a search/retrieve response is implemented in various filter modules. It is their responsibility to define the exact indexing and record display filtering rules.
The virtual Debian package
libidzebra-2.0-modules
installs all base filter
modules.
The DOM XML filter uses a standard DOM XML structure as internal data model, and can thus parse, index, and display any XML document.
A parser for binary MARC records based on the ISO2709 library standard is provided, it transforms these to the internal MARCXML DOM representation.
The internal DOM XML representation can be fed into four different pipelines, consisting of arbitrarily many successive XSLT transformations; these are for
input parsing and initial transformations,
indexing term extraction transformations
transformations before internal document storage, and
retrieve transformations from storage to output format
The DOM XML filter pipelines use XSLT (and if supported on your platform, even EXSLT), it brings thus full XPATH support to the indexing, storage and display rules of not only XML documents, but also binary MARC records.
Finally, the DOM XML filter allows for static ranking at index time, and to to sort hit lists according to predefined static ranks.
Details on the experimental DOM XML filter are found in Chapter 7, DOM XML Record Model and Filter Module.
The Debian package libidzebra-2.0-mod-dom
contains the DOM filter module.
The functionality of this record model has been improved and replaced by the DOM XML record model. See Section 2.5.1, “DOM XML Record Model and Filter Module”.
The Alvis filter for XML files is an XSLT based input filter. It indexes element and attribute content of any thinkable XML format using full XPATH support, a feature which the standard Zebra GRS-1 SGML and XML filters lacked. The indexed documents are parsed into a standard XML DOM tree, which restricts record size according to availability of memory.
The Alvis filter uses XSLT display stylesheets, which let the Zebra DB administrator associate multiple, different views on the same XML document type. These views are chosen on-the-fly in search time.
In addition, the Alvis filter configuration is not bound to the arcane BIB-1 Z39.50 library catalogue indexing traditions and folklore, and is therefore easier to understand.
Finally, the Alvis filter allows for static ranking at index time, and to to sort hit lists according to predefined static ranks. This imposes no overhead at all, both search and indexing perform still O(1) irrespectively of document collection size. This feature resembles Google's pre-ranking using their PageRank algorithm.
Details on the experimental Alvis XSLT filter are found in Chapter 8, ALVIS XML Record Model and Filter Module.
The Debian package libidzebra-2.0-mod-alvis
contains the Alvis filter module.
The functionality of this record model has been improved and replaced by the DOM XML record model. See Section 2.5.1, “DOM XML Record Model and Filter Module”.
The GRS-1 filter modules described in
Chapter 9, GRS-1 Record Model and Filter Modules
are all based on the Z39.50 specifications, and it is absolutely
mandatory to have the reference pages on BIB-1 attribute sets on
you hand when configuring GRS-1 filters. The GRS filters come in
different flavors, and a short introduction is needed here.
GRS-1 filters of various kind have also been called ABS filters due
to the *.abs
configuration file suffix.
The grs.marc and
grs.marcxml filters are suited to parse and
index binary and XML versions of traditional library MARC records
based on the ISO2709 standard. The Debian package for both
filters is
libidzebra-2.0-mod-grs-marc
.
GRS-1 TCL scriptable filters for extensive user configuration come
in two flavors: a regular expression filter
grs.regx using TCL regular expressions, and
a general scriptable TCL filter called
grs.tcl
are both included in the
libidzebra-2.0-mod-grs-regx
Debian package.
A general purpose SGML filter is called
grs.sgml. This filter is not yet packaged,
but planned to be in the
libidzebra-2.0-mod-grs-sgml
Debian package.
The Debian package
libidzebra-2.0-mod-grs-xml
includes the
grs.xml filter which uses Expat to
parse records in XML and turn them into IDZebra's internal GRS-1 node
trees. Have also a look at the Alvis XML/XSLT filter described in
the next session.