Table of Contents
The functionality of this record model has been improved and replaced by the DOM XML record model. See Chapter 7, DOM XML Record Model and Filter Module.
The record model described in this chapter applies to the fundamental,
structured
record type grs
, introduced in
Section 2.5.3, “GRS-1 Record Model and Filter Modules”.
Many basic subtypes of the grs type are currently available:
grs.sgml
This is the canonical input format described Section 1.1, “GRS-1 Canonical Input Format”. It is using simple SGML-like syntax.
grs.marc.
type
This allows Zebra to read
records in the ISO2709 (MARC) encoding standard.
Last parameter type
names the
.abs
file (see below)
which describes the specific MARC structure of the input record as
well as the indexing rules.
The grs.marc
uses an internal representation
which is not XML conformant. In particular MARC tags are
presented as elements with the same name. And XML elements
may not start with digits. Therefore this filter is only
suitable for systems returning GRS-1 and MARC records. For XML
use grs.marcxml
filter instead (see below).
The loadable grs.marc
filter module
is packaged in the GNU/Debian package
libidzebra2.0-mod-grs-marc
grs.marcxml.
type
This allows Zebra to read ISO2709 encoded records.
Last parameter type
names the
.abs
file (see below)
which describes the specific MARC structure of the input record as
well as the indexing rules.
The internal representation for grs.marcxml
is the same as for MARCXML.
It slightly more complicated to work with than
grs.marc
but XML conformant.
The loadable grs.marcxml
filter module
is also contained in the GNU/Debian package
libidzebra2.0-mod-grs-marc
grs.xml
This filter reads XML records and uses
Expat to
parse them and convert them into IDZebra's internal
grs
record model.
Only one record per file is supported, due to the fact XML does
not allow two documents to "follow" each other (there is no way
to know when a document is finished).
This filter is only available if Zebra is compiled with EXPAT support.
The loadable grs.xml
filter module
is packaged in the GNU/Debian package
libidzebra2.0-mod-grs-xml
grs.regx.
filter
This enables a user-supplied Regular Expressions input filter described in Section 1.2, “GRS-1 REGX And TCL Input Filters”.
The loadable grs.regx
filter module
is packaged in the GNU/Debian package
libidzebra2.0-mod-grs-regx
grs.tcl.
filter
Similar to grs.regx but using Tcl for rules, described in Section 1.2, “GRS-1 REGX And TCL Input Filters”.
The loadable grs.tcl
filter module
is also packaged in the GNU/Debian package
libidzebra2.0-mod-grs-regx
Although input data can take any form, it is sometimes useful to describe the record processing capabilities of the system in terms of a single, canonical input format that gives access to the full spectrum of structure and flexibility in the system. In Zebra, this canonical format is an "SGML-like" syntax.
To use the canonical format specify grs.sgml
as
the record type.
Consider a record describing an information resource (such a record is sometimes known as a locator record). It might contain a field describing the distributor of the information resource, which might in turn be partitioned into various fields providing details about the distributor, like this:
<Distributor> <Name> USGS/WRD </Name> <Organization> USGS/WRD </Organization> <Street-Address> U.S. GEOLOGICAL SURVEY, 505 MARQUETTE, NW </Street-Address> <City> ALBUQUERQUE </City> <State> NM </State> <Zip-Code> 87102 </Zip-Code> <Country> USA </Country> <Telephone> (505) 766-5560 </Telephone> </Distributor>
The keywords surrounded by <...> are
tags, while the sections of text
in between are the data elements.
A data element is characterized by its location in the tree
that is made up by the nested elements.
Each element is terminated by a closing tag - beginning
with <
/, and containing the same symbolic
tag-name as the corresponding opening tag.
The general closing tag - </>
-
terminates the element started by the last opening tag. The
structuring of elements is significant.
The element Telephone,
for instance, may be indexed and presented to the client differently,
depending on whether it appears inside the
Distributor element, or some other,
structured data element such a Supplier element.
The first tag in a record describes the root node of the tree that makes up the total record. In the canonical input format, the root tag should contain the name of the schema that lends context to the elements of the record (see Section 2, “GRS-1 Internal Record Representation”). The following is a GILS record that contains only a single element (strictly speaking, that makes it an illegal GILS record, since the GILS profile includes several mandatory elements - Zebra does not validate the contents of a record against the Z39.50 profile, however - it merely attempts to match up elements of a local representation with the given schema):
<gils> <title>Zen and the Art of Motorcycle Maintenance</title> </gils>
Zebra allows you to provide individual data elements in a number of variant forms. Examples of variant forms are textual data elements which might appear in different languages, and images which may appear in different formats or layouts. The variant system in Zebra is essentially a representation of the variant mechanism of Z39.50-1995.
The following is an example of a title element which occurs in two different languages.
<title> <var lang lang "eng"> Zen and the Art of Motorcycle Maintenance</> <var lang lang "dan"> Zen og Kunsten at Vedligeholde en Motorcykel</> </title>
The syntax of the variant element is
<var class type value>
.
The available values for the class and
type fields are given by the variant set
that is associated with the current schema
(see Section 1.1.2, “Variants”).
Variant elements are terminated by the general end-tag </>, by the variant end-tag </var>, by the appearance of another variant tag with the same class and value settings, or by the appearance of another, normal tag. In other words, the end-tags for the variants used in the example above could have been omitted.
Variant elements can be nested. The element
<title> <var lang lang "eng"><var body iana "text/plain"> Zen and the Art of Motorcycle Maintenance </title>
Associates two variant components to the variant list for the title element.
Given the nesting rules described above, we could write
<title> <var body iana "text/plain> <var lang lang "eng"> Zen and the Art of Motorcycle Maintenance <var lang lang "dan"> Zen og Kunsten at Vedligeholde en Motorcykel </title>
The title element above comes in two variants. Both have the IANA body type "text/plain", but one is in English, and the other in Danish. The client, using the element selection mechanism of Z39.50, can retrieve information about the available variant forms of data elements, or it can select specific variants based on the requirements of the end-user.
In order to handle general input formats, Zebra allows the operator to define filters which read individual records in their native format and produce an internal representation that the system can work with.
Input filters are ASCII files, generally with the suffix
.flt
.
The system looks for the files in the directories given in the
profilePath setting in the
zebra.cfg
files.
The record type for the filter is
grs.regx.
filter-filename
(fundamental type grs
, file read
type regx
, argument
filter-filename).
Generally, an input filter consists of a sequence of rules, where each rule consists of a sequence of expressions, followed by an action. The expressions are evaluated against the contents of the input record, and the actions normally contribute to the generation of an internal representation of the record.
An expression can be either of the following:
INIT
The action associated with this expression is evaluated exactly once in the lifetime of the application, before any records are read. It can be used in conjunction with an action that initializes tables or other resources that are used in the processing of input records.
BEGIN
Matches the beginning of the record. It can be used to initialize variables, etc. Typically, the BEGIN rule is also used to establish the root node of the record.
END
Matches the end of the record - when all of the contents of the record has been processed.
/
reg
/
Matches regular expression pattern reg
from the input record. The operators supported are the same
as for regular expression queries. Refer to
Section 3.6, “Zebra Regular Expressions in Truncation Attribute (type = 5)”.
BODY
This keyword may only be used between two patterns. It matches everything between (not including) those patterns.
FINISH
The expression associated with this pattern is evaluated once, before the application terminates. It can be used to release system resources - typically ones allocated in the INIT step.
An action is surrounded by curly braces ({...}), and consists of a sequence of statements. Statements may be separated by newlines or semicolons (;). Within actions, the strings that matched the expressions immediately preceding the action can be referred to as $0, $1, $2, etc.
The available statements are:
type [parameter ... ]
Begin a new
data element. The type
is one of
the following:
Begin a new record. The following parameter should be the
name of the schema that describes the structure of the record, e.g.,
gils
or wais
(see below).
The begin record
call should precede
any other use of the begin
statement.
Begin a new tagged element. The parameter is the name of the tag. If the tag is not matched anywhere in the tagsets referenced by the current schema, it is treated as a local string tag.
Begin a new node in a variant tree. The parameters are
class type value
.
parameter
Create a data element. The concatenated arguments make
up the value of the data element.
The option -text
signals that
the layout (whitespace) of the data should be retained for
transmission.
The option -element
tag
wraps the data up in
the tag
.
The use of the -element
option is equivalent to
preceding the command with a begin
element
command, and following
it with the end
command.
[type]
Close a tagged element. If no parameter is given,
the last element on the stack is terminated.
The first parameter, if any, is a type name, similar
to the begin
statement.
For the element
type, a tag
name can be provided to terminate a specific tag.
no
Move the input pointer to the offset of first character that
match rule given by no
.
The first rule from left-to-right is numbered zero,
the second rule is named 1 and so on.
The following input filter reads a Usenet news file, producing a record in the WAIS schema. Note that the body of a news posting is separated from the list of headers by a blank line (or rather a sequence of two newline characters.
BEGIN { begin record wais } /^From:/ BODY /$/ { data -element name $1 } /^Subject:/ BODY /$/ { data -element title $1 } /^Date:/ BODY /$/ { data -element lastModified $1 } /\n\n/ BODY END { begin element bodyOfDisplay begin variant body iana "text/plain" data -text $1 end record }
If Zebra is compiled with support for Tcl enabled, the statements described above are supplemented with a complete scripting environment, including control structures (conditional expressions and loop constructs), and powerful string manipulation mechanisms for modifying the elements of a record.