As mentioned above, there can be only one indexing stylesheet, and configuration of the indexing process is a synonym of writing an XSLT stylesheet which produces XML output containing the magic elements discussed in Section 1.1, “ALVIS Internal Record Representation”. Obviously, there are million of different ways to accomplish this task, and some comments and code snippets are in order to lead our Padawan's on the right track to the good side of the force.
Stylesheets can be written in the pull or the push style: pull means that the output XML structure is taken as starting point of the internal structure of the XSLT stylesheet, and portions of the input XML are pulled out and inserted into the right spots of the output XML structure. On the other side, push XSLT stylesheets are recursively calling their template definitions, a process which is commanded by the input XML structure, and are triggered to produce some output XML whenever some special conditions in the input stylesheets are met. The pull type is well-suited for input XML with strong and well-defined structure and semantics, like the following OAI indexing example, whereas the push type might be the only possible way to sort out deeply recursive input XML formats.
A pull stylesheet example used to index OAI harvested records could use some of the following template definitions:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:z="http://indexdata.dk/zebra/xslt/1" xmlns:oai="http://www.openarchives.org/&acro.oai;/2.0/" xmlns:oai_dc="http://www.openarchives.org/&acro.oai;/2.0/oai_dc/" xmlns:dc="http://purl.org/dc/elements/1.1/" version="1.0"> <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/> <!-- disable all default text node output --> <xsl:template match="text()"/> <!-- match on oai xml record root --> <xsl:template match="/"> <z:record z:id="{normalize-space(oai:record/oai:header/oai:identifier)}"> <!-- you might want to use z:rank="{some &acro.xslt; function here}" --> <xsl:apply-templates/> </z:record> </xsl:template> <!-- &acro.oai; indexing templates --> <xsl:template match="oai:record/oai:header/oai:identifier"> <z:index name="oai_identifier" type="0"> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> <!-- DC specific indexing templates --> <xsl:template match="oai:record/oai:metadata/oai_dc:dc/dc:title"> <z:index name="dc_title" type="w"> <xsl:value-of select="."/> </z:index> </xsl:template> <!-- etc, etc --> </xsl:stylesheet>
Notice also, that the names and types of the indexes can be defined in the indexing XSLT stylesheet dynamically according to content in the original XML records, which has opportunities for great power and wizardry as well as grande disaster.
The following excerpt of a push stylesheet might be a good idea according to your strict control of the XML input format (due to rigorous checking against well-defined and tight RelaxNG or XML Schema's, for example):
<xsl:template name="element-name-indexes"> <z:index name="{name()}" type="w"> <xsl:value-of select="'1'"/> </z:index> </xsl:template>
This template creates indexes which have the name of the working
node of any input XML file, and assigns a '1' to the index.
The example query
find @attr 1=xyz 1
finds all files which contain at least one
xyz
XML element. In case you can not control
which element names the input files contain, you might ask for
disaster and bad karma using this technique.
One variation over the theme dynamically created indexes will definitely be unwise:
<!-- match on oai xml record root --> <xsl:template match="/"> <z:record> <!-- create dynamic index name from input content --> <xsl:variable name="dynamic_content"> <xsl:value-of select="oai:record/oai:header/oai:identifier"/> </xsl:variable> <!-- create zillions of indexes with unknown names --> <z:index name="{$dynamic_content}" type="w"> <xsl:value-of select="oai:record/oai:metadata/oai_dc:dc"/> </z:index> </z:record> </xsl:template>
Don't be tempted to cross the line to the dark side of the force, Padawan; this leads to suffering and pain, and universal disintegration of your project schedule.
An exchange format can be anything which can be the outcome of an
XSLT transformation, as far as the stylesheet is registered in
the main Alvis XSLT filter configuration file, see
Section 1, “ALVIS Record Filter”.
In principle anything that can be expressed in XML, HTML, and
TEXT can be the output of a schema
or
element set
directive during search, as long as
the information comes from the
original input record XML DOM tree
(and not the transformed and indexed XML!!).
In addition, internal administrative information from the Zebra indexer can be accessed during record retrieval. The following example is a summary of the possibilities:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:z="http://indexdata.dk/zebra/xslt/1" version="1.0"> <!-- register internal zebra parameters --> <xsl:param name="id" select="''"/> <xsl:param name="filename" select="''"/> <xsl:param name="score" select="''"/> <xsl:param name="schema" select="''"/> <xsl:output indent="yes" method="xml" version="1.0" encoding="UTF-8"/> <!-- use then for display of internal information --> <xsl:template match="/"> <z:zebra> <id><xsl:value-of select="$id"/></id> <filename><xsl:value-of select="$filename"/></filename> <score><xsl:value-of select="$score"/></score> <schema><xsl:value-of select="$schema"/></schema> </z:zebra> </xsl:template> </xsl:stylesheet>
The source code tarball contains a working Alvis filter example in
the directory examples/alvis-oai/
, which
should get you started.
More example data can be harvested from any OAI compliant server, see details at the OAI http://www.openarchives.org/ web site, and the community links at http://www.openarchives.org/community/index.html. There is a tutorial found at http://www.oaforum.org/tutorial/.