The character map files are used to define the word tokenization
and character normalization performed before inserting text into
the inverse indexes. Zebra ships with the predefined character map
files tab/*.chr
. Users are allowed to add
and/or modify maps according to their needs.
Table 10.1. Character maps predefined in Zebra
File name | Intended type | Description |
---|---|---|
numeric.chr | :n | Numeric digit tokenization and normalization map. All
characters not in the set -{0-9}., will be
suppressed. Note that floating point numbers are processed
fine, but scientific exponential numbers are trashed. |
scan.chr | :w or :p | Word tokenization char map for Scandinavian
languages. This one resembles the generic word tokenization
character map tab/string.chr , the main
differences are sorting of the special characters
üzæäøöå and equivalence maps according to
Scandinavian language rules. |
string.chr | :w or :p | General word tokenization and normalization character map, mostly useful for English texts. Use this to derive your own language tokenization and normalization derivatives. |
urx.chr | :u | URL parsing and tokenization character map. |
@ | :0 | Do-nothing character map used for literal binary indexing. There is no existing file associated to it, and there is no normalization or tokenization performed at all. |
The contents of the character map files are structured as follows:
encoding-name
This directive must be at the very beginning of the file, and it
specifies the character encoding used in the entire file. If
omitted, the encoding ISO-8859-1
is assumed.
For example, one of the test files found at
test/rusmarc/tab/string.chr
contains the following
encoding directive:
encoding koi8-r
and the test file
test/charmap/string.utf8.chr
is encoded
in UTF-8:
encoding utf-8
value-set
This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal:
Backslashes may be used to introduce three-digit octal, or
two-digit hex representations of single characters
(preceded by x
).
In addition, the combinations
\\, \\r, \\n, \\t, \\s (space — remember that real
space-characters may not occur in the value definition), and
\\ are recognized, with their usual interpretation.
Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the preceding point), e.g., {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set.
parentheses () may be used to enclose multi-byte characters - e.g., diacritics or special national combinations (e.g., Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value statement.
For example, scan.chr
contains the following
lowercase normalization and sorting order:
lowercase {0-9}{a-y}üzæäøöå
value-set
This directive introduces the
upper-case equivalences to the value set (if any). The number and
order of the entries in the list should be the same as in the
lowercase
directive.
For example, scan.chr
contains the following
uppercase equivalent:
uppercase {0-9}{A-Y}ÜZÆÄØÖÅ
value-set
This directive introduces the character
which separate words in the input stream. Depending on the
completeness mode of the field in question, these characters either
terminate an index entry, or delimit individual "words" in
the input stream. The order of the elements is not significant —
otherwise the representation is the same as for the
uppercase
and lowercase
directives.
For example, scan.chr
contains the following
space instruction:
space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~
value-set
target
This directive introduces a mapping between each of the
members of the value-set on the left to the character on the
right. The character on the right must occur in the value
set (the lowercase
directive) of the
character set, but it may be a parenthesis-enclosed
multi-octet character. This directive may be used to map
diacritics to their base characters, or to map HTML-style
character-representations to their natural form, etc. The
map directive can also be used to ignore leading articles in
searching and/or sorting, and to perform other special
transformations.
For example, scan.chr
contains the following
map instructions among others, to make sure that HTML entity
encoded Danish special characters are mapped to the
equivalent Latin-1 characters:
map (æ) æ map (ø) ø map (å) å
In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, you can also use the character map files to make Zebra ignore leading articles in sorting records, or when doing complete field searching.
This is done using the map
directive in the
character map file. In a nutshell, what you do is map certain
sequences of characters, when they occur in the
beginning of a field, to a space. Assuming that the
character "@" is defined as a space character in your file, you
can do:
map (^The\s) @ map (^the\s) @
The effect of these directives is to map either 'the' or 'The', followed by a space character, to a space. The hat ^ character denotes beginning-of-field only when complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just as any other character.
Because the default.idx
file can be used to
associate different character maps with different indexing types
-- and you can create additional indexing types, should the need
arise -- it is possible to specify that leading articles should
be ignored either in sorting, in complete-field searching, or
both.
If you ignore certain prefixes in sorting, then these will be eliminated from the index, and sorting will take place as if they weren't there. However, if you set the system up to ignore certain prefixes in searching, then these are deleted both from the indexes and from query terms, when the client specifies complete-field searching. This has the effect that a search for 'the science journal' and 'science journal' would both produce the same results.
value-set
This directive introduces equivalence classes of strings for searching purposes only. It's a one-to-many conversion that takes place only during search before the map directive kicks in.
For example given:
equivalent æä(ae)
a search for the äsel
will be be match any of
æsel
, äsel
and
aesel
.