2. Charmap Files

2. Charmap Files
Prev	Chapter 10. Field Structure and Character Sets	Next

The character map files are used to define the word tokenization and character normalization performed before inserting text into the inverse indexes. Zebra ships with the predefined character map files tab/*.chr. Users are allowed to add and/or modify maps according to their needs.

Table 10.1. Character maps predefined in Zebra

File name	Intended type	Description
`numeric.chr`	`:n`	Numeric digit tokenization and normalization map. All characters not in the set `-{0-9}.,` will be suppressed. Note that floating point numbers are processed fine, but scientific exponential numbers are trashed.
`scan.chr`	`:w or :p`	Word tokenization char map for Scandinavian languages. This one resembles the generic word tokenization character map `tab/string.chr`, the main differences are sorting of the special characters `üzæäøöå` and equivalence maps according to Scandinavian language rules.
`string.chr`	`:w or :p`	General word tokenization and normalization character map, mostly useful for English texts. Use this to derive your own language tokenization and normalization derivatives.
`urx.chr`	`:u`	URL parsing and tokenization character map.
`@`	`:0`	Do-nothing character map used for literal binary indexing. There is no existing file associated to it, and there is no normalization or tokenization performed at all.

The contents of the character map files are structured as follows:

encoding encoding-name

This directive must be at the very beginning of the file, and it specifies the character encoding used in the entire file. If omitted, the encoding ISO-8859-1 is assumed.

For example, one of the test files found at test/rusmarc/tab/string.chr contains the following encoding directive:

         encoding koi8-r

and the test file test/charmap/string.utf8.chr is encoded in UTF-8:

         encoding utf-8

lowercase value-set

This directive introduces the basic value set of the field type. The format is an ordered list (without spaces) of the characters which may occur in "words" of the given type. The order of the entries in the list determines the sort order of the index. In addition to single characters, the following combinations are legal:

Backslashes may be used to introduce three-digit octal, or two-digit hex representations of single characters (preceded by x). In addition, the combinations \\, \\r, \\n, \\t, \\s (space — remember that real space-characters may not occur in the value definition), and \\ are recognized, with their usual interpretation.
Curly braces {} may be used to enclose ranges of single characters (possibly using the escape convention described in the preceding point), e.g., {a-z} to introduce the standard range of ASCII characters. Note that the interpretation of such a range depends on the concrete representation in your local, physical character set.
parentheses () may be used to enclose multi-byte characters - e.g., diacritics or special national combinations (e.g., Spanish "ll"). When found in the input stream (or a search term), these characters are viewed and sorted as a single character, with a sorting value depending on the position of the group in the value statement.

For example, scan.chr contains the following lowercase normalization and sorting order:

         lowercase {0-9}{a-y}üzæäøöå

uppercase value-set

This directive introduces the upper-case equivalences to the value set (if any). The number and order of the entries in the list should be the same as in the lowercase directive.

For example, scan.chr contains the following uppercase equivalent:

         uppercase {0-9}{A-Y}ÜZÆÄØÖÅ

space value-set

This directive introduces the character which separate words in the input stream. Depending on the completeness mode of the field in question, these characters either terminate an index entry, or delimit individual "words" in the input stream. The order of the elements is not significant — otherwise the representation is the same as for the uppercase and lowercase directives.

For example, scan.chr contains the following space instruction:

         space {\001-\040}!"#$%&'\()*+,-./:;<=>?@\[\\]^_`\{|}~

map value-set target

This directive introduces a mapping between each of the members of the value-set on the left to the character on the right. The character on the right must occur in the value set (the lowercase directive) of the character set, but it may be a parenthesis-enclosed multi-octet character. This directive may be used to map diacritics to their base characters, or to map HTML-style character-representations to their natural form, etc. The map directive can also be used to ignore leading articles in searching and/or sorting, and to perform other special transformations.

For example, scan.chr contains the following map instructions among others, to make sure that HTML entity encoded Danish special characters are mapped to the equivalent Latin-1 characters:

         map (&aelig;)      æ
         map (&oslash;)     ø
         map (&aring;)      å

In addition to specifying sort orders, space (blank) handling, and upper/lowercase folding, you can also use the character map files to make Zebra ignore leading articles in sorting records, or when doing complete field searching.

This is done using the map directive in the character map file. In a nutshell, what you do is map certain sequences of characters, when they occur in the beginning of a field, to a space. Assuming that the character "@" is defined as a space character in your file, you can do:

	 map (^The\s) @
	 map (^the\s) @

The effect of these directives is to map either 'the' or 'The', followed by a space character, to a space. The hat ^ character denotes beginning-of-field only when complete-subfield indexing or sort indexing is taking place; otherwise, it is treated just as any other character.

Because the default.idx file can be used to associate different character maps with different indexing types -- and you can create additional indexing types, should the need arise -- it is possible to specify that leading articles should be ignored either in sorting, in complete-field searching, or both.

If you ignore certain prefixes in sorting, then these will be eliminated from the index, and sorting will take place as if they weren't there. However, if you set the system up to ignore certain prefixes in searching, then these are deleted both from the indexes and from query terms, when the client specifies complete-field searching. This has the effect that a search for 'the science journal' and 'science journal' would both produce the same results.

equivalent value-set

This directive introduces equivalence classes of strings for searching purposes only. It's a one-to-many conversion that takes place only during search before the map directive kicks in.

For example given:

         equivalent æä(ae)

a search for the äsel will be be match any of æsel, äsel and aesel.