Section

Name

Section -- defines a document section

indexer.conf search.htm

Synopsis

Section {name} {number} [ {maxlen} [datatype] [when] [format] [cloneflag] [separator] [ {source} {pattern} {replacement} ] ]

Description

When used in search.htm, the Section command requires only the first three parameters and activates recognition of section name references in search queries, for example:


title:word1 body:word2
    
See the Section called Restricting search words to a section in Chapter 11 for details. There are no any other purposes of using the Section command in search.htm. The rest of this article applies mostly to indexer.conf.

string is the section name and number is the section ID between 0 and 255. Use 0 if you don't want to index the sections.

Note: It is recommended to use different sections ID for different documents parts, which makes possible to set different weights for the different document parts, as well as restrict search to a section at search time.

The maxlen argument contains the maximum length of the section which should be stored in the database. If maxlen is set to 0, then this section is not stored in the database and therefore is not available at search time by the methods RESULT::document_property() and RESULT::document_property_html().

The datatype parameter is optional. If the parameter is omitted, then the words of this section are treated as usual words, i.e. they are stored and compared lexicographically.

If the datatype is set to decimal, then the words of this section are treated as decimal numbers with up to 9 integral digits and up to 9 fractional digits. The words of this section are stored as a 18-digit words in the format IIIIIIIIIFFFFFFFFF, where IIIIIIIII is the integral part left padded with zeroes, and FFFFFFFFF is the fractional part right padded with zeros.

when is an optional parameter defining when the section is to be created. The following values are possible:

format is a flag telling indexer which parser to use for the section. Two values are understood:

The format parameter is designed for use in combination with the simple type of HTDBDoc queries (i.e. consisting of a list of data columns, without full HTTP headers). The default value is text. If your SQL table contains data in HTML format, you can specify the html option to force removing of HTML tags. See the Section called Indexing SQL tables (htdb:/ virtual URL scheme) in Chapter 6 for details about simple HTDBDoc queries.

The cloneflag parameter is a flag describing whether the section should affect clone detection. It can be DetectClone (or cdon), or NoDetectClone (or cdoff). By default, all url.* section values (i.e. various URL parts) are not taken in account for clone detection, while any other sections take part in clone detection.

separator is a string that separates consequent chunks of the same section.

User-defined sections

The source, pattern and replacement parameters can be used to extract user defined sections.

source can include variable references using ${VARNAME} syntax. Multiple variable references are allowed.

pattern represents a regular expression to specify which parts of source should go to the section.

replacement defines how the extracted parts of source are comnibed into the result. replacement can contain references of the form $n, where n is a number in the range 0-9. Every reference is replaced to text captured by the n-th parenthesized sub-pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing sub-pattern.


# Use a combination of URL and raw body content to extract
# the host part of URL and title into the section "udef"
Section HTTP.Content 0 0
Section udef  1 256 cdoff  "" "${URL}:${HTTP.Content}" "^http://([^/]*)/.*<title>(.*)</title>" "$1 $2"
    

Conditional sections

The source, pattern and replacement arguments can also be used to create sections only under certain conditions:


# Create "body" only for the given host name
Section HTTP.Content 0 0
Section body  1 256 cdoff "" "${URL}:${HTTP.Content}" "^http://www.mysite.com/.*<body>(.*)</body>" "$1"
    

Special purpose sections

Examples


# Standard document sections
Section body                    1
Section title                   2
Section meta.keywords           3
Section meta.description        4

# Incoming link text
Section ilinktext               5

# URL parts
Section url.file                6
Section url.path                7
Section url.host                8
Section url.proto               9

# Useful meta information
Section Charset                 10
Section Content-Type            11
Section Content-Language        12

# Message/rfc822 headers
Section msg.from		15
Section msg.to			16
Section msg.subject		17


# MP3 tags
Section MP3.Song                25
Section MP3.Album               26
Section MP3.Artist              27
Section MP3.Year                28

# HTML tag attributes
Section attribute.alt           35
Section attribute.label         36
Section attribute.summary       37
Section attribute.title         38
Section attribute.face          39

# A user-defined section
Section h1                      40      128 "<h1>(.*)</h1>" $1

# User-defined date extracted from the "Date" meta-tag
Section User.Date               0       10 '<META NAME="Date" +CONTENT="([^"]*)">' "$1"

# Replacing Content-Type to application/msword
Section Content-Type            0       64 afterheaders cdoff "" "${URL}" "http://site/*.doc" "application/msword"

# Using "afterguesser" in conjuction with ${HTTP.LocalCharsetContent}
Section HTTP.LocalCharsetContent 0      0
Section h1lcs                   41      128 afterguesser cdoff "" "${HTTP.LocalCharsetContent}" "<h1>(.*)</h1>" $1

# Using a simple HTDBDoc query for a SQL table with text and HTML columns
Section 1 256 column1 text
Section 2 256 colimn2 html
      

See also

MaxDocSize, MaxWordLength, MinWordLength, UseLocalCachedCopy.