indexer configuration

Specifying the Web space for indexing

When indexer found a new URL during crawling, indexer checks whether the URL has a corresponding Web space definition command Server, Realm or Subnet given in indexer.conf. URLs which do not have a corresponding Web space definition command are not added into the database. Also, the URLs which already present in the search database and appear not to have corresponding Web space definition commands are deleted from the database. This can happen after removing of some of the definition commands from indexer.conf.

The Web definiton commands have the following format:


Server [Method] [SubSection] <pattern> [alias]
Realm [Method] [CaseType] [MatchType] [CmpType] <pattern> [alias]
Subnet [Method] [MatchType] <pattern>

The mandatory parameter pattern specifies an URL, or its part, or a pattern.

The optional parameter method specifies the action for this command. It can take one of the following values: Allow, Disallow, HrefOnly, CheckOnly, Skip, CheckMP3, CheckMP3Only. By default, the value Allow is used.

  1. Allow

    Allow specifies that all corresponding documents will be indexed and scanned for new links. Depending on Content-Type, an external parser can be executed if needed.

  2. Disallow

    Disallow specifies that all corresponding documents will be ignored and deleted from the database.

  3. HrefOnly

    HrefOnly specifies that all corresponding documents will only be scanned for new links, but their content won't be indexed. This is useful, for example, when indexing mail archives, when the index pages are only scanned for links to new messages.

    
Server HrefOnly Page http://www.mail-archive.com/general%40mnogosearch.org/
    Server Allow    Path http://www.mail-archive.com/general%40mnogosearch.org/
    

  4. CheckOnly

    CheckOnly specifies that all corresponding documents will be requested using the HEAD HTTP method instead of the default GET method. When using CheckOnly, only brief information about the documents (such as size, last modification time, content type) will be fetched. This method can be helpful to detect broken links on your site. For example:

    
Server HrefOnly  http://www.mnogosearch.org/
    Realm  CheckOnly *
    

    These commands instruct indexer to scan all documents on the site www.mnogosearch.org and collect all outgoing links. Brief info about every document outside www.mnogosearch.org will be requested using the HEAD method. After indexing is done, use indexer -S command to see if there are any pages with status 404 Not found.

  5. Skip

    Skip specifies that all corresponding documents will be skipped while indexing. This is useful when you need to disable temporarily reindexing of some sites, but to keep them available for search with their previous content. indexer will mark these documents as "fresh" and put in the end of its queue.

  6. CheckMP3

    CheckMP3 specifies that the corresponding documents will be checked for MP3 tags even if the Content-Type is not equal to audio/mpeg. This is useful if the remote server sends application/octet-stream as Content-Type for MP3 files. In case when MP3 tags are found in some document, they will be indexed, otherwise the document will be further processed according to the Content-Type.

  7. CheckMP3Only

    This method is very similar to CheckMP3, but in case when MP3 tags are not found in a document, the document is not further processed.

The optional SubSection parameter specifies the pattern match method, which can be one of the following values: page, path, site, world, with path being the default.

  1. Server path

    All URLs from the same directory match. For example, if: Server path http://localhost/path/to/index.html is given, all URLs starting with http://localhost/path/to/ will match this command.

    The following commands have the same effect when searching for a matching Web space definition command:

    
Server path http://localhost/path/to/index.html
    Server path http://localhost/path/to/index
    Server path http://localhost/path/to/index.cgi?q=bla
    Server path http://localhost/path/to/index?q=bla
    

  2. Server site

    All URLs from the same host match. For example, Server site http://localhost/path/to/a.html will allow to index the entire site http://localhost/.

  3. Server world

    If world subsection is specified, then absolutely any URL will correspond to this definiton command. See the explanation below.

  4. Server page

    Means exact match, only the given URL will match this command.

  5. subsection in news:// schema

    Subsection is always considered as site for the news:// URL schema. This is because unlike ftp:// or http://, the news:// schema has no recursive paths. Use Server news://news.server.com/ to index the whole news server or, for example, Server news://news.server.com/udm to index all messages from the /udm hierarchy.

The optional parameter CaseType specifies case sensitivity for string comparison, it can take one of the following values: case - case insensitive comparison, or nocase - case sensitive comparison.

The optional parameter CmpType specifies comparison type and can take two values: Regex and String. String wildcards are the default match type. You can use ? and * signs in the patter, they mean "one character" and "any number of characters" respectively. For example, if you want to index all HTTP sites in the .ru domain, you can use this command:


Realm http://*.ru/*

The regex comparison type says that the pattern is a regular expression. For example, you can describe everything in the .ru domain using the regex comparison type:


Realm Regex ^http://.*\.ru/

The optional parameter MatchType can be Match or NoMatch, with Match as default. Realm NoMatch has reverse effect. It means that URLs not matching the given pattern will correspond to this Realm command. For example, use this command to index everything but the .com domain:


Realm NoMatch http://*.com/*

The optional alias argument provides URL rewrite rules, described in details in the Section called Aliases.

Using different parameters for a server and its subsections

indexer examines the Web space definition command in order of their appearance in indexer.conf. Thus, if you want to give different parameters to a site and its subsections, you can add the command describing a subsection before the command describing the entire site. Imagine that you have a subdirectory which contains news articles and want those articles to be reindexed more often than the rest of the site. The following combination can be useful in this cases:


# Add subsection
Period 200000
Server http://servername/news/

# Add server
Period 600000
Server http://servername/

These commands give different reindexing periods for the /news/ subdirectory and the rest of the site. indexer will choose the first command for the URL http://servername/news/page1.html.

The default indexer behavior

The default behavior of indexer is to follow through the links found having correspondent Web space definition commands given in the indexer.conf file. indexer jumps between sites if both of them have a corresponding Web definition command. For example, there are two commands:


Server http://www/
Server http://web/

When indexing http://www/page1.html indexer WILL follow the link http://web/page2.html. Note that although these pages are on different sites, BOTH of them have a correspondent Web space definition command.

If we delete one of the commands, indexer will remove all expired URLs from this server during the next crawling sessions.

Aliases

mnoGoSearch offers a flexible technique of aliases and reverse aliases, making it possible to index sites by downloading documents from another location. For example, if you index your local web server, it is possible to load pages directly from the hard disk without involving your web server in the crawling process. Another example is building of a search engine for the primary site using its mirror to download the documents.

Different ways of using aliases are described in the next sections.

The Alias indexer.conf command

The Alias indexer.conf command uses this format:


Alias <masterURL> <mirrorURL>

For example, if you wish to index http://www.mnogosearch.ru/ using the nearest German mirror http://www.gstammw.de/mirrors/mnoGoSearch/, you can add these lines into your indexer.conf:


Server http://www.mnogosearch.ru/
Alias  http://www.mnogosearch.ru/  http://www.gstammw.de/mirrors/mnoGoSearch/

When crawling, indexer will download the documents from the mirror site http://www.gstammw.de/mirrors/mnoGoSearch/. At search time search.cgi will display URLs from the master site http://www.mnogosearch.ru/.

Another example: You want to index all sites from the domain udm.net. Suppose one of the servers (e.g. http://home.udm.net/) is stored on the local machine in the directory /home/httpd/htdocs/. These commands will be useful:


Realm http://*.udm.net/
Alias http://home.udm.net/ file:///home/httpd/htdocs/
    

Indexer will load documents form the site home.udm.net using the local disk, and will use HTTP for the other sites.

Using different aliases for server parts

Aliases are searched in the order of their appearance in indexer.conf. So, you can create different aliases for a server and its parts:


# First, create alias for example for /stat/ directory which
# is not under common location:
Alias http://home.udm.net/stat/  file:///usr/local/stat/htdocs/

# Then create alias for the rest of the server:
Alias http://home.udm.net/ file:///usr/local/apache/htdocs/

Note: If you change the order of these commands, the alias for the directory /stat/ will never be found.

Using aliases in the Server command

You can specify the location used by indexer as an optional argument in a Server command:


Server  http://home.udm.net/  file:///home/httpd/htdocs/

Using aliases in the Realm command

Aliases in the Realm command are based on regular expressions. The implementation of this feature reminds PHP's preg_replace() function. Aliases in the Realm command work only if the regex match type is used, and do not work in case of the string match type.

Use this syntax for Realm aliases:


Realm regex <URL_pattern> <alias_pattern>

When indexer finds a URL matching to URL_pattern, it builds an alias using alias_pattern. alias_pattern can contain references of the form $n, where n is a number in the range of 0-9. Every reference is replaced to text captured by the n-th parenthesized sub-pattern. $0 refers to text matched by the whole pattern. Opening parentheses are counted from left to right (starting from 1) to obtain the number of the capturing sub-pattern.

Example: your company hosts a few hundred users with their own domains in the form of www.username.yourname.com. All user sites are stored on the disk in the subdirectory /htdocs under their home directories: /home/username/htdocs/.

You can write this command into indexer.conf (note that the dot '.' character has a special meaning in regular expressions and should be escaped with a '\' sign when dot is used in its literal meaning):


Realm regex (http://www\.)(.*)(\.yourname\.com/)(.*)  file:///home/$2/htdocs/$4

Imagine that indexer processes a document located at http://www.john.yourname.com/news/index.html. These patterns will be captured:


   $0 = http://www.john.yourname.com/news/index.htm (the whole pattern match)
   $1 = http://www.      - subpattern matching (http://www\.)
   $2 = john             - subpattern matching (.*)
   $3 = .yourname.com/   - subpattern matching (\.yourname\.com/)
   $4 = /news/index.html - subpattern matching (.*)

After the matches are found, the subpatterns $2 and $4 are substituted to alias_pattern, which will result into this alias:


file:///home/john/htdocs/news/index.html

The AliasProg command

AliasProg can be useful for a web hosting company indexing its customer web sites by loading documents directly from the disk without having to involve the HTTP server into crawling process (to offload the server). Document layout can be very complex to describe it using the Server or Realm commands. AliasProg defines an external program that can be executed with an URL in the command line argument and return the corresponding alias to STDOUT. Use $1 to pass URLs to the command line.

The command in this example uses the replace program from MySQL distribution and replaces URL substring http://www.apache.org/ to file:///usr/local/apache/htdocs/:


AliasProg  "echo $1 | /usr/local/mysql/bin/mysql/replace http://www.apache.org/ file:///usr/local/apache/htdocs/"
    

You can write your own complex program for converting URLs int their aliases using any preferred programming language.

The ReverseAlias command

The ReverseAlias indexer.conf command allows mapping of URLs before a URL is inserted into the database. Unlike the Alias command (which performs mapping right before a document is downloaded), the ReverseAlias command performs mapping immediately after a new link is found.


ReverseAlias http://name2/   http://name2.yourname.com/
Server       http://name2.yourname.com/
    

In the above example, all links with the short server name will be converted to links with the full server and will be put into the database after converting.

Another possible use of the ReverseAlias is stripping off various undesired query string parameters like PHPSESSID=XXXX.

The following example will strip off the PHPSESSID=XXXX part from the URLs like http://www/a.php?PHPSESSID=XXX, when there are no any other query string parameters other than PHPSESSID. The question mark is deleted as well:


ReverseAlias regex  (http://[^?]*)[?]PHPSESSID=[^&]*$          $1$2
    

Stripping the PHPSESSID=XXXX from the URL like w/a.php?PHPSESSID=xxx&.., that is when PHPSESSID=XXXX is the very first query string parameter followed by a number of other parameters. The ampersand sign & after the PHPSESSID=XXXX part is deleted as well. The question mark ? is not deleted:


ReverseAlias regex  (http://[^?]*[?])PHPSESSID=[^&]*&(.*)      $1$2

Stripping the PHPSESSID=XXXX part from the URLs like http://www/a.php?a=b&PHPSESSID=xxx or http://www/a.php?a=b&PHPSESSID=xxx&c=d, where PHPSESSID=XXXX is not the first parameter. The ampersand sign & before PHPSESSID=XXXX is deleted:


ReverseAlias regex  (http://.*)&PHPSESSION=[^&]*(.*)         $1$2
    

Search-time aliases in search.htm

It is also possible to define aliases in the search template (search.htm). The Alias command in search.htm is identical to the one in indexer.conf, but is applied at search time rather than during crawling.

The syntax of the Alias command in search.htm is similar to indexer.conf:


Alias <find-prefix> <replace-prefix>
    

Suppose your search.htm has the following command:


Alias http://localhost/ http://www.mnogo.ru/
    

When search.cgi returns a page with the URL http://localhost/news/article10.html, it will be replaced to http://www.mnogo.ru/news/article10.html.

Note: When you need aliases, you can put aliases either into indexer.conf (to convert the remote notation to the local notation during crawling time) or into search.htm (to convert the local notation to the remote notation during search time). Use the approach which looks more convenient for you.

ServerTable

The quick way to specify URLs to be indexed by mnoGoSearch is just to specify them using the Server or Realm directives in the indexer.conf file. However, in some cases users might already have URLs saved in a SQL database, it would be much simpler to have mnoGoSearch use this information. This can be done using the ServerTable command, which is available in mnoGoSearch starting from the version 3.3.7.

When ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo is specified, indexer loads server information from the given SQL table my_server and loads the server parameters from the table my_srvinfo.

The following sections provide step-by-step instructions how to create, populate and load Server tables.

Step 1: creating Server table

The tables server and srvinfo that are already present in mnoGoSearch are used internally. One should not try to use these tables to insert your own URLs. Instead, you must create your own tables with similar structures. For example, with MySQL you can do:


CREATE TABLE my_server LIKE server;
CREATE TABLE my_srvinfo LIKE srvinfo;
     

Note: You may find useful to do some modifications in the column types, for example, add AUTOINCREMENT flag to rec_id. However, don't change the column names - mnoGoSearch looks up the columns by their names.

Step 2: populating Server table

Now that you have your custom tables, you can load data:


INSERT INTO my_server (rec_id, enabled, command, url) VALUES (1, 1, 'S', 'http://server1/');
INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES (1, 'Period', '30d');

INSERT INTO my_server (rec_id, enabled, command, url) VALUES (2, 1, 'S', 'http://server2/');
INSERT INTO my_srvinfo (srv_id, sname, sval) VALUES (1, 'MaxHops', '3');   
    

The columns rec_id, enabled and url must be specified in the INSERT INTO my_server statements.

The columns parent and pop_weight should NOT be specified, as these columns used by mnoGoSearch internally.

The columns tag, category, ordre, weight can be specified optionally.

my_srvinfo is a child table of my_server. These tables are joint using the condition my_server.rec_id = my_srvinfo.srv_id.

sname in the table my_srvinfo is the name of a directive that might be specified for the particular URL in indexer.conf. For example, you might want to specify Period of "30d" for the respective URL, so you insert a record with sname="Period" and sval="30d", or set MaxHops to "3", so you insert a record with sname="MaxHops" and sval="3".

The meaning of various columns is explained in the Section called Database schema in Chapter 13.

Note: Look at the table srvinfo data to get examples about how it is used.

Step 3: loading Server table

Now that you have data in your custom Server tables, you need to specify the new tables in indexer.conf. Just add the following line:


      ServerTable mysql://user:pass@host/dbname/my_server?srvinfo=my_srvinfo
      

Note: If the srvinfo parameter is omitted, parameters are loaded from the table with name srvinfo by default.

A quick way to test if your Server table works fine is to insert one or two URLs into the my_server table that do not already exist in your indexer.conf, then run indexer and specify that only the given URLs are to be indexed, e.g.:


./indexer -a -u http://server1/
./indexer -a -u http://server2/
    
If it is working properly, you should see the test URLs being indexed.

Important notes on using Server table

1) You can create as many custom server/srvinfo tables as you like, and then specify each pair in the indexer.conf file using a different ServerTable directive with the appropriate values.

2) Using your own Server table does not stop other URLs that are specified in your indexer.conf from being indexed. indexer will do both. So you can define some non-changing URLs in the indexer.conf file, and put the URLs that tend to come and go into your custom Server table. You can also write some scripts that copy URLs from your own database into your custom Server table used by mnoGoSearch.

Server table structure

See the Section called Database schema in Chapter 13 for the meaning of various columns in the Server tables.

FlushServerTable

Flush server sets active field to inactive for all ServerTable records. Use this command to deactivate all commands in ServerTable before loading new commands from indexer.conf or from other ServerTable.