Chapter 3. Indexing

Table of Contents
Indexing in general
HTTP response codes mnoGoSearch understands
Content-Encoding support
indexer configuration
Using syslog
Disabling Apache logging
Cached copies

Indexing in general

Configuration

Indexer configuration is covered mostly by the indexer.conf-dist file. You can find it in the /etc directory of the mnoGoSearch installation directory. Also, you may want to take a look into the other *.conf samples in the doc/samples directory of the mnoGoSearch source distribution.

To set up indexer.conf file, go to the /etc directory of your mnoGoSearch installation, copy indexer.conf-dist to indexer.conf and edit it using a text editor. Typically, the DBAddr command needs to be modified according to your database connection parameters, as well as a new command Server describing your Web site needs to be added. The other default indexer.conf commands are usually suitable in most cases and do not need changes. The file indexer.conf is well-commented and contains examples for the most important commands, so you will find it easy to configure.

To configure the search front-end search.cgi, copy the file search.htm-dist to search.htm and edit it. Typically, only DBAddr needs to be modified according to your database connection parameters, similar to indexer.conf. See Chapter 10 for more detailed description.

Creating SQL table structure

To create SQL tables required for mnoGoSearch, use indexer -Ecreate. When started with this argument, indexer opens the file containing the SQL statements necessary for creating all SQL tables according to the database type and storage mode given in the DBAddr command in indexer.conf. The files with the SQL scripts are typically installed to the /share directory of the mnoGoSearch installation, which is usually /usr/local/mnogosearch/share/mnogosearch/.

Dropping SQL table structure

To drop all SQL tables created by mnoGoSearch, use indexer -Edrop. The files with the SQL statements required to drop all tables previously created by mnoGoSearch is installed in the /share directory of the mnoGoSearch installation.

Note: In some cases when you need to remove all existing data from the search database and to crawl your sites from the very beginning, you can use indexer -Edrop followed by indexer -Ecreate instead of truncating the existing tables (indexer -C). In some databases recreating the tables work faster than truncating data from the existing tables.

Running indexer

Run indexer periodically (once a week, a day, an hour...), depending on how often changes on your sites happen. You may find useful adding indexer into cron job.

If you run indexer without any command line arguments, it crawls only new and expired documents, while fresh documents are not crawled. You can change expiration time with help of the Period indexer.conf command. The default expiration period is one week. If you need to crawl all documents, including the fresh ones, (i.e. without having to wait for their expiration period), use the -a command line option. indexer will mark all documents as expired at startup.

HTTP redirects

If indexer gets a redirect response (301, 302, 303 HTTP status), the URL from the Location: HTTP header is added into the database.

Note: indexer puts the redirect target into its queue. It does not follow the redirect target immediately after processing an URL with a redirect response.

Crawling time optimization

When downloading documents, indexer tries to do some optimization. It sends the If-Modified-Since HTTP header for the documents it have already downloaded (during the previous crawling sessions). If the HTTP server replies "304 Not modified", then only minor updates in the database are done.

When indexer downloads a document (i.e. when it gets a "HTTP 200 Ok" response) it calculates the document checksum using the crc32 algorithm. If checksum is the same to the previous checksum stored in the database, indexer will not do full updates in the database with the new information about this document. This is also done for optimization purposes to improve crawling performance.

The -m command line option prevents indexer from sending the If-Modified-Since headers and forces full updating the database even if the checksum is the same. It can be useful if you have modified indexer.conf. For example, when the Allow, Disallow rules were changed, or new Server commands were added, and therefore you need indexer to parse the old documents once again and add new links which were ignored in the previous configuration.

Note: Sometimes you may need to force reindexing of some document (or a group of documents), that is force both document downloading (even when it is not expired yet) and updating the information about the document in the database (even if the checksum has not modified). You may find this command useful:


indexer -am -u http://site/some/document.html

Subsection control

indexer understand the -t, -u, -s command line options to limit actions to only a part of the database. -t forces a limit on Tag, -u forces a limit on URL substring (using SQL LIKE wildcards). -s forces a limit on HTTP status. All limit command can be specified multiple times. All limit options of the same group are OR-ed, and the groups are AND-ed. For example, if you run indexer -s200 -s304 -u http://site1/% -u http://site2/%, indexer will re-crawl the documents having HTTP status 200 or 304, only from the site http://site1/ or from the site http://site2/.

Note: The above command line will be internally interpreted into this SQL query when fetching URLs from the queue:


SELECT
  <columns>
FROM
  url
WHERE
  status IN (200,304)
AND
  (url LIKE 'http://site1/%' OR url LIKE 'http://site2/%'
AND
  next_index_time >= <current_time>

How to clear the database

To clear all information from the database, use indexer -C.

By default, indexer asks for a confirmation if you are sure to delete data from the database.


$ indexer -C
You are going to delete content from the database(s):
pgsql://root@/root/?dbmode=blob
Are you sure?(YES/no)
You can use the -w command line option together with -C to force deleting data without asking for confirmation: indexer -Cw.

You may also delete only a part of the database. All subsection control options are taking into account when deleting data. For example:


indexer -Cw -u http://site/% 
will delete infomation about all documents from the site http://site/ without asking for confirmation.

Database Statistics

If you run indexer -S, indexer will display the current database statistics, including the number of total and expired documents for each HTTP status:


$indexer -S

          Database statistics [2008-12-21 15:35:34]

    Status    Expired      Total
   -----------------------------
         0        883        971 Not indexed yet
       200          0        891 OK
       404          0       1585 Not found
   -----------------------------
     Total        883       3447
It is also possible to see database statistic for a certain moment of time in the future with help of the -j command line argument, to check expiration period of the documents. -j understands time in the format YYYY-MM[-DD[ HH[:MM[:SS]]]], or time offset from the current time using the same format with the Period command. For example, 7d12h means seven days and 12 hours:

$indexer -S -j 7d12h

          Database statistics [2008-12-29 03:44:19]

    Status    Expired      Total
   -----------------------------
         0        971        971 Not indexed yet
       200        891        891 OK
       404       1585       1585 Not found
   -----------------------------
     Total       3447       3447
From the above output we know that after the given period of time all documents in the database will have expired.

Note: All subsection control options work together with -S.

The meaning of the various status values is given in this list:

  • 0 - a new document (not visited yet)

If status is not 0, then it's a HTTP response code indexer got when downloading this document. Some of the HTTP codes are:

  • 200 - OK (the document was successfully downloaded)

  • 301 - Moved Permanently (redirect to another URL)

  • 302 - Moved Temporarily (redirect to another URL)

  • 303 - See Other (redirect to another URL)

  • 304 - Not modified (the document has not been modified since last visit)

  • 401 - Authorization required (use login/password for the given URL)

  • 403 - Forbidden (you have no access to this URL)

  • 404 - Not found (the document does not exist)

  • 500 - Internal Server Error (an error in a CGI script, etc)

  • 503 - Service Unavailable (host is down, connection timed out)

  • 504 - Gateway Timeout (read timeout happened during downloading the document)

HTTP 401 means that this URL is password protected. You can use the AuthBasic command in indexer.conf to specify the login:password pair for this URL.

HTTP 404 means that you have a broken link in one of your document (a reference to a resource that does not exist).

Take a look at HTTP specific documentation for the further information on HTTP status codes.

Using indexer for site validation

Run indexer -I to display the list of URLs together with their referrers. It can be useful to find broken links on your site.

Note: If HoldBadHrefs is set to 0, link validation won't work.

Note: All subsection control options work together with -I. For example, indexer -I -s 404 will display the list of the documents with HTTP status 404 Not found together with their referrers where the links to the missing documents were found.

You can use mnoGoSearch especially for link validation purposes.

Running multiple indexer instances for crawling

It is always safe to run multiple indexer processes with different indexer.conf files configured to use different databases in the DBAddr.

Some databases also allow to run multiple indexer crawling processes with the same indexer.conf file. As of mnoGoSearch version 3.3.8, it is possible with MySQL, PostgreSQL and Oracle. Starting from version 3.3.10, multiple running indexer crawling processes is also possible with Microsoft SQL Server. indexer uses locking mechanisms provided by the database software (such as SELECT FOR UPDATE, LOCK TABLE, (TABLOCKX)) when fetching crawling targets from the database. This is done to avoid double crawling of the same documents by simultaneous indexer processes.

Note: indexer is known to work fine with 30 simultaneous crawling processes with MySQL.

Note: It is not recommended to use the same database with different indexer.conf files. The first process can add new documents to the database, while the second process can delete the same documents because of different configuration. This process can never stop.

Running indexer with multiple threads

You can start indexer with multiple threads using the -N command line option. For example, indexer -N10 will start 10 crawling threads, which means 10 documents from different locations will be downloaded at the same time, which improves crawling performance significantly.

Note: Running 10 instances of indexer is effectively very similar to running a single indexer with 10 threads. You may notice some difference though if you terminate (using Ctrl-Break) or kill (using kill(1)) indexer, or if indexer crashes for some reasons (e.g. when it hits some bug in the sources). In case of separate processes only one process will die and the alive processes will continue crawling, while in case of a multi-threaded indexer all threads die and crawling completely stops.