mnoGoSearch 3.4.1 reference manual: Full-featured search engine software
Prev		Next

Chapter 5. External parsers

Table of Contents
Supported parser types
Setting up parsers
Preventing indexer from getting stuck on a parser execution
Pipes in a parser command line
Parsers and character sets
The UDM_URL environment variable
External parsers for the most common file types

mnoGoSearch understands a number of file formats out of the box, and is able to index files of these types using the build-in parsers. For the other file types it can use external parsers.

An external parser is an executable program which can convert a file of some Mime type to the one of the types supported by mnoGoSearch. For example, if you want mnoGoSearch to index PostScript files , you can do it with help of the ps2ascii parser, which reads a PostScript file from STDIN and produces plain text output to STDOUT.

Note: External parsers are also often referenced in this manual as filters, or converters.

Supported parser types

mnoGoSearch supports four types of external parsers:

read data from STDIN, send the result to STDOUT
read data from a file, send the result to STDOUT
read data from a file, send the result to another file
read data from STDIN, send the result to a file

Setting up parsers

Configure Mime types
Make sure your HTTP server sends correct Content-Type headers. For Apache, have a look at the file mime.types, the most common Mime types are already defined there.
If you want to index local files or an FTP server, use the AddType command in indexer.conf to associate file name extensions with their Mime types. For example:
```
AddType text/html *.html
```
Add parsers
Add parser definition commands. These commands have the following format with three arguments:
```
Mime <from_mime> <to_mime> <command line>
```
For example, the following command defines a parser for man pages:
```
# Use deroff for parsing man pages ( *.man )
Mime  application/x-troff-man   text/plain   deroff
```
This parser will take data from STDIN and output results to STDOUT.
Some parsers can not operate on STDIN and require a file to read from. In this case indexer can create a temporary file in /tmp and remove the file when the parser is done. Use the $1 macro in the parser command line to substitute the temporary file name. For example, the Mime command for the catdoc MS-Word-to-text converter can look like this:
```
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
```
If your parser writes the result into an output file, use the $2 macro. indexer will replace $2 with the output temporary file name, then start the parser, read the result from this temporary file and delete the file. For example:
```
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
```
The parser above will read data from the first temporary file and write results to the second file. Both temporary files will be deleted after reading parser results. Note that this command is effectively the same with the previous example. They only differ in the execution method used by indexer: file-to-STDOUT versus file-to-file.

Preventing indexer from getting stuck on a parser execution

To prevent indexer from getting stuck on a parser execution you can specify the amount of time (in seconds) indexer waits for an external parser to return results. Use the ParserTimeOut indexer.conf for this purpose. For example:

ParserTimeOut 600

The default value is 300 seconds (5 minutes). If an external parser does not return results within this period of time, indexer will kill the parser process, remove the associated temporary files and continue with the next document in the queue.

Pipes in a parser command line

You can use pipes in a parser command line. For example, these lines will be useful to index gzipped man pages from the local disk:

AddType  application/x-gzipped-man  *.1.gz *.2.gz *.3.gz *.4.gz
Mime     application/x-gzipped-man  text/plain  "zcat | deroff"

Parsers and character sets

Some parsers can produce output in a character sets different from the one given in the LocalCharset command. You can specify the output character set in a parser command line to make indexer convert the parser output to LocalCharset. For example, if catdoc is configured to produce output in windows-1251 character set but LocalCharset is set to koi8-r, you can use this command for parsing MS Word documents:

Mime  application/msword  "text/plain; charset=windows-1251" "catdoc -a $1"

The `UDM_URL` environment variable

When executing a parser, indexer creates the UDM_URL environment variable with the document URL as a value. You can use this variable in the parser scripts.

Note: When running indexer with multiple threads it's not recommended to use the UDM_URL environment variable, use the ${URL} variable in the parser command line instead. See Mime for more details.

External parsers for the most common file types

MS Word (`*.doc`)

catdoc MS-Word-to-text converter
Home page, Homepage (the Catdoc fork at Alioth project), also listed on Freecode.
indexer.conf:
```
Mime application/msword "text/plain; charset=utf-8"  "catdoc -d utf-8 $1"
```

wvWare MS-Word-to-HTML converter

Home page, also listed on Freecode.

indexer.conf:

Mime application/msword    "text/html; charset=utf-8"    "wvHtml --charset=utf-8 $1 -"

Tika MS-Word-to-text converter
Home page.
indexer.conf:
```
Mime application/msword    "text/plain; charset=utf-8"    "java -Xms128m -Xmx256m -jar /opt/tika-0.5/tika-app-0.5.jar --text $1"
```
The exact path to Tika JAR archive file may vary depending on you system configuration.

MS Excel (`*.xls`)

xls2csv MS-Excel-to-text converter
A part of catdoc distribution.
indexer.conf:
```
Mime application/vnd.ms-excel   text/plain      "xls2csv $1"
```
Excel-XLS-to-HTML converter
Available from the project homepage, also listed on Freecode. and SourceForge. Download page includes binaries for Windows and source code at SourceForge.
indexer.conf:
```
Mime application/vnd.ms-excel  text/html  "xlhtml $1"
```

MS PowerPoint (`*.ppt`)

pptohtml PowerPoint-PPT-to-HTML converter
A part of the xlhtml distribution. Available from the project homepage, also listed on Freecode and SourceForge. Download page includes binaries for Windows and source code at SourceForge.
indexer.conf:
```
Mime application/vnd.ms-powerpoint   text/html   "ppthtml $1"
```

MS Word 2007 (`*.docx`)

MS Word 2007 files can be indexed with help of unzip.

unzip is included into the majority of the modern Unix distributions.

indexer.conf:

AddType application/vnd.openxmlformats-officedocument.wordprocessingml.document *.docx
Mime application/vnd.openxmlformats-officedocument.wordprocessingml.document text/xml "unzip -p $1 word/document.xml"

Rich Text (`*.rtf`)

unrtf RTF-to-HTML converter

Homepage, and FTP download page.

indexer.conf:

Mime text/rtf*        text/html  "/usr/local/mnogosearch/sbin/unrtf --html $1"
Mime application/rtf  text/html  "/usr/local/mnogosearch/sbin/unrtf --html $1"

rtfx RTF-to-XML converter

Homepage, also listed on Freecode.

indexer.conf:

Mime text/rtf*       text/xml "rtfx -w $1 2>/dev/null"
Mime application/rtf text/xml "rtfx -w $1 2>/dev/null"

rthc RTF-to-HTML converter

indexer.conf:

Mime text/rtf*       text/html "rthc --use-stdout $1 2>/dev/null"
Mime application/rtf text/html "rthc --use-stdout $1 2>/dev/null"

Adobe Acrobat (`*.pdf`)

pdftohtml Adobe PDF converter
Homepage (original), Homepage (the Poppler fork).
indexer.conf:
```
Mime application/pdf text/html  "pdftohtml -noframes -enc UTF-8 -i -stdout $1"
```
pdftotext Adobe PDF converter
A part of the xpdf project.
Homepage, Homepage (the Poppler fork), also listed on Freecode.
indexer.conf:
```
Mime application/pdf            text/plain      "pdftotext $1 -"
```

PostScript (`*.ps`)

ps2ascii PostScript converter
A part of the GhostScript project.
Homepage, also listed on Freecode.
indexer.conf:
```
Mime application/postscript    text/plain  "ps2ascii $1"
```

MS Works 2, 3, 4, 5 (2000), and 8 (`*.wps`)

libwps WPS-to-HTML and WPS-to-text converter

Homepage.

indexer.conf:

# Text output:
Mime application/vnd.ms-works "text/plain; charset=utf-8"  "wps2text $1"

# HTML output:
Mime application/vnd.ms-works "text/html; charset=utf-8"  "wps2html $1"

Corel WordPerfect 4.x and and later (`*.wpd`)

libpwd WPD-to-HTML and WPD-to-text converter

Homepage.

indexer.conf:

# For text output:
Mime application/vnd.wordperfect "text/plain; charset=utf-8" "wpd2text $1"

# For indexing HTML output
Mime application/vnd.wordperfect "text/html; charset=utf-8" "wpd2html $1"

RPM

RPM converter by Mario Lang <lang[at]zid[dot]tu-graz[dot]ac[dot]at>

/usr/local/bin/rpminfo:

#!/bin/bash
/usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE}
(%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body>
%{DESCRIPTION}\n</body></html>" -p $1

indexer.conf:

Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"

It renders to such nice RPM information:

3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4]
       Mysql is a SQL (Structured Query Language) database server.
       Mysql was written by Michael (Monty) Widenius. See the CREDITS
       file in the distribution for more credits for mysql and related
       things....
       (application/x-rpm) 2088855 bytes

If you're using an external parser not listed here, please contribute your parser configuration to <general@mnogosearch.org>.

Prev	Home	Next
mnoGoSearch HTML parser		Extended indexing features