mnoGoSearch understands a number of file formats out of the box, and is able to index files of these types using the build-in parsers. For the other file types it can use external parsers.
An external parser is an executable program which can convert a file of some Mime type to the one of the types supported by mnoGoSearch. For example, if you want mnoGoSearch to index PostScript files , you can do it with help of the ps2ascii parser, which reads a PostScript file from STDIN and produces plain text output to STDOUT.
Note: External parsers are also often referenced in this manual as filters, or converters.
mnoGoSearch supports four types of external parsers:
read data from STDIN, send the result to STDOUT
read data from a file, send the result to STDOUT
read data from a file, send the result to another file
read data from STDIN, send the result to a file
Configure Mime types
Make sure your HTTP server sends correct Content-Type headers. For Apache, have a look at the file mime.types, the most common Mime types are already defined there.
If you want to index local files or an FTP server, use the AddType command in indexer.conf to associate file name extensions with their Mime types. For example:
AddType text/html *.html
Add parser definition commands. These commands have the following format with three arguments:
Mime <from_mime> <to_mime> <command line>
For example, the following command defines a parser for man pages:
# Use deroff for parsing man pages ( *.man ) Mime application/x-troff-man text/plain deroff
This parser will take data from STDIN and output results to STDOUT.
Some parsers can not operate on STDIN and require a file to read from. In this case indexer can create a temporary file in /tmp and remove the file when the parser is done. Use the $1 macro in the parser command line to substitute the temporary file name. For example, the Mime command for the catdoc MS-Word-to-text converter can look like this:
Mime application/msword text/plain "/usr/bin/catdoc -a $1"
If your parser writes the result into an output file, use the $2 macro. indexer will replace $2 with the output temporary file name, then start the parser, read the result from this temporary file and delete the file. For example:
Mime application/msword text/plain "/usr/bin/catdoc -a $1 >$2"
The parser above will read data from the first temporary file and write results to the second file. Both temporary files will be deleted after reading parser results. Note that this command is effectively the same with the previous example. They only differ in the execution method used by indexer: file-to-STDOUT versus file-to-file.
To prevent indexer from getting stuck on a parser execution you can specify the amount of time (in seconds) indexer waits for an external parser to return results. Use the ParserTimeOut indexer.conf for this purpose. For example:
ParserTimeOut 600
The default value is 300 seconds (5 minutes). If an external parser does not return results within this period of time, indexer will kill the parser process, remove the associated temporary files and continue with the next document in the queue.
You can use pipes in a parser command line. For example, these lines will be useful to index gzipped man pages from the local disk:
AddType application/x-gzipped-man *.1.gz *.2.gz *.3.gz *.4.gz Mime application/x-gzipped-man text/plain "zcat | deroff"
Some parsers can produce output in a character sets different from the one given in the LocalCharset command. You can specify the output character set in a parser command line to make indexer convert the parser output to LocalCharset. For example, if catdoc is configured to produce output in windows-1251 character set but LocalCharset is set to koi8-r, you can use this command for parsing MS Word documents:
Mime application/msword "text/plain; charset=windows-1251" "catdoc -a $1"
UDM_URL
environment variableWhen executing a parser, indexer creates the UDM_URL environment variable with the document URL as a value. You can use this variable in the parser scripts.
Note: When running indexer with multiple threads it's not recommended to use the
UDM_URL
environment variable, use the${URL}
variable in the parser command line instead. See Mime for more details.
catdoc MS-Word-to-text converter
Home page, Homepage (the Catdoc fork at Alioth project), also listed on Freecode.
indexer.conf:
Mime application/msword "text/plain; charset=utf-8" "catdoc -d utf-8 $1"
wvWare MS-Word-to-HTML converter
Home page, also listed on Freecode.
indexer.conf:
Mime application/msword "text/html; charset=utf-8" "wvHtml --charset=utf-8 $1 -"
Tika MS-Word-to-text converter
indexer.conf:
Mime application/msword "text/plain; charset=utf-8" "java -Xms128m -Xmx256m -jar /opt/tika-0.5/tika-app-0.5.jar --text $1"
The exact path to Tika JAR archive file may vary depending on you system configuration.
xls2csv MS-Excel-to-text converter
A part of catdoc distribution.
indexer.conf:
Mime application/vnd.ms-excel text/plain "xls2csv $1"
Excel-XLS-to-HTML converter
Available from the project homepage, also listed on Freecode. and SourceForge. Download page includes binaries for Windows and source code at SourceForge.
indexer.conf:
Mime application/vnd.ms-excel text/html "xlhtml $1"
pptohtml PowerPoint-PPT-to-HTML converter
A part of the xlhtml distribution. Available from the project homepage, also listed on Freecode and SourceForge. Download page includes binaries for Windows and source code at SourceForge.
indexer.conf:
Mime application/vnd.ms-powerpoint text/html "ppthtml $1"
MS Word 2007 files can be indexed with help of unzip.
unzip is included into the majority of the modern Unix distributions.
indexer.conf:
AddType application/vnd.openxmlformats-officedocument.wordprocessingml.document *.docx Mime application/vnd.openxmlformats-officedocument.wordprocessingml.document text/xml "unzip -p $1 word/document.xml"
unrtf RTF-to-HTML converter
Homepage, and FTP download page.
indexer.conf:
Mime text/rtf* text/html "/usr/local/mnogosearch/sbin/unrtf --html $1" Mime application/rtf text/html "/usr/local/mnogosearch/sbin/unrtf --html $1"
rtfx RTF-to-XML converter
Homepage, also listed on Freecode.
indexer.conf:
Mime text/rtf* text/xml "rtfx -w $1 2>/dev/null" Mime application/rtf text/xml "rtfx -w $1 2>/dev/null"
rthc RTF-to-HTML converter
indexer.conf:
Mime text/rtf* text/html "rthc --use-stdout $1 2>/dev/null" Mime application/rtf text/html "rthc --use-stdout $1 2>/dev/null"
pdftohtml Adobe PDF converter
Homepage (original), Homepage (the Poppler fork).
indexer.conf:
Mime application/pdf text/html "pdftohtml -noframes -enc UTF-8 -i -stdout $1"
pdftotext Adobe PDF converter
A part of the xpdf project.
Homepage, Homepage (the Poppler fork), also listed on Freecode.
indexer.conf:
Mime application/pdf text/plain "pdftotext $1 -"
libwps WPS-to-HTML and WPS-to-text converter
indexer.conf:
# Text output: Mime application/vnd.ms-works "text/plain; charset=utf-8" "wps2text $1"
# HTML output: Mime application/vnd.ms-works "text/html; charset=utf-8" "wps2html $1"
libpwd WPD-to-HTML and WPD-to-text converter
indexer.conf:
# For text output: Mime application/vnd.wordperfect "text/plain; charset=utf-8" "wpd2text $1"
# For indexing HTML output Mime application/vnd.wordperfect "text/html; charset=utf-8" "wpd2html $1"
RPM converter by Mario Lang <lang[at]zid[dot]tu-graz[dot]ac[dot]at>
/usr/local/bin/rpminfo:
#!/bin/bash /usr/bin/rpm -q --queryformat="<html><head><title>RPM: %{NAME} %{VERSION}-%{RELEASE} (%{GROUP})</title><meta name=\"description\" content=\"%{SUMMARY}\"></head><body> %{DESCRIPTION}\n</body></html>" -p $1
indexer.conf:
Mime application/x-rpm text/html "/usr/local/bin/rpminfo $1"
It renders to such nice RPM information:
3. RPM: mysql 3.20.32a-3 (Applications/Databases) [4] Mysql is a SQL (Structured Query Language) database server. Mysql was written by Michael (Monty) Widenius. See the CREDITS file in the distribution for more credits for mysql and related things.... (application/x-rpm) 2088855 bytes
If you're using an external parser not listed here,
please contribute your parser configuration
to <general@mnogosearch.org>
.