mnoGoSearch 3.4.1 reference manual

Full-featured search engine software


Table of Contents
1. Introduction
mnoGoSearch Features
Where to get mnoGoSearch.
Disclaimer
Authors
Contributors (in no particular order)
Frequently Asked Questions
2. Installing mnoGoSearch
SQL database requirements
Supported operating systems
Tools required for installation
Installing mnoGoSearch
Running search.cgi from inetd / xinetd
Possible installation problems
Creating a binary package
Installation registration
Installing mnoGoSearch PHP module
3. Indexing
Indexing in general
Configuration
Creating SQL table structure
Dropping SQL table structure
Running indexer
HTTP redirects
Crawling time optimization
Subsection control
How to clear the database
Database Statistics
Using indexer for site validation
Running multiple indexer instances for crawling
Running indexer with multiple threads
HTTP response codes mnoGoSearch understands
Content-Encoding support
indexer configuration
Specifying the Web space for indexing
Aliases
ServerTable
FlushServerTable
Using syslog
Disabling Apache logging
Cached copies
Configuring cached copies
Using cached copies at search time
Moving cached copies to another machine
Using the original document as a cached copy source
4. Supported file formats and mime types
Built-in parsers
mnoGoSearch HTML parser
Tag parser
HTML entities
META tags
Links
Comments
5. External parsers
Supported parser types
Setting up parsers
Preventing indexer from getting stuck on a parser execution
Pipes in a parser command line
Parsers and character sets
The UDM_URL environment variable
External parsers for the most common file types
MS Word (*.doc)
MS Excel (*.xls)
MS PowerPoint (*.ppt)
MS Word 2007 (*.docx)
Rich Text (*.rtf)
Adobe Acrobat (*.pdf)
PostScript (*.ps)
MS Works 2, 3, 4, 5 (2000), and 8 (*.wps)
Corel WordPerfect 4.x and and later (*.wpd)
RPM
6. Extended indexing features
News extensions
Creating an MP3 search engine
MP3 indexer.conf commands
Restricting search to a certain MP3 section
Indexing SQL tables (htdb:/ virtual URL scheme)
HTDB indexer.conf commands
HTDB variables
Using multiple HTDB sources
Using mnoGoSearch as an external SQL full-text engine
Indexing a database driven Web server
Indexing a program output (exec:/ and cgi:/ virtual URL schemes)
Passing parameters to the cgi:/ virtual scheme
Passing parameters to the exec:/ virtual scheme
Using the exec:/ virtual scheme as an external retrieval system
Mirroring
Creating a mirror
Using a mirror as crawler cache.
Dumping and restoring the search database
Dumping the search database
Restoring the search database
7. mnoGoSearch word index formats
Word modes with an SQL database
Various modes used to store words
Storage mode - single
Storage mode - multi
Storage mode - blob
Live updates emulator with DBMode=blob
Extended features with DBMode=blob
Maximum amount of words collected from a document
Substring search notes
mnoGoSearch performance issues
MySQL performance
Post-indexing optimization
Oracle notes
Introduction
Compilation, Installation and Configuration
IBM DB2 notes
8. Subsections
Tags
Adding tags
Using tags at search time
Using substring tag match
Multiple selections
Using tags with indexer
9. Internationalization
Character sets
Supported character sets
Multiple languages in the same database
Character set conversion
Character set conversion at search time
Character sets aliases
Document character set detection
Automatic character set guesser
The default character set
The default Language
Search pages with multi-lingual interface
Installing a multi-lingual interface
How it works
Possible troubles
Segmenters for Chinese, Thai and Japanese languages
Japanese phrase segmenter
Chinese phrase segmenter
Thai phrase segmenter
The CJK phrase segmenter
Indexing multilingual servers
10. mnoGoSearch templates
Introduction
Processing instructions
Statements
Comments
Data types
Selection statements
Iteration statements
Jump statements
Compound statements
Variable declarations
Expression statements
Primary expression
Postfix expression
Unary expression
Cast expression
Multiplicative expression
Additive expression
Shift expression
Relational expression
Equality expression
AND expression
Exclusive OR expression
Inclusive OR expression
Logical AND expression
Logical OR expression
Conditional expression
Assignment expression
Expression
Functions
The string class
The ENV class
The RESULT class
The QUERYWORD class
The DOCUMENT class
Security issues
11. Searching documents
Using search front-ends
Performing search
Search parameters
Changing weights of the different document parts at search time
Changing importance of individual query words
Using multiple templates
Advanced Boolean search
Restricting search words to a section
Phrase search
Exact section match
How search handles expired documents
Designing search.htm
Relative links in search.htm
Adding a small Search form to the other pages of your site
Ranking documents
Commands affecting document score
Relevancy
Analyzing score values
Popularity
Tracking search queries
Search results cache
Fuzzy search
Ispell
Synonyms
Dehyphenation
Loading synonyms and word forms from the SQL database
Dumping Ispell data
Transliteration
Searching numbers
Accent insensitive search
12. mnoGoSearch cluster
Introduction
How it works
Operations done on the database machines
How a typical XML response looks like
Operations done on the front-end machine
Cluster types
Installing and configuring a merge cluster
Installing and configuring a distributed cluster
Using dump/restore tools to add cluster nodes
Cluster limitations
13. Miscellaneous
Environment variables
Using mnoGoSearch as an embedded search library
libmnogosearch
mnoGoSearch API
The udm-config script
MySQL fulltext parser plugin
Database schema
Reporting bugs
Currently known bugs
Core dump reports
I. Reference
I. mnoGoSearch command reference
AddType -- associates file names or extensions with mime types
AddEncoding -- associates file names or extensions with encoding types
Affix -- loads an Ispell affix file
AjaxLinks -- defines whether to store AJAX links with hash fragments
Alias -- associates master and mirror sites
AliasProg -- calls an external URL rewrite program
Allow --  allows to index the documents with the given URL pattern
AlwaysFoundWord -- defines a word that is treated as found in any document
AuthBasic -- defines user name and password for basic HTTP authorization
BrowserCharset -- defines browser character set
CachedCopyEncoding -- defines whether to use cached copy compression
CaseFolding -- chooses an alternative case mapping
CheckMP3 -- checks for MP3 meta information
CheckMP3Only -- checks for MP3 meta information
CheckOnly -- checks if a document exists
CollectLinks -- defines what kind of links between documents should be stored in the database (e.g. for popularity rank).
ComplexSynonyms -- defines whether to use phrase-to-word and phrase-to-phrase synonyms
CrawlDelay -- defines the number of seconds to wait between requests to the same server
CrawlerThreads -- sets the number of indexer threads started for crawling
CustomLog --  enables logging to STDOUT using the given format
CVSIgnore -- defines whether to index internal CVS files
DateFactor -- gives lower score to old documents
DateFormat -- defines date format
DBAddr -- sets the database(s) connection string
DefaultContentType -- defines default Content-Type
Dehyphenate -- enables searching for dehyphenated forms of compound words
DefaultLang -- defines default language
DetectClones -- enables or disables clone detection
Disallow -- disallows indexing defined URLs
DNSCacheTimeOut -- defines maximum amount of live time of a cached DNS entry
DocMemCacheSize -- this command is obsolete
DocSizeWeight -- changes document size impact on the document score
DocTimeOut -- defines maximum amount of time spent to download a document
ExcerptSize -- defines maximal excerpt length
ExcerptStopword -- defines whether to highlight stopwords.
ExcerptPadding -- defines excerpt context length
FlushServerTable -- puts the server.active value in sync with indexer.conf
FollowLinks -- defines what kind of links between documents should be followed.
FollowSymLinks -- defines whether to dereference symlinks
ForceIISCharset1251 -- assume that Microsoft IIS servers return windows-1251 character set
GuesserUseMeta -- defines whether to use meta tags for character set detection
GroupBySite -- enables grouping search results by site
HoldBadHrefs -- defines period of time to keep bad documents in the database
HrefOnly -- scans matching documents for links only
HTDBAddr -- describes a connection string to a remote SQL data source
HTDBDoc -- describes a query to fetch a document content from an SQL source
HTDBLimit -- limits the amount of document IDs fetched in a single HTDBList query
HTDBList -- describes a query to fetch document list from an SQL data source
HTTPHeader -- adds a desired header into HTTP requests
IDFFactor -- changes the effect of inverse document frequency
ImportEnv -- imports an environment variable
Include -- includes additional configuration file
Index -- defines whether the document content should be indexed
IndexCacheSize -- sets the amount of RAM indexer uses for the search index cache
IndexerThreads -- sets the number of indexer threads started for indexing
IndexIf -- allows indexing documents whose section matches the given pattern
IndexTime --  Defines in the Last-Modified HTTP header should be processed for date detection
IPRequestPerMinLimit -- limits the number of requests to the same IP address
IspellUsePrefixes -- allows to use Ispell prefixes at search time
LangMapFile -- loads language map for character set and language guesser
LangMapUpdate -- activates updating of the loaded language maps
Limit -- describes a fast limit
LoadURLBasicInfo -- defines whether to load basic section values to display in search results
LoadChineseList -- loads a Chinese frequency dictionary
LoadTagInfo -- loads tag values to display in search results
LoadThaiList -- loads a Thai word frequency dictionary
LoadURLInfo -- loads extended section values to display in search results
LocalCharset -- defines local character set
Locale -- sets a desired locale
Log2Stderr -- Defines whether to print messages to STDERR
LogLevel -- sets verbosity level
MaxDocSize -- defines maximal document size
MaxDocPerSite -- defines maximal document number to pick up from every site
MaxHops -- defines maximal way in "mouse clicks"
MaxNetErrors -- defines maximal network errors
MaxWordLength -- defines maximal word length
Mime -- defines external parser for given mime-type
MinCoordFactor -- gives more score to documents having query words closer to the beginning
MinWordLength -- defines minimal word length
MirrorHeadersRoot -- defines root directory for mirrored document headers
MirrorPeriod -- defines fresh period for mirrored files
MirrorRoot -- defines root directory for mirrored documents
NetErrorDelayTime -- defines document processing delay
NewsExtensions -- enables news extensions
NoIndexIf -- disallows indexing documents having a section matching a pattern.
NumSections -- tells the number of sections configured in indexer.conf
NumDistinctWordFactor -- gives more score to documents having more distinct words
NumWordFactor -- gives more score to documents having more found words
ParserTimeOut -- defines maximum allowed parser execution time
Period -- defines crawling period
Phrase2CountFactor -- gives more score to documents having a two-word phrase or subphrase
Phrase3CountFactor -- gives more score to documents having a three-word phrase or subphrase
PopularityFactor -- sets how a document's popularity affects its score
Proxy -- defines HTTP proxy address
ReadTimeOut -- defines stalled connections timeout
Realm -- describes Web-space for indexing, using regex/wild patterns
RemoteCharset -- defines default character set for Server or Ream
RemoteFileNameCharset -- defines default character set of file and directory names
ReplaceVar -- creates or modifies a variable
ResultsLimit -- sets the maximum number of results displayed
ReverseAlias -- rewrites URL before inserting to the database
Robots --  defines whether to respect robots.txt and robot directives (in HTTP headers, meta tags, link attributes).
SaveSectionSize -- defines whether to store section sizes for better relevancy quality
Section -- defines a document section
Server -- describes Web-space for indexing
ServerTable -- loads servers to index from the database
ServerWeight -- defines server weight for Popularity Rank calculation
Skip -- skips visiting the documents with URL matching the given pattern
SkipIf -- skip revisiting the documents with a section matching the given pattern
Spell -- loads an Ispell dictionary file
SQLWordForms -- loads synonyms or word forms from the database
StartHops -- defines Hops value for start URLs
StopwordFile -- loads stopwords file
StrictModeThreshold -- threshold to switch to a less strict search mode
StripAccents -- converts letters to their non-accented counterparts
Subnet -- Subnet
SubstringMatchMinWordLength -- defines minimal word length allowed for substring match
Suggest -- Display misspelled search word suggestions
Synonym -- loads a synonym list from a file
SyslogFacility -- sets syslog facility
Tag -- assigns a generic grouping tag to a set of documents
URL -- inserts URL into database
UserCacheQuery -- stores a search result to the database using a user-defined SQL query
URLDataThreshold -- improves search performance for queries returning a small number of results
URLSelectCacheSize -- sets URL cache size for indexer
URLSelectSkipLock -- defines whether to skip locking URLs when fetching crawling targets from the database
UseCookie -- defines whether to use per-session cookies during crawling
UseLocalCachedCopy -- whether to use the original document as a source for excerpts and Cached Copy
UseCRC32URLId -- defines whether to use CRC32 for URL ID generation
UseNumericOperators -- defines whether to interpret numeric operators in a search query
UseRangeOperators -- defines whether to recognize range operators in a search query
UseRemoteContentType -- specifies whether to trust the Content-Type HTTP header from the remote servers
UserOrder -- specifies an SQL query for user defined ordering
UsePopularity -- defines whether to calculate popularity during indexing
UseSitemap -- defines whether to use Sitemap Protocol when crawling
UserScore -- specifies an SQL query to calculate user defined score for desired documents.
UserSiteScore -- specifies an SQL query to calculate user defined score for certain sites.
UserScoreFactor -- sets the effect of the UserScore command
VarDir -- defines mnoGoSearch working directory
VaryLang -- defines languages for multilingual indexing
wf -- sets the default weights for different document parts
WordCacheSize -- defines maximum allowed in-memory words cache size
WordDensityFactor -- gives more score to documents having higher word density
WordFormFactor -- gives more score to the original query word form (as opposite to Synonym or Ispell fuzzy forms)
WordDistanceWeight -- changes word distance impact on the document score
II. mnoGoSearch C API function reference
UdmEnvInit -- Allocates or initializes a search context variable
UdmEnvFree -- Closes a search context
UdmAgentInit -- Allocates or initializes a search session variable
UdmAgentFree -- Closes a search session
UdmAgentAddLine -- Adds a configuration command
UdmFind2 -- Executes a search query
UdmResultFree -- Frees a search result
A. mnoGoSearch change history
Changes in 3.4
Changes in 3.4.1 (December 15, 2015)
Index
List of Tables
3-1. Verbose levels
9-1. Supported character sets
9-2. Character set aliases
11-1. Available search parameters
13-1. Environment variables mnoGoSearch understands
13-2. server table schema
13-3. Server parameters in the table srvinfo.
List of Examples
1. UdmEnvInit example #1
2. UdmEnvInit example #2
1. UdmEnvFree example #1
2. UdmEnvFree example #2
1. UdmAgentInit example #1
2. UdmAgentInit example #2
1. UdmAgentFree example #1
2. UdmAgentFree example #2
1. UdmAgentAddLine example
1. UdmFind2 example
2. UdmFind2 - a complete search application example
3. Makefile example
1. UdmResultFree example #1