Fuzzy search

Ispell

mnoGoSearch can use Ispell dictionaries for stemming purposes, i.e. to find different grammatical forms of the same words. Word forms are generated only at search time, indexer stores all words in the database "as is". search.cgi generates all forms of the search words given in the query, and takes all of them in account. For example, when processing a search query having the word test, search.cgi will also find other forms of this word, such as testing, tests, tested.

Two types of Ispell files

mnoGoSearch understands two types of Ispell files: affixes and dictionaries. An Ispell affix file contains rules for word form generating, and has approximately the following format:


Flag V:
       E   > -E, IVE      # As in create> creative
      [^E] > IVE          # As in prevent > preventive
Flag *N:
       E   > -E, ION      # As in create > creation
       Y   > -Y, ICATION  # As in multiply > multiplication
     [^EY] > EN           # As in fall > fallen

An Ispell dictionary file contains a list of the words with rule references, and has the following format:


wop/S
word/DGJMS
wordage/S
wordbook
wordily
wordless/P
    

Note: mnoGoSearch cannot load Ispell hash files generated with help of buildhash. Only the original Ispell dicitionaries in text format are understood.

Using Ispell

To start using mnoGoSearch with stemming, you need to download and extract an Ispell language package and specify Affix and Spell commands in the search.htm file. The format of the commands is:


Affix {lang} {charset} {Ispell affixes file name}
Spell {lang} {charset} {Ispell dictionary filename}

The first parameter of both commands is a two-letter language abbreviation. The second parameter is the Ispell file character set. The third one is the filename. File names are relative to the mnoGoSearch /etc directory. Absolute paths can also be specified.

Note: Simultaneous loading of multiple languages is possible, e.g.:


Affix en iso-8859-1 en.aff
Spell en iso-8859-1 en.dict
Affix de iso-8859-1 de.aff
Spell de iso-8859-1 de.dict

...will load stemming information for both English and German languages.

Customizing Ispell dictionaries

It is possible that some rare words found on your site may not present in Ispell dictionaries. You can create the list of these words in a plain text file with the following format (one word per line):


rare.dict:
----------
webmaster
intranet
---------

Then, you can add Ispell flags into this file using the existing words as examples. Find an existing word with similar grammatical rules and copy its flag to the new word. For example, English dictionary has this line:

postmaster/MS

So, webmaster with the MS flags should be OK:

webmaster/MS

Then copy this file to the /etc directory of you mnoGoSearch installation and add it with help of the Spell.

Synonyms

Starting from the version 3.2, mnoGoSearch supports synonyms-based fuzzy search.

Synonym files are installed into the etc/synonym subdirectory of the mnoGoSearch installation.

To enable synonyms, use the Synonym command in your search template file. For example:


Synonym synonym/english.syn
Synonym synonym/russian.syn

The file names are relative to the /etc directory of your mnoGoSearch installation, or are absolute, if begin with /.

Please feel free to send us your own synonym lists to .

When creating your own synonym file, you can the English synonym file as an example. A synonym file must beginning with the following two commands:


Language: en
Charset:  us-ascii

The further lines contain synonyms, one group of synonyms per line. For example:


car auto automobile

All words written on the same line are considered to be equal. If you type one of the words in the search form, all other words from the same line will also be found.

An optional Mode command can also be used inside a synonym file. It understands three mode values: roundtrip, oneway and return, with the roundtrip value as default, and also two recursion flags: recursive and final, with recursive as default.

If "Mode: oneway" is specified, then the words written on the same line are not considered as equal synonyms anymore. Only the leftmost word is expanded to other words. For example:


Mode: oneway
car auto automobile
Searching for the word car will also search for auto and automobile, but searching for auto will not find neither car nor automobile, and searching for automobile will not find neither car not auto.

If "Mode: return" is specified, then all words are expanded only to the leftmost word, while the leftmost word itself is not expanded. For example:


Mode: return
car auto automobile
Searching for car won't search neither for auto nor for automobile, but searching for auto will also search for car, and searching for automobile will also search for car.

If "Mode: recursive" is specified, then any found synonym is further passed to word form generator to get more word forms (synonyms, ispell forms, etc). If "Mode: final" is specified, then the found synonyms are no longer passed to form generator. For example:


Mode: final
09 2009
09 september
If you search for 2009, search.cgi will also find 09 and vice versa. However, if you search for 2009, it will not find september. In other words, 09 is a synonym for both 2009 and september, but 2009 and september are not synonyms to each other.

It's possible to use multiple Mode commands in the same synonym file and thus switch between the oneway, return and roundtrip style of synonyms for different lines:


Mode: roundtrip
colour color
Mode: oneway
car auto automobile

Starting with the version 3.2.34, mnoGoSearch supports a simple type of phrase synonyms:


president "george bush"

That means, if you type the word president, the phrase "george bush" will also be searched.

Starting from the version 3.3.9, you can additionally specify ComplexSynonyms yes to activate using of complex synonym types phrase-to-word and phrase-to-phrase, so if you type the query "george bush", then the word president will also be searched.

Dehyphenation

Searching for both hyphenated and dehyphenated compound words at the same time is also possible. For example, when searching for the compound word "peace-making", the word peacemaking will also be found. Please refer to the Dehyphenate command description for details.

Loading synonyms and word forms from the SQL database

It is also possible to load synonyms or word forms from the database. Refer to SQLWordForms command description for details.

Dumping Ispell data

To dump Ispell data in a format suitable for loading into an SQL table for further use with SQLWordForms, copy all Affix and Spell commands from search.htm into indexer.conf then run indexer -Edumpspell > dump.txt. indexer will write all word forms to the given file dump.txt in this format:


...
abate/abate
abate/abating
abate/abated
abate/abater
abate/abates
...

Use the database specific tools and SQL syntax to load the newly created dump file into a SQL table. For example, in case of MySQL:


CREATE TABLE spell
(
  word varchar(64) not null,
  form varchar(64) not null,
  key(word),
  key(form)
);
LOAD DATA INFILE 'dump.txt' INTO TABLE spell FIELDS TERMINATED BY '/';

Transliteration

Starting from the version 3.2.34, mnoGoSearch supports transliteration.

Use the tl=yes parameter to search.cgi to activate transliteration.

Currently, Latin-to-Cyrillic and Cyrillic-to-Latin transliteration is implemented. I.e. if you type a word in the Latin script, a Cyrillic word with the same spelling is also searched, and vice versa.

Searching numbers

Starting from the version 3.2.36, mnoGoSearch supports numeric operators.

When UseNumericOperators is set to yes, the "<" and ">" signs are treated as numeric comparison operators, e.g. "<100" finds all documents which have numbers less than 100 in their body or title or other sections according to the wf" settings. Numeric operators can currently work only with the databases which support automatic data type comparison between VARCHAR and INT and do not require an explicit type cast. MySQL, PostgreSQL and SQLite are know to work.

If you specify two numeric operators in the same search query, e.g. ">100 <200", then the documents having numbers more than 100 and, at the same time, having numbers less than 200 will be found. I.e. the above query does not strictly mean "a number between 100 and 200". A "between"-alike operator will be implemented later.

Accent insensitive search

When doing searches, mnoGoSearch relies on the database collation settings, thus accent insensitive searches will be available if your database software supports and is configured to use an accent insensitive collation.

Accent insensitive search with MySQL

To configure mnoGoSearch for accent insensitive searches for German, French, Italian, Portuguese and some other Western languages, use the latin1_german1_ci collation when creating the database you're going to use with mnoGoSearch:


CREATE DATABASE mnogosearch CHARACTER SET latin1 COLLATE latin1_german1_ci;
With this collation, MySQL totally ignores all diacritic marks, so for example, searches for the French word cote will also find coté and vice versa.

Accent insensitive search with Firebird

To configure mnoGoSearch for accent insensitive searches for German, French, Italian, Portuguese etc. with Firebird, use the PT_BR collation. Firebird doesn't have the global database default collation, so it must be set in the CREATE TABLE statement for the table bdict. In order to do so, open the file /usr/local/mnogosearch/share/ibase/create.blob.sql in your favorite text editor and add the CHARACTER SET and the COLLATE clauses into the definition of the column word:


CREATE TABLE bdict (
        word VARCHAR(64) CHARACTER SET ISO8859_1 NOT NULL COLLATE PT_BR,
        ...
);

Highlighting collation matches

Starting with the version 3.3.3, mnoGoSearch can recognize the word forms returned by the underlying SQL collation, and use them for generating excerpts and highlighting. For example, if your database is configured to use German DIN-2 based collation (e.g. latin1_german2_ci in MySQL), then searches for the word gross will also return groß. Both word forms will be highlighted. Prior to 3.3.3, only the exact word forms were used for excerpts and highlighting.

Note: Highlighting collation matches works only with DBMode=blob. Adding this feature for DBMode=single and DBMode=multi would have serious search performance impact.

Accent insensitive search with other databases

To make accent insensitive searches possible with databases not supporting accent insensitive collations, mnoGoSearch provides the StripAccents command. When StripAccents is set to yes, mnoGoSearch converts all accented letters to their non-accented counterparts. Conversion happens both during indexing (before storing data into the word index), and during search (before looking up in the word index). For example, the French word coté is converted into cote.

Removing accents is only done for the word index. Accents are not removed from section values, so sections (e.g. title, body, CachedCopy are stored with their original accented letters, providing correct search results presentation.

Range search

Starting from the version 3.3.12 range search is available. When UseRangeOperators is set to yes, the operators [a TO b], [a TO b}, {a TO b] and {a TO b} are interpreted as range operators. The keyword TO must be in upper case.

Square brackets denote inclusive range search, while curly brackets denote exclusive range search. For example,

[apple TO peach}
will find documents having words that are lexicographically between apple and peach, including apple but excluding peach.

Range search operators can be combined with other operators. For example,

title:[apple TO peach]
will find all documents having a word between apple and peach in title (inclusively). See the Section called Restricting search words to a section for details on section references.

The query:

"yellow [apple TO peach]"
will find documents having phrases where the first word is yellow, and the second word is in the range between apple and peach.

The query:

"[green TO yellow][apple TO peach]"
will find documents having phrases where the first word is in the range between green and yellow, and the second word is in the range between apple and peach.

The range query syntax can be used in combination with the decimal type sections. For example, the query:

title:t-shirt price:[10.1 TO 200]
will find documents that have t-shirt in title and have price in the range from 10.1 to 200, providing that the section price is marked as decimal in both indexer.conf and search.htm:
Section price 5 256 decimal