Chapter 11. Searching documents

Table of Contents
Using search front-ends
Designing search.htm
Ranking documents
Tracking search queries
Search results cache
Fuzzy search

Using search front-ends

Performing search

Open your preferred front-end in Web browser:


http://your.web.server/path/to/search.cgi
or

http://your.web.server/path/to/search.php
or

http://your.web.server/path/to/search.pl

To start search, type the words you want to find and press the SUBMIT button. For example, ``MySQL ODBC''. mnoGoSearch will find documents having the words MySQL and/or ODBC. The best matching documents will be displayed in the top of the search results.

Note: The quote signs `` and '' are not parts of the search query. They are used in this example and in the other examples given in the manual to separate search queries from the other text.

Note: mnoGoSearch works case insensitively. The case of the letters in a search query does not matter.

Search parameters

mnoGoSearch front-ends support the following CGI query string parameters (which can be used in the HTML search form variables).

Note: Search parameters can also be set using the ReplaceVar command.

Table 11-1. Available search parameters

qtext parameter with the search query words
sa character sequence specifying the result sorting order. Small letters mean ascending order, capital letters mean descending order. The following letters are understood: R or r - for sorting by score, P or p - for sorting by Popularity Rank, D or d - for sorting by modification date. U or u - for sorting by URL. S or s - for sorting by a user defined section (see also the su parameter). The default value is R, which means sorting in descending score order.
suthe user defined section name to sort results when s=S or s=s is given.

Note: Use the UserOrder command to improve performance of sorting by a user defined section.

sl.*a section limit. You can limit searches using a certain value of a desired section. For example, sl.title=Top will only search among the documents having title equal to Top. Section values support SQL wildcards % and _.

<SELECT NAME="sl.title" MULTIPLE>
<OPTION VALUE="%2008%">2008</OPTION>
<OPTION VALUE="%2008%">2007</OPTION>
</SELECT>
The above code in the HTML search form will limit searches to the documents having the substrings 2007 or 2008 in their titles, according to the user choice.
flLoads a fast limit with the given name pattern. The limit should be previously defined using the Limit command. If the fl value starts with minus character, the limit is considered as excluding limit. For example, fl=-name restricts search to the documents not covered by the limit name. SQL LIKE operator is used when loading fast limits during search time, so % and _ wildcards can be used in the fl pattern. If the pattern matches multiple limits, search is restricted to the documents covered by either of them. If an excluding limit pattern matches multiple limits, search is restricted to the documents covered by non of them.
pspage size, the number of documents displayed on one page, 10 by default.
npthe current page number, 0 by default (the first page)
offssearch result start point (offset). 0 by default (meaning display starting from the first document). offs is an alternative way to to set the desired offset. np=2&ps=10 is effectively the same to offs=20&ps=10, and both mean display 10 documents starting from 21. If both offs and np are specified, then np is ignored.

Note: Using offs you can display results starting from an arbitrary offset, even in the "middle" of a page, for example: offs=5&ps=10 means display 10 documents starting from 5.

msearch mode. all and any values are supported. The default value is all.
wmword match type. The available values are wrd, beg, end and sub respectively meaning whole word, word beginning, word ending and word substring match, with the whole word match type by default. Mininum word length for substring match is controlled by the SubstringMatchMinWordLength command in search.htm. See also the Section called Substring search notes in Chapter 7.
tA Tag limit. Limits search through the documents with the given tag only. This parameter has a similar effect to the -t option in indexer command line
ul

An URL limit. Limits search results by an URL pattern. If the ul value represents a relative URL, then search.cgi automatically adds % wildcards before and after the ul value. For example:


<OPTION VALUE="/manual/">
will add (url LIKE '%/manual/%') condition into the SQL query. If the ul value is an absolute URL with schema, then search.cgi will add % sign only in the end of the value. For example for:

<OPTION VALUE="http://localhost/">
search.cgi will add (url LIKE 'http://localhost/%') condition.

Note: Using an absolute URL is more efficient as it can use SQL indexes for optimization.

Additionally to the automatically added wildcards, you can use your own % and _ wildcards in the pattern. For example:


<OPTION VALUE="http://localhost/%/archive/">

Multiple ul values can be given in the query string, which allows to use a SELECT MULTIPLE input type in the HTML search form. Multiple values are joined using the OR condition. For example, when a user selects both options from this list:


<SELECT NAME="ul" MULTIPLE>
<OPTION VALUE="/dir1/">Dir1</OPTION>
<OPTION VALUE="/dir2/">Dir2</OPTION>
</SELECT>
search.cgi will add (url LIKE '%/dir1/%' OR url LIKE '%/dir2/%') condition into the search query.

ue

Limits the search results by excluding the documents matching the given URL pattern.

The ue parameter detects absolute and relative URL patterns and automatically adds wildcards, and supports your own wildcards, similarly to the ul parameter.

Multiple ue parameters are also understood to exclude multiple URL patterns at the same time. Multiple parameters are joined using the AND SQL operator. For example, when a user selects both options from this list:


<SELECT NAME="ue" MULTIPLE>
<OPTION VALUE="/dir1/">Dir1</OPTION>
<OPTION VALUE="/dir2/">Dir2</OPTION>
</SELECT>
search.cgi will add (url NOT LIKE '%/dir1/%' AND url NOT LIKE '%/dire2/%') condition into the search query.

Note: The ul and ue parameters can be given at the same time.

wf A weight factor vector. It allows to change weights of the different document sections at search time. The wf value should be passed in the form of a hexadecimal number. Check the explanation below.
nwf A No section weight factor vector. See the explanation below.
gA language limit to find documents only in the given language. The value should be a two-letter language abbreviation. Have a look into the Section called Indexing multilingual servers in Chapter 9 for details. An HTML form example:

<SELECT NAME="g">
<OPTION VALUE="">All language
<OPTION VALUE="en">English
<OPTION VALUE="de">German
<OPTION VALUE="ru">Russian
</SELECT>
tmpltThe search template file name (without path), to specify the template file to use instead of the default file search.htm.
typeA Content-Type limit to find documents with the given type, for example application/pdf. Multiple type parameters can be passed in the same query. SQL LIKE patterns are also understood.
spDefines whether to use stemming. sp=1 tells search.cgi to use the Ispell commands given in search.htm. sp=0 makes search.cgi ignore all Ispell commands and therefore return only the exact word forms entered by the user. The default value is 1. See the Section called Ispell for details.
syDefines whether to use synonyms. sy=1 allows using the synonym type of fuzzy search. sy=0 makes search.cgi ignore all synonym-related commands. The default value is 1.
tlDefines whether to use the transliteration type of fuzzy search. tl=yes or tl=1 means to use transliteration. tl=no or tl=0 means to switch transliteration off. The default value is 0.
dtA time limit. Three time limit types are supported.

dt=back limits the result to recent documents, modified within the period of time between now and back to the past up to the given period of time. The period is to be passed using the dp parameter.

If dt=er is given, then search results are limited to the documents newer or older than the given date value. dx=1 means newer (or after). dx=-1 means older (or before). The date value is specified using the dy, dm, and dd parameters.

If dt=range is given, then search returns documents modified within the given date range. The parameters db and de are used to pass the first and the last dates.

dpA "recentness" limit. To be used in combination with dt=back. dp should be specified using the xxxA[yyyB[zzzC]] format. xxx, yyy, zzz are numbers (can be negative!). A, B, C are field descriptors, similar to the descriptors strptime() and strftime() C functions use, with the following meaning: s - second, M - minute, h - hour, d - day, m - month, y - year. For example:

  4h30m     - 4 hours and 30 minutes
  1Y6M-15d  - 1 year and six month minus 15 days
  1h-60m+1s - 1 hour minus 60 minutes plus 1 second
dxThe newer/older flag. dx=1 means newer. dx=-1 means older. dx is to be used together with dt=er.
dmMonth (when dt=er), starting from 0: 0 - January, 1 - February, ... , 11 - December.
dyYear (when dt=er), using the four digit format. For example: dy=2008.
ddDay (when dt=er), a number in the range 1...31.
db The beginning date (when dt=range), using the dd/mm/yyyy format.
deThe end date (when dt=range), using the dd/mm/yyyy format.
usSpecifies the name of the user defined score list which should be loaded and mixed with the score values internally calculated by mnoGoSearch, according to UserScore and UserScoreFactor configuration. If us is empty, or there is no a UserScore command with the given name, us is ignored.
ssSpecifies the name of the user defined site score list which should be loaded and mixed with the scores internally calculated by mnoGoSearch, according to UserSiteScore and UserScoreFactor configuration. If us is empty, or there is no a UserSiteScore command the given name, ss is ignored.
GroupBySiteEnables or disables grouping results by site. Can be set to yes or no, with the default value no. This parameter has the same effect with the GroupBySite search.htm command.

Changing weights of the different document parts at search time

Changing weights (importance) of the different document parts (sections) is possible with help of the wf HTML form variable passed to search.cgi.

To be able to use this feature, it is recommended to set different section IDs for different document parts in the Section command in indexer.conf. Currently up to 256 separate sections are supported.

Imagine that we have these default sections in indexer.conf:


Section body        1  256
Section title       2  128
Section keywords    3  128
Section description 4  128

The wf value is a string of hexadecimal digits ABCD, where every digit represents a weight factor for the corresponding section. The rightmost digit corresponds to the section with ID=1. If a weight factor for some section is 0, then this section is totally ignored at search time.

For the given above section configuration:


      D is a factor for section 1 (body)
      C is a factor for section 2 (title)
      B is a factor for section 3 (keywords)
      A is a factor for section 4 (description)

Examples:


    wf=0001 will search through the section body only.

    wf=1110 will search through the sections
    title,  keywordsdescription.
    The section body will be ignored.
    
    wf=F421 will search through:
           Description with factor 15 (F hex)
           Keywords with factor 4
           Title with factor 2
           Body with factor 1

It is also possible to set the default wf value using the wf search.htm command. If wf is omitted in the query and there is no a wf command defined in search.htm, all section factors are considered to be equal to 1, which means that all sections have the same weight.

Starting from the version 3.3.0, it is also possible to specify the wf value as a DBAddr search.htm command parameter. This can be useful if you're using multiple DBAddr commands to merge search results from multiple databases and want to give higher or lower score to the results coming from a certain database.

The nwf search parameter uses the same format with wf. If all found words appear only in a single section, then resulting score becomes lower. It can be used for example to ignore spam in the KEYWORDS meta tag. If you use high wf and nwf values for the section corresponding to the KEYWORDS meta tag, then score will be high only if KEYWORDS match the rest of the document, that is if the query words appear in KEYWORDS and at the same time in other sections (like title or body). If the query words are found in the section KEYWORDS alone, then score for this documents will be low. Starting from the version 3.3.3, nwf can also be set as a parameter to the DBAddr command in search.htm.

Changing importance of individual query words

mnoGoSearch search query language allows to specify different importance for individual search query words. The range of possible user-defined importance values is 0-256. The the default value is 256 for all query words. You can change importance of some words using a special keyword importance immediately followed by a number and a semicolon character:


star wars importance10:movie
In the above example, importance for the words star and wars is 256 (the default values), while importance for the word movie is 10, which makes it less important when ranking found documents.

If you specify importance0: for some query word, for example:


star wars importance0:movie
then this word will be ignored only at ranking time, however this word will still be required if you're doing an m=all search query (i.e. "find all words"). Therefore, in the above example, search will not return documents which don't have the word movie.

Using multiple templates

It is often required to use multiple templates with the same search.cgi. There are a few ways to do it. They are given here in the order search.cgi detects the template name.

  1. search.cgi checks the environment variable UDMSEARCH_TEMPLATE. So you can put a path to the desired search template to UDMSEARCH_TEMPLATE.

  2. search.cgi also supports Apache internal redirect. It checks the REDIRECT_STATUS and REDIRECT_URL environment variables. To start using Apache internal redirect you can add these lines into httpd.conf:

    
AddType text/html .zhtml
    AddHandler zhtml .zhtml
    Action zhtml /cgi-bin/search.cgi
    

    Put search.cgi into your /cgi-bin/ directory. Then put the HTML search templates into your Web server directory using the .zthml extension, for example template.zhtml. Now you can open the search page by typing this URL in the browser location bar:

    
http://www.site.com/path/to/template.zhtml
    
    Instead of .zthml you can configure any other extension on your choice.

  3. search.cgi also checks the URL part after the "search.cgi" substring, which is available in the PATH_INFO environment variable. For example, if you type http://site/search.cgi/search1.html in your browser, search.cgi will open search1.htm as a template file. If you type http://site/search.cgi/search2.html, it will use search2.htm, and so on.

  4. If the above three ways did not work, search.cgi opens a template which has the same name with the script being executed by reading the SCRIPT_NAME environment variable value. search.cgi opens the template file ETC/search.htm, search1.cgi opens the template file ETC/search1.htm and so on, where ETC is mnoGoSearch /etc directory (usually /usr/local/mnogosearch/etc). So, you can create a number of symbolic or hard links to the same search.cgi and open it using its different names.

Advanced Boolean search

You can compose complex search queries with help of the Boolean query language.

mnoGoSearch understands the following Boolean operators:

& - logical AND. For example, ``mysql & odbc''. mnoGoSearch will return the documents containing both words mysql and odbc. You can also use + for this operator.

| - logical OR. For example, ``mysql|odbc''. mnoGoSearch will find the documents containing the word mysql, or containing the word odbc.

~ - logical NOT. For example, ``mysql & ~odbc''. mnoGoSearch will find the documents containing the word mysql and not containing the word odbc at the same time. Note that the ~ operator can only exclude the given word from the results. The query ``~mysql & ~odbc'' will return no result.

() - the grouping command to compose more complex queries. For example, ``(mysql | msql) & ~postgres''.

Note: Boolean operators work only in queries having two or more words. search.cgi ignores Boolean operators in queries consisting of a single word. Thus, the query ``~odbc'' will just search for the word odbc without treating the ~ sign as the NOT operator.

Note: Boolean search considers stopwords as found in any documents that contain the other search terms from the same query. For example, if ``the'' is a stopword, the query ``(Jana First)|(Michael Second)|the'' will return all documents that have any of the four non-stopword terms and is effectively the same to ``Jana|First|Michael|Second''.

Note: If a search query consists of more than 64 words, Boolean search results are not predictable.

Restricting search words to a section

Starting from the version 3.2.39, mnoGoSearch understands section name references. For example, ``title:web body:server'' will find the documents having the word web in their titles and at the same time the word server in their bodies. To make search.cgi recognize section names, you need to copy the desired Section commands from indexer.conf to search.htm.

Note: Section name references can be combined with Boolean operators.

Phrase search

Phrase search is activated by using quote characters around the words. For example, the query ``"search engine"'' will return the documents having the word search immediately followed by the word engine, while the query ``search engine'' (i.e. without the surrounding quotes) will not require the words to be close to each other.

Note: It is possible to combine two or more phrases in the same query, as well as combine phrases with Boolean operators.

Starting from the version 3.2.39, automatic phrase search is forced for complex words having dots, dashes, underscores, commas and slashes (- _ . , /) as delimiters between the word parts. For example, the query ``max_allowed_packet'' automatically searches for the phrase ``"max allowed packet"'', not just for the three separate words.

Exact section match

Starting from the version 3.3.0, exact section match syntax is available. An exact section match query consists of a section reference (as described in the Section called Restricting search words to a section ), followed by the = (the EQUAL sign), followed by a phrase in quotes. For example, the search query ``title="search engine"'' will return the documents having title equal to the phrase "search engine".

Exact section match is not available if you set SaveSectionSize set to no.

How search handles expired documents

Expired documents are still searchable with their old content.