Ranking documents

By default, mnoGoSearch sorts results by score. Score is calculated as relevancy value mixed with various other factors listed in the Section called Commands affecting document score .

Note: You can also request a non-default document ordering with help of the s search parameter. Have a look into the Section called Search parameters to know how to order documents by Date, Popularity Rank, URL and other document parameters.

Commands affecting document score

Have a look into these manual sections to know about various commands that affect document ordering and/or score values: DateFactor, DocSizeWeight, IDFFactor, MinCoordFactor, NumDistinctWordFactor, NumSections, NumWordFactor, UserScore, PopularityFactor, WordDistanceWeight, Phrase2CountFactor, Phrase3CountFactor, WordFormFactor, WordDensityFactor.

Relevancy

Relevancy for every found document is calculated as the cosine of the angle formed by two weights vectors, the vector for the search query and the vector for the found document. The number of coordinates in the vectors is equal to the number of the words in the search query (NumWords) multiplied by the number of the active sections, defined by the NumSections command: NumWords * NumSections. Every coordinate in the vector corresponds to one word in one section, the coordinate value consists of thee factors:

Imagine we typed the search query ``test document'' in the search form, and search returned this HTML document among the other results:

<HTML>
  <HEAD>
    <TITLE>
      Test
    </TITLE>
  </HEAD>
  <BODY>
    This is a test document to test the score value 
  </BODY>
</HTML>
Also, for similicity reasons, imagine that NumSections is set to 2 (that is only the body and title sections are active), wf is set to its default value (weight factors for alls sections are equal to 1), and WordDensityFactor is set to 255 (the strongest density effect).

mnoGoSearch will use these two vectors to calculate relevancy:


  Vq= (1, 1, 1, 1)
for the search query and

  Vd= (1, 0, 0.2, 0.1)
for the above document, calculated as follows:

The cosine value value for the above two vectors is 0.634335.

Now imagine that we set wf to "1111181" and therefore made the weight factor for the section title higher. Now relevancy will be calculated using these two vectors:


  Vq= (8, 8, 1, 1)
for the search query and

  Vd= (8, 0, 0.2, 0.1)
for the above document, which will result in the relevancy value 0.704660.

The relevancy value calculated as explained above is further mixed with various other parameters to get the final score value, for example the average distance between the words in the document, the distance of the words from the beginning of the section, and the other parameters listed in the Section called Commands affecting document score .

Note: In the default configuration mnoGoSearch produces quite small score values, because it expects the words to be found in up to 256 sections and therefore uses the 256 coordinate vectors. Have a look into NumSections search.htm command description how to specify the real number of sections and thus increase the score values. Changing NumSections does not affect the document order, it only changes the absolute score values for all documents.

Analyzing score values

Starting from the version 3.3.7, mnoGoSearch allows to debug score values calculated for the documents found and thus helps to find a combination of all score factors which is the best for you. In order to debug score values go through these steps:

  1. Add a line of the code to display the DebugScore property of the search environment before the code where the search template presents result statistics, so it looks about like this (assiming the default search.htm):

    
...
    <!-- Result statistics: first, last, total found, search time -->
    <?mnogosearch {cout << env.property("DebugScore"); } ?>
    <table class="result-statistics">
    ...
    

  2. Add a new line of the code displaying the IDproperty of documents, near the place where it displays order, so it looks like this:
    
...
          <span class="order">
          <?mnogosearch cout << res.document_property_html(i, "order")<< '\n';?>
          [<?mnogosearch cout << res.document_property_html(i, "id"); ?>]
          </span>
    ...
    
  3. Open search.cgi in your browser and run some search query consisting of multiple words. You will additionally see the document IDs near document ranks.

  4. Choose a document you want to see the debug information for. Remember its ID (let's say the ID is 100).

  5. Go to your browser's location bar, add &DebugURLID=100 at the very end of the URL and press Enter.

    Note: Now the URL will look approximately like this:

    
http://hostname/cgi-bin/search.cgi?q=test+query&DebugURLID=100
              

  6. Find a line of this format in between the search form and the results:
    
DebugScore: url_id=100 RDsum=98 distance=84 (84/1) minmax=0.99091089
                density=0.00196271 numword=0.90135133 wordform=0.00000000
            
    It will give you an idea why the score value for the selected document is too high or too low and help to fine tune various parameters like WordDistanceWeight or WordDensityFactor.

Note: Score debug information is currently displayed only for queries with multiple search words. Queries with a single search word don't return debug information.

Popularity

Popularity is a way of measuring the importance of documents. Taking into account popularity strongly improves the quality of search results over a collection of linked hypertext documents.

Popularity is calculated by counting the number and quality of links to a document. The more incoming links a document has, the better popularity it gets. The more outgoing links a document has, the less quality of each outgoing link is. Self-links (when a document refers to itself), are ignored and do not affect the document popularity.

To calculate popularity, indexer uses the information collected in the table links during crawling.

Popularity is automatically calculated when you start indexer --index (i.e. perform full re-indexing). Alternatively, popularity can be calculated without recreating the search index by running indexer --rewritepop.

When calculating popularity for a document, indexer takes into account the value of the ServerWeight command associated with this document (which is 1 by default).