indexer.conf
CollectLinks {all | yes | no | inner | outer | site | page | badscheme | bad | hops | filter | persite}...
CollectLinks defines what kind of links between documents should be stored in the database. This information can be used to calculate popularity rank, as well as for SEO purposes.
Multiple arguments are possible in the same command.
The following argument values are understood:
inner - links to the documents that are covered by Server or Realm commands.
outer - links to the documents that are not covered by any Server or Realm command.
site - links going to the same site.
page - links goung to the same document (self links).
badscheme - links with URL schemes that are normally ignored by indexer, such as mailto: or javascript:.
bad - erroneous links with malformed address
filter - links to the documents that should not be crawled because of Disallow rules.
hops - links from the documents that have reached the MaxHops limit.
persite - links to the documents that have reached the MaxDocPerSite limit.
yes - is a synonym for:
CollectLinks inner outer site filter hops persite
Note: Self-links (page), unsupported links (badscheme) and malformed links (bad) are not included into yes.
all - collect all links, including those not covered by yes.
no - do not collect any links.
Using popularity rank is described in details in the Section called Popularity in Chapter 11.
mnoGoSearch versions prior to 3.3.0 implicitly collected links between all crawled documents. Starting from the version 3.3.0, the default behavior was changed to skip collecting links, for crawling performance purposes. As a side effect popularity rank calculation is not possible in the default configuration. If popularity rank is important for your installation, please specify CollectLinks yes in indexer.conf.
Link information is stored into the table 'links
'
of the mnoGoSearch database, with the following structure:
CREATE TABLE links ( url_id int(11) NOT NULL, weight float NOT NULL, url text NOT NULL, src varchar(10) NOT NULL, rel varchar(32) NOT NULL, linktext text NOT NULL, KEY url_id (url_id) );
Note: The structure can slightly vary depending on the database backend being used.
The src field stores information about the link source, where the link came from:
a - the a HTML tag:
<a href="http://www.site.com/">link text</a>
frame - the frame HTML tag:
<frame src="frame.html">
iframe frame - the frame HTML tag:
<iframe src="http://www.site.com"></iframe>
img - the img HTML tag:
<img src="a.jpg">
meta - the meta HTML tag:
<meta http-equiv="refresh" content="5;URL='http://www.site.com/'">
link - the meta HTML tag:
<link rel="alternate" href="page2.html">
area - the area HTML tag:
<area shape="rect" coords="0,0,82,126" href="page.htm">
script - the script HTML tag:
<script src="myscripts.js"></script>
xml - a tag in a XML file, e.g.:
<item> <title>title1</title> <link>http://site.com/</link> </item>
htdb - the link was generated by some of the HTDB routines.
redir - the Location header of a HTTP redirect.
conf - indexer.conf file, e.g. from the Server or URL commands.
cline - indexer command line, e.g.:
indexer -i -u http://www.site.com/
ufile - the URL file in indexer command line, e.g.:
indexer -i -f urllist.txt
robots - a directive in robots.txt, e.g.:
Sitemap: http://www.site.com/sitemap.xml
The rel field stores information from the rel attribute of the tag link, e.g.:
<link rel="canonical" href="http://www.site.com/"/>