Content-type: text/html; charset=iso-8859-1
Size: 18042
Content-Type: text/xml
Document:
Character sets
Charsets
Supported character sets
mnoGoSearch supports almost all known 8 bit
character sets as well as some multi-byte charsets including Korean
euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as
utf8. Some multi-byte character sets are not
supported by default, because the conversion tables for them are
rather large that leads to increase of the executable files
size. See configure parameters to enable support
for these charsets.
mnoGoSearch also supports the following
Macintosh chatacter sets: MacCE, MacCroatian, MacGreek, MacRoman,
MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew,
MacCyrillic, MacGujarati.
Several languages in one database
It is often necessary to deal with several
languages simultaneously. Number of supported languages depends on
choice of character set that mnoGoSearch
will use to store data. Character set is specified with
LocalCharset command.
UTF-8 mode
When UTF-8 is specified in
LocalCharset command, you may work with any
languages supported in Unicode. That means you may use any number of over 650 languages. However, using UTF-8 may consume large amount of disk space (up to twice for some languages), leading to slower indexation and search.
non-UTF-8 mode
Since every character set includes latin
characters, any character set supports at least two languages -
English and one or more other languages. US-ASCII
is an exception - it supports only Latin characters.
When using
mnoGoSearch in standard (non-UTF-8) mode,
you may use as many languages as you like if they all belong to same
language group.
Language groups
Language group
Languages
Character sets
Group 1
Western Europe:
Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French,
Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish,
Swedish
ASCII 8, CP437,
CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman,
MacIceland
Group 2
Eastern Europe:
Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish,
Romanian, Slovak, Slovene
CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian
Group 4
Baltic
CP1257, iso-8859-4, iso-8859-13
Group 5
Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian
CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic
Group 6
Arabic
CP864, CP1256, ISO 8859-6, MacArabic
Group 7
Greek
CP869, CP1253, ISO 8859-7, MacGreek
Group 8
Hebrew
CP1255, ISO 8859-8, MacHebrew
Group 9
Turkish
CP857, CP1254, ISO 8859-9, MacTurkish
Group 101
Japanese
Shift-JIS, EUC-JP, ISO-2022-JP
Group 102
Simplified Chinese (PRC)
EUC-GB
Group 103
Traditional Chinese (ROC)
Big 5
Group 104
Korean
EUC-KR
Group 105
Thai
CP874, TIS 620, MacThai
Group 106
Vietnamese
CP1258
Group 107
Indian
MacGujarati
Group 108
Georgian
geostd8
Unicode
Over 650 languages
UTF-8 (Unicode)
E.g. in case you search engine is configured to
use LocalCharset from the 5th group (Cyrillic), you
may index servers containing documents in Bulgarian, Byelorussian,
Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language
document in UTF-8 is possible as well; however
indexer will extract and save only cyrillic content
from the page. To provide support for over 650 languages, please use
LocalCharset utf-8 command.
Recoding
indexer recodes all
documents to the character set specified in the
LocalCharset
indexer.conf
command. Internally recoding is implemented using Unicode. Please note
that some recoding may loose some data. For example, recoding between
any Greek and Russian charsets looses all national characters. This
does not matter for a single language sites. If you want to build
multi-lingual search engine use UTF8 character set as a
LocalCharset.
Recoding at search time
You may use BrowserCharset
command to choose a charset which will be used to display search
results. BrowserCharset may differ from
LocalCharset.
Character sets aliases
Each charset is recognized by a number of its
aliases. Web servers can return the same charset in different
notation. For example, iso-8859-2, iso8859-2, latin2 are the same
charsets. There is support for charsets names aliases which search
engine can understand:
Charsets aliases
iso-8859-1:
CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1
iso-8859-10:
CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6
iso-8859-11:
ISO-8859-11, TIS-620, TIS620, TACTIS
iso-8869-13:
ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7
iso-8859-14:
ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8
iso-8859-15:
ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998
iso-8859-16:
ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000
iso-8859-2:
CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2
iso-8859-3:
CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3
iso-8859-4:
CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4
iso-8859-5:
CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988
iso-8859-6:
ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987
iso-8859-7:
CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987
iso-8859-8:
CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988
iso-8859-9:
CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5
armscii-8:
ARMSCII-8
big5:
BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5
cp1250:
CP1250, MS-EE, WINDOWS-1250
cp1251:
CP1251, MS-CYRL, WINDOWS-1251
cp1252:
CP1252, MS-ANSI, WINDOWS-1252
cp1253:
CP1253, MS-GREEK, WINDOWS-1253
cp1254:
CP1254, MS-TURK, WINDOWS-1254
cp1255:
CP1255, MS-HEBR, WINDOWS-1255
cp1256:
CP1256, MS-ARAB, WINDOWS-1256
cp1257:
CP1257, WINBALTRIM, WINDOWS-1257
cp1258:
CP1258, WINDOWS-1258
cp437:
437, CP437, IBM437
cp850:
850, CP850, CSPC850MULTILINGUAL, IBM850
cp852:
852, CP852, IBM852
cp855:
855, CP855, IBM855
cp857:
857, CP857, IBM857
cp860:
860, CP860, IBM860
cp861:
861, CP861, IBM861
cp862:
862, CP862, IBM862
cp863:
863, CP863, IBM863
cp864:
864, CP864, IBM864
cp865:
865, CP865, IBM865
cp866:
866, CP866, CSIBM866, IBM866
cp869:
869, CP869, IBM869, CP874, WINDOWS-874
euc-kr:
CSEUCKR, EUC-KR, EUCKR
gb2312:
CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58
koi8-r:
CSKOI8R, KOI8-R
koi8-u
KOI8-U
shift-jis:
CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS
cp367:
ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII
utf8:
UTF-8, UTF8
viscii:
CSVISCII, VISCII, VISCII1.1-1
maccyrillic:
MACCYRILLIC, X-MAC-CYRILLIC
macroman:
MACROMAN, MACINTOSH, CSMACINTOSH, MAC
MacCentralEurope:
MACCENTRALEUROPE, MACCE
Document charset detection
indexer detects document character set in this order:
"Content-type: text/html; charset=xxx"
<META NAME="Content-Type" CONTENT="text/html; charset=xxx">
CommandGuesserUseMeta
Selection of this variant may be switch off by command: GuesserUseMeta no in your
indexer.conf.
Defaults from "Charset" field in Common Parameters
Automatic charset guesser
Since 3.2.0 mnoGoSearch has an automatic charset
and language guesser. It currently recognizes more than 100 various
charsets and languages. Charset and language detection is implemented
using "N-Gram-Based Text Categorization" technique. There is a number
of so called "language map" files, one for each language-charset
pair. They are installed under
/usr/local/mnogosearch/etc/langmap/ directory by
default. Take a look there to check the list of currently provided
charset-language pairs. Guesser works fine for texts bigger than 500
characters. Shorter texts may not be guessed well.
Build your own language maps
To build your own language map use mguessermguesser
utility. In addition, your need to collect file with language samples in charset desired. For new language map creattion,
use the following command:
mguesser -p -c charset -l language < FILENAME > language.charset.lm
You can also use mguesser utility for guessing document's language and charset by exsisting
language maps. To do this, use following command:
mguesser [-n maxhits] < FILENAME
For some languages, it may be used few different charset. To convert from one charset supported by
mnoGoSearch to another, use mconvmconv
utility.
mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile
By default, both mguesser and mconv utilities is installed into
/usr/local/mnoogosearch/sbin/ directory.
CommandLangMapUpdate
Since version 3.2.14, mnoGoSearch has an ability to update language and charset maps
automaticaly while indexing, if remote server supply with pages exactly specified language and charset.
To enable this function, specify command
LangMapUpdate yes
in your indexer.conf file.
Default charset
CommandCharset
Use Charset indexer.conf command to choose the default charset of indexed servers.
Default Language
CommandDefaultLang
You can set default language for Servers by using DefaultLang
indexer.conf variable. This is useful while restricting search by URL language.
######################