Character sets <indexterm><primary>Charsets</primary></indexterm>

Content-type: text/html; charset=iso-8859-1 Size: 18042 Content-Type: text/xml Document: Character sets <indexterm><primary>Charsets</primary></indexterm> Supported character sets mnoGoSearch supports almost all known 8 bit character sets as well as some multi-byte charsets including Korean euc-kr, Chinese big5 and gb2312, Japanese shift-jis, euc-jp and iso-2022-jp, as well as utf8. Some multi-byte character sets are not supported by default, because the conversion tables for them are rather large that leads to increase of the executable files size. See configure parameters to enable support for these charsets. mnoGoSearch also supports the following Macintosh chatacter sets: MacCE, MacCroatian, MacGreek, MacRoman, MacTurkish, MacIceland, MacRomania, MacThai, MacArabic, MacHebrew, MacCyrillic, MacGujarati. Several languages in one database It is often necessary to deal with several languages simultaneously. Number of supported languages depends on choice of character set that mnoGoSearch will use to store data. Character set is specified with LocalCharset command. UTF-8 mode When UTF-8 is specified in LocalCharset command, you may work with any languages supported in Unicode. That means you may use any number of over 650 languages. However, using UTF-8 may consume large amount of disk space (up to twice for some languages), leading to slower indexation and search. non-UTF-8 mode Since every character set includes latin characters, any character set supports at least two languages - English and one or more other languages. US-ASCII is an exception - it supports only Latin characters. When using mnoGoSearch in standard (non-UTF-8) mode, you may use as many languages as you like if they all belong to same language group. Language groups Language group Languages Character sets Group 1 Western Europe: Albanian, Catalan, Danish, Dutch, English, Faeroese, Finnish, French, Galician, German, Icelandic, Italian, Norwegian, Portuguese, Spanish, Swedish ASCII 8, CP437, CP850, CP860, CP1252, ISO 8859-1, ISO 8859-15, MacRoman, MacIceland Group 2 Eastern Europe: Croatian, Czech, Estonian, Hungarian, Latvian, Lithuanian, Polish, Romanian, Slovak, Slovene CP852, CP1250, ISO 8859-2, MacCentralEurope, MacRomania, MacCroatian Group 4 Baltic CP1257, iso-8859-4, iso-8859-13 Group 5 Cyrillic: Bulgarian, Byelorussian, Macedonian, Russian, Serbian, Ukrainian CP855, CP866, CP1251, ISO 8859-5, Koi8-r, Koi8-u, MacCyrillic Group 6 Arabic CP864, CP1256, ISO 8859-6, MacArabic Group 7 Greek CP869, CP1253, ISO 8859-7, MacGreek Group 8 Hebrew CP1255, ISO 8859-8, MacHebrew Group 9 Turkish CP857, CP1254, ISO 8859-9, MacTurkish Group 101 Japanese Shift-JIS, EUC-JP, ISO-2022-JP Group 102 Simplified Chinese (PRC) EUC-GB Group 103 Traditional Chinese (ROC) Big 5 Group 104 Korean EUC-KR Group 105 Thai CP874, TIS 620, MacThai Group 106 Vietnamese CP1258 Group 107 Indian MacGujarati Group 108 Georgian geostd8 Unicode Over 650 languages UTF-8 (Unicode)

E.g. in case you search engine is configured to use LocalCharset from the 5th group (Cyrillic), you may index servers containing documents in Bulgarian, Byelorussian, Macedonian, Russian, Serbian and Ukrainian. Indexing a multi-language document in UTF-8 is possible as well; however indexer will extract and save only cyrillic content from the page. To provide support for over 650 languages, please use LocalCharset utf-8 command. Recoding indexer recodes all documents to the character set specified in the LocalCharset indexer.conf command. Internally recoding is implemented using Unicode. Please note that some recoding may loose some data. For example, recoding between any Greek and Russian charsets looses all national characters. This does not matter for a single language sites. If you want to build multi-lingual search engine use UTF8 character set as a LocalCharset. Recoding at search time You may use BrowserCharset command to choose a charset which will be used to display search results. BrowserCharset may differ from LocalCharset. Character sets aliases Each charset is recognized by a number of its aliases. Web servers can return the same charset in different notation. For example, iso-8859-2, iso8859-2, latin2 are the same charsets. There is support for charsets names aliases which search engine can understand: Charsets aliases iso-8859-1: CP819, CSISOLATIN, IBM819, ISO-8859-1, ISO-IR-100, ISO_8859-1, ISO_8859-1:1987, L1, LATIN1 iso-8859-10: CSISOLATIN6, ISO-8859-10, ISO-IR-157, ISO_8859-10, ISO_8859-10:1992, L6, LATIN6 iso-8859-11: ISO-8859-11, TIS-620, TIS620, TACTIS iso-8869-13: ISO-8859-13, ISO-IR-179, ISO_8859-13, L7, LATIN7 iso-8859-14: ISO-8859-14, ISO-IR-199, ISO_8859-14, ISO_8859-14:1998, L8, LATIN8 iso-8859-15: ISO-8859-15, ISO-IR-203, ISO_8859-15, ISO_8859-15:1998 iso-8859-16: ISO-8859-16, ISO-IR-226, ISO_8859-16, ISO_8859-16:2000 iso-8859-2: CSISOLATIN2, ISO-8859-2, ISO-IR-101, ISO_8859-2, ISO_8859-2:1987, L2, LATIN2 iso-8859-3: CSISOLATIN3, ISO-8859-3, ISO-IR-109, ISO_8859-3, ISO_8859-3:1988, L3, LATIN3 iso-8859-4: CSISOLATIN4, ISO-8859-4, ISO-IR-110, ISO_8859-4, ISO_8859-4:1988, L4, LATIN4 iso-8859-5: CSISOLATINCYRILLIC, CYRILLIC, ISO-8859-5, ISO-IR-144, ISO_8859-5, ISO_8859-5:1988 iso-8859-6: ARABIC, ASMO-708, CSISOLATINARABIC, ECMA-114, ISO-8859-6, ISO-IR-127, ISO_8859-6, ISO_8859-6:1987 iso-8859-7: CSISOLATINGREEK, ECMA-118, ELOT_928, GREEK, GREEK8, ISO-8859-7, ISO-IR-126, ISO_8859-7, ISO_8859-7:1987 iso-8859-8: CSISOLATINHEBREW, HEBREW, ISO-8859-8, ISO-IR-138, ISO_8859-8, ISO_8859-8:1988 iso-8859-9: CSISOLATIN5, ISO-8859-9, ISO-IR-148, ISO_8859-9, ISO_8859-9:1989, L5, LATIN5 armscii-8: ARMSCII-8 big5: BIG-5, BIG-FIVE, BIG5, BIGFIVE, CN-BIG5, CSBIG5 cp1250: CP1250, MS-EE, WINDOWS-1250 cp1251: CP1251, MS-CYRL, WINDOWS-1251 cp1252: CP1252, MS-ANSI, WINDOWS-1252 cp1253: CP1253, MS-GREEK, WINDOWS-1253 cp1254: CP1254, MS-TURK, WINDOWS-1254 cp1255: CP1255, MS-HEBR, WINDOWS-1255 cp1256: CP1256, MS-ARAB, WINDOWS-1256 cp1257: CP1257, WINBALTRIM, WINDOWS-1257 cp1258: CP1258, WINDOWS-1258 cp437: 437, CP437, IBM437 cp850: 850, CP850, CSPC850MULTILINGUAL, IBM850 cp852: 852, CP852, IBM852 cp855: 855, CP855, IBM855 cp857: 857, CP857, IBM857 cp860: 860, CP860, IBM860 cp861: 861, CP861, IBM861 cp862: 862, CP862, IBM862 cp863: 863, CP863, IBM863 cp864: 864, CP864, IBM864 cp865: 865, CP865, IBM865 cp866: 866, CP866, CSIBM866, IBM866 cp869: 869, CP869, IBM869, CP874, WINDOWS-874 euc-kr: CSEUCKR, EUC-KR, EUCKR gb2312: CHINESE, CSGB2312, CSISO58GB231280, GB2312, GB_2312-80, ISO-IR-58 koi8-r: CSKOI8R, KOI8-R koi8-u KOI8-U shift-jis: CSSHIFTJIS, MS_KANJI, S-JIS, SHIFT-JIS, SHIFT_JIS, SJIS cp367: ANSI_X3.4-1968, ASCII, CP367, CSASCII, IBM367, ISO-IR-6, ISO646-US, ISO_646.IRV:1991, US, US-ASCII utf8: UTF-8, UTF8 viscii: CSVISCII, VISCII, VISCII1.1-1 maccyrillic: MACCYRILLIC, X-MAC-CYRILLIC macroman: MACROMAN, MACINTOSH, CSMACINTOSH, MAC MacCentralEurope: MACCENTRALEUROPE, MACCE

Document charset detection indexer detects document character set in this order: "Content-type: text/html; charset=xxx" <META NAME="Content-Type" CONTENT="text/html; charset=xxx"> CommandGuesserUseMeta Selection of this variant may be switch off by command: GuesserUseMeta no in your indexer.conf. Defaults from "Charset" field in Common Parameters Automatic charset guesser Since 3.2.0 mnoGoSearch has an automatic charset and language guesser. It currently recognizes more than 100 various charsets and languages. Charset and language detection is implemented using "N-Gram-Based Text Categorization" technique. There is a number of so called "language map" files, one for each language-charset pair. They are installed under /usr/local/mnogosearch/etc/langmap/ directory by default. Take a look there to check the list of currently provided charset-language pairs. Guesser works fine for texts bigger than 500 characters. Shorter texts may not be guessed well. Build your own language maps To build your own language map use mguessermguesser utility. In addition, your need to collect file with language samples in charset desired. For new language map creattion, use the following command: mguesser -p -c charset -l language < FILENAME > language.charset.lm You can also use mguesser utility for guessing document's language and charset by exsisting language maps. To do this, use following command: mguesser [-n maxhits] < FILENAME For some languages, it may be used few different charset. To convert from one charset supported by mnoGoSearch to another, use mconvmconv utility. mconv [OPTIONS] -f charset_from -t charset_to [configfile] < infile > outfile By default, both mguesser and mconv utilities is installed into /usr/local/mnoogosearch/sbin/ directory. CommandLangMapUpdate Since version 3.2.14, mnoGoSearch has an ability to update language and charset maps automaticaly while indexing, if remote server supply with pages exactly specified language and charset. To enable this function, specify command LangMapUpdate yes in your indexer.conf file. Default charset <indexterm><primary>Command</primary><secondary>Charset</secondary></indexterm> Use Charset indexer.conf command to choose the default charset of indexed servers. Default Language <indexterm><primary>Command</primary><secondary>DefaultLang</secondary></indexterm> You can set default language for Servers by using DefaultLang indexer.conf variable. This is useful while restricting search by URL language. ######################