Eighteenth International Unicode Conference

Issues and Solution in Pan-China Information Retrieval

Thomas Emerson - Basis Technology Corporation

Intended Audience:	Manager, Software Engineer
Session Level:	Advanced

The last ten years has seen a significant effort put into the research and development of information retrieval (IR) systems for Chinese speaking locales. Internet search engines, digital libraries and full-text retrieval systems require effective and accurate indexing and query-processing technology, and the features of the Chinese language limits the applicability of many techniques and algorithms used with Western Languages. A common limitation of all existing Chinese IR systems is their restriction to texts in a single locale.

This paper describes the special issues in Chinese information retrieval, including the trade-offs of indexing using n-gram versus word-based models, the effect these decisions have on the algorithms selected and the way results are presented to the user. This paper also describes the issues in implementing an IR system that works across Chinese locales, taking into account differences in character sets and terminology used in different regions of China. To our knowledge this is the first time such a system has been developed.

We show that extending a search engine across Chinese locales limits the effective choices you have in indexing, character representation and whether or not you perform query-term expansion: accurate word-based indexing is essential, using Unicode, and term expansion is vital when searching documents authored in a locale different from that of the searcher.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

13 December 2000, Webmaster