Issues and Solution in Pan-China Information Retrieval
Thomas Emerson - Basis Technology Corporation
Intended Audience: |
Manager, Software Engineer |
Session Level: |
Advanced |
The last ten years has seen a significant effort put into the
research and development of information retrieval (IR) systems for
Chinese speaking locales. Internet search engines, digital
libraries and full-text retrieval systems require effective and
accurate indexing and query-processing technology, and the features of
the Chinese language limits the applicability of many techniques and
algorithms used with Western Languages. A common limitation of all
existing Chinese IR systems is their restriction to texts in a single
locale.
This paper describes the special issues in Chinese information
retrieval, including the trade-offs of indexing using n-gram versus
word-based models, the effect these decisions have on the
algorithms selected and the way results are presented to the user. This paper
also describes the issues in implementing an IR system that works
across Chinese locales, taking into account differences in character
sets and terminology used in different regions of China. To our
knowledge this is the first time such a system has been developed.
We show that extending a search engine across Chinese locales limits
the effective choices you have in indexing, character representation
and whether or not you perform query-term expansion: accurate
word-based indexing is essential, using Unicode, and term expansion is
vital when searching documents authored in a locale different from
that of the searcher.
|