Twenty-first International Unicode Conference

Contribution and Limits of the Use of Unicode in Approximate Pattern-Matching

Yves Lepage - ATR

Intended Audience:	Software Engineers, Systems Analysts, Technical Writers, Translators, NLP People
Session Level:	Beginner, Intermediate

In example-based methods in machine translation, typically, knowing that "he is young" is translated by "wakai" in Japanese, allows us to translate "he is not young" (just "not" is added) by relying on the previous sentence to construct "wakakunai" (just "i" is replaced by "kunai"). Hence, the need to retrieve similar sentences from collections of already translated sentences, for which approximate pattern-matching is used.

Now, machine translation implies different languages, and different languages may use different character sets, usually coded in different ways. This incidentally implies that the set of punctuation is different in different languages. Until now, we had two different implementations of our approximate pattern-matching algorithm for the two languages of our concern: English (ASCII) and Japanese (EUC). For Japanese, this had the disadvantage that the texts that we searched had to be consistently encoded in EUC only. However, Japanese people tend to use different character sets in writing in Japanese and to use different punctuation sets simultaneously.

Shifting to Unicode gave two advantages. Firstly, the problem of using different character sets in one language is eliminated. Secondly, the notion of punctuation becomes insensitive to languages. Punctuation stands somewhere between the logical and the physical structure of texts. But still, if for a language like English, punctuation helps in approximate pattern-matching for determining word boundaries, it is not the case for such a language as Japanese. The answer to this stands outside of the scope of character sets. This is a problem for natural language processing.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

21 February 2002, Webmaster