Twenty-second International Unicode Conference

Developing a Multilingual Text Analysis Engine - Does Using Unicode Solve All the Issues?

Brian O'Donovan - IBM Software Group

Intended Audience:	Software Engineers, Systems Analysts, Content Developers
Session Level:	Beginner, Intermediate

The IBM Dictionary and Linguistic Tools Group produces linguistic analysis tools which support over 30 different languages. The previous version of this product supported a wide variety of code pages for each of the languages. This presentation will describe the capabilities and value of the linguistic tools, and then describe how the dictionaries were ported from a mixed code page architecture to one that uses Unicode for all dictionaries.

I will discuss some of the benefits we gain from using Unicode e.g.

Dictionary build and maintenance is possible in various locales (i.e. the build machine no longer has to be in the same locale as the dictionary).
It is possible to analyse many varied languages with much less language specific code.
Our architecture for dealing with multilingual documents is much simpler.

I will also describe some internationalization issues that remain even after the Unicode conversion e.g.

Ambiguous character representation
Loose and varied orthographic rules for different languages
Problems in defining a word boundary. This is a big issue for languages which are written without spaces, but is also a problem in spaced languages.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

23 May 2002, Webmaster