Developing a Multilingual Text Analysis Engine - Does Using Unicode Solve All the Issues?
Intended Audience: |
Software Engineers, Systems Analysts, Content Developers |
Session Level: |
Beginner, Intermediate |
The IBM Dictionary and Linguistic Tools Group produces linguistic analysis tools
which support over 30 different languages. The previous version of this product
supported a wide variety of code pages for each of the languages. This
presentation will describe the capabilities and value of the linguistic tools,
and then describe how the dictionaries were ported from a mixed code page
architecture to one that uses Unicode for all dictionaries.
I will discuss some of the benefits we gain from using Unicode e.g.
- Dictionary build and maintenance is possible in various locales (i.e. the
build machine no longer has to be in the same locale as the dictionary).
- It is possible to analyse many varied languages with much less language specific code.
- Our architecture for dealing with multilingual documents is much simpler.
I will also describe some internationalization issues that remain even after the Unicode
conversion e.g.
- Ambiguous character representation
- Loose and varied orthographic rules for different languages
- Problems in defining a word boundary. This is a big issue for languages which are
written without spaces, but is also a problem in spaced languages.
|