Eighteenth International Unicode Conference

Beyond Text Representation -- Building on Unicode to Implement a Multi-lingual Text Analysis Framework

Thomas Hampp-Bahnmueller - IBM Germany

Intended Audience:	Software Engineer, Software Analyst
Session Level:	Intermediate

Applications dealing with natural language documents in several languages are faced with various text analysis tasks. All those applications will have to solve the basic tasks of code set conversions and text representation. For those tasks Unicode can provide a solid foundation. But most applications will have to deal with more tasks. Such tasks may range from simple tokenization or dictionary lookup up to more complex tasks like part of speech disambiguation, summarization or even parsing.

We want to present the design and implementation of a flexible TIPSTER inspired software library to facilitate those multi-lingual text analysis tasks. It builds on Unicode for its text layer but also provides means for the representation of lingustic entities beyond the text layer. The library focuses on modularity, code exchange/reuse and configurability. It reaches those goals by separating the application from the implementation modules actually performing the analysis tasks. Implementation modules for various text analysis tasks can be combined, Modules for the same task can be exchanged without any change to the application.

We want to discuss if and how analysis tasks are influenced by building on Unicode as a text representation. The direct influence of Unicode on a task may be range from substantial (e.g. for tokenization) to inconsequential (e.g. for summarization) depending on the task at hand. But regardless of the direct influence of Unicode on an analysis task we will show that none of them could be achieved without a solid text representation to start with.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

11 December 2000, Webmaster