Beyond Text Representation -- Building on Unicode to Implement a Multi-lingual Text Analysis Framework
Intended Audience: |
Software Engineer, Software Analyst |
Session Level: |
Intermediate |
Applications dealing with natural language documents in several
languages are faced with various text analysis tasks. All those
applications will have to solve the basic tasks of code set conversions
and text representation. For those tasks Unicode can provide a solid
foundation. But most applications will have to deal with more tasks.
Such tasks may range from simple tokenization or dictionary lookup up to
more complex tasks like part of speech disambiguation, summarization or
even parsing.
We want to present the design and implementation of a flexible TIPSTER
inspired software library to facilitate those multi-lingual text
analysis tasks. It builds on Unicode for its text layer but also
provides means for the representation of lingustic entities beyond the
text layer. The library focuses on modularity, code exchange/reuse and
configurability. It reaches those goals by separating the application
from the implementation modules actually performing the analysis tasks.
Implementation modules for various text analysis tasks can be combined,
Modules for the same task can be exchanged without any change to the
application.
We want to discuss if and how analysis tasks are influenced by building
on Unicode as a text representation. The direct influence of Unicode on a
task may be range from substantial (e.g. for tokenization) to
inconsequential (e.g. for summarization) depending on the task at hand.
But regardless of the direct influence of Unicode on an analysis task we
will show that none of them could be achieved without a solid text
representation to start with.
|