| Beyond Text Representation -- Building on Unicode to Implement a Multi-lingual Text Analysis Framework
| Intended Audience: | Software Engineer, Software Analyst |  
| Session Level: | Intermediate |  Applications dealing with natural language documents in several
  languages are faced with various text analysis tasks.  All those
  applications will have to solve the basic tasks of code set conversions
  and text representation.  For those tasks Unicode can provide a solid
  foundation.  But most applications will have to deal with more tasks.
  Such tasks may range from simple tokenization or dictionary lookup up to
  more complex tasks like part of speech disambiguation, summarization or
  even parsing.  We want to present the design and implementation of a flexible TIPSTER
  inspired software library to facilitate those multi-lingual text
  analysis tasks.  It builds on Unicode for its text layer but also
  provides means for the representation of lingustic entities beyond the
  text layer.  The library focuses on modularity, code exchange/reuse and
  configurability.  It reaches those goals by separating the application
  from the implementation modules actually performing the analysis tasks.
  Implementation modules for various text analysis tasks can be combined,
  Modules for the same task can be exchanged without any change to the
  application.  We want to discuss if and how analysis tasks are influenced by building
  on Unicode as a text representation.  The direct influence of Unicode on a
  task may be range from substantial (e.g.  for tokenization) to
  inconsequential (e.g.  for summarization) depending on the task at hand.
  But regardless of the direct influence of Unicode on an analysis task we
  will show that none of them could be achieved without a solid text
  representation to start with.
 |