Language Processing Issues with Unicode Data
Richard Youatt - American University of Armenia Corporation
Intended Audience: |
Manager, Software Engineer, Systems Analyst, Marketer, Academia/Education |
Session Level: |
Intermediate |
Unicode/ISO10646 and the associated programming languages that manipulate
the elements of those character sets have opened up a new realm of technical
possibilities. These have had primary application in the worlds of e-commerce,
software globalization, and the organizational and administrative needs of large
multinational organizations. At the same time, a door has been opened to the
world of Information Technology and the World Wide Web for the lesser-known
cultures and languages of the world.
Even among those concerned with minority rights and cultures, less attention has
been focused on the purer linguistic issues, the benefits of technology
assisted research in language processing and computer assisted linguistics than
on computer literacy and access to the Information Highway. This presentation
addresses some of those issues drawing upon theoretical and practical work
with the Digital Library of Classical Armenian Literature at the American
University of Armenia, and looks at some of the generic issues of
language processing with Unicode data.
The primary conclusion is that linguistic and historical research has yet to
take full advantage of the "technology boost" that is now available, and that
this requires a multidisciplinary and international approach to project work.
Progress in technology does not necessarily enhance linguistic skills, or
promote conceptual and intellectual progress in language studies. Technology
assisted research linguists are equally concerned with the semantics and
etymology of language as with the ability to manipulate the elements of the
ISO10646/Unicode repertoire that does not in fact meet their full needs.
|