A Unified Phonemic Code Based Scheme for Effective Processing
of Indian Languages
R.K. Joshi - National Centre for
Software Technology
Intended Audience: |
Software Engineers, Font Developers |
Session Level: |
Intermediate, Advanced |
Statement of Purpose:The primary purpose is to present a unified phonemic code based
technique that can be used to support Indian languages in a variety
of applications. The complexity of processing Indian languages is
first presented. This is followed by a detailed exposition of this
phoneme basis, its ancient historical background, applicability to
Unicode and ISCII, a unified technique for Indian language
processing tasks, with specific examples from a shaping/rendering
engine using OpenType fonts. Finally experience from varied
applications is briefly discussed. Brief Description:The multitude of Indian languages and dialects are written using
9 scripts. While these scripts have been allotted distinct code
pages in the Unicode scheme, applications supporting Indian
languages are yet to be found on a number of standard platforms.
One primary reason could be the fact that rendering, and processing
in general, of Indian languages is complex and mandates distinctly
different techniques. Orthograpy follows a phonetically driven
basis of compositing "phonetic units" to form complex glyphs. While
the character set is compact, authentic rendering implies a
generative mechanism that can produce glyphs corresponding to all
possible character sequences. Complex as it may seem, clear rules
can be defined based on a canonical treatise by Panini, the ancient
grammarian. These rules establish a perfect correspondence between
phonemes constituting a syllable and its graphical form. And such
rules can be defined for each of the Indic scripts. Decomposing
text using this phonemic basis, followed by phoneme based
computations provides a single unified technique for rendering
Indic scripts. In fact, it is well suited even for other processing
tasks such as sorting, searching, speech synthesis, speech
recognition, transliteration, etc. Conclusions:The software complexity of supporting Indian languages in
different applications can be controlled by the use of a unified
technique based on phonemic codes obtained from a well defined
transformation of Unicode or ISCII encoding. This is amply
illustrated by actual implementation experiences. |