The Ideographic Composition Scheme and Its Applications in Chinese Text Processing
Qin Lu - The Hong Kong Polytechnic University
Intended Audience: |
Software Engineer, Systems Analyst, People Supporting East Asian Script |
Session Level: |
Beginner, Intermediate |
Ideographic characters or often formed by some smaller functional units, which we
call character components. These character components can be ideographic radicals,
ideograph character propers, or some pure ideograph components which cannot be used
alone as an ideograph characters. Unicode 3.0 has included twelve IDCs, which were
originally proposed to describe some not yet encoded ideographic characters. However,
IDCs when used with other ideographic characters and components using a formal method,
which we call the ideographic composition scheme, can provide a linear method to
describe a two dimensional ideographic character using its components.
This paper, will first give some background information on how the IDCs were introduced,
the ideograph characters and their relationships with ideographic components. Then, the
formal composition scheme will be presented, and the algorithm for parsing an ideograph
composition sequence will be given. The paper will also present an on-going project in
which an ideographic character component feature database are built for all Unicode
ideographic characters. Searching for ideographic characters using any of its components
recursively and vise vera will also be presented.
Theme: Language processing issues with Unicode data
|