Contribution and Limits of the Use of Unicode in Approximate Pattern-Matching
Intended Audience: |
Software Engineers, Systems Analysts, Technical Writers, Translators, NLP People |
Session Level: |
Beginner, Intermediate |
In example-based methods in machine translation, typically,
knowing that "he is young" is translated by "wakai" in
Japanese, allows us to translate "he is not young" (just
"not" is added) by relying on the previous sentence to
construct "wakakunai" (just "i" is replaced by "kunai").
Hence, the need to retrieve similar sentences from
collections of already translated sentences, for which
approximate pattern-matching is used.
Now, machine translation implies different languages, and
different languages may use different character sets,
usually coded in different ways. This incidentally implies
that the set of punctuation is different in different
languages. Until now, we had two different implementations
of our approximate pattern-matching algorithm for the two
languages of our concern: English (ASCII) and Japanese
(EUC). For Japanese, this had the disadvantage that the
texts that we searched had to be consistently encoded in EUC
only. However, Japanese people tend to use different
character sets in writing in Japanese and to use different
punctuation sets simultaneously.
Shifting to Unicode gave two advantages. Firstly, the
problem of using different character sets in one language is
eliminated. Secondly, the notion of punctuation becomes
insensitive to languages. Punctuation stands somewhere
between the logical and the physical structure of texts.
But still, if for a language like English, punctuation helps
in approximate pattern-matching for determining word
boundaries, it is not the case for such a language as
Japanese. The answer to this stands outside of the scope of
character sets. This is a problem for natural language
processing.
|