A Composite Approach to Language/Encoding Detection
Intended Audience: |
Manager, Software Engineer |
Session Level: |
Intermediate, Advanced |
As the adoption of Unicode spreads, the transition of web pages from various
native encodings to Unicode will most probably occur under the hood without much
fanfare or notice to general users. One important contribution a browser
development can make for further adoption of Unicode on web pages is to make it
unnecessary for the user to engage an encoding menu. As the menu's importance
decreases for the user, the change of encoding used in documents will be
unnoticed by the user. Auto-detection thus plays an important role in this
regard.
In a new version of Netscape 6 under development, we will implement an advanced
auto-detection algorithm that combines 3 different methods of charset detection.
This paper presents these 3 types of detection methods and discusses merits and
demerits of each and proposes a composite approach in which all 3 methods are
used in such a way as to maximize their strength and complement other detection
methods.
Using code fragments, we will discuss how the Code Scheme method and the
Character Distribution method together produce accurate detection for multi-byte
encodings used in Asia. We also argue that for mono-byte encodings, the 2-Char
Sequence method is effective. We have run a number of tests with this composite
detection algorithm and the results have been quite satisfactory. While people
may be familiar with the Code Scheme approach, the Character Distribution and
the 2-Char Sequence methods represent new approaches and should be of interest
to i18n developers.
|