| A Composite Approach to Language/Encoding Detection
| Intended Audience: | Manager, Software Engineer |  
| Session Level: | Intermediate, Advanced |  
As the adoption of Unicode spreads, the transition of web pages from various 
native encodings to Unicode will most probably occur under the hood without much 
fanfare or notice to general users. One important contribution a browser 
development can make for further adoption of Unicode on web pages is to make it 
unnecessary for the user to engage an encoding menu.  As the menu's importance 
decreases for the user, the change of encoding used in documents will be 
unnoticed by the user. Auto-detection thus plays an important role in this 
regard. In a new version of Netscape 6 under development, we will implement an advanced 
auto-detection algorithm that combines 3 different methods of charset detection. 
This paper presents these 3 types of detection methods and discusses merits and 
demerits of each and proposes a composite approach in which all 3 methods are 
used in such a way as to maximize their strength and complement other detection 
methods. Using code fragments, we will discuss how the Code Scheme method and the 
Character Distribution method together produce accurate detection for multi-byte 
encodings used in Asia.  We also argue that for mono-byte encodings, the 2-Char 
Sequence method is effective. We have run a number of tests with this composite 
detection algorithm and the results have been quite satisfactory. While people 
may be familiar with the Code Scheme approach, the Character Distribution and 
the 2-Char Sequence methods represent new approaches and should be of interest 
to i18n developers. |