Nineteenth International Unicode Conference

A Composite Approach to Language/Encoding Detection

Shanjian Li & Katsuhiko Momoi - Netscape Communications

Intended Audience:	Manager, Software Engineer
Session Level:	Intermediate, Advanced

As the adoption of Unicode spreads, the transition of web pages from various native encodings to Unicode will most probably occur under the hood without much fanfare or notice to general users. One important contribution a browser development can make for further adoption of Unicode on web pages is to make it unnecessary for the user to engage an encoding menu. As the menu's importance decreases for the user, the change of encoding used in documents will be unnoticed by the user. Auto-detection thus plays an important role in this regard.

In a new version of Netscape 6 under development, we will implement an advanced auto-detection algorithm that combines 3 different methods of charset detection. This paper presents these 3 types of detection methods and discusses merits and demerits of each and proposes a composite approach in which all 3 methods are used in such a way as to maximize their strength and complement other detection methods.

Using code fragments, we will discuss how the Code Scheme method and the Character Distribution method together produce accurate detection for multi-byte encodings used in Asia. We also argue that for mono-byte encodings, the 2-Char Sequence method is effective. We have run a number of tests with this composite detection algorithm and the results have been quite satisfactory. While people may be familiar with the Code Scheme approach, the Character Distribution and the 2-Char Sequence methods represent new approaches and should be of interest to i18n developers.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

22 Jun 2001, Webmaster