Twenty-first International Unicode Conference

Folded Trie: Efficient Data Structure for all of Unicode

Vladimir Weinstein - IBM Corporation

Intended Audience:	Managers, Software Engineers, Systems Analysts, Technical Writers
Session Level:	Intermediate, Advanced

Dealing with Unicode characters, especially when supplementary code points are involved, requires storing and processing large amounts of data. To provide good performance and memory usage, ICU has evolved several specialized data structures for mapping a Unicode character to data associated with that character. Such structures include hash tables, inversion lists, and trie structures.

The introduction of supplementary code points into the Unicode standard forces implementations to deal with a larger number of assigned characters. While a dual-index trie structure can handle supplementaries, it degrades the performance for BMP characters to some extent. In response to this, ICU developed a new data structure called a folded-trie, or UTrie. This structure is used throughout the ICU 2.1 library, both in the C/C++ and Java versions.

The design of UTrie rests on two important properties of the code point distribution in the Unicode standard. First, the BMP is densely populated, while the supplemenary space is sparsely populated. Second, in the vast majority of the processes that involve Unicode data, BMP code points are accessed much more frequently supplementary code points. These two properties allow for the construction of the UTrie, which features a fast, single-index lookup for BMP code points combined with the exceptional processing for supplementary code points.

This paper examines UTrie data structure and its usage in detail. Normalization and character properties service are closely examined as examples of how UTrie is used in practice, and what the performance implications are.

When the world wants to talk, it speaks Unicode

International Unicode Conferences are organized by Global Meeting Services, Inc., (GMS). GMS is pleased to be able to offer the International Unicode Conferences under an exclusive license granted by the Unicode Consortium. All responsibility for conference finances and operations is borne by GMS. The independent conference board serves solely at the pleasure of GMS and is composed of volunteers active in Unicode and in international software development. All inquiries regarding International Unicode Conferences should be addressed to info@global-conference.com.

Unicode and the Unicode logo are registered trademarks of Unicode, Inc. Used with permission.

21 February 2002, Webmaster