Creating custom break iterators for ICU (International Components for Unicode)
Edward Batutis - Batutis Internationalization Consulting
Intended Audience: |
Software Engineers |
Session Level: |
Intermediate, Advanced |
This paper will discuss creating custom break iterators for International Components
for Unicode (ICU) a popular internationalization toolkit. ICU for Java and ICU for C/C++
provide break iterators to be used for character, word, and line-breaking. These iterators
are useful for parsing text - for example, extracting words for a search engine or
implementing a word-wrap feature in a text editor.The break iterators supplied are
sufficient for many purposes, but some implementors may wish to use their own customized
iterators. This paper will first discuss the default break iterators supplied by ICU for
Java and C/C++ and how they are implemented. Next, the paper will cover how the existing
iterators can be extended or replaced to meet an application-specific requirement.
|