Optimizing the Usage of Normalization
Intended Audience: |
Managers, Software Engineers, Systems Analysts, Technical Writers |
Session Level: |
Intermediate, Advanced |
Many processes (UCA) and standards (see the W3C Character Model) require
the use of normalization. Although there are several efficient
implementations of the normalization algorithms, it is not free. This paper
discusses how carefully preparing the supporting data and using
normalization procedures wisely can substantially improve the performance
of other processes, and illustrates proper usage with examples from the
collation service in the ICU library. In particular, it discusses checking
for pre-existing normalized text, incremental normalization of text,
concatenation of normalized text, and the use of the FCD format.
Text is in the FCD format when canonical decomposition without any
canonical reordering produces correct NFD text. Almost all text in practice
is in FCD, and a test to see whether text is in FCD is very fast. A
correctly optimized algorithm can check for FCD, and avoid normalization if
the text is in that format. However, to be able to support FCD, the data
used by the algorithm must be preprocessed to be what is called
'canonically closed'.
|