Normalizing strings

The string normalize() method returns the Unicode Normalization Form of the string.

Unicode assigns a unique numerical value, called a code point, to each character. However, it is possible that multiple code points (or sequences of code points) can represent the same character. For example, the character ñ can be represented in two ways:

Loading TypeScript...

However, since the code points are different, string comparison will not treat them as equal. Also, since the number of code points in each version is different, they even have different lengths!

Loading TypeScript...

Composed and decomposed

The normalize() method helps solve this problem by converting a string into a normalized form that is common to all sequences of code points that represent the same characters. You can use the "NFD" or "NFC" arguments (Normalized Form Decomposed/Composed) to determine one of two possible forms for the resulting string.

Loading TypeScript...

Note that the length of the normalized form under "NFD" is 2, because "NFD" gives the decomposed version of the canonical form, where single code points are split into multiple combining ones. The decomposed canonical form for "ñ" is "\u006E\u0303".

You can specify "NFC" to get the composed canonical form, in which multiple code points are replaced with single code points where possible. The composed canonical form for "ñ" is "\u00F1".

Loading TypeScript...

Canonical equivalence and compatibility

In Unicode, two sequences of code points have canonical equivalence if they represent the same abstract characters. Sequences that are canonically equivalent should always have the same visual appearance and behavior (for example, they should always be sorted in the same way).

Two sequences of code points are compatible if they represent the same abstract characters, and should be treated alike in some - but not necessarily all - circumstances. All canonically equivalent sequences are also compatible, but not necessarily the other way around.

For example, the code point U+FB00 represents the ligature "ff", and is compatible with two consecutive U+0066 code points ("ff"). Similarly the code point U+24B9 represents the symbol "Ⓓ", and is compatible with the U+0044 code point ("D"). In some respects (such as sorting) they should be treated as equivalent, while in some respects (such as visual appearance) they should not, so they are not canonically equivalent.

You can use normalize() with the "NKFD" or "NFKC" arguments to produce a string that will be the same for all compatible strings.

Loading TypeScript...

When applying compatibility normalization, it is important to consider what you intend to do with the strings. In the example above, the normalization may be appropriate if building a search index, because it enables a user to find the string by searching for "f". However, it may not be appropriate for display purposes, as the visual representation may be greatly distorted from the original character.

Was this page helpful?