Internal representation of strings

Strings are represented as immutable sequences of UTF-16 code units.

Immutability means operations like .toUpperCase(), .concat(), and .replace() always return a new string - the original sequence of characters is never modified.

A code unit is simply a number ranging from 0 to 65535. One or more code units form a code point, which ranges from 0 to 1114111 (0x10FFFF) - each number represents a particular character (or emoji).

Loading TypeScript...

Even though 💩 looks like one character, its length is 2!

Code units and code points

A JavaScript string is a sequence of 16-bit unsigned integers called code units. Most common characters (e.g. ASCII and Latin-based characters) fit into a single code unit. Characters outside the Basic Multilingual Plane (BMP), such as emojis or rare Chinese characters, require two code units — known as a surrogate pair. This enables UTF-16 to capture an enormous range of possible characters while still using storage space as efficiently as possible.

Since UTF-16 can split some characters into multiple code units, operations like indexing and measuring length can give misleading results.

Loading TypeScript...

To work with characters as Unicode code points, use for ... of and Array.from.

Loading TypeScript...

You can also use String.prototype.codePointAt() to retrieve a single grapheme cluster - a sequence of code points that are treated as a single unit.

Loading TypeScript...

For complex scripts like Devanagari, you may need to use a grapheme-splitting library like grapheme-splitter to reliably retrieve individual characters.

JavaScript engines

String immutability simplifies reasoning about string values and allows JavaScript engines (programs that run JavaScript programs) to agressively optimize their usage.

  • V8 is Google's open source high-performance engine, written in C++, which is widely used in browsers and in Node.js
  • SpiderMonkey is Mozilla's engine, and is written in C++, Rust, and JavaScript
  • There are many more JavaScript engines for specialized use cases

To reduce memory duplication and improve speed, most modern JavaScript engines will intern short strings, where memory is reused for identical strings. Modern engines also use ropes, where concatenation is deferred by storing references to substrings and computing actual content lazily. Ropes may themselves be internally converted and reorganized (e.g. flattening arrays) when needed.

These optimizations are engine-specific and invisible to developers, but they're important when considering what changes might lead to tangible performance improvements.

Was this page helpful?