Tokenization

Tokenization

Text Tokenization

The process of splitting text into smaller units (tokens) that a language model processes as discrete inputs.

Tokenization is the preprocessing step that converts raw text into a sequence of tokens — the fundamental units that language models process. A token is typically a word, subword, or character, depending on the tokenisation scheme.

Modern LLMs use subword tokenisation algorithms such as Byte Pair Encoding (BPE) or SentencePiece, which split rare words into subword units while keeping common words as single tokens. This balances vocabulary size against the ability to represent any word.

🔍 Click image to zoom

Tokens — how LLMs read text

Frequently Asked Questions

What is Text Tokenization?

The process of splitting text into smaller units (tokens) that a language model processes as discrete inputs. Tokenization is the preprocessing step that converts raw text into a sequence of tokens — the fundamental units that language models process. A token is typically a word, subword, or character, depending on the tokenisation scheme.

How is Text Tokenization used in practice?

Why is Text Tokenization important in AI?

Text Tokenization is a foundational concept in Core Concept. The process of splitting text into smaller units (tokens) that a language model processes as discrete inputs.

Frequently Asked Questions

What is Text Tokenization?

How is Text Tokenization used in practice?

Why is Text Tokenization important in AI?

See Also