🔍 Click image to zoom

Tokens — how LLMs read text
Share

Frequently Asked Questions

What is Text Tokenization?

The process of splitting text into smaller units (tokens) that a language model processes as discrete inputs. Tokenization is the preprocessing step that converts raw text into a sequence of tokens — the fundamental units that language models process. A token is typically a word, subword, or character, depending on the tokenisation scheme.

How is Text Tokenization used in practice?

Modern LLMs use subword tokenisation algorithms such as Byte Pair Encoding (BPE) or SentencePiece, which split rare words into subword units while keeping common words as single tokens. This balances vocabulary size against the ability to represent any word.

Why is Text Tokenization important in AI?

Text Tokenization is a foundational concept in Core Concept. The process of splitting text into smaller units (tokens) that a language model processes as discrete inputs.

See Also