Transformer

Transformer

Transformer Architecture

The Transformer is a neural network architecture introduced in "Attention Is All You Need" (Google, 2017) that replaced recurrent neural networks (RNNs) with a parallel self-attention mechanism — enabling models to directly capture relationships between any two tokens regardless of distance, and making large-scale GPU training practical for the first time.

A Transformer processes input through stacked encoder and decoder blocks: encoder-only variants such as BERT generate contextual embeddings for natural language understanding (NLU), while decoder-only variants power large language models (LLMs) such as GPT and generate text autoregressively, predicting each token from all previous ones.

The Transformer is the foundation of virtually every modern AI system — LLMs, multimodal models, and diffusion models — with advances such as Flash Attention, KV cache, and mixture-of-experts (MoE) routing extending its context window beyond one million tokens while reducing the memory and latency needed for inference.

🔍 Click image to zoom

Transformer architecture explained

Frequently Asked Questions

Why did Transformers replace RNNs?

Transformers replaced RNNs primarily because of three advantages: parallelisability (RNNs process tokens sequentially, preventing efficient GPU utilisation; Transformers process all tokens simultaneously), better long-range dependency modelling (RNNs struggle with distant context due to vanishing gradients; self-attention has direct paths between any two tokens), and scalability (Transformer performance scales predictably with data and compute, enabling billion-parameter models).

What is the context window of a Transformer?

The context window is the maximum number of tokens a Transformer model can process at once — both the input it reads and the output it generates. Early models like GPT-2 had a context window of 1,024 tokens; modern models range from 8,000 (GPT-4 base) to over 1 million tokens (Gemini 1.5 Pro). A larger context window allows the model to reason over longer documents, but quadratically increases memory usage because self-attention computes relationships between every pair of tokens.

What is the difference between encoder-only and decoder-only Transformers?

Encoder-only Transformers (such as BERT) process the entire input bidirectionally and produce a representation for each token — ideal for classification, NER, and question answering. Decoder-only Transformers (such as GPT) generate text autoregressively, each token attending only to previous tokens — ideal for generation tasks. Encoder-decoder Transformers (such as T5 and BART) use both halves: the encoder reads the input and the decoder generates the output, making them well-suited for translation and summarisation.

Frequently Asked Questions

Why did Transformers replace RNNs?

What is the context window of a Transformer?

What is the difference between encoder-only and decoder-only Transformers?

See Also