🔍 Click image to zoom
Frequently Asked Questions
Why did Transformers replace RNNs?
Transformers replaced RNNs primarily because of three advantages: parallelisability (RNNs process tokens sequentially, preventing efficient GPU utilisation; Transformers process all tokens simultaneously), better long-range dependency modelling (RNNs struggle with distant context due to vanishing gradients; self-attention has direct paths between any two tokens), and scalability (Transformer performance scales predictably with data and compute, enabling billion-parameter models).
What is the context window of a Transformer?
The context window is the maximum number of tokens a Transformer model can process at once — both the input it reads and the output it generates. Early models like GPT-2 had a context window of 1,024 tokens; modern models range from 8,000 (GPT-4 base) to over 1 million tokens (Gemini 1.5 Pro). A larger context window allows the model to reason over longer documents, but quadratically increases memory usage because self-attention computes relationships between every pair of tokens.
What is the difference between encoder-only and decoder-only Transformers?
Encoder-only Transformers (such as BERT) process the entire input bidirectionally and produce a representation for each token — ideal for classification, NER, and question answering. Decoder-only Transformers (such as GPT) generate text autoregressively, each token attending only to previous tokens — ideal for generation tasks. Encoder-decoder Transformers (such as T5 and BART) use both halves: the encoder reads the input and the decoder generates the output, making them well-suited for translation and summarisation.