🔍 Click image to zoom

GPT — generative pre-trained transformer
Share

Frequently Asked Questions

What is the difference between GPT and BERT?

GPT uses the decoder part of the Transformer and is trained to predict the next token (left-to-right), making it well-suited for text generation. BERT uses the encoder part and is trained to predict masked tokens using bidirectional context (seeing both left and right), making it better suited for classification and understanding tasks. GPT is generative; BERT is discriminative.

What does the "pre-trained" in GPT mean?

"Pre-trained" means the model was first trained on a massive general-purpose text corpus (hundreds of billions of tokens from the web, books, and code) before any task-specific training. This pre-training phase teaches the model broad world knowledge, grammar, and reasoning patterns. A separate fine-tuning phase then adapts the pre-trained model for instruction following, helpfulness, and safety using techniques like SFT and RLHF.

How has GPT evolved from version 1 to GPT-4?

GPT-1 (2018) had 117 million parameters and demonstrated basic transfer learning for NLP tasks. GPT-2 (2019) scaled to 1.5 billion parameters and could generate coherent multi-paragraph text, raising concerns about misuse. GPT-3 (2020) reached 175 billion parameters and exhibited strong few-shot learning. GPT-4 (2023) is multimodal (accepting images and text), has improved reasoning, and significantly reduced hallucination rates compared to GPT-3.5.

See Also