RLHF

RLHF

Reinforcement Learning from Human Feedback

Reinforcement Learning from Human Feedback (RLHF) is the alignment training technique that transforms a raw pre-trained large language model (LLM) into a helpful, safe assistant — by using human preference rankings as a reward signal to shape the model's outputs through reinforcement learning.

RLHF operates in three stages: first, supervised fine-tuning (SFT) on human-written demonstrations teaches the LLM the target behaviour; second, human annotators rank pairs of model outputs to train a reward model; third, the LLM is optimised via proximal policy optimisation (PPO) to maximise the reward model's score while staying close to its SFT baseline — a constraint that prevents reward hacking.

RLHF is the primary alignment technique behind ChatGPT, Claude, Gemini, and most frontier LLMs, and is increasingly complemented by variants such as RLAIF (reinforcement learning from AI feedback), direct preference optimisation (DPO), and constitutional AI, which reduce dependence on expensive human annotation by using the model itself or a set of principles to generate preference signals.

🔍 Click image to zoom

RLHF — the full three-stage pipeline

Frequently Asked Questions

What is the difference between RLHF and SFT?

Supervised Fine-Tuning (SFT) trains a model on example input–output pairs to mimic desired behaviour. RLHF goes further: it trains a separate reward model on human comparisons, then uses reinforcement learning to optimise the LLM to maximise that reward. SFT teaches the model what to say; RLHF teaches the model to prefer better responses, even on prompts not seen during SFT.

What is a reward model in RLHF?

A reward model in RLHF is a separate neural network trained to predict human preference scores for model outputs. Human annotators compare pairs of model responses and indicate which is better; the reward model learns to replicate this judgement at scale. During the RL phase, the LLM is trained to generate responses that maximise the reward model's score, effectively optimising against a learned proxy for human preference. The reward model is not deployed to users — it is only used during training.

What are the limitations of RLHF?

RLHF has several known limitations. Reward hacking occurs when the model learns to maximise the reward model's score in ways that do not reflect genuine quality — for example, producing verbose but shallow responses. Human annotator disagreement introduces inconsistency into preference data. RLHF is also expensive: it requires large volumes of human comparisons and significant RL compute. Alternatives such as DPO (Direct Preference Optimisation) attempt to achieve similar alignment without the separate RL phase, reducing complexity and cost.

Frequently Asked Questions

What is the difference between RLHF and SFT?

What is a reward model in RLHF?

What are the limitations of RLHF?

See Also