🔍 Click image to zoom

RLHF — the full three-stage pipeline
Share

Frequently Asked Questions

What is the difference between RLHF and SFT?

Supervised Fine-Tuning (SFT) trains a model on example input–output pairs to mimic desired behaviour. RLHF goes further: it trains a separate reward model on human comparisons, then uses reinforcement learning to optimise the LLM to maximise that reward. SFT teaches the model what to say; RLHF teaches the model to prefer better responses, even on prompts not seen during SFT.

What is a reward model in RLHF?

A reward model in RLHF is a separate neural network trained to predict human preference scores for model outputs. Human annotators compare pairs of model responses and indicate which is better; the reward model learns to replicate this judgement at scale. During the RL phase, the LLM is trained to generate responses that maximise the reward model's score, effectively optimising against a learned proxy for human preference. The reward model is not deployed to users — it is only used during training.

What are the limitations of RLHF?

RLHF has several known limitations. Reward hacking occurs when the model learns to maximise the reward model's score in ways that do not reflect genuine quality — for example, producing verbose but shallow responses. Human annotator disagreement introduces inconsistency into preference data. RLHF is also expensive: it requires large volumes of human comparisons and significant RL compute. Alternatives such as DPO (Direct Preference Optimisation) attempt to achieve similar alignment without the separate RL phase, reducing complexity and cost.

See Also