RLHF

How does AI learn to "speak human"? From pre-training to alignment

RLHF Alignment Reward Model

About 10 min read

Chapter 01

What's Wrong with Pre-trained LLMs?

During pre-training, large language models learn an astonishing ability: predicting the next token. By reading trillions of words from the internet, they acquire grammar, knowledge, and even reasoning capabilities.

But here's the problem: they learn from all of the internet data — including toxic content, biased speech, misinformation, and rude language. A raw pre-trained model may generate harmful, inaccurate, or even dangerous responses.

This is why we need alignment — making the model Helpful, Harmless, and Honest (the "3H" principles).

// INTERACTIVE: Raw vs Aligned Model

Same question, two very different response styles:

Raw Pre-trained

User: How can I improve my writing?

Aligned (RLHF)

User: How can I improve my writing?

Pre-training is like learning every language (including profanity). Alignment is learning manners — knowing what to say and what not to say.

Chapter 02

The Three Steps of RLHF

RLHF (Reinforcement Learning from Human Feedback) is the most widely used alignment method today. It consists of three core steps:

// INTERACTIVE: RLHF Pipeline

Click "Next Step" to see how each stage connects:

SFT — Supervised Fine-Tuning

Fine-tune the pre-trained model on high-quality, human-written Q&A data. This step teaches the model the "format of conversation" — how to properly answer questions instead of aimlessly continuing text.

Reward Model — Learning Preferences

Human annotators rank multiple responses to the same question (A is better than B). A separate model is trained to predict these human preferences. This becomes the "reward model" that scores any response.

PPO / RL — Reinforcement Learning

Use the reward model's scores as the "reward signal" and optimize the LLM via PPO (Proximal Policy Optimization). The model learns to generate higher-scoring responses while a KL divergence constraint prevents it from drifting too far.

Step 1 / 3

SFT is textbook learning, the Reward Model is the teacher's grading rubric, and RL is practicing over and over to improve your score.

Chapter 03

The Reward Model — AI's "Taste"

The reward model is the core component of RLHF. Its training data comes from human annotator preference rankings: given two responses A and B to the same question, the annotator picks which one is better.

Through massive amounts of comparison data, the reward model learns to predict human preferences — it can't answer questions itself, but it can judge the quality of any answer.

However, reward models have a classic problem: reward hacking. The model may find "shortcuts" — generating responses that score high without actually being better. For example, excessive flattery, repetitive safety disclaimers, or overly verbose answers.

// INTERACTIVE: Simulate Human Preference Labeling

Play the annotator — click the response you think is better:

Loading question...

Response A

Response B

Reward model score accumulation

The reward model is like a food critic — it can't cook, but it can tell the difference between a great dish and a bad one.

Chapter 04

From RLHF to DPO — Simpler Alignment

While RLHF is effective, it's also complex: it requires training a separate reward model and running an unstable RL training loop. Researchers have been searching for simpler alternatives.

DPO (Direct Preference Optimization), proposed in 2023, was a breakthrough. It skips the reward model entirely, optimizing the LLM directly from preference pairs. Mathematically, DPO's objective is equivalent to RLHF, but the implementation is dramatically simpler.

Constitutional AI, introduced by Anthropic, has the model self-critique and revise its own outputs based on a set of "constitutional principles," reducing dependence on human annotators.

RLAIF (RL from AI Feedback) replaces human feedback with AI-generated feedback, dramatically reducing annotation costs.

// INTERACTIVE: Three Alignment Methods Compared

Click a card to see its detailed workflow:

🏋
RLHF
3 steps / Complex
SFT + Reward Model + RL training loop

🎯
DPO
1 step / Direct
Skip the reward model, optimize preferences directly

🪞
Constitutional AI
Self-loop / Autonomous
Principle-based self-critique and revision

RLHF workflow: First, fine-tune with human-written data (SFT). Then train a reward model on human preference rankings. Finally, use PPO to iteratively optimize the LLM under the reward model's guidance. Pros: proven effectiveness. Cons: complex system, unstable training, expensive human annotation.

DPO workflow: Train the LLM directly on preference data (A is better than B), with no separate reward model or RL loop needed. Mathematically equivalent to RLHF, but as simple to implement as standard supervised learning. Increasingly adopted by open-source models.

Constitutional AI workflow: The model generates a response, then critiques and revises it based on preset "constitutional principles" (e.g., "be harmless," "be honest"). The revised data is used for further training. Anthropic's Claude models extensively use this approach.

RLHF is hiring a coach to guide your training. DPO is studying game film to improve on your own. Constitutional AI is correcting yourself in front of a mirror.

Chapter 05

Summary

⚠

Pre-trained ≠ Useful

Pre-trained models learn language ability but not judgment. Alignment is the critical step from "can speak" to "speaks well."

👥

Human Preferences = Signal

RLHF's core insight: collect human preference rankings to turn the notion of "good" vs "bad" into trainable data.

⭐

Reward Models Learn Taste

The reward model doesn't generate answers — it judges them. It's the "taste referee" in the RLHF system.

🚀

DPO Simplifies Everything

DPO and Constitutional AI are making alignment simpler and more efficient, accelerating open-source model progress.

The essence of RLHF: teaching AI not just to "speak," but to "speak well" — from language ability to social intelligence.

Next: Code Agents