How does AI learn to "speak human"? From pre-training to alignment
About 10 min readDuring pre-training, large language models learn an astonishing ability: predicting the next token. By reading trillions of words from the internet, they acquire grammar, knowledge, and even reasoning capabilities.
But here's the problem: they learn from all of the internet data — including toxic content, biased speech, misinformation, and rude language. A raw pre-trained model may generate harmful, inaccurate, or even dangerous responses.
This is why we need alignment — making the model Helpful, Harmless, and Honest (the "3H" principles).
Same question, two very different response styles:
Pre-training is like learning every language (including profanity). Alignment is learning manners — knowing what to say and what not to say.
RLHF (Reinforcement Learning from Human Feedback) is the most widely used alignment method today. It consists of three core steps:
Click "Next Step" to see how each stage connects:
Fine-tune the pre-trained model on high-quality, human-written Q&A data. This step teaches the model the "format of conversation" — how to properly answer questions instead of aimlessly continuing text.
Human annotators rank multiple responses to the same question (A is better than B). A separate model is trained to predict these human preferences. This becomes the "reward model" that scores any response.
Use the reward model's scores as the "reward signal" and optimize the LLM via PPO (Proximal Policy Optimization). The model learns to generate higher-scoring responses while a KL divergence constraint prevents it from drifting too far.
SFT is textbook learning, the Reward Model is the teacher's grading rubric, and RL is practicing over and over to improve your score.
The reward model is the core component of RLHF. Its training data comes from human annotator preference rankings: given two responses A and B to the same question, the annotator picks which one is better.
Through massive amounts of comparison data, the reward model learns to predict human preferences — it can't answer questions itself, but it can judge the quality of any answer.
However, reward models have a classic problem: reward hacking. The model may find "shortcuts" — generating responses that score high without actually being better. For example, excessive flattery, repetitive safety disclaimers, or overly verbose answers.
Play the annotator — click the response you think is better:
The reward model is like a food critic — it can't cook, but it can tell the difference between a great dish and a bad one.
While RLHF is effective, it's also complex: it requires training a separate reward model and running an unstable RL training loop. Researchers have been searching for simpler alternatives.
DPO (Direct Preference Optimization), proposed in 2023, was a breakthrough. It skips the reward model entirely, optimizing the LLM directly from preference pairs. Mathematically, DPO's objective is equivalent to RLHF, but the implementation is dramatically simpler.
Constitutional AI, introduced by Anthropic, has the model self-critique and revise its own outputs based on a set of "constitutional principles," reducing dependence on human annotators.
RLAIF (RL from AI Feedback) replaces human feedback with AI-generated feedback, dramatically reducing annotation costs.
Click a card to see its detailed workflow:
RLHF is hiring a coach to guide your training. DPO is studying game film to improve on your own. Constitutional AI is correcting yourself in front of a mirror.
Pre-trained models learn language ability but not judgment. Alignment is the critical step from "can speak" to "speaks well."
RLHF's core insight: collect human preference rankings to turn the notion of "good" vs "bad" into trainable data.
The reward model doesn't generate answers — it judges them. It's the "taste referee" in the RLHF system.
DPO and Constitutional AI are making alignment simpler and more efficient, accelerating open-source model progress.
The essence of RLHF: teaching AI not just to "speak," but to "speak well" — from language ability to social intelligence.