Reinforcement Learning from Human Feedback (RLHF)

Aligning AI outputs with human preferences.

Overview

Reinforcement Learning from Human Feedback (RLHF) is a technique in which an AI system—often a Large Language Model—receives direct feedback from humans (through comparisons, rankings, or reward signals) about the quality of its outputs. The model is then updated (via RL) to maximize this human-provided reward function, making the system's responses more aligned with human preferences or values.

How RLHF Works

Collect Human Feedback: Humans rank or label outputs from the model.
Train a Reward Model: A secondary model learns to predict which outputs humans prefer.
Policy Optimization: The main model is optimized (via reinforcement learning) to produce higher "reward"—i.e., more human-preferred outputs.

Relevance

RLHF has become central to LLM alignment, improving helpfulness, reducing toxic or off-topic outputs, and offering a path to fine-tuning model behavior.