Reinforcement Learning from Human Feedback (RLHF)

Aligning AI outputs with human preferences.

Overview

Reinforcement Learning from Human Feedback (RLHF) is a technique in which an AI system—often a Large Language Model—receives direct feedback from humans (through comparisons, rankings, or reward signals) about the quality of its outputs. The model is then updated (via RL) to maximize this human-provided reward function, making the system's responses more aligned with human preferences or values.

How RLHF Works

Collect Human Feedback: Humans rank or label outputs from the model.
Train a Reward Model: A secondary model learns to predict which outputs humans prefer.
Policy Optimization: The main model is optimized (via reinforcement learning) to produce higher "reward"—i.e., more human-preferred outputs.

Relevance

RLHF has become central to LLM alignment, improving helpfulness, reducing toxic or off-topic outputs, and offering a path to fine-tuning model behavior.

PreviousReinforcement Learning Exploration Exploitation

NextSelf Supervised Learning

Reinforcement Learning from Human Feedback (RLHF)

Overview

How RLHF Works

Relevance

On this page

On this page

Reinforcement Learning from Human Feedback (RLHF)

Overview

How RLHF Works

Relevance

Related Concepts

On this page

On this page