Projection

Mathematical technique for removing unwanted components from model vectors

Overview

In the context of AI and specifically in modifying model behavior, projection is a mathematical technique used to remove unwanted components from a [vector] representing the model's activity, particularly to change how a model processes harmful prompts. It involves calculating the component of a vector (representing model activations) that is parallel to a specified [refusal direction] and subtracting this component, such that the resulting vector no longer has any component pointing in the refusal direction.

What is Projection?

Mathematically removing the [refusal direction] to modify AI behavior

A technique to eliminate a specific component of a vector
Used to modify how AI models behave and respond to harmful prompts
Identifies and subtracts the part of a vector aligned with an undesirable direction

How Does it Work?

The activation vector is treated as an arrow in a multi-dimensional space, representing how the model is processing
A [refusal direction] vector indicates the model's unwanted response
Projection calculates how much the activation aligns with the refusal direction, by finding its "shadow" onto the vector
By subtracting that "shadow", we remove the response, and it allows for the model to respond in a new way

Mathematical Formula

The projection of an activation vector h onto a refusal direction d_refusal is calculated as:

proj_(d_refusal)(h) = (h · d_refusal) d_refusal

PreviousPredictive Analytics

NextRelated Streams

Projection

Overview

What is Projection?

How Does it Work?

Mathematical Formula

On this page

On this page

Projection

Overview

What is Projection?

How Does it Work?

Mathematical Formula

Related Concepts

On this page

On this page