Projection

Mathematical technique for removing unwanted components from model vectors

Overview

In the context of AI and specifically in modifying model behavior, projection is a mathematical technique used to remove unwanted components from a [vector] representing the model's activity, particularly to change how a model processes harmful prompts. It involves calculating the component of a vector (representing model activations) that is parallel to a specified [refusal direction] and subtracting this component, such that the resulting vector no longer has any component pointing in the refusal direction.

What is Projection?

Mathematically removing the [refusal direction] to modify AI behavior

  • A technique to eliminate a specific component of a vector
  • Used to modify how AI models behave and respond to harmful prompts
  • Identifies and subtracts the part of a vector aligned with an undesirable direction

How Does it Work?

  • The activation vector is treated as an arrow in a multi-dimensional space, representing how the model is processing
  • A [refusal direction] vector indicates the model's unwanted response
  • Projection calculates how much the activation aligns with the refusal direction, by finding its "shadow" onto the vector
  • By subtracting that "shadow", we remove the response, and it allows for the model to respond in a new way

Mathematical Formula

The projection of an activation vector h onto a refusal direction d_refusal is calculated as:

proj_(d_refusal)(h) = (h · d_refusal) d_refusal