Projection
Mathematical technique for removing unwanted components from model vectors
Overview
In the context of AI and specifically in modifying model behavior, projection is a mathematical technique used to remove unwanted components from a [vector] representing the model's activity, particularly to change how a model processes harmful prompts. It involves calculating the component of a vector (representing model activations) that is parallel to a specified [refusal direction] and subtracting this component, such that the resulting vector no longer has any component pointing in the refusal direction.
What is Projection?
Mathematically removing the [refusal direction] to modify AI behavior
- A technique to eliminate a specific component of a vector
- Used to modify how AI models behave and respond to harmful prompts
- Identifies and subtracts the part of a vector aligned with an undesirable direction
How Does it Work?
- The activation vector is treated as an arrow in a multi-dimensional space, representing how the model is processing
- A [refusal direction] vector indicates the model's unwanted response
- Projection calculates how much the activation aligns with the refusal direction, by finding its "shadow" onto the vector
- By subtracting that "shadow", we remove the response, and it allows for the model to respond in a new way
Mathematical Formula
The projection of an activation vector h onto a refusal direction d_refusal is calculated as:
proj_(d_refusal)(h) = (h · d_refusal) d_refusal