[Literature Review] Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering

This review is intended for my personal learning

Paper Info

arXiv: 2410.16314
Title: Steering Large Language Models using Conceptors: Improving Addition-Based Activation Engineering
Authors: Joris Postmus, Steven Abreu
Code: jorispos/conceptorsteering

Prior Knowledge

Activation engineering steers LLM outputs by directly modifying internal activations at inference time, without changing model parameters. The standard approach extracts a steering vector by averaging the residual stream activations from a set of in-context learning (ICL) prompts that demonstrate a desired input-output function, then adds this vector to the residual stream during inference. This method, formalized by Todd et al. in Function Vectors in Large Language Models, was shown to capture and transfer task-level behavior (e.g., mapping words to their antonyms) across unseen inputs. However, because the steering vector is a single point estimate obtained by averaging, it discards information about the variance and correlation structure of the underlying activation distribution. This limits its ability to represent complex activation patterns.

Conceptors, introduced by Jaeger in Conceptors: an easy introduction, are positive semi-definite matrices that encode the principal directions and variances of a set of neural activation vectors as a high-dimensional ellipsoid. Originally developed for controlling pattern-generating recurrent neural networks, conceptors act as soft projection operators: unlike hard projection matrices whose eigenvalues are strictly zero or one, conceptor eigenvalues lie continuously in , allowing graded attenuation of activation components according to how well they align with the captured pattern.

Main Question

Can conceptors, used as steering matrices that softly project activations onto learned ellipsoidal regions, outperform the standard approach of adding a single steering vector to the residual stream for controlling LLM outputs at inference time?

Key Claims

Conceptor-based steering, which replaces vector addition with matrix-vector multiplication using a soft projection matrix, consistently outperforms additive steering across multiple function-steering tasks on both GPT-J (6B) and GPT-NeoX (20B).
Mean-centering provides substantial gains for additive steering but only marginal gains for conceptor steering, and conceptors without mean-centering already outperform additive steering with mean-centering.
Boolean operations on conceptors (specifically AND) provide a principled way to combine multiple steering goals that empirically outperforms the arithmetic mean of individual steering vectors.

Method

The paper builds on the function vector framework from Todd et al., where a set of ICL prompts demonstrating a function (e.g., antonyms) are used to extract the last token's activations from the residual stream at a specified layer . In the standard approach, these activations are averaged into a steering vector and added to the residual stream at inference time with a scaling coefficient .

Conceptor Construction

Instead of averaging, the authors stack the cached activation vectors into a matrix and compute a conceptor matrix by solving a regularized reconstruction objective:

where is the aperture parameter controlling the trade-off between fidelity to the activation pattern and generalization. The closed-form solution is:

where is the correlation matrix and is the number of samples. The eigenvalues of are given by , where are the eigenvalues of . These eigenvalues fall in , making the conceptor a soft projection: large pushes eigenvalues toward 1 (passing more signal), while small pushes them toward 0 (suppressing variability).

Steering is then performed via matrix-vector multiplication:

where is a rescaling coefficient.

Boolean Operations for Combined Steering

To compose multiple steering targets, the paper uses Boolean operations on conceptors defined by Jaeger in Controlling Recurrent Neural Networks by Conceptors. For two conceptors and computed from correlation matrices and , the AND operation is:

This is derived from de Morgan's law applied to the OR and NOT operations on conceptors. The AND-combined conceptor captures the intersection of the two activation subspaces, providing a principled alternative to averaging steering vectors for composite tasks.

Experimental Setup

Function steering experiments use GPT-J (6B) and GPT-NeoX (20B) on six tasks from the function vector dataset: antonyms, present-past, English-French, singular-plural, country-capital, and capitalize. For each task, 100 ICL prompts with 10 input-output pairs each are compiled to extract activations. Performance is measured as top-1 accuracy on 1000 test input-output pairs, with each experiment repeated five times across different random seeds. Hyperparameters are optimized via grid search: and for conceptor steering, and for additive steering. Both methods are also tested with mean-centering, which subtracts the mean activation computed over a general-purpose text dataset to remove anisotropic bias from the activation space.

Composite function experiments combine pairs of individual functions (e.g., English-French AND antonyms) on GPT-J, comparing the AND-combined conceptor against the arithmetic mean of individual steering vectors, with both evaluated against baselines computed directly on the composite function.

Result

On single-function steering, conceptor-based steering outperforms additive steering on all six tasks for both GPT-J and GPT-NeoX. The gains are largest on harder tasks: on GPT-J, antonym accuracy goes from 20.54% (addition) to 52.14% (conceptor), country-capital from 32.04% to 81.62%, and English-French from 18.88% to 59.02%. Easier tasks like capitalize show smaller absolute gains (93.16% to 96.68%). Consistent with prior findings, steering is most effective at middle layers (9-16 for GPT-J, 10-30 for GPT-NeoX).

Mean-centering substantially improves additive steering (up to 2x on country-capital) but provides only marginal improvements for conceptor steering (at most 5% on country-capital). Notably, conceptor steering without mean-centering already exceeds additive steering with mean-centering on every task.

On composite functions, the AND-combined conceptor outperforms the mean-combined steering vectors on all three tested task combinations. On the English-French and antonyms combination, the AND-combined conceptor even surpasses the additive baseline that was computed directly on the composite task.

The hyperparameter sweep reveals that conceptor steering is relatively robust: and are optimal or near-optimal across all tasks on GPT-J, and similar stability holds for GPT-NeoX with minor task-specific variation.

Function Vectors in Large Language Models - Establishes the function vector framework and dataset that this paper directly builds upon for its steering experiments.
Activation Addition: Steering Language Models without Optimization - Introduces addition-based activation steering as an inference-time intervention method, the primary baseline in this paper.
Improving Activation Steering in Language Models with Mean-Centring - Proposes the mean-centering technique evaluated alongside conceptor steering in this paper.
Controlling Recurrent Neural Networks by Conceptors - The original theoretical framework for conceptors and their Boolean algebra, from which the steering matrices and combination operations are derived.
Steering Llama 2 via Contrastive Activation Addition - Extends activation addition to behavioral steering via contrastive pairs, representing the kind of higher-level steering target where conceptors could be tested next.