[Literature Review] Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior

This review is intended for my personal learning

Paper Info

arXiv: 2605.05115
Title: Manifold Steering Reveals the Shared Geometry of Neural Network Representation and Behavior
Authors: Daniel Wurgaft, Can Rager, Matthew Kowal, Vasudev Shyam, Sheridan Feucht, Usha Bhalla, Tal Haklay, Eric Bigelow, Raphael Sarfati, Thomas McGrath, Owen Lewis, Jack Merullo, Noah D. Goodman, Thomas Fel, Atticus Geiger, Ekdeep Singh Lubana
Code: goodfire-ai/causalab

This paper is a part of Goodfire's Neural Geometry Series, which is a very inspirational series of works to keep an eye on.

Prior Knowledge

Activation steering refers to a family of inference-time interventions that modify a model's behavior by directly editing its hidden states. The most common protocol is linear steering, which adds a scalar-weighted "steering vector" to activations at a chosen layer. Linear steering is motivated by the linear representation hypothesis (LRH), which posits that concepts are encoded along single approximately orthogonal directions in activation space. In practice, linear steering often produces brittle, incoherent, or off-target outputs, suggesting that the assumed Euclidean geometry of activation space may be wrong.

A parallel line of work documents that internal representations of structured concepts often lie on curved low-dimensional manifolds rather than along lines. Days of the week and months of the year arrange themselves into approximately circular structures, sequential concepts like ages into open curves, and graph-structured tokens learned in context into manifolds that mirror the latent graph. This paper sits at the intersection of these two threads, asking whether the curvature observed in representations is causally relevant to behavior, and whether interventions should be performed along the curved manifold rather than along straight lines.

Main Question

Does the geometric structure of activation manifolds causally constrain a model's output behavior, and if so, can steering that respects this geometry produce more natural behavioral trajectories than the standard linear approach?

Key Claims

For tasks with structured concept domains, the activation manifold fit to internal representations and the behavior manifold fit to output probability distributions are approximately related by a scaled isometry. Pairwise geodesic distances along the two manifolds correlate strongly, while Euclidean distances do not.
Steering along geodesics of ("manifold steering") induces behavioral trajectories that closely follow , with smooth and ordered transitions between adjacent concepts. Linear steering instead cuts through off-manifold regions and produces "teleportation" between non-adjacent concepts.
Optimizing interventions to produce behavioral trajectories along ("pullback steering") recovers activation paths that trace the curvature of . This bidirectional correspondence supports the view that the two manifolds are alternate images of the same underlying conceptual geometry.
For multi-dimensional conceptual spaces learned in context, manifold steering affords factored control: steering along one intrinsic coordinate of the manifold modulates one dimension of the conceptual space while leaving the others intact.
The same principles transfer to a visual world model, where steering along the activation manifold of a recurrent encoder produces smooth physical movement of a simulated agent, while linear steering produces incoherent superpositions of positional beliefs.

Method

The framework has two halves: fitting manifolds in both activation and behavior space, then comparing different steering strategies that interpolate between two points under different geometric assumptions.

Setup and Manifold Fitting

For each task, the conceptual domain (e.g., days of the week) defines a set of possible outputs. Prompts are templated additions such as "What day is k days after z?". For an input , denotes a chosen-layer activation and denotes the output distribution restricted to tokens in plus an "other" class. Activations and output distributions are averaged across all prompts sharing the same correct answer to yield concept centroids.

The activation manifold is obtained by reducing activations to 64 dimensions via PCA and fitting a cubic spline through the centroids. The behavior manifold is fit on the probability simplex after mapping each centroid into Hellinger space via , which linearizes the simplex so that

is an ordinary Euclidean distance. Splines are then fit on the sphere and decoded points are squared back to recover valid distributions.

Isometry Test

For every pair of centroids, the authors compute geodesic distance along (cumulative Euclidean distance along the spline) and along (cumulative Hellinger distance). Pearson correlation between these two pairwise distance matrices quantifies the degree of isometry. As a baseline, they also compare Euclidean (straight-line) activation distances to behavior-space geodesics.

Steering as Geodesic Selection

The basic intervention replaces the activation at the chosen layer with a value and continues the forward pass. A steering path connects endpoints and and produces a behavioral trajectory . Three strategies are compared:

where is the spline parameterization of . Manifold steering interpolates in the intrinsic coordinates of the activation manifold and maps the result back through , so the steering path stays on . Pullback steering is defined implicitly: given a target path on , the corresponding activation path is found by optimization so that its induced behavioral trajectory matches .

These three strategies are unified by viewing each as a geodesic under a different Riemannian metric on activation space:

where is an energy function on activations, is the activation-to-behavior map, and is the Hellinger metric on . The density metric inflates distances in off-manifold regions, so its geodesics hug .

Naturalness Measure

To quantify how natural a steered behavioral trajectory is, the authors define a cumulative energy

where is the Bhattacharyya distance from to the nearest point on . Low means the trajectory stays close to the natural output manifold throughout.

Tasks and Models

Language model experiments use Llama 3.1 8B with layer-28 activations across four tasks: cyclic concepts (weekdays, months), sequential concepts (letters, ages), and two-dimensional graph structures learned in context following the Park et al. graph-tracing protocol (5x5 grid in main text, 9x9 cylinder in appendix). For the in-context tasks, two-dimensional thin plate splines (TPS) are fit through node centroids instead of one-dimensional cubic splines. The visual world model is a recurrent encoder-decoder (convolutional encoder feeding a GRU with hidden size , plus a convolutional decoder producing residual frames) trained on the Mountain Car environment, where the conceptual coordinate is the car's continuous position . For Mountain Car, the behavior manifold is constructed by binning the position range, fitting a smooth spline through bin means, and defining over bins with temperature .

Result

Isometry Between Activation and Behavior Manifolds

The circular structure of activation-space representations for days and months was already known from prior work, but the same structure appearing in the behavior manifold over output distributions is a novel discovery of this paper. It arises from the fact that output distributions place most mass on the target concept with the remainder concentrated on neighboring concepts, so adjacent concepts have similar output distributions and end up adjacent in Hellinger space.

Across all four language tasks, geodesic distances on and correlate strongly: for weekdays, for months, for letters, for ages. The same correlation using Euclidean activation distances is markedly lower ( respectively). Multidimensional scaling embeddings recover the cyclic or sequential structure for manifold paths but produce warped or scrambled layouts for linear paths.

Manifold Steering Follows the Behavior Manifold

Manifold steering produces smooth output transitions where probability mass shifts steadily across adjacent concepts (Tuesday to Wednesday to Thursday), while linear steering exhibits "teleportation" between non-adjacent concepts and sometimes spikes probability on unrelated tokens. The cumulative energy is substantially lower for manifold steering: 0.34 vs. 0.93 (weekdays), 0.36 vs. 1.09 (months), 2.42 vs. 6.95 (letters), 5.21 vs. 13.49 (ages), an average improvement of with all comparisons at .

Pullback Recovers the Activation Manifold

The pullback optimization is performed via L-BFGS within the first 32 dimensions of the 64-dimensional PCA subspace used to fit the activation manifold. To quantify how well the optimized pullback path recovers the manifold-steering path, the authors compute an intrinsic : both paths are projected into the subspace spanned by singular directions explaining 99% of the variance in the manifold-steering path, and is measured using orthogonal closest-point residuals in that subspace. Pullback paths recover the activation manifold much better than linear baselines do: = 0.77, 0.75, 0.78, 0.47 (weekdays, months, letters, ages) versus = 0.42, 0.32, 0.23, 0.24, with all comparisons at .

Factored Control in In-Context Learning

For both the 5x5 grid and the 9x9 cylinder graphs learned in context, geodesic distances on the activation and behavior manifolds correlate at . Linear-path correlations are substantially lower: for the grid and for the cylinder. Steering along one intrinsic coordinate of the fitted two-dimensional activation manifold moves the random-walk position along one dimension of the grid while leaving the other unchanged, recovering factored control. Linear steering continues to exhibit teleportation between non-adjacent grid positions.

Visual World Model

On Mountain Car, steering through waypoints between encoder states corresponding to positions and , manifold steering produces decoded frames showing smooth, coherent movement of the car between positions. Linear steering produces blurred or doubled car images at intermediate waypoints, reflecting an incoherent superposition of positional beliefs as the path leaves the activation manifold, followed by a "teleportation" to the endpoint. The Pearson correlation between manifold-space activation distances and behavior-space arc-length is , while the linear-distance correlation collapses to because folds back on itself in the encoder's ambient space, so points whose underlying positions are far apart can sit arbitrarily close in ambient space.

Notably, both and trace closed curves rather than open arcs in their respective ambient spaces. This is because the visually distinctive states at the wall () and the goal () appear similar to the encoder and are mapped to neighboring activations, illustrating that the activation manifold reflects perceptual similarity in the encoder rather than the literal one-dimensional position coordinate.

Not All Language Model Features Are One-Dimensionally Linear - Engels et al.'s discovery of circular representations for days and months, which establishes the existence of curved low-dimensional structure in activation space that this paper exploits and intervenes on.
ICLR: In-Context Learning of Representations - Park et al.'s graph-tracing protocol, used here to construct in-context tasks with synthetic two-dimensional concept geometries and demonstrate factored control via manifold steering.
Language Models Use Trigonometry to Do Addition - Kantamneni and Tegmark's helix representation of numbers, cited in the paper as the closest precedent for geometry-respecting intervention; this work generalizes that protocol to arbitrary fitted manifolds and provides a theoretical framework relating it to behavioral geometry.