[Literature Review] Mechanistic Interpretability for Steering Vision-Language-Action Models

This review is intended for my personal learning

Paper Info

arXiv: 2509.00328
Title: Mechanistic Interpretability for Steering Vision-Language-Action Models
Authors: Bear Haon, Kaylene Stocking, Ian Chuang, Claire Tomlin

Prior Knowledge

Vision-Language-Action (VLA) models are transformer-based policies that take a task description and image observation as input and produce robot actions as output. They are typically built by fine-tuning a pretrained Vision-Language Model (VLM) on expert robot trajectory data, where a set of rarely used VLM tokens are repurposed as "action tokens" to encode control outputs. Representative VLAs include RT-2, OpenVLA, and pi0-FAST.

A key interpretability technique for transformers, introduced by Geva et al. in Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space, treats the rows of the FFN output weight matrix as "value vectors." Because these vectors live in the same space as the model's output embeddings, each one can be projected onto the token vocabulary to reveal which tokens it promotes. This allows semantic interpretation of individual FFN neurons without requiring any additional training data, unlike methods such as sparse autoencoders (SAEs).

Main Question

Can mechanistic interpretability techniques developed for LLMs be applied to VLA models to both understand their internal representations and steer their behavior at inference time, without any fine-tuning, reward signals, or environment interaction?

Key Claims

Despite being trained to output only action tokens, VLAs retain rich semantic representations from VLM pretraining throughout their layers, with less than 25% of FFN neurons being rewired for action prediction.
Internal semantic concepts such as "slow" or "up" are causally linked to robot actions that express those concepts, even though the model was never given explicit feedback on the meaning of different action tokens.
Targeted activation-level interventions on semantically meaningful neuron clusters can modulate VLA behavior at inference time in a zero-shot manner, and this is more effective than prompt modification alone.
Task-specific fine-tuning primarily rewires the action token distribution across value vectors rather than destroying semantic structure inherited from VLM pretraining.

Method

The paper follows a two-stage approach: first interpreting the internal representations of VLAs, then leveraging those representations for steering.

Interpreting VLA Value Vectors

The core interpretability tool is the decomposition of the transformer FFN output. For a given FFN layer with input , the output can be written as:

where are input-dependent activations and are the rows of the output weight matrix (value vectors). Since these value vectors are input-independent and reside in the same linear space as the model's final output, they can be projected onto the token vocabulary to reveal which tokens each neuron promotes. This allows assigning semantic meanings to individual neurons.

The authors conduct three interpretability analyses:

They randomly sample 10 value vectors per layer from both the PaliGemma VLM and the pi0 VLA, checking whether at least 4 of the top 30 projected tokens follow a coherent pattern (following the methodology from Geva et al.). This is done to quantify whether VLA training destroys semantic organization.
They measure the proportion of action tokens among the top-100 token projections for value vectors at each layer, to understand where in the network action token reasoning occurs.
They compare value vectors between pi0-FAST and a checkpoint fine-tuned on the DROID dataset using a two-proportion z-test, to characterize how task-specific fine-tuning alters the internal representations.

Steering via Activation Override

Given the semantic structure found in value vectors, the authors propose overriding the activations of a selected subset of neurons with a fixed scalar :

This induces a residual shift that propagates through the transformer and modulates the action token distribution.

The neuron subset is identified by clustering value vectors according to their semantic content. Two approaches are used: manual selection (inspecting top projected tokens for relevant keywords like "fast" or "low") and kNN clustering over semantic embeddings. For the kNN approach, each value vector is assigned a semantic embedding by computing the softmax-weighted average of the output embeddings of its top-5 projected tokens. Cosine-based kNN clustering groups value vectors with similar semantics, and the cluster whose centroid is closest to a target concept embedding (e.g., "up") is selected for intervention.

In OpenVLA (PyTorch), the intervention is implemented via a forward hook on the FFN's down_proj layer. In pi0 (JAX), the neuron indices and activation coefficient are passed directly into modified FFN code.

Experimental Setup

Simulation experiments use the 7B-parameter OpenVLA model fine-tuned on LIBERO-Long, a suite of 10 long-horizon manipulation tasks. Two types of simulation experiments are conducted: steering motion magnitude (fast vs. slow clusters, sweeping over cluster sizes of 10 and 20 and activation strengths from 2 to 20) and temporal localization (injecting up-themed clusters at early, late, or all layers).

Real robot experiments use the 3B-parameter pi0-FAST model on a UR5 arm with two pick-and-place tasks: Low/High Transport (75 episodes, varying transport height) and Slow/Fast Transport (120 episodes, varying speed). The model is fine-tuned with LoRA on small custom datasets. Crucially, the height and speed variations are not labeled in the training prompts, so the model must have internalized these behavioral modes implicitly. Baselines include no intervention, prompt modification (e.g., prepending "low" to the prompt), and random vector interventions.

Result

Interpretability Findings

Value vectors throughout both the VLM and VLA models exhibit identifiable semantic patterns at similar rates, confirming that VLA training does not destroy the semantic organization inherited from pretraining. Action tokens appear in value vectors at every layer of the VLA, not just the final layers, suggesting that the model reasons about control actions continuously rather than in a distinct final stage. After task-specific fine-tuning on DROID, the most significant changes in value vectors are concentrated in action tokens: the fine-tuned model develops a more specialized, uneven distribution of action tokens compared to the relatively flat distribution of the base model, with only a modest (1.2x) increase in tokens related to DROID task instructions.

Simulation Steering

Fast clusters consistently induce larger end-effector displacements compared to slow clusters, with an average improvement of 27.73% and maximum gains of 148.54% in some configurations. All 10 matched comparisons are statistically significant (, paired t-test). For temporal localization, full-layer interventions produce the largest average Y-displacement (), followed by late-layer () and early-layer () interventions. Late-layer interventions can match full-layer effects at higher intensities and larger cluster sizes.

Real Robot Steering

In Low/High Transport, the low intervention produces the lowest overall trajectories. In Slow/Fast Transport, the slow intervention results in the slowest overall movements. However, high and fast interventions resemble the unsteered baseline, which the authors attribute to the baseline trajectory already being implicitly fast and high. Random interventions show minimal difference from the no-intervention baseline, confirming that semantic alignment of the steering vectors matters. Activation steering outperforms prompt modification in both tasks.

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space - The core interpretability technique used in this paper, treating FFN value vectors as promoters of token-level concepts.
Transformer Feed-Forward Layers Are Key-Value Memories - Earlier foundational work establishing that FFN layers function as key-value stores, motivating the per-neuron analysis.