[Literature Review] Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain

This review is intended for my personal learning

Paper Info

arXiv: 2309.01660
Title: Unveiling Theory of Mind in Large Language Models: A Parallel to Single Neurons in the Human Brain
Authors: Mohsen Jamali, Ziv M. Williams, Jing Cai

Prior Knowledge

Theory of Mind (ToM) is the capacity to attribute mental states to others, including the recognition that another agent may hold beliefs that diverge from reality. The standard test is the false-belief task: a subject must understand that a character who missed an environmental change still holds an outdated belief about the world. This capacity typically emerges in children around age four.

At the neural level, ToM is supported by a distributed network including the TPJ, STS, and dmPFC. A prior study by the same group (Jamali et al., 2021, Nature) recorded single-neuron activity in the human dmPFC and found neurons that selectively modulate their firing rates based on whether the agent in the task holds a true or false belief. This is the direct biological reference point for the present paper.

Main Question

Do hidden embeddings in large language models exhibit belief-type-specific modulations during ToM tasks, and do these modulations parallel single-neuron responses previously documented in the human dmPFC?

Key Claims

A subset of hidden embeddings in large LLMs responds differentially to true-belief versus false-belief trials, constituting a form of artificial neural specialization for ToM content.
This selectivity scales with model size and correlates with behavioral performance on false-belief tasks.
The spatial distribution of responsive embeddings across layers mirrors the hierarchical organization of ToM brain regions, concentrated in middle and higher layers, absent in early input layers.

Method

Four open-source LLM families were evaluated: Falcon (1b–40b), LLaMa (3b–33b), Pythia (3b–12b), and GPT-2 (medium–XL). Task materials consisted of 76 trials adapted from the same protocol used in the prior human single-neuron recording study. Each trial paired a true-belief and a false-belief scenario with length-matched statements, followed by a fact question and a belief question.

Behavioral Evaluation

For each trial, the statement and question were concatenated and fed into the model. Rather than generating free-form text, the model's output logits over the two candidate answer tokens (e.g., "tree" vs. "ground") were compared directly. The answer with the higher logit was taken as the model's response.

Embedding Selectivity

Hidden states were extracted from every transformer layer and restricted to the question tokens, then averaged across token positions. For a model with layers and embedding dimension , this yields a matrix of shape where is the number of trials.

Each embedding dimension was then tested independently using a Mann-Whitney U test, comparing values across true-belief and false-belief trials. The test statistic is:

where are the group sizes and is the rank sum for the true-belief group. A dimension was considered selective if . The layer with the highest percentage of selective dimensions was taken as the representative value for each model.

Population Decoding

To test whether trial type could be read out from the full embedding population, a logistic regression classifier with L2 regularization was trained on the embedding vectors from each layer:

where is the trial type label and . A 75/25 pair-preserving train-test split was repeated 100 times and accuracy was averaged across iterations and layers.

Two controls were applied throughout: randomly permuted statement words (preserving questions), and questions presented without any preceding statement.

Result

On behavioral evaluation, all models performed well on true-belief and factual questions, but false-belief accuracy was strongly size-dependent. Models above 12B parameters averaged 68% accuracy (LLaMa-33b reaching 69%), while smaller models performed at or below chance. Control conditions confirmed the advantage was driven by statement content, not word frequency or question structure.

At the embedding level, large models (>12B) had an average of 3.9% of embedding dimensions showing significant selectivity for trial type, versus 0.6% in small models. This proportion followed an exponential relationship with false-belief accuracy:

with fitted parameters and . Responsive embeddings clustered in middle and higher layers, with virtually none in the earliest layers, consistent across model families. Falcon-40b showed the highest concentration, with 6.3% of dimensions selective at layer 25.

Trial type was decodable from the full embedding population at an average accuracy of 75% for large models, and 81% for Falcon-40b. Embeddings from correctly decoded trials showed significantly greater between-condition differentiation than those from incorrectly decoded trials (mean z-scored difference 0.60 vs. 0.25, ), linking individual embedding selectivity to model-level ToM performance. The word-permutation control dropped decoding accuracy to approximately 55% across all models.

Thoughts

The correlation between embedding selectivity and behavioral accuracy is suggestive, but the causal direction remains untested. A natural next step would be activation patching: if selectively suppressing the ToM-responsive embeddings identified here degrades false-belief performance while leaving true-belief and factual performance intact, that would provide much stronger evidence that these embeddings are mechanistically involved rather than merely correlated. This kind of intervention would also help distinguish whether the representations are belief-specific or just proxies for general contextual integration.

The task design raises a deeper question worth pursuing separately. Given that Ullman (2023) showed LLMs fail on trivially altered versions of false-belief tasks, it would be informative to run the same embedding analysis on those adversarial variants. If the ToM-selective embeddings disappear or lose their predictive power under surface perturbations that preserve the logical structure of the task, that would suggest the representations are tied to narrative surface patterns rather than genuine belief tracking, which has significant implications for how we interpret the biological parallel the paper draws.

Single-neuronal predictions of others' beliefs in humans — the direct biological reference for this paper; establishes dmPFC single-neuron selectivity for true vs. false beliefs in humans.
Theory of mind may have spontaneously emerged in large language models — the behavioral ToM benchmark results that motivate this mechanistic investigation.
Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks — raises the key concern that LLM ToM performance may be driven by surface patterns rather than genuine belief reasoning.