[Literature Review] Condition-Dependent Brain-Model Alignment between Whisper and the Human Speech Network under Acoustic Degradation

This review is intended for my personal learning

Paper Info

Venue: NeurIPS 2025 Workshop: CogInterp
OpenReview: S7r9LaYvdT
Title: Acoustic Degradation Reweights Cortical and ASR Processing: A Brain-Model Alignment Study
Authors: Francis Pingfan Chien, Chia-Chun Dan Hsu, Po-Jang Hsieh, Yu Tsao

Venue: NeurIPS 2025 Workshop: UniReps
OpenReview: uTpW9eTBUy
Title: Condition-Dependent Representational Alignment between Whisper and the Human Speech Network
Authors: Chia-Chun Dan Hsu, Francis Pingfan Chien, Rong Chao, Ching Chih Sung, Yu-Te Wang, Po-Jang Hsieh, Yu Tsao

This review covers both papers together, since they report the same fMRI experiment and the same Whisper-Tiny alignment analysis. The CogInterp paper emphasizes the behavioral and processing-level account, while the UniReps paper gives more space to the representational and anatomical interpretation.

Prior Knowledge

Brain score was introduced in the vision domain by Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? and extended to language by The neural architecture of language: Integrative modeling converges on predictive processing. It quantifies how well a model's internal representations predict measured brain responses, computed as the Pearson correlation between predicted and actual BOLD on held-out trials, then divided by the ROI's noise ceiling so that a value of 1 indicates the upper bound permitted by measurement reliability.

Listening-effort research, summarized by Peelle in Listening effort: How the cognitive consequences of acoustic challenge are reflected in brain and behavior, shows that when speech is degraded, comprehension drops and cortical resources are reallocated toward frontal regions associated with attention and control. A related but conceptually distinct account from predictive coding holds that the brain weights its computations by the reliability (precision) of sensory evidence: when input reliability falls, top-down predictions are downweighted and processing relies more on bottom-up acoustic features. The two papers reviewed here invoke both frameworks.

The Fedorenko language network, defined functionally per subject with a language localizer, comprises bilateral anterior temporal, posterior temporal, angular gyrus, inferior frontal gyrus (triangular and orbital), and middle frontal gyrus. The MFG overlaps with the multiple-demand network, a domain-general control system. These functional ROIs are more reliable than group anatomical regions for language encoding analyses.

Whisper is an encoder-decoder Transformer ASR model. Its encoder reads a log-Mel spectrogram and produces context-rich acoustic representations at roughly 20 ms resolution. Its decoder is auto-regressive and predicts the next token conditioned on the encoder output and prior tokens, so its pre-terminal hidden states tend to summarize sentence-level information.

Main Question

Does acoustic degradation systematically alter the layer-wise correspondence between a Transformer-based ASR model and the human cortex, and does that shift mirror the precision-weighted reweighting proposed by predictive-coding accounts while also tracking declines in intelligibility, perceived quality, and comprehension?

Key Claims

Under clean speech, Whisper-Tiny alignment peaks in frontal language regions: middle and late encoder layers align maximally with the right IFG, and a middle decoder layer aligns maximally with the left MFG, consistent with hierarchical predictive processing.
Under noisy speech, alignment shifts toward early encoder layers in the right Heschl's gyrus and toward the right IFG pars orbitalis at late encoder layers, while decoder peaks weaken and become spatially diffuse.
The shift is statistically significant in three specific layer-ROI pairs: the right IFG at encoder layers 3 and 4, and the left MFG at decoder layer 2. It co-occurs with significant behavioral declines in intelligibility, perceived quality, and comprehension, linking the neural reweighting to perceptual outcomes.

Method

Participants, Stimuli, and Behavior

Twenty-five healthy native Mandarin speakers with normal hearing listened to 24 clean and 24 noisy Mandarin sentences (10 words each, about 3 seconds) inside the scanner. Noisy trials were created by mixing each sentence with stationary speech-shaped noise at dB SNR. After each sentence, participants reported intelligibility and perceived quality on a 5-point MOS scale and answered a two-option comprehension question.

fMRI Acquisition and ROIs

Functional images were collected on a 3T Siemens Skyra (gradient-echo EPI, TR = 2000 ms, TE = 24 ms, voxel mm). Preprocessing in SPM12 included slice-timing correction, motion correction, normalization to MNI space, and 8 mm FWHM smoothing. First-level GLMs estimated separate boxcar regressors for clean and noisy conditions convolved with the HRF, with six motion parameters as nuisance covariates. Twelve ROIs were used: bilateral subject-specific Fedorenko language fROIs (AntTemp, PostTemp, Angular, IFG triangular, IFG orbital, MFG) plus bilateral Heschl's gyri from the AAL atlas via MarsBaR.

Whisper-Tiny Representations

Layer-wise embeddings were extracted from Whisper-Tiny, including the zeroth pre-block layer in both the encoder and the decoder, yielding 5 encoder and 5 decoder layers. For each encoder layer , the first 200 of 1500 hidden states (covering the roughly 3-second stimulus) were retained to form , then flattened to a 76,800-dimensional vector. For each decoder layer, the hidden state at the last non-special token position served as , since the next-token objective encourages this position to summarize the entire sentence.

Linear Encoding and Brain Score

Encoder vectors were projected from 76,800 down to around 2,700 dimensions using sparse random projection, with the target dimensionality set by the Johnson-Lindenstrauss lemma at for 24 samples. Per-voxel ridge regression was fit with RidgeCV using nested leave-one-out cross-validation across 60 logarithmically spaced regularization values between and :

Alignment was quantified by the brain score: per-participant median voxel-wise Pearson correlation between predicted and actual BOLD on the held-out fold, averaged across 4 cross-validation folds, then normalized by the ROI's noise ceiling to lie in .

Condition Contrast

Brain scores were compared between clean and noisy conditions using paired tests with Benjamini-Hochberg FDR correction at across all layer-by-ROI combinations.

Result

All three behavioral measures dropped significantly under noise (intelligibility, perceived quality, and comprehension; FDR-BH ).

Under clean speech, encoder brain scores increased with depth and peaked at encoder layer 3 in the right IFG (brain score 0.76); the right IFG was the best ROI for encoder layers 1 through 4. Decoder alignment peaked at decoder layer 2 in the left MFG (0.80), with the left MFG dominating the middle decoder layers (dec-1 and dec-2) and the left Heschl's gyrus dominating the later decoder layers (dec-3 and dec-4).

Under noisy speech, the layer profile flattened and shifted. The strongest encoder peak moved to encoder layer 1 in the right Heschl's gyrus (0.76), with a second peak at encoder layer 4 in the right IFG pars orbitalis (0.80). Decoder peaks were weaker and more spatially dispersed, with the right MFG and bilateral anterior temporal cortex appearing as best decoder ROIs at typical peak values around 0.71.

In the paired condition contrast, clean speech produced significantly higher alignment than noisy speech at three layer-ROI pairs: the right IFG at encoder layers 3 and 4, and the left MFG at decoder layer 2 (). No layer-ROI pair survived FDR correction in the opposite direction.

Together, the results yield a compact layer-to-region map: the early encoder aligns with Heschl's gyrus, the middle and late encoder align with the IFG (right IFG under clean, right IFGorb under noise), and the middle decoder aligns with the left MFG under clean input but loses that specificity under noise.

Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like? - The original Brain-Score framework in the vision domain, introducing noise-ceiling-normalized neural predictivity as a metric for model-brain alignment.
Many but not all deep neural network audio models capture brain responses and exhibit correspondence between model stages and brain regions - Establishes that audio model stages map onto auditory cortex hierarchically, the methodological precedent for the layer-region mapping here.
Toward a realistic model of speech processing in the brain with self-supervised learning - Earlier evidence that speech models recapitulate cortical processing stages, the precedent for using Whisper layers as candidate brain models.