Reproducibility Report for Study ‘The neural architecture of language: Integrative modeling converges on predictive processing’ by Schrimpf & colleagues (2021, PNAS)

Author

Atlas Kazemian

Published

Invalid Date

Introduction

In “The Neural Architecture of Language: Integrative Modeling Converges on Predictive Processing” (Schrimpf et al., 2021), the authors investigated how large language models correspond to neural and behavioral measures of human language processing. Using fMRI and electrocorticography datasets, they compared 43 artificial neural network models and found that transformer-based models—especially GPT-2—best predicted neural activity and behavioral responses to sentences.

The main finding targeted for reproducibility is the set of brain scores reported in Figure 2 of the paper. Schrimpf et al. showed that GPT-2 models achieved the highest neural predictivity, in some cases reaching the neural noise ceiling—meaning they could explain nearly all the variance in the neural data. In contrast, models such as BERT exhibited much lower brain predictivity.

This project aims to reproduce the brain scores for one of the three datasets used in the original study (the Pereira et al., 2018 fMRI dataset) and for a subset of model families: GloVe, BERT, and GPT-2. Ideally, the same r-values as those reported in Figure 3 of the paper should be obtained. The brain score measures the similarity between brain responses and model representations, computed as the Pearson correlation between actual neural responses and predicted neural responses. The predicted responses are obtained by linearly mapping model embeddings to brain responses using a cross-validated L2 regression procedure.

If time permits, the study will be extended by incorporating more recent models to explore whether state-of-the-art models exhibit similar or stronger relationships.

Justification for choice of study

Several recent works suggest that the brain similarity scores reported in Schrimpf et al. (2021) may be inflated due to autocorrelation effects during cross-validation in the linear mapping step. These studies recommend performing cross-validation without shuffling to mitigate this issue. My primary goal is to reproduce the original results and examine whether this methodological adjustment affects the outcomes.

Anticipated challenges

The main challenge I anticipate is downloading and setting up the neural data from the GitHub repository provided by Schrimpf et al. Because the study dates back to 2021 and relies on an older Python environment, dependency mismatches with newer Python versions may cause installation and compatibility issues.

Methods

Description of the steps required to reproduce the results

The project can be divided into three main steps:

  1. Dataset download and preprocessing: Download the data from the original paper’s repository and preprocess it to obtain the stimuli × voxels matrices used in the original analysis.

  2. Activation extraction: Extract LLM activations for the same set of stimuli (sentences) used in the fMRI experiment.

  3. Computing correlation scores: Build the model evaluation pipeline consisting of a ridge regression cross-validation framework. Compute the average Pearson correlation for each model layer and select the best-performing layer for each model.

Differences from original study

There may be minor differences in the model evaluation pipeline, which could lead to slight variations in the brain scores. However, the overall trend across model families is expected to remain consistent with the original findings.

Project Progress Check 1

Measure of success

A hard measure of success is reproducing the same brain scores for the same models as reported in the original study. A softer measure of success is replicating the relative trend in brain scores shown in Figure 2 for the Pereira dataset and the selected subset of models. If neither is achieved, the reproducibility attempt will be considered unsuccessful.

Pipeline progress

  1. Dataset download and preprocessing: The data has been downloaded from the original repository and preprocessed to match the stimuli × voxels structure used in the paper’s analysis.

  2. Activation extraction: This step has not yet started. Model activations will be extracted using GPUs on the CCN2 nodes. This process is expected to be relatively quick.

  3. Computing correlation scores: The code for the model evaluation pipeline—including ridge regression cross-validation and average Pearson correlation computation—has been mostly completed. It will be tested once the model activations are extracted. All scripts can be found in this github repository: https://github.com/akazemian/251-Replication-Project.

Results

Data preparation

Given that the project code was implemented using python on the ccn2 nodes, the data preparation steps here are screenshots imported from the python code, also included in the project repository under src/neural_data.

Import data

Stimuli and neural responses were downloaded from first the web and loaded using the script src/neural_data/pereira2018.py. Below are phots of the tsv stimulus file as well as the raw neural xarray neural dataset.

Data exclusion / filtering

The raw neural data was then processed to only include reliable voxels, and only those within the language regions of the brain. This reduced the set of all voxels to ~13000 voxels across 9 subjects.

Prepare data for analysis - create columns etc.

Below is a photo of what the dataset looks like as a numpy array after fitlering and processing. The rows refer to the ~600 sentences, and the columns refer to voxels showing teh neural responses to each sentence. These voxels are concatenated across all subjects.

Key analysis

The analyses as specified in the analysis plan.

Below is a graph of the original results (left) and the reproduced results (right) for smallest and larges versions of the BERT-uncased model (small, large) as well as the GPT2 model (small, extra large).

Exploratory analyses

In addition to the original replication, my goal was to also consider the case in which linear mapping (ridge regression) is performed without shuffling the stimuli. The motivation being that several recent studies have argued that shuffling the stimulus set, as was done in Schrimpf et al (2021), inflates neural predictivity in language models. Below is the main figure, with the addition of a new plot (right most), where stimuli was not shuffled during regression.

Discussion

Summary of Reproduction Attempt

The overall trend of the results shown in the original paper, where the encoding score for GPT2 model group was higher than BERT, and the effect of scale (larger models have higher scores) was successfully reproduced. However, the actual values of the encoding scores computed were much lower than in the original study. This is primarily due to the ceiling-normalization procedure done in the original study, where the model scores were normalized by how much signal is present in the dataset. However the details of the noise ceiling calculation was not discussed in the paper, and was buried in the code. Therefore, it proved to be more challenging than I anticipated to perform the nose ceiling normalization. As a result, I believe that the result of the original study were partially reproduced.

In addition to the main replication, I also set out to replicate findings from a newer study, claiming that neural predictivity in Schrimpf et al (2021) is heavily inflated by introducing shuffling during linear regression. I successfully replicated this finding, showing that using contiguous cross validation splits, without shuffling the sentences, results in much lower neural predictivity across all models. In addition, the superiority of GPT2 as a model family, and the effect of scale, is no longer observable.

Commentary

The results from the exploratory analysis are very important for the field of langauge neuroscience. What leads to the higher brain score values in Schrimpf et al (2021) is the autocorrelation present in in the stimuli as a result of the continuous nature of language. When the stimuli is shuffled, the linear model uses the autocorrelation towards learning a better mapping, as opposed to focusing on useful information. This inflates the neural scores in an unsatisfying way. Therefore, when using linear regression as the similarity computation methodology, it is important to run controls (ex: with and without shuffle) to ensure that the results are reliable and meaningful.