Reproducibility Report for ‘Base Models Know How to Reason, Thinking Models Learn When’ by Venhoff et al. (arXiv preprint: 2510.07364v3, 2025)

Author

Linas Nasvytis (linasmn@stanford.edu)

Published

October 29, 2025

Introduction

The target study proposes a novel hybrid model that combines base language models with minimal, interpretable steering mechanisms to activate latent reasoning capabilities. Venhoff et al. demonstrate that large base models (e.g., Qwen2.5-32B, Yi-34B) already possess strong reasoning skills, and that a “thinking model” (e.g., DeepSeek-R1 or Claude 3.7) can guide when these reasoning circuits should be activated. Their method involves:
1. Discovering interpretable reasoning directions via Sparse Autoencoders trained on hidden states from chain-of-thought prompts;
2. Learning steering vectors aligned with reasoning-relevant behaviors (e.g., arithmetic steps, uncertainty, backtracking) by comparing base and thinking model predictions at the token level;
3. Applying these vectors selectively, using token-level classifications from the thinking model to steer the base model only at key points.

I aim to reproduce the full hybrid steering model pipeline (Steps 1–3). Due to computational constraints, I will conduct this reproduction using Llama-3.1-8B as the base model (one of the four models evaluated in the original study). I will reimplement the sparse autoencoder-based direction discovery, use open-weight models (e.g., DeepSeek or Mixtral derivatives) for reasoning supervision, and evaluate whether lightweight steering improves performance over the unsteered base model.

Justification for choice of study

My PhD research focuses on large-scale analysis of chain-of-thought traces across both humans and language models – an area that is in its complete infancy in terms of research. This study is one of the only works (1) whose authors are at an arm’s length from myself, as some of the most prominent work has been published within Stanford’s Psychology Department, (2) which directly addresses questions I am analyzing in my own work, and (3) whose reproduction would be immensely helpful in learning a new method for my research – using mechanistic interpretability to analyze chain of thought reasoning in language models.

I am aware that since the paper is relatively new, it is only available as a pre-print. However, given how well it would fit both the scope of the project and my interests, this would be an immensely rewarding reproduction project to work on. While it would involve quite a bit of work, I am confident I could manage the workload and find the necessary computational resources for it.

Anticipated challenges

I will use Llama-3.1-8B as the base model (included in the original study), which may yield weaker steering effects compared to larger models.
I may need to substitute the “thinking model” (e.g., Claude 3.7) with an open-weight alternative for supervision, which could affect quality of the token-level steering signal.
Sparse autoencoder training and direction discovery will require GPU resources, though with sufficient time, I believe this is feasible.
Randomness in model outputs and training may lead to variability in discovered directions or evaluation results.
Some implementation details (e.g., classifier thresholds, interpolation weights) are not fully specified and may require re-tuning.

Links

Project repository (on Github): https://github.com/psych251/venhoff2025

Original paper (as hosted in your repo): https://arxiv.org/pdf/2510.07364

Methods

Description of the steps required to reproduce the results

Model setup
- Use Llama-3.1-8B as the base model (via HuggingFace or vLLM).
- Select an open-weight “thinking model” (e.g., DeepSeek-MoE, Mixtral, or similar) to act as a reasoning supervisor.
Data collection
- Generate a dataset of chain-of-thought (CoT) traces by prompting the base and thinking models on arithmetic, symbolic, and reasoning tasks (e.g., GSM8K, Last Letter, or synthetic puzzles).
Sparse autoencoder training
- Collect hidden states (e.g., MLP activations) from the base model on CoT prompts.
- Train a sparse autoencoder (SAE) to compress these states and extract interpretable latent directions.
- Identify individual neurons or features correlated with reasoning behaviors (e.g., arithmetic steps, uncertainty).
Steering vector extraction
- Compare token-level outputs of the base and thinking models to label when steering is needed.
- Train simple classifiers (e.g., logistic regression) to predict when to apply each steering vector.
- Extract final steering directions by averaging latent components aligned with reasoning triggers.
Hybrid inference
- For new inputs, run the classifier to determine which tokens need steering.
- Add the selected direction vector(s) to the base model’s hidden state at those positions (using vector addition or scaling).
- Run inference and evaluate accuracy.
Evaluation
- Compare performance of: Base model without steering and Hybrid model with selective steering
- Report metrics such as task accuracy, token-level intervention rate, and example outputs.

Differences from original study

Base model: I will use Llama-3.1-8B, which is smaller than most models used in the original study (e.g., Yi-34B, Qwen2.5-32B). However, this model was included in their experiments, so results should still be comparable.
Thinking model: I will substitute the proprietary Claude 3.7 with an open-weight alternative (e.g., DeepSeek-v2 or Mixtral) for reasoning supervision. This may lead to differences in token-level steering decisions.
Compute: I will run all experiments on several older generation GPUs (NVIDIA A40), which may limit batch sizes and training speed for the sparse autoencoder and classifier components.

Project Progress Check 1

Measure of success

Reproduction success will be measured by the improvement in task accuracy (e.g., GSM8K or similar benchmark) when applying the learned steering vectors to the base model (Llama-3.1-8B), compared to the unsteered baseline.

Specifically, success will be quantified as:

ΔAccuracy = Accuracy(steered model) − Accuracy(base model)

A successful reproduction will demonstrate a non-trivial gain (e.g., ≥2–3%) in performance using the hybrid steering method, along with qualitative alignment between discovered directions and reasoning-related behaviors (e.g., arithmetic steps, uncertainty).

Pipeline progress

Model setup: Completed. The base model (Llama-3.1-8B) is downloaded and ready for use.
Compute resources: Confirmed. Multiple GPUs have been allocated for running the sparse autoencoder and steering experiments.
Next steps: Prepare chain-of-thought datasets for hidden state extraction and begin training the sparse autoencoder on base model activations.

Reproducibility Report for ‘Base Models Know How to Reason, Thinking Models Learn When’ by Venhoff et al. (arXiv preprint: 2510.07364v3, 2025)

Introduction

Justification for choice of study

Anticipated challenges

Links

Methods

Description of the steps required to reproduce the results

Differences from original study

Project Progress Check 1

Measure of success

Pipeline progress

Results

Data preparation

Key analysis

Exploratory analyses

Discussion

Summary of Reproduction Attempt

Commentary