Replication of ‘Visual sense of number vs. sense of magnitude in humans and machines’ by Testolin, A., Dolfi, S., Rochus, M. et al. (2020, Sci Rep)

Author

Wenqing Cao (cwenqing@ucsd.edu)

Published

October 24, 2025

Introduction

Justification:

It is widely believed that the approximate number system (ANS) supports formation of non-exact representation of quantities in adults, children, and even animals (Dehaene, 2011). ANS affords abilities like quickly discriminating two different numerical quantities or estimating the amount of items. The acuity of discrimintation is proportional to the ratio of the two quantities, and is found to predict future math learning performance. However, debate exists around whether numerosity is the main perceptual feature driving ANS, or numerosity is actually estimated from non-numerical visual features. This paper attempted to reconcile this debate by testing deep neural networks and human participants in the same task, adopting a stimulus space that could disentangle the contribution of numerical and non-numerical features. For this replication project, I will try to replicate the human experiment in this paper. It is related to my side research interest of probing ANS in vision language models.

Methods:

The research question is whether visual numerosity is a primary perceptual attribute or whether people estimate numerosity from continuous magnitudes. To test this, researchers adopted the dot array comparison task, where the participant sees two side-by-side dot arrays for a limited period of time, and then needs to indicate which array has a higher number of dots. To quantitatively estimate the contribution of non-numerical features, researchers constructed a 3-D orthogonal stimulus space which includes 3 dimensions: numerosity, size, and spacing. Specifically, they generated a database of 21970 images of dot arrays by picking 13 levels on each dimension, each level put evenly on a log scale. They then randomly sampled 300 pairs of dot images from the database as test stimuli, encompassing ratios from 0.5 to 0.9, biasing the harder ratios. Participants indicated which image in the pair had more dots by pressing left or right arrows on keyboard.

They recruited volunteer college students (n = 40) as huamn subjects. Study session length for each participant ~ 30 minutes. The power of their human behavioral finding was strong, mean accuracy = 83%, with significant individual-level GLM fits (R² = 0.55, chi-square value = 191.14, p < 0.001). Coefficient fits for each dimension were significant for numerosity (t(39)=23.54, p<.001) and spacing (t(39)=7.21, p<.001), but not for size (t(39)=1.37, p=.18).

The key analysis that I’m trying to replicate is the GLM fit on individual level, and the coefficient fit for each dimension. I will use the original stimuli (the 300 sampled pairs). I could recruit less participants (n = 25, 300 trials/30 mins each)

What’s the measure of interest? - Accuracy on a dot array numerosity stroop task

What construct does it map to? - Human adults’ approximate representation of numerosity

Is its estimate reported? - Taking mean - Effect size R^2 = 0.55 - Unit: Number of correct trials?

Any measure of variability? - SE?

Link: https://github.com/AnnaWCao/testolin2020/tree/main

Methods

Power Analysis

Original effect size, power analysis for samples to achieve 80%, 90%, 95% power to detect that effect size. Considerations of feasibility for selecting planned sample size.

A: The original study found a medium effect size, R^2 = 0.55. To achieve 80%, 90%, 95% power to detect this effect size, 26, 35, 43 participants are needed, respectively, given significance level a = 0.05. Given that the study session is long (~30 mins), and that each participants see 300 trials, it might be more feasible to aim for a smaller sample size and a medium high power.

Planned Sample

Planned sample size and/or termination rule, sampling frame, known demographics if any, preselection rules if any.

A: I plan to collect 25 samples. Participants will be college undergraduate students in the local San Diego area, such that the sample’s mean age and education level will roughly match the original study sample which consisted of 40 volunteer students (mean age = 23.7 years, range = 20-28, females = 80%).

Materials

All materials - can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

A: “Images of size 200 × 200 pixels were generated by randomly placing white dots on a black background. For the discrimination task there were 13 levels of Numerosity (range 7–28), 13 levels of Size (range 2.6–10.4 pixels × 105) and 13 levels of Spacing (range 80–320 pixels × 105), evenly spaced on a logarithmic scale. For each selected point in the stimulus space 10 different images were generated by randomly varying dots displacement, resulting in a dataset of 21970 unique images. For the human experiment we randomly selected images from the dataset to create 300 image pairs with different magnitude ratios, oversampling the more difficult numerosity ratios (10% with ratio between 0.5 and 0.6; 20% with ratio between 0.6 and 0.7; 30% with ratio between 0.7 and 0.8; 40% with ratio between 0.8 and 0.9).”

Because the authors did not publish code or data specifying the 300 image pairs used in the study, I will produce my own code to randomly select the 300 image pairs. The authors did publish the exact dimensions and point coordinates for all 21970 unique dot array images, so everything else will be equal.

Procedure

Can quote directly from original article - just put the text in quotations and note that this was followed precisely. Or, quote directly and just point out exceptions to what was described in the original article.

A: “Stimuli were projected on a 19-inch color screen. Participants sat approximately 70 cm from the screen and placed their head on a chin rest. Participants were verbally instructed to select the stimulus with more dots, responding with the left and right arrows of the keyboard depending on its side of appearance (feedback was given only during few practice trials). The task consisted in 3 blocks of 100 trials each, for a total of 300 trials. Each trial began with a fixation cross at the center of the screen (500 ms), followed by the simultaneous presentation of two stimuli (250 ms), one at the right and one at the left of the cross with eccentricity of ~12 visual degrees, and then by two masks of black and white Gaussian noise in the same positions (150 ms). A black screen was then displayed until response, without time limit. After response, a pseudorandom inter-trial interval between 1250 and 1750 ms occurred.”

Analysis Plan

Can also quote directly, though it is less often spelled out effectively for an analysis strategy section. The key is to report an analysis strategy that is as close to the original - data cleaning rules, data exclusion rules, covariates, etc. - as possible.

Clarify key analysis of interest here You can also pre-specify additional analyses you plan to do.

A: “All responses below stimulus presentation time were considered outliers, as well as response times over two standard deviations from the participant’s mean response time in equally difficult trials (based on numerosity ratio). A generalized linear model (with probit link function) was then fitted to the choice data of each participant41, which was modeled as a function of the three regressors Numerosity, Size and Spacing”

Instead of running a generalized linear model for each individual, I will fit a group-level general linear mixed effect model to all trials across participants to investigate the effect of numerosity, size, spacing, and random effect for each individual. The dependent measure will be the probability of getting a particular trial correct. A GLMM is suitable because participants’ responses were binary choices (left/right array).

Differences from Original Study

Explicitly describe known differences in sample, setting, procedure, and analysis plan from original study. The goal, of course, is to minimize those differences, but differences will inevitably occur. Also, note whether such differences are anticipated to make a difference based on claims in the original article or subsequent published research on the conditions for obtaining the effect.

A: In my replication project, the sample size (n = 25) will be smaller than the previous study (n = 40). The setting and procedure will be the same. There will be a slight difference in stimuli as I will be randomly sampling the 300 dot image pairs from their stimuli database. The analysis plan will be difference as I will run a GLMM on group level including random intercepts, instead of running a GLM with each individual participant.

Methods Addendum (Post Data Collection)

You can comment this section out prior to final report with data collection.

Actual Sample

Sample size, demographics, data exclusions based on rules spelled out in analysis plan

Differences from pre-data collection methods plan

Any differences from what was described as the original plan, or “none”.

Results

Data preparation

Data preparation following the analysis plan.

A: First, based on the exclusion criteria established, trials where a response is made below stimulus presentation time or where response time is over 2*SD from the participants’ mean response time in equally difficult trials will be eliminated from the analysis. To achieve this, I will first merge all participants’ data into a dataframe. Then I will apply a filter that only selects data rows where response time > stimulus presentation time (250 ms). I will also apply a filter that removes data where response time > mean(response time) + 2SD. I will make sure that columns include individual participant number, response time, 1/0 if correct on the trial, ratio in numerosity, ratio in size, ratio in spacing.

Confirmatory analysis

I will fit the GLMM model to the dataframe. Specifically, fixed effects will be log-ratios of numerosity, size, spacing, random effects will be Participant-specific differences in slope. From the resulting model I will compare the effect sizes of numerosity, size, spacing, and individual difference, to determine which/whether dimension is the main predictor of performance (or which dimension dominates approximate number discrimination).

Side-by-side graph with original graph is ideal here

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Replication Attempt

Open the discussion section with a paragraph summarizing the primary result from the confirmatory analysis and the assessment of whether it replicated, partially replicated, or failed to replicate the original result.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis, (b) assessment of the meaning of the replication (or not) - e.g., for a failure to replicate, are the differences between original and present study ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the replication attempt. None of these need to be long.