Reproducibility Report for Poli et al. (2024, Developmental Science)

Author

Victoria Hennessy (vhennessy@ucsd.edu)

Published

November 24, 2025

Introduction

My current line of research focuses on individual differences in early information seeking. To study infant learning strategies, Poli and colleagues (2020; 2024) have developed an analytic pipeline that leverages an infant-friendly visual learning (VL) task from which information-theoretic measures (e.g., predictability and information gain) can be derived for each trial in sequences of visual events.

These measures are then incorporated into a hierarchical Bayesian model fitted to real infants’ looking behavior (look-aways, looking time, and saccadic latency) in the VL task to infer the values of latent parameters that represent specific cognitive functions: processing speed, learning performance, curiosity, and sustained attention (see Figure 1).

Figure 1. The research pipeline in Poli et al. 2024 (Figure 2 in the original paper). The current project will focus on the first three boxes outlined in yellow.

For the proposed project, I have four primary aims:

(1) Reproduce the findings using the original data with automatic differentiation variational inference (ADVI).

(2) Simulate infant looking behavior in the visual learning task from the posterior distribution of the original data (e.g., use individual latent parameters from the original data to generate a simulated dataset of infant looking times). Note: this is a posterior predictive simulation.

(3) Fit the original HBM to the simulated dataset to estimate the latent parameters of interest: curiosity (information gain x looking time).

(4) Compare model fit and analyses from simulated data to the original data (Poli et al., 2024) by indexing the number of infants from the original study and the simulated dataset who display a significant coefficient for curiosity.

Working through the described pipeline will help me develop the analytic toolkit and computational understanding needed for my own research program.

Key analysis of interest: Posterior predictive check (PPC) to validate a hierarchical Bayesian model of infant attention during a probabilistic learning task. Specifically, testing whether the model can (1) generate realistic data matching observed distributions of looking time and saccadic latency, and (2) recover known parameters when fitted to simulated data, thereby confirming the model adequately captures infant learning behavior.

Anticipated challenges

Simulating this data and analytic pipeline involves three high-level stages: input generation, hierarchical modeling, and outcome replication, each of which will pose their own computational challenges.

The code to generate the sequences for the visual learning task is publicly available. I will follow Poli et al. (2020 and 2024) to attempt to derive the information-theoretic measures for each trial in the visual learning task. This may be complex and time-consuming to undertake from scratch given the scope of the current project. In the event that this step hinders the completion of the subsequent step, I will reference the original analysis script, found here, and/or simulate just a portion of the looking behavior (e.g., looking time) to simplify the process.

Links

Project Repository

Original Paper

Original Data Repository

Methods

Reproduction:

Data Preparation (included in `original code`)

Load the raw data for the visual learning task.
- Combine the two datasets, Roris_nostd.csv and Roris_smiley.csv.
Convert to model-friendly format (pytensor variables)..
Z-score the dependent variables (looking time and saccadic latency)
Z-score the independent variables (information gain, predictability, and surprise)
Handle missing values.

Verify the following columns are present in the combined dataset:

subj: subject ID
nseq: sequence number
ntrialseq: trial number within sequence
dwell: looking time to sequence
slat: saccadic latency
event: look-away (binary, either 0 or 1)
D: KL divergence or information gain (pre-computed)
H: entropy or predictability (pre-computed)
I: surprise (pre-computed)

Analysis Pipeline

Fit the hierarchical Bayesian model (HBM) using variational inference
- Specify relationships of interest (i.e., between looking behavior and latent parameters).
- Fit the model using ADVI.
- Derive individual estimates for:
  - Learning Performance: β₁^SL (correlation between saccadic latency and predictability)
  - Curiosity: β₁^LT (correlation between looking time and information gain)
Model validation
- Check convergence.
  
  R^ threshold should be < 1.004 if using MCMC. Qualitative check with ADVI.
- (If MCMC) Compare model fit between hierarchical vs. simple (group-only) model
Extract individual differences
- Once model has ran, compute posterior distributions for each parameter.
  
  The HBM produces a probability distribution for each parameter for each infant (e.g., infant X’s “learning performance” isn’t a single number, but a distribution of plausible values within a CI with 89% confidence). The model represents this as 20,000 samples from the distribution.
```
posterior=pd.DataFrame()
posterior["subjnum"]= markasgood.values.reshape(-1, )
posterior["LT0"]=np.median(trace["LT0"], axis=0)  # Processing speed (looking time)
posterior["LT1"]=np.median(trace["LT1"], axis=0)  # Curiosity
posterior["SL0"]=np.median(trace["SL0"], axis=0)  # Processing speed (saccadic)
posterior["SL1"]=np.median(trace["SL1"], axis=0)  # Learning performance
posterior["lambda0"]=np.median(trace["lambda0"], axis=0)  # Sustained attention
posterior["beta_LA"]=np.median(trace["beta_LA"], axis=0)  # (not used in main analysis)
posterior.to_csv('posterior_median.csv')
```
- Save to an output file in which there is one row per infant; columns = latent parameters.
gen_summary_rep.csv contains posterior summary statistics for all model parameters including the individual-level parameters of interest, learning performance (SL1) and curiosity (LT1).
- Identify infants with significant individual effects (89% credible intervals excludes zero).
- Compare to original findings.

Simulation and Posterior Predictive Check:

Load posterior means from gen_summary_rep.csv as “true” generating parameters for simulation.
Generate simulated dataset using true parameters with original predictor structure: simulate looking time and saccadic latency from Student’s T distribution (nu=15), look-away events from Poisson. Preserve original missing data structure (2,253 NaN values) and filter invalid negative values.
Standardize simulated predictors (entropy, surprise, KL-divergence) using z-scoring with nan_policy=“omit”, matching original preprocessing.
Re-fit hierarchical Bayesian model to simulated data using identical ADVI procedure.
Compare the recovered parameters to true values using distributional fit via histograms, Q-Q plots, and individual-level significance patterns.

Specifically, I aim to validate the model recovers both population-level effects and individual differences by achieving a comparable distributional fit and parameter estimates with simulated data.

If the model adequately captures infant probabilistic learning mechanisms, it should generate data distributions matching observed patterns and recover known parameters from synthetic data.

Differences from original study

The computing environments are the same. Visualizations may be carried out in R, but the model and any pre-processing steps will be in Python. ADVI will be used instead of MCMC as outlined above.

Project Progress Check 3

Measure of success

The outcome measure will be a successful reproduction of their findings based on the latent parameters from the Bayesian model. I aim to reproduce the following as indexed by 89% credible intervals different from zero (where zero indicates lack of an effect) with both the original and the simulated data:

Learning Performance (saccadic latency x predictability): approximately 57 infants (40%) with a significant coefficient (Poli et al., 2024).
Curiosity (looking time x information gain): approximately 31 infants (22%) with a significant coefficient (Poli et al., 2024).

Pipeline progress

Modernized the original modeling script:
- Updated pymc3 to pymc, theano to pytensor
- Removed masked arrays since PyMC handles missing values automatically
Changed from true sampling to optimization for feasibility:
1. Switched to ADVI
  - Original paper and findings use MCMC with 500K tuning + 10K sampling.
  - I use ADVI with 30,000 optimization iterations and 50,000 posterior draws.
    - This is primarily for computational feasibility.
2. Added convergence tracking, basically records the mean and SD of ADVI’s approximation at every iteration and allows it to improve its Gaussian approximation to the posterior; the tracking lets me see if it’s converging and allows me to assess model fit.
3. Commented out WAIC/LOO.
  - ADVI produces approximate posterior samples, not exact MCMC samples. The trace from ADVI is generated from a variational approximation (i.e., a fitted normal distribution), not from the actual posterior distribution. ADVI doesn’t automatically compute and store pointwise log-likelihoods needed for WAIC/LOO.
4. Assessed ADVI convergence. The lines flattened out by iteration 10k, indicating that the model successfully reached an “optimal” approximation of the posteriors:
  
  Figure 2. Convergence plot after fitting model over full dataset.
Fit the model w/ ADVI sampling:

I did this several times with data from just 3 subjects for testing purposes, and then with all subjects in the combined dataset (shown above). Fitting using ADVI took approximately 30-40 minutes for each full run.

Ran a basic analysis script (advi_analysis.py) to evaluate the significance of individual subject coefficients from the model based on 89% credible intervals w/out zero.

Preliminary Results w/ ADVI:

Learning Performance (saccadic latency x predictability): approximately 63 infants (43.45%) with a significant coefficient (SL1).

Curiosity (looking time x information gain): approximately 40 infants (27.59%) with a significant coefficient (LT1).
Simulated a new dataset using true parameters as the predictors and evaluate (1) distributional fit (Figures 3 and 4) and (2) individual-level subject significance patterns (just like in Step 4).

Preliminary Results w/ ADVI:

Learning Performance: approximately 65 infants (44.83%) with a significant coefficient (SL1).

Curiosity: approximately 51 infants (35.17%) with a significant coefficient (LT1).

Results

Data preparation

Data preparation following the analysis plan.

Key analysis

Exploratory analyses

Any follow-up analyses desired (not required).

Discussion

Summary of Reproduction Attempt

Open the discussion section with a paragraph summarizing the primary result from the key analysis and assess whether you successfully reproduced it, partially reproduced it, or failed to reproduce it.

Identify one insight your simulation gave you about the strengths or limitations of the original experimental design.

Across over 9K observations (i.e., all individual trials across infants), there were over 2K missing or NaN values (nearly 24.2%!). This initially presented as a limitation, but the simulation revealed that it accurately reflects lapses in infant attention or disengagement that are informative to the cognitive processes of interest. I preserved the missingness structure in the simulation and successfully recovered parameters despite it, demonstrating that the HBM framework can appropriately the inevitable discrepancies in infants’ looking data.

How would this simulation help you design a follow-up experiment with a similar paradigm?

This simulation was highly informative of the hierarchical Bayesian modeling process, which I intend to use to extract individual differences with the same paradigm in a new set of infants. Validating the model with simulated data serves as both a proof of concept and feasibility for my in-person replication.

Commentary

Add open-ended commentary (if any) reflecting (a) insights from follow-up exploratory analysis of the dataset, (b) assessment of the meaning of the successful or unsuccessful reproducibility attempt - e.g., for a failure to reproduce the original findings, are the differences between original and present analyses ones that definitely, plausibly, or are unlikely to have been moderators of the result, and (c) discussion of any objections or challenges raised by the current and original authors about the reproducibility attempt (if you contacted them). None of these need to be long.