Replication of Foundations of intuitive power analyses in children and adults by Pelz et al., 2022 (2022, Nature Human Behaviour)

Author

Sophie Mazor (smazor@ucsd.edu)

Published

December 6, 2025

Introduction

The current project aims to replicate Experiment 1 from Pelz et al., 2022. This study broadly investigated whether children and adults intuitively perform a power analysis (i.e., whether people are sensitive to the amount of evidence needed in order to make a particular inference or decision). Prior work has suggested that both children and adults use task difficulty as a metric when deciding whether to persist, put in more effort, or give up when solving problems (Chevalier 2018; Ganesan & Steinbeis, 2021; Kool et al., 2010; Serko et al., 2024; Wang & Bonawitz 2023), and that they adjust their information search to resolve uncertainty (Bonawitz et al., 2011; Cook et al., 2011; Gottlieb et al., 2013; Lapidow & Bonawitz 2023). Researchers have also proposed that humans are remarkably sensitive to probabilities, weighing the relative likelihood of multiple possibilities to guide their behavior and inference (Denison & Xu 2014; Gopnik et al., 2001; Ruggeri et al., 2019; Siegel et al., 2021; Waismeyer et al., 2015). Yet, it is unknown whether adults adjust their information search based on the relative difficulty of a particular problem, seeking less information for easier problems, and more for harder ones.

In order to answer this question, participants were shown 10 pairs of boxes across 10 trials, where each pair contained a different distribution of colored balls. The difficulty of the trials varied from very hard (e.g., boxes containing 51% one color; 49% of another) to very easy (e.g., 95% one color; 5% another). On each trial, participants were shown two boxes, each belonging to a different character. They were asked to enter a number for how many balls should be taken out of the boxes in order for a third party to determine which of the two boxes is being sampled. They measured whether participants adjusted their sampling behavior as distributions became harder/easier to discriminate. Indeed, they find that as distributions became more difficult (i.e., closer to 50/50), adults requested 0.37 ± 0.04 (standard error (s.e.), 95% CI (0.28, 0.45)) more balls for each decreasing proportion. These findings propose that adults adjust their information search based on the relative difficulty of a particular problem.

Repository: https://github.com/psyc-201/pelz_2022

Original Paper: https://github.com/sophiemazor/pelz2022/tree/f8836bf68249a13be8bf75fb235c6bac3622d63d/original_paper

Preregistration: https://osf.io/dxbkr/overview

Experiment: https://github.com/psyc-201/pelz_2022/blob/main/index.html

Methods

Power Analysis

The paper reports using G*Power to conduct a power analysis based on pilot data. They found that in order to get an effect size of 0.4 achieve a power level of 0.9 (90%), they needed 30 adult participants. Because this replication study will be conducted on an online platform (i.e., Prolific), this planned sample size is feasible.

Planned Sample

The planned sample size is 30 adults. As in the original study, participants will be excluded if they fail the attention check question and must be at least 18 years old in order to participate. There are no other eligibility criteria. The final sample contained 33 adults.

Materials

All stimuli were created digitally for the experiment using R studio and JsPsych. Following the design of Experiment 1 in Pelz et al., 2022, participants were introduced to the task through a training trial where they saw two boxes with inverse proportions of balls (i.e., 72/28, 28/72) inside. Following the training trial, participants completed ten separate test trials, each showing participants two boxes, one belonging to the “narrator” and one to another character. Boxes contained proportions of balls with distributions that ranged in difficulty to discriminate between (i.e., 95/5, 90/10, 85/15, 80/20, 75/25, 70/30, 65/35, 60/40, 55/45, 51/49). All stimuli colors were matched to that of the original study, though the positions of each colored ball differed slightly due to recreating the stimuli.

Procedure

Training Trial

Procedure for the training trial closely followed that of the original paper where adults first saw a short animated video with two boxes filled with inverse proportions of black and white balls inside (72/28 and 28/72). Both boxes were hidden behind a single occluder and shuffled around such that the position of each box was unknown to participants. Then, a hand reached into one of the boxes and extracted 11 balls, placing them into an occluded bowl. Participants could not see the color of each ball as they were extracted. After all 11 balls were sampled, the contents of the bowl were revealed (i.e., 8 (72%) white balls and 3 (28%) black ones))). Participants were then asked which of the two boxes had been sampled from. Failure to answer this attention check question correctly was used as exclusion criteria. This animation video differed slightly from the original paper due to recreation of the stimuli. However, the conceptual content remained the same.

Catch Trial

Following the training trial, participants were administered a catch trial where two boxes appeared on the screen. Participants were told to select the box on a particular side of the screen if they were paying attention. Failure to answer this question correctly was also used as exclusion criteria.

Test Trials

Participants were shown another animated video with four characters, each with a pair of boxes to convey the range of proportions that boxes may contain during the task. As in the original paper, participants were told, ‘This time, you will be deciding how many balls I should put in the bowl. For each friend, think about how tricky it will be to figure out which box I am picking from. For some friends you might need more balls to decide which box they are picked from, and for some friends you might need fewer. Try not to ask for more balls than you need.’ Then, participants completed ten separate trials each displaying a pair of boxes along with the prompt, ‘How many balls do you think I need to put in the bowl for you to know whether the balls came from my box or (the current character)’s box?’ Participants then entered the number of samples they thought were necessary to decipher between the two boxes into a text box. The level of difficulty in discriminating between the boxes ranged in difficulty from very easy (95/5 and 5/95) to very difficult (51/49 and 49/51). The order of which each of the ten pairs were presented was randomized for each participant. Critically, adults were not told the specific proportions nor the number of balls in each box. Instead, they had to estimate the relative proportions and contents by simply looking at the two boxes. Participants did not receive feedback on their answers.

Reliability and Validity

The key construct measured in this project is whether or not adults are capable of computing an “intuitive power analysis.” This is measured by showing participants 10 pairs of boxes, each with different proportions of balls inside. We measure whether or not participants request more samples for harder distributions and fewer for easier distributions. If this is the case, we may conclude that adults are sensitive to the idea that different amounts of data or information are needed to solve problems of different difficulties. The authors do not report any reliability or validity information, and given that this is a novel task with one trial for each distribution, there is little opportunity to examine how reliable or valid the measure is (e.g., whether effects would hold if participants were given multiple trials for the same distribution, perhaps with different colored balls). The behavioral task was the only method used in the original paper.

Analysis Plan

I conducted one of the key analyses from Experiment 1. First, I used a linear mixed effects model to examine participants’ sampling behavior across the ten distributions. As in the original paper, the proportions of colored balls was a fixed effect, with each participant as a random intercept. Then, I identified a null model that left out the effect of proportion. Using a likelihood ratio test (Chi-Squared), I compared the two models against one another to examine which model best fit the data.

Differences from Original Study

The original study was run online via Amazon’s Mechanical Turk. This replication was run on Prolific. The exact procedure was nearly identical to the original study, with some modifications to the attention check prompt and the number of samples extracted during the training trials (see Procedure section for more detailed deviations from the original study). Conceptually, all of the information presented to participants was the same as in the original task, thus it is unlikely that these differences should have an impact on the claims made in the original paper.

Methods Addendum (Post Data Collection)

Actual Sample

47 adults were recruited via Prolific. 13 participants were excluded as they failed to correctly answer the attention check question in the training trial. One additional participant was excluded due to a technical error, resulting in a final sample of 33 adults over the age of 18.

Differences from pre-data collection methods plan

none

Results

Data preparation

Because each measure was identical to that of the original experiment, data preparation followed similar steps. All participants who failed the attention check after the training trial (N = 13) and catch trial (N = 0) were excluded. The key variables used in the linear mixed effects model were the exact numerical answers participants provided for each of the ten test trials (i.e., the number of samples participants think should be drawn).

# analysis script for Pelz Replication
#Sophie Mazor

library(ggplot2)
library(lmerTest)

Loading required package: lme4

Loading required package: Matrix


Attaching package: 'lmerTest'

The following object is masked from 'package:lme4':

    lmer

The following object is masked from 'package:stats':

    step

library(tidyr)


Attaching package: 'tidyr'

The following objects are masked from 'package:Matrix':

    expand, pack, unpack

library(tidyverse)

Warning: package 'readr' was built under R version 4.5.2

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.6
✔ forcats   1.0.0     ✔ stringr   1.5.2
✔ lubridate 1.9.4     ✔ tibble    3.3.0
✔ purrr     1.1.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ tidyr::expand() masks Matrix::expand()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ tidyr::pack()   masks Matrix::pack()
✖ tidyr::unpack() masks Matrix::unpack()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(dplyr)

data = read.csv('../data/Pelz_Replication_Data.csv')

n_total <- data %>%
  distinct(official_id) %>%
  nrow()

print(n_total)

[1] 46

# make a new df with only the participants who passed the attention check and catch trial
final_data_set <- data %>%
  filter(data$attention_check == "72White", data$catch_trial == 'correct') #filter participants who get it incorrect

failed_attention <- n_total - n_distinct(final_data_set$official_id) #tell me how many participants failed the attention checks

print(failed_attention)

[1] 13

#extract the relevant columns from the main df
filter_model_data <- final_data_set[, c('official_id', 'trial_name', 'samples_requested')]

#treat trial name as a factor
filter_model_data$trial_name <- factor(filter_model_data$trial_name)

# take away underscore in trial names and just give numeric proportion 
filter_model_data$proportion <- as.numeric(sub("_.*", "", filter_model_data$trial_name)) / 100

# create trial_label for visualization
filter_model_data <- filter_model_data %>%
  mutate(trial_label = str_extract(trial_name, "^[0-9]+"))


# visualize
summary_df <- filter_model_data %>%
  group_by(trial_name) %>%
  summarise(
    min_response = min(samples_requested, na.rm = TRUE),
    mean_response = mean(samples_requested, na.rm = TRUE),
    max_response = max(samples_requested, na.rm = TRUE),
    .groups = 'drop'
  )

pastel_colors <- scales::hue_pal(l = 75, c = 60)(length(unique(summary_df$trial_name)))
summary_df$trial_name <- factor(summary_df$trial_name, levels = summary_df$trial_name)

ggplot(filter_model_data, aes(x = trial_label, y = samples_requested, fill = trial_name)) +
  geom_boxplot(
    width = 0.6,
    outlier.shape = 16,   
    outlier.size = 2,
    outlier.color = "grey40",
    outlier.alpha = 0.7,
    color = "black") +
  stat_summary(fun = mean, geom = "point", shape = 18, color = "darkred", size = 3) +
  scale_fill_manual(values = pastel_colors) +
  scale_y_continuous(limits = c(0, 100), breaks = seq(0, 100, 25)) +
  theme_minimal(base_size = 14) +
  theme(
    legend.position = "none",
    panel.grid.major = element_blank(),
    panel.grid.minor = element_blank(),
    axis.text.x = element_text(hjust = .1),
    axis.ticks = element_line(color = "black", linewidth = 0.5),
    axis.ticks.length = unit(3, "pt"),
  panel.border = element_rect(color = "black", fill = NA, linewidth = 1)) +
  labs(
    x = "Proportion of Colored Balls",
    y = "Number of Balls",
  )

#linear mixed effects model to assess whether participants' sampling differs across trials

#compare null model to linear mixed effects model

# modify trial names to remove underscore and just list proportion
filter_model_data$proportion <- as.numeric(sub("_.*", "", filter_model_data$trial_name)) / 100

#define difficulty as proportions
filter_model_data$difficulty <- 0.5 - abs(filter_model_data$proportion - 0.5) #set difficulty based on distance from 0.5 (closest to 0.5 is most difficult)
filter_model_data$difficulty <- filter_model_data$difficulty * 100 #scale to match original study

# Full model: includes difficulty of proportion as fixed effect
mixed_model <- lmer(samples_requested ~ difficulty + (1 | official_id), 
                           data = filter_model_data, REML = FALSE)

# Null model: intercept only (leaves out effect of proportion)
model_null <- lmer(samples_requested ~ 1 + (1 | official_id), 
                   data = filter_model_data, REML = FALSE)

# Chi-squared likelihood ratio test to compare the two models
chi_squared <- -2 * (as.numeric(logLik(model_null)) - as.numeric(logLik(mixed_model)))
p_value <- pchisq(chi_squared, df = 1, lower.tail = FALSE)

print(chi_squared)

[1] 111.6167

print(p_value)

[1] 4.335204e-26

# calculate confidence intervals (95%)
confidence_interval <- confint(mixed_model, parm = "difficulty", level = 0.95, method = "Wald")

print(confidence_interval)

               2.5 %    97.5 %
difficulty 0.4662813 0.6553542

# summary
summary(mixed_model)

Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's
  method [lmerModLmerTest]
Formula: samples_requested ~ difficulty + (1 | official_id)
   Data: filter_model_data

      AIC       BIC    logLik -2*log(L)  df.resid 
   2699.7    2714.9   -1345.9    2691.7       328 

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-1.9169 -0.5562 -0.1083  0.3625  4.3481 

Random effects:
 Groups      Name        Variance Std.Dev.
 official_id (Intercept) 121.6    11.03   
 Residual                156.6    12.51   
Number of obs: 332, groups:  official_id, 33

Fixed effects:
             Estimate Std. Error        df t value Pr(>|t|)    
(Intercept)   2.70075    2.42684  64.97299   1.113     0.27    
difficulty    0.56082    0.04823 299.09300  11.627   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Correlation of Fixed Effects:
           (Intr)
difficulty -0.543

Confirmatory analysis

To examine whether or not participants’ sampling behaviors differs as a result of distribution difficulty, I used a linear mixed effects model with trial proportions as a fixed effect and participants as a random intercept. I then used a likelihood ratio test to compare the full model to a null model that left out the effect of proportion difficulty to assess which model best fit the dataset. The graph for Experiment 1 was also replicated, as shown below.

My analysis finds that adults were sensitive to the difficulty of the discrimination problem, asking for more balls as the discrimination problems got more difficult (χ2(1) = 111.61, P < 0.001), requesting 0.56 ± 0.04 (standard error (s.e.), 95% CI (0.46, 0.65)) more balls for each decreasing proportion.

Pelz et al., 2022 Graph

Replication Graph

Exploratory analyses

none

Discussion

Summary of Replication Attempt

The primary result indeed replicated the original result. The full model which included a fixed effect of proportion difficulty was a significantly better fit for the data compared to the null model, suggesting that participants were using difficulty to guide their sampling behavior. As in the original paper, this difference was highly significant (p < 0.001). Additionally, the original paper finds that participants requested 0.37 ± 0.04 (standard error (s.e.), 95% CI (0.28, 0.45)) more balls for each decreasing proportion. I find a similar effect– participants requested 0.56 ± 0.04 (s.e.), 95% CI (0.46, 0.65)) more balls for each decreasing proportion. As seen in the graph visualizations, the relationship between sampling behavior and distribution difficulty is linear across both experiments. Overall, the results replicate the key finding in the original paper.

These findings hold key implications about adults’ information search. Not only are adults sensitive to task difficulty (e.g., Chevalier 2018; Ganesan & Steinbeis, 2021; Kool et al., 2010; Serko et al., 2024; Wang & Bonawitz 2023), they use task difficulty as a metric to adapt how much information they seek out to solve a particular problem. Adults do not sample consistent amounts of evidence for all problems they encounter, rather, they tailor their information search to efficiently solve tasks. Just as scientists do, people recognize the level of statistical power necessary to answer different questions: When probability distributions have greater overlap, the amount of power or information needed to distinguish between the distributions increases.

Commentary

This replication reinforces and replicates the findings in the original paper. As a group, adults use problem difficulty to adjust their information search. A puzzling finding from both the original paper and this replication is the linearity between participants’ sampling and distribution difficulty. Prior work (e.g., Vul et al., 2014) proposes that this relationship may be U-shaped, rather than linear: When information gain is costly, as it often is in the real world, people may seek out less information for the most difficult problems– reflecting a sensitivity to when further effort may be unproductive. This is a potential limitation of this current study design. Even though participants were instructed to not ask for more samples than necessary, they were not explicitly given the option to request zero information, and asking for 100 samples proved just as feasible as asking for three. Future work should examine information search in more costly scenarios to explore whether a true cost to sampling influences behavior. These findings also raise further questions about whether individuals vary in the amount of information they prefer prior to committing to a decision– a question I am currently pursuing in my graduate work.