Methods
Participants
Data was collected from 149 adults recruited via Prolific on Mon
5/5/2025. Participants were required to be in the United States, fluent
in English, and have not participated in the earlier pilot of this
study.
Participants were paid $2.50 for an estimated 8.5-11 minute task. In
fact, the study generally took about 14 minutes for participants.
The final sample included 141 adults (n = 46-48 in each of the 3
conditions).
pop |
n |
short |
48 |
med |
46 |
tall |
47 |
Exclusion criteria
8 participants (5.4% of all participants) were excluded for meeting
at least 1 of the following exclusion criteria:
failing the sound check (n = 1 participants)
failing to select the correct task description (i.e., did not
select “Watching videos about fictional people from an island”) (n = 7
participants)
Due to a high percentage of participants failing to check
both “A boat” and “Zarpies”, we did not exclude on that
basis.
Demographics

age |
mean |
sd |
n |
43.80 |
14.16 |
141 |
- The sample skewed young in age.
gender |
n |
prop |
Male |
71 |
50.4% |
Female |
70 |
49.6% |
- The sample reflected the diversity of the gender identities in the
US.
race |
n |
prop |
White, Caucasian, or European American |
90 |
63.8% |
Black or African American |
33 |
23.4% |
South or Southeast Asian |
5 |
3.5% |
White, Caucasian, or European American,Native American, American Indian, or Alaska Native |
3 |
2.1% |
East Asian |
2 |
1.4% |
Hispanic or Latino/a |
2 |
1.4% |
White, Caucasian, or European American,East Asian |
2 |
1.4% |
Native American, American Indian, or Alaska Native |
1 |
0.7% |
Native Hawaiian or other Pacific Islander |
1 |
0.7% |
Prefer not to specify |
1 |
0.7% |
White, Caucasian, or European American,Hispanic or Latino/a |
1 |
0.7% |
- The sample was also racially diverse.
education |
n |
prop |
High school/GED |
16 |
11.3% |
Some college |
30 |
21.3% |
Bachelor's (B.A., B.S.) |
63 |
44.7% |
Master's (M.A., M.S.) |
28 |
19.9% |
Doctoral (Ph.D., J.D., M.D.) |
3 |
2.1% |
Prefer not to specify |
1 |
0.7% |
- The sample was mostly college-educated.
Procedure
This study was administered as a Qualtrics
survey, and approved by the NYU IRB (IRB-FY2024-9169).
After providing their consent, participants completed a captcha and
sound check, and were asked to watch videos sound on. Participants then
watched the following videos in order:
In the prior setting and familiarization phase,
participants saw an actual picture of 5 human adults and then another
picture of a different 5 adults appear on screen against a grid. These
adults were all 10 gridline units tall.
In the boat introduction, all participants saw a
boat that was 7 units tall. The boat height was specified to be
accidental (“When the boat builders were building the boat, they started
building the boat from the bottom, but ran out of the special wood they
needed for the boat! So the boat ended up being this tall. It might be
hard for anyone who is taller than the boat to get on the boat.”), to
avoid any justificatory reasoning about the height of the boat being
informative about the height of Zarpies or vice versa.
In the boat boarding phase, participants saw a
parade of Zarpies attempt to board the boat to visit us, one at a time.
Participants were told that they were all grown-up Zarpies. Unlike the
last pilot, participants saw that Zarpie island had many Zarpies, and
were told that these Zarpies’ names “were drawn out of a hat to try and
visit us”.
Like the last pilot, the sample (the Zarpies who successfully boarded
the boat) was held constant across conditions:
To validate the paradigm, the population (the parade of Zarpies who
attempted to board the boat) was visible and differed across conditions.
Bold indicates successful boarding.
In the short population condition, the parade was Zarpies of
heights (4, 5, 6,
6, 7, 8). All Zarpies
who attempted to board successfully boarded (6 out of 6 successful =
100% successful), with the last Zarpie (height 8) stooping to board,
since they are a bit taller than the boat ceiling (7 units
tall).
In the medium population condition, the parade was Zarpies of
heights (4, 5, 6,
6, 7, 8, 8, 8, 9, 9,
10). Not all Zarpies who attempted to board were successful in boarding
(6 out of 11 successful = 54.5% successful). The second Zarpie of height
8 stooped to board, since they are a bit taller than the boat ceiling (7
units tall).
In the tall population condition, the parade was Zarpies of
heights (4, 5, 6,
6, 7, 8, 8, 8, 8, 8,
8, 9, 10, 10, 11, 12). Not all Zarpies who attempted to board were
successful in boarding (6 out of 16 successful = 37.5% successful). The
second Zarpie of height 8 stooped to board, since they are a bit taller
than the boat ceiling (7 units tall).
After the boat training phase, participants were asked a
memory check: “Did all of the Zarpies board the boat?”
(yes/no), and received either an affirmation or correction.
In the sample observation phase, all
participants saw the Zarpies who successfully boarded the boat get off
the boat to visit us. The Zarpies got off one at a time, and each
waved/descrunched if relevant. The height of this observed sample (4, 5,
6, 6, 7, 8) was held constant across conditions.
Sample.
To emphasize the height of the Zarpies relative to the boat,
participants watched Zarpies deboard the boat, wave, reboard the boat
(with any Zarpies taller than the boat stooping down again to board
again), and deboard again (with any Zarpies taller than the boat
straightening up again).
Participants were asked the following DVs in fixed order:
Participants were asked the average height of the Zarpies who
visited (Sample representation) and
the average height of Zarpies on Zarpie island (Population representation), in
counterbalanced order.
Participants were asked an [explicit comparison] question asking
them to compare the heights of Zarpies on Zarpie island to that of
Zarpies who visited: shorter, about the same, or taller.
Participants were shown pairs of Zarpies (6v7, 6v8, 7v8) and told
one Zarpie is from Zarpie island and one is a Zarpie who visited, and
asked to guess which one is the Zarpie on Zarpie island.
Finally, participants were also asked for feedback at the end of the
task: any problems or confusion they had, and what they thought the task
was about (see Participant
feedback).
Primary results
Sample representation
As a check that they could retrieve the mean of the sample they
observed, participants were asked, “Which picture shows the average
height of the Zarpies who visited?” Response options were a Zarpie of
height 4, 5, 6, 7, or 8.
Sample question.
Since all participants saw the same sample (4, 5, 6, 6, 7, 8), all
participants should provide the same response regardless of condition.
This response is expected to be the mean of the sample: 6.

As expected, there was no main effect of population condition
on sample representations (in a simple linear regression),
since all participants observed the same sample (6 Zarpies: 4, 5, 6, 6,
7, 8).
lm(dv_sample ~ pop,
data = data) %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_sample
## Sum Sq Df F value Pr(>F)
## pop 0.098 2 0.1645 0.8485
## Residuals 40.895 138
In contrast to the earlier pilots, which used the “how tall” wording,
participants were not different from the true mean & mode,
6, suggesting the change in wording to “average height” successfully
warded off reports of the “tallest/extreme” height.
t.test(data %>%
select(dv_sample),
mu = mean(observed_sample)) # true mean of observed sample = 6
##
## One Sample t-test
##
## data: data %>% select(dv_sample)
## t = 0.15563, df = 140, p-value = 0.8765
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
## 5.916997 6.097187
## sample estimates:
## mean of x
## 6.007092
Just to confirm we warded off the “extreme” reading of the question,
we can look at the precise breakdown of responses, and see that the
responses show a more continuous pattern, rather than the dichotomous
6v8 pattern from earlier pilots.

Population representation
As a check for their representation of the
population, participants were asked: “Which picture shows the
average height of Zarpies on Zarpie island?” Response options were a
Zarpie of height 4, 5, 6, 7, or 8.
Population question.
If this question is a valid measure of participants’ representation
of the average height of Zarpies, and participants remember how tall
Zarpies are in the boarding scene and use that as their representation
of Zarpies on Zarpie island, the expected response in each condition
is:
- pop short: (4, 5, 6, 6, 7, 8) –> population mean = 6
- pop med: (4, 5, 6, 6, 7, 8, 8, 8, 9, 9, 10) –> population mean =
7.27
- pop tall: (4, 5, 6, 6, 7, 8, 8, 8, 8, 8, 8, 9, 10, 10, 11, 12) –>
population mean = 8

Indeed, participants’ reports of the population height significantly
differ across conditions (F(2) = 7.77, p < .001).
Posthoc pairwise comparisons reveal that participants made taller
population inferences in the tall condition, compared to the medium
condition (t(138) = -2.21, p = .043), and compared to
the short condition (t(138) = -3.93, p < .001).
lm_pop <- lm(dv_pop ~ pop,
data = data)
lm_pop %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_pop
## Sum Sq Df F value Pr(>F)
## pop 9.181 2 7.7672 0.000636 ***
## Residuals 81.557 138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_pop %>%
cohens_f()
## # Effect Size for ANOVA
##
## Parameter | Cohen's f | 95% CI
## -----------------------------------
## pop | 0.34 | [0.18, Inf]
##
## - One-sided CIs: upper bound fixed at [Inf].
lm_pop %>%
emmeans("pop") %>%
pairs(adjust = "FDR") %>%
summary()
## contrast estimate SE df t.ratio p.value
## short - med -0.267 0.159 138 -1.685 0.0943
## short - tall -0.620 0.158 138 -3.931 0.0004
## med - tall -0.353 0.159 138 -2.213 0.0428
##
## P value adjustment: fdr method for 3 tests
Just to confirm we warded off the “extreme” reading of the question,
we can look at the precise breakdown of responses, and see that the
responses show a more continous pattern, rather than the dichotomous 6v8
pattern from earlier pilots.

Explicit comparison
Participants were explicitly asked to compare the population to the
sample: “Do you think the Zarpies on Zarpie island are shorter,
about the same, or taller than the Zarpies who
visited?”

|
shorter |
about the same |
taller |
short |
12% |
79% |
8% |
med |
9% |
46% |
46% |
tall |
9% |
38% |
53% |
Should we be worried that participants in the medium and tall
conditions were not at ceiling for reporting that the Zarpies on Zarpie
island are “taller” than the Zarpies who visited (59-63%)?
- They might be “about the same”, in the sense that they are all
Zarpies at the end of the day?
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.00001069
## alternative hypothesis: two.sided
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.0001137
## alternative hypothesis: two.sided
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.000003323
## alternative hypothesis: two.sided
##
## Fisher's Exact Test for Count Data
##
## data: .
## p-value = 0.7479
## alternative hypothesis: two.sided
Participants’ explicit comparison responses differed by condition
(\(p\) < .001, Fisher’s exact).
Specifically, responses in the short population condition differed from
responses in the medium and tall conditions (\(p\)s < .001, Fisher’s exact), but
responses in the medium and tall conditions did not differ from each
other (\(p\) = .75, Fisher’s
exact).
These results suggest that participants are sensitive to the fact
that the population must be taller if taller Zarpies got cut-off, as in
the medium and tall conditions, but the difference between the medium
and tall conditions is relatively subtle, and not captured on this
explicit measure.
Pairwise forced-choice
A new measure piloted in this study presented participants with pairs
of Zarpies of different heights, one of which is a Zarpie on Zarpie
island and the other one a Zarpie who visited, and asked to guess which
was the Zarpie on Zarpie island.
The pairs tested were 6v8, 7v8, and 6v7, in randomized order.
6v8
Forced choice question.

## $title
## [1] "Pairwise forced-choice: 6v8"
##
## $subtitle
## [1] "Which one is the Zarpie on Zarpie island?"
##
## attr(,"class")
## [1] "labels"
There is a significant main effect of condition (F(2) =
3.24, p = .042). None of the conditions were significantly
different from each other, with marginal effects of the medium and tall
conditions being more likely to choose “8” than the short condition
(t(138) < -2.04, ps > .063).
lm_fc_6v8 <- lm(dv_fc_6v8 ~ pop,
data = data)
lm_fc_6v8 %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_fc_6v8
## Sum Sq Df F value Pr(>F)
## pop 5.673 2 3.2357 0.04234 *
## Residuals 120.966 138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_fc_6v8 %>%
emmeans("pop") %>%
pairs(adjust = "FDR") %>%
summary()
## contrast estimate SE df t.ratio p.value
## short - med -0.3931 0.193 138 -2.035 0.0656
## short - tall -0.4477 0.192 138 -2.330 0.0637
## med - tall -0.0546 0.194 138 -0.281 0.7791
##
## P value adjustment: fdr method for 3 tests
6v7
Forced choice question.

## $title
## [1] "Pairwise forced-choice: 6v7"
##
## $subtitle
## [1] "Which one is the Zarpie on Zarpie island?"
##
## attr(,"class")
## [1] "labels"
There is a significant main effect of condition (F(2) =
3.74, p = .026). The medium and tall conditions were
significantly more likely to chose “7” over “6” compared to the short
condition (t(138) < -2.29, ps = .036), but did not
significantly differ from each other (t(138) = 0.16, p
= .88).
Note that in all three conditions, there were 2 Zarpies of height 6
and 1 Zarpie of height 7 seen in the population, all of whom
successfully boarded into the sample.
The fact that there remains a difference by condition shows that
participants are integrating across all the heights seen, rather than
matching height to height.
lm_fc_6v7 <- lm(dv_fc_6v7 ~ pop,
data = data)
lm_fc_6v7 %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_fc_6v7
## Sum Sq Df F value Pr(>F)
## pop 1.6877 2 3.7401 0.02619 *
## Residuals 31.1350 138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_fc_6v7 %>%
emmeans("pop") %>%
pairs(adjust = "FDR") %>%
summary()
## contrast estimate SE df t.ratio p.value
## short - med -0.2382 0.0980 138 -2.431 0.0355
## short - tall -0.2230 0.0975 138 -2.287 0.0355
## med - tall 0.0153 0.0985 138 0.155 0.8771
##
## P value adjustment: fdr method for 3 tests
7v8
Forced choice question.

## $title
## [1] "Pairwise forced-choice: 7v8"
##
## $subtitle
## [1] "Which one is the Zarpie on Zarpie island?"
##
## attr(,"class")
## [1] "labels"
There is a marginal main effect of condition (F(2) = 2.98,
p = .054). Conditions did not differ from each other, other
than a marginal effect where the tall condition was marginally more
likely than the short condition to choose “8” (t(138) = -2.42,
p = .051).
lm_fc_7v8 <- lm(dv_fc_7v8 ~ pop,
data = data)
lm_fc_7v8 %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_fc_7v8
## Sum Sq Df F value Pr(>F)
## pop 1.432 2 2.9792 0.05411 .
## Residuals 33.177 138
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
lm_fc_7v8 %>%
emmeans("pop") %>%
pairs(adjust = "FDR") %>%
summary()
## contrast estimate SE df t.ratio p.value
## short - med -0.1495 0.101 138 -1.477 0.2128
## short - tall -0.2434 0.101 138 -2.419 0.0507
## med - tall -0.0939 0.102 138 -0.923 0.3575
##
## P value adjustment: fdr method for 3 tests
Secondary results
Sample vs population
As an implicit comparison, we can compare participants’ responses to
the sample question to their responses to the population question using
paired t-tests.

##
## Paired t-test
##
## data: data %>% filter(pop == "short") %>% pull(dv_sample) and data %>% filter(pop == "short") %>% pull(dv_pop)
## t = -1.2188, df = 47, p-value = 0.229
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.27610394 0.06777061
## sample estimates:
## mean difference
## -0.1041667
As expected, participants in the short condition did not give
different responses to sample and population questions (\(t\)(47) = -1.22, \(p\) = .23). This is expected since in the
short condition, the sample and the population are identical.
##
## Paired t-test
##
## data: data %>% filter(pop == "med") %>% pull(dv_sample) and data %>% filter(pop == "med") %>% pull(dv_pop)
## t = -3.367, df = 45, p-value = 0.001565
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -0.6601253 -0.1659617
## sample estimates:
## mean difference
## -0.4130435
##
## Paired t-test
##
## data: data %>% filter(pop == "tall") %>% pull(dv_sample) and data %>% filter(pop == "tall") %>% pull(dv_pop)
## t = -5.4045, df = 46, p-value = 0.000002236
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
## -1.0804373 -0.4940307
## sample estimates:
## mean difference
## -0.787234
In contrast, participants in the medium condition and
participants in the tall condition each gave taller responses to the
population question than to the sample question (medium: \(t\)(45) = -3.37, \(p\) = .0016, tall: \(t\)(46) = -5.40, \(p\) < .001). This makes sense, because
in those conditions, the taller portion of the population got cut off
from boarding.
This result supports the idea that participants in the medium and
tall conditions know the population differs from the sample, i.e., that
the population is taller than the sample.
Were participants in the tall condition significantly more likely
than participants in the medium condition to give taller responses to
the population than sample questions?
## Anova Table (Type II tests)
##
## Response: response
## Sum Sq Df F value Pr(>F)
## pop 3.725 2 4.1981 0.015995 *
## dv 13.195 1 29.7408 0.0000001097 ***
## pop:dv 5.553 2 6.2582 0.002197 **
## Residuals 122.452 276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Yes, there is a significant interaction between population condition
(short, med, tall) and dv (sample vs population) (F(2) = 6.26,
p = .0022).
What if we drill down even further to just the medium and tall
conditions? There is a marginal interaction between population condition
(med, tall) and dv (sample vs population) (F(2) = 3.20,
p = .075).
## Anova Table (Type II tests)
##
## Response: response
## Sum Sq Df F value Pr(>F)
## pop 1.278 1 2.5138 0.11459
## dv 16.860 1 33.1534 0.00000003562 ***
## pop:dv 1.628 1 3.2003 0.07529 .
## Residuals 92.556 182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Order effects
Participants saw the two DVs in counterbalanced order:
pop_sample
= population DV first, then sample DV
sample_pop
= sample DV first, then population DV
There was no effect of DV order on sample responses, nor on
population responses.

lm(dv_sample ~ pop * cb_dvorder,
data = data) %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_sample
## Sum Sq Df F value Pr(>F)
## pop 0.127 2 0.2126 0.8087
## cb_dvorder 0.208 1 0.6975 0.4051
## pop:cb_dvorder 0.336 2 0.5621 0.5714
## Residuals 40.351 135

lm(dv_pop ~ pop * cb_dvorder,
data = data) %>%
Anova()
## Anova Table (Type II tests)
##
## Response: dv_pop
## Sum Sq Df F value Pr(>F)
## pop 9.078 2 7.5674 0.0007674 ***
## cb_dvorder 0.513 1 0.8557 0.3566043
## pop:cb_dvorder 0.072 2 0.0599 0.9419039
## Residuals 80.972 135
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1