Summary

Goals

The primary goal of this pilot was to establish basic comprehension of the paradigm among adults by addressing any confusion or problems arising from the task.
- Selection of what was mentioned in the task (“A boat” and “Fictional animals”) was surprisingly inaccurate (many selected “A boat” and not “Fictional animals”), so exclusion criteria were loosened to simply checking one of the two (see Exclusion criteria). Participants could have thought it was a single select question.
- Participants generally reported no issues with the task and were naive to the purpose of the task (see Task comprehension).
The secondary goal of this pilot was to calibrate our expectations on a ballpark effect size for powering a subsequent high-powered preregistered study with adults, although the sample size of this pilot was arbitrary and not designed to detect an effect.
- The expected effect of boat height on population inferences was not detected in this pilot. If anything, the effect appears to go slightly in the opposite direction, such that participants think that Zarpies in general are slightly taller as the boat gets taller (see Population inferences).
- Some speculative reasons for this include justificatory reasoning (whatever height the boat is is the ideal height to board Zarpies, so Zarpies are around boat height) or perceptual scaling to stimuli (things are bigger on screen = Zarpies must be bigger too).

Possibilities and next steps

Possibility 1: Adults do adjust their inferences about social groups based on sampling processes, but that this task paradigm fails to elicit such inferences or such inferences are masked by details of this particular task.
- If the boat training is too subtle in communicating that the observed Zarpies are the outcomes of a selective process, maybe pilot again with some changes: explicitly ask adults for their explanations of their population inferences & whether they think the observed Zarpies are a representative sample, switch to a more sensitive measure (e.g., continuous slider scale) for more sensitivity on getting a sense of the effect in adults (sliders would ultimately be challenging with children though).
- To address justificatory reasoning, we could add the verbal manipulation (brainstormed in an earlier version of this study), explicitly stating that the height of the boat is accidental/unrelated to the height of Zarpies.
- Addressing perceptual scaling might require moving completely away from a height-based paradigm to some other non-perceptual property. This would be significantly more demanding for children though.
Possibility 2: Adults do adjust their inferences about social groups based on sampling processes, but the effect size is quite small and/or this pilot is simply not designed to detect such effects, e.g., sample size is underpowered.
- We could run preregistered study with a bigger sample, powered to detect the smallest effect size of realistic interest.
Possibility 3: Adults simply do not adjust their inferences about social groups based on sampling processes to any significant extent, which could be a basis for stereotype formation. This task is a fair test of their abilities.
- We could run a preregistered study with Bayesian analyses to test hypotheses in both directions (incl. the null hypothesis that adults do not account for sampling processes).

Methods

Participants

Data was collected from 108 adults recruited via Prolific on 12/18/2024. Participants were required to be in the United States and fluent in English.

Participants were paid $2.00 for an estimated 8 minute task. Prolific was set to recruit 100 participants, but Qualtrics collected 108 complete responses, perhaps because some participants completed the survey but then did not turn in their submission or returned their submission before payment. Due to a misconfiguration of the Qualtrics survey, participants were redirected to a broken link upon completion, so some participants could have failed to manually turn in their submissions or returned their submission, after completing the survey. All participants who turned in submissions (n = 100) were manually approved for payment.

The final sample included 93 adults (n = 15-20 in each of 5 boat height conditions).

boatheight	n
6	20
7	19
8	15
9	19
10	20

Exclusion criteria

15 participants (13.9% of all participants) were excluded for meeting at least 1 of the following exclusion criteria:

failing the sound check (n = 2 participants)
failing to check an item mentioned in the task (i.e., did not select “A boat” or “Fictional animals” or both options) (n = 2 participants)
failing to select the correct task description (i.e., did not select “Learning about people who live on an island”) (n = 11 participants)

Note: Initially, participants who failed to check exactly both of 2 items mentioned in the task (i.e., failed to select both “A boat” and “Fictional animals” and nothing else) were going to be excluded. But, surprisingly, many participants failed this criterion (n = 28 participants, or 25.9% of the entire sample), so this criterion was loosened to simply checking either one correct item or both correct items. Participants could have thought it was a single select question.

check_mention	n	prop	pass_both	pass_one
A boat	22	20.4%	FALSE	TRUE
A boat,Fictional animals	80	74.1%	TRUE	TRUE
A boat,Fictional animals,A train	2	1.9%	FALSE	FALSE
Fictional animals	4	3.7%	FALSE	TRUE

Demographics

age group	n	prop
18 to 24	26	28.0%
25 to 34	31	33.3%
35 to 44	20	21.5%
45 to 54	9	9.7%
55 to 64	6	6.5%
Prefer not to specify	1	1.1%

The sample skewed young in age.

gender	n	prop
Female	48	51.6%
Male	41	44.1%
Non-binary	3	3.2%
Prefer not to specify	1	1.1%

The sample reflected the diversity of the gender identities in the US.

race	n	prop
White, Caucasian, or European American	44	47.3%
Black or African American	24	25.8%
Hispanic or Latino/a	5	5.4%
East Asian	4	4.3%
South or Southeast Asian	4	4.3%
Prefer not to specify	3	3.2%
White, Caucasian, or European American,Black or African American	3	3.2%
Hispanic or Latino/a,Black or African American	1	1.1%
Middle Eastern or North African	1	1.1%
White, Caucasian, or European American,Black or African American,East Asian	1	1.1%
White, Caucasian, or European American,Hispanic or Latino/a	1	1.1%
White, Caucasian, or European American,Native American, American Indian, or Alaska Native	1	1.1%
White, Caucasian, or European American,South or Southeast Asian	1	1.1%

The sample was also racially diverse.

education	n	prop
Less than high school	2	2.2%
High school/GED	13	14.0%
Some college	21	22.6%
Bachelor's (B.A., B.S.)	42	45.2%
Master's (M.A., M.S.)	10	10.8%
Doctoral (Ph.D., J.D., M.D.)	3	3.2%
Prefer not to specify	2	2.2%

The sample was mostly college-educated.

Procedure

This study was administered as a Qualtrics survey, and approved by the NYU IRB (IRB-FY2024-9169).

After providing their consent, participants completed a captcha and sound check, and were asked to watch videos sound on. Participants then watched the following videos in order:

In the prior setting and familiarization phase, participants saw a picture of 5 adults and then another picture of a different 5 adults appear on screen against a grid. These adults were all 10 units tall.
In the boat training phase, participants were randomly assigned to see a boat either 6, 7, 8, 9, or 10 units tall (between-subjects).

To communicate how the boat functions to exclude those shorter than the boat, participants watched a parade 20 fictional animals (Quaffas, taken from Foster-Hanson et al., 2019) attempt to board the boat, one at a time, from shortest to tallest.

To maintain the same effect of the boat across boat height conditions, the height of animals were scaled to the height of the boat, such that 10 animals were always shorter than the boat and 10 animals were always taller than the boat.

Of the 10 animals shorter than the boat, all boarded the boat successfully (upon the first success: “Yay, the Quaffa boarded the boat!”). Of the 10 animals taller than the boat, all but one were unable to board the boat (upon the first failure: “Uh-oh, this Quaffa couldn’t board the boat! It decided not to go.”). The one that boarded the boat was the third quaffa in this set of 10, and it bent its head to become shorter than the boat ceiling to board the boat (“Uhoh, this Quaffa couldn’t board the boat. Look, it’s stooping its head to get on board!”).

After the boat training phase, participants were asked a memory check: “Did all of the animals board the boat?” (yes/no). Participants who answered “no” received an affirmation (“That’s right, not all of the animals made it onto the boat! The animals taller than the boat couldn’t get on, or had to stoop to get on.”) Those who failed were given a correction (“Actually, not all of the animals made it onto the boat! The animals taller than the boat couldn’t get on, or had to stoop to get on.”), and were included, regardless of memory check performance.
In the observed sample phase, participants were told that Zarpies are people who live on a far-away island, and watched the boat from before leave the island. All participants - fixed across conditions - saw the same set of Zarpies get off the boat: there were 6 Zarpies, of heights 4, 5, 6, 6, 7, and 8, respectively. Participants were told that they were all grown-up Zarpies.

Each Zarpie then waved to the participant in sequence. To maintain consistency with the constraining effect of the boat, any Zarpies taller than the height of the boat were depicted initially stooping and then straightening up when they waved (e.g., if the boat height was 6, the Zarpie of height 7 and the Zarpie of height 8 were initially stooping when they got off the boat, and straightened up when they waved).

To emphasize the height of the Zarpies relative to the boat, participants watched Zarpies deboard the boat, wave, reboard the boat (““Oh the Zarpies forgot some of their things on the boat.”, with any Zarpies taller than the boat stooping down again to board again), and deboard again (with any Zarpies taller than the boat straightening up again).
Participants were asked the 2 primary DVs - about sample representation and about population inferences - in counterbalanced order (see Order effects).
Since this was a pilot, participants were also asked for feedback at the end of the task: any problems or confusion they had, and what they thought the task was about (see Participant feedback).

Task comprehension

Memory check

Almost all participants passed the memory check by saying “no” (n = 1 or 1.1% failed).

Participant feedback

Participants by and large did not report any problems or confusion with the task. See data file for details on these responses.

1 participant mentioned missing the sound check
1 participant mentioned that the noise of the first failed animal boarding was startling to their dog.

Participants also by and large were naive to the specific purpose of the task.

Most participants guessed it was to test some educational activity for children.
A few participants commented that the task was strange to experience as an adult.
Some participants mentioned something vague about social groups, which was the advertised topic in the Prolific ad.

Primary results

Sample representation

As a check for their representation of the observed sample (fixed across conditions: 6 Zarpies of heights 4, 5, 6, 6, 7, 8), participants were asked: “Which picture shows how tall most of the Zarpies who visited are?” Response options were a Zarpie of height 4, 6, or 8.

The expected answer to this question is 6 (indicated by the dashed line on the below plot), since 6 is the mode of the observed sample (fixed across conditions: 6 Zarpies of heights 4, 5, 6, 6, 7, 8).

As expected, there was no main effect of boat height on sample representations (in a simple linear regression), since all participants observed the same sample (6 Zarpies: 4, 5, 6, 6, 7, 8).

lm(dv_sample ~ boatheight,
   data = data) %>% 
  summary()

## 
## Call:
## lm(formula = dv_sample ~ boatheight, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1545 -0.2454 -0.2150 -0.1545  1.8456 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  5.97263    0.39964  14.945 <0.0000000000000002 ***
## boatheight   0.03030    0.04914   0.617               0.539    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6915 on 91 degrees of freedom
## Multiple R-squared:  0.004161,   Adjusted R-squared:  -0.006783 
## F-statistic: 0.3802 on 1 and 91 DF,  p-value: 0.539

Slightly surprisingly, participants overall reported the sample to be slightly taller than 6, the true mode.

t.test(data %>% 
         select(dv_sample), 
       mu = 6) # true mean of observed sample

## 
##  One Sample t-test
## 
## data:  data %>% select(dv_sample)
## t = 3.0092, df = 92, p-value = 0.00338
## alternative hypothesis: true mean is not equal to 6
## 95 percent confidence interval:
##  6.073116 6.356992
## sample estimates:
## mean of x 
##  6.215054

Population inferences

To assess their inferences about the population, participants were asked: “Which picture shows how tall most Zarpies are on Zarpie island?” Response options were a Zarpie of height 4, 6, or 8.

The use of “most” was intended to be a child-friendly version of eliciting the mean/average of a distribution.

Unexpectedly, there was a significant positive effect of boat height on inferences of population height (in a simple linear regression). That is, participants who saw a taller boat thought Zarpies in general were also slightly taller. This effect, however, is pretty small.

lm(dv_pop ~ boatheight,
   data = data) %>% 
  summary()

## 
## Call:
## lm(formula = dv_pop ~ boatheight, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.15836 -0.40078 -0.15836 -0.03715  1.96285 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  5.30987    0.46067   11.53 <0.0000000000000002 ***
## boatheight   0.12121    0.05665    2.14              0.0351 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7971 on 91 degrees of freedom
## Multiple R-squared:  0.0479, Adjusted R-squared:  0.03744 
## F-statistic: 4.578 on 1 and 91 DF,  p-value: 0.03506

Some speculation about why there is a small but positive effect:

participants could be engaging in just-world/justificatory reasoning, where the boat is designed to be optimal for Zarpies, and thus signals something about Zarpie height.
participants could be engaging in perceptual scaling of stimuli to everything else on screen, including the boat. The boat is on-screen during the population inference DV, as a visual reminder of the boat height. Perceptually, a taller boat may make a taller Zarpie look more “proportional”?

Secondary results

Sample vs population

Sample representation and population inferences are very similar (on 6). Eyeballing it, both have a very very slight tendency for both to drift taller (to 8).

t.test(data$dv_pop, data$dv_sample,
       paired = TRUE)

## 
##  Paired t-test
## 
## data:  data$dv_pop and data$dv_sample
## t = 0.68629, df = 92, p-value = 0.4943
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  -0.1221909  0.2512232
## sample estimates:
## mean difference 
##      0.06451613

cor.test(data$dv_pop, data$dv_sample,
         method = "pearson")

## 
##  Pearson's product-moment correlation
## 
## data:  data$dv_pop and data$dv_sample
## t = 2.779, df = 91, p-value = 0.006624
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.08057489 0.45734321
## sample estimates:
##       cor 
## 0.2796924

Sample representation and population inferences are not statistically different from each other, and are moderately correlated with each other.

Order effects

Participants saw the two DVs in counterbalanced order:

pop_sample = population DV first, then sample DV
sample_pop = sample DV first, then population DV

Although the population inferences seem to vary a bit when they came after sample responses (sample_pop), there weren’t any significant effects of DV order in this data.

lm(dv_sample ~ boatheight * cb_dvorder,
   data = data) %>% 
  summary()

## 
## Call:
## lm(formula = dv_sample ~ boatheight * cb_dvorder, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1604 -0.2496 -0.2050 -0.1604  1.8534 
## 
## Coefficients:
##                                 Estimate Std. Error t value            Pr(>|t|)
## (Intercept)                      6.02648    0.54482  11.061 <0.0000000000000002
## boatheight                       0.02231    0.06729   0.332               0.741
## cb_dvordersample_pop            -0.11646    0.81226  -0.143               0.886
## boatheight:cb_dvordersample_pop  0.01712    0.09982   0.172               0.864
##                                    
## (Intercept)                     ***
## boatheight                         
## cb_dvordersample_pop               
## boatheight:cb_dvordersample_pop    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6991 on 89 degrees of freedom
## Multiple R-squared:  0.004714,   Adjusted R-squared:  -0.02883 
## F-statistic: 0.1405 on 3 and 89 DF,  p-value: 0.9355

lm(dv_pop ~ boatheight * cb_dvorder,
   data = data) %>% 
  summary()

## 
## Call:
## lm(formula = dv_pop ~ boatheight * cb_dvorder, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.09302 -0.44856 -0.21407 -0.09302  2.03539 
## 
## Coefficients:
##                                 Estimate Std. Error t value          Pr(>|t|)
## (Intercept)                      5.39334    0.62664   8.607 0.000000000000245
## boatheight                       0.11725    0.07740   1.515             0.133
## cb_dvordersample_pop            -0.19921    0.93424  -0.213             0.832
## boatheight:cb_dvordersample_pop  0.01117    0.11481   0.097             0.923
##                                    
## (Intercept)                     ***
## boatheight                         
## cb_dvordersample_pop               
## boatheight:cb_dvordersample_pop    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.804 on 89 degrees of freedom
## Multiple R-squared:  0.0526, Adjusted R-squared:  0.02067 
## F-statistic: 1.647 on 3 and 89 DF,  p-value: 0.1842

Structural skew: Study 1a adults pilot analysis

Marianna Zhang

2024-12-18