Summary

Study 10 was an exploratory study to explore how generic statements vs specific statements about a heterogeneous set of features affect adults’ generalization.

In this study, adults (n = 300, 100/condition) heard 15 generics about heterogeneous features of Zarpies (i.e., a fixed set of physical, diet, and personality generics, 5 generics each). Adults then completed an inductive potential task: in each trial, they observed a Zarpie with a novel feature, and rated its prevalence among Zarpies (0-100%). Trials involved 5 novel features from each cluster (i.e., 5 novel physical, diet, and personality features), comprising 15 novel features in total.

Methods

Participants

condition	n
heterogeneous generic	95
baseline	102
heterogeneous specific	97

Data was collected from 297 adults (n = 95-102/condition) via Prolific on Monday, December 1, 2025. Participants required to be in the United States, fluent in English, and having not participated in prior studies under this protocol. Participants were paid $1.75 for an estimated 5 to 8 minute task. Participants were requested to participate via desktop.

Exclusion criteria

We recruited 300 participants, of whom 3 participants (1.0% of all participants) were excluded for meeting at least 1 of the following exclusion criteria:

failing the sound check (i.e., did not select “bird” as the sound heard during the sound check video) (n = 2 participants)
failing the attention check (i.e., did not select 100% on slider when asked to during induction task) (n = 0 participants)
admitting to use of AI after being explicitly informed use was prohibited (n = 0 participants)
failing the task check (n = 1 participants)

Demographics

We used the Prolific representative sample feature to recruit a sample representative of the US based on Census data on sex, age, and ethnicity (Simplified US Census).

mean	sd	n
age
45.99	15.44	294

gender	n	prop
Female	146	49.7%
Male	142	48.3%
Non-binary	3	1.0%
Dual Gender	1	0.3%
Prefer not to specify	1	0.3%
genderfluid	1	0.3%

race	n	prop
White, Caucasian, or European American	189	64.3%
Black or African American	36	12.2%
Hispanic or Latino/a	20	6.8%
East Asian	12	4.1%
South or Southeast Asian	7	2.4%
White, Caucasian, or European American,Hispanic or Latino/a	6	2.0%
White, Caucasian, or European American,East Asian	4	1.4%
White, Caucasian, or European American,Native American, American Indian, or Alaska Native	3	1.0%
Native American, American Indian, or Alaska Native	2	0.7%
White, Caucasian, or European American,Black or African American	2	0.7%
White, Caucasian, or European American,Middle Eastern or North African	2	0.7%
White, Caucasian, or European American,South or Southeast Asian	2	0.7%
East Asian,Native Hawaiian or other Pacific Islander	1	0.3%
Hispanic or Latino/a,Black or African American	1	0.3%
Indigenous American	1	0.3%
Middle Eastern or North African	1	0.3%
Mixed	1	0.3%
Prefer not to specify	1	0.3%
South or Southeast Asian,East Asian	1	0.3%
White, Caucasian, or European American,Hispanic or Latino/a,Black or African American	1	0.3%
black american	1	0.3%

education	n	prop
Less than high school	2	0.7%
High school/GED	39	13.3%
Some college	84	28.6%
Bachelor's (B.A., B.S.)	127	43.2%
Master's (M.A., M.S.)	25	8.5%
Doctoral (Ph.D., J.D., M.D.)	15	5.1%
Prefer not to specify	2	0.7%

The sample was about evenly split on college completion.

Procedure

This study was administered as a Qualtrics survey, and approved by the NYU IRB (IRB-FY2023-6812).

After providing their consent, participants completed a captcha, pledge not to use AI, and sound check. Participants then completed:

Training phase: Participants were randomly assigned to one of 3 conditions:

Heterogeneous generic condition - 15 generic statements about Zarpies’ physical, diet, and personality features (5 of each type), in random order.
Heterogeneous specific condition - 15 specific statements about individual Zarpies’ physical, diet, and personality features (5 of each type), in random order.
Baseline condition - nothing

Test phase (induction task): Participants completed an induction task where they imagined seeing a Zarpie with a novel feature, and estimated the prevalence of that feature among Zarpies using a slider from 0 to 100 (initialized at 0). All participants completed the same 15 trials, with order of trials randomized:

5 physical features
5 diet features
5 personality features

Test phase (group characterization): Participants were then asked to respond to a freeform question asking: “What do you think characterizes Zarpies as a group?”

Participants then completed a few task completion questions, demographics, and were debriefed.

Data processing

Prevalence judgments were converted to a scale from 0 to 1, with 0 and 1 values trimmed to 0.01 and 0.99 to support a beta regression, since a uniform beta distribution does not include its endpoints of 0 and 1.

Computational modeling

To get a sense of feature space, we embed all the features (training and test features), as they appear in generic statements, using a sentence transformer (MiniLM). We then use PCA to reduce that multi-dimensional space down to a 2 dimensional feature space. The code doing this and generating these plots is in Python, in the project folder under “Model”.

Training and test features as embedded in 2D feature space by a sentence embedding model.

The training conditions are all generics, which indicate what features are known to be kind-linked. Based on what features are known to be kind-linked, our feature-specific model tries to fit a multivariate Gaussian function (3D) over feature space (2D). This multivariate Gaussian function can be thought of a “kind concept”, and is centered on a mean and has a covariance matrix defining its spread. For example, if the Gaussian is centered over physical features, it will have stronger generalization to other physical features, and weaker generalization to more distant features in feature space.

We can then use the model to make predictions about novel test features based on their embedding location in feature space and the Gaussian concept:

The Gaussian provides a probability over “kind scores” for each test feature.
A test feature’s “kind score” in turn is entered into a beta distribution which provides a probability over the likelihood of the test feature being kind-linked.
The likelihood of the test feature being kind-linked then sets a Bernoulli distribution, which determines whether the test feature is in fact kind-linked.
Note that we have yet to link this model to prevalence judgments (or embed it in the rational speech acts framework).

It’s hard to visualize a 3D Gaussian over a 2D space, so the below visualizations use ellipses to depict cross-sections from sampled Gaussians, to show approximately what the expected Gaussian might look like after each training condition.

Primary results

Induction task

Analyses of the induction task were logistic regressions unless otherwise specified, predicting prevalence (.01-.99) with participant and test feature as random intercepts. Test feature (“can snap with their toes”, etc.) is technically nested within test feature type (physical, diet, personality), but since each test feature is unique to each test feature type, a model with the nesting term is analytically equivalent to the previous model, so the nesting term was omitted for simplicity of specification.

Overall

We can look at prevalence estimates overall.

# condition
glmm_condition <-
  glmmTMB(prevalence ~ condition + (1|participant) + (1|test_feature_type),
          data = data_tidy, 
          family = beta_family(link = "logit"))

glmm_condition %>% 
  Anova()

## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: prevalence
##            Chisq Df     Pr(>Chisq)    
## condition 40.046  2 0.000000002014 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

glmm_condition %>% 
  emmeans("condition") %>%
  contrast(method = "pairwise") %>%
  summary(adjust = "FDR")

##  contrast                                       estimate    SE  df z.ratio
##  heterogeneous generic - baseline                  0.468 0.102 Inf   4.607
##  heterogeneous generic - heterogeneous specific    0.626 0.103 Inf   6.085
##  baseline - heterogeneous specific                 0.158 0.101 Inf   1.562
##  p.value
##   <.0001
##   <.0001
##   0.1183
## 
## Results are given on the log odds ratio (not the response) scale. 
## P value adjustment: fdr method for 3 tests

Indeed, there is a significant effect of condition on prevalence ($\chi^2$(2) = 40.05, p < .001), based on an ANOVA conducted on a logistic regression with condition, test feature type, and their interaction as fixed effects, with random intercepts for participant and test feature and random slopes of condition within test feature type.

The heterogeneous generic condition led to greater generalization compared to baseline, and compared to the heterogeneous specific condition.

However, the heterogeneous specific condition only marginally led to lower generalization compared to baseline.

# # make contrast matrix for condition
# C <- matrix(
#   c(
#     # physical  diet   pers    hetero
#       1,       -1/3,  -1/3,   -1/3,    # Contrast 1: physical vs others
#       0,        1,    -1,      0,      # Contrast 2: diet vs personality
#       0,        1,     0,     -1,      # Contrast 3: diet vs heterogeneous
#       1,        1,     1,      1       # Overall mean (intercept)
#   ),
#   nrow = 4,
#   byrow = TRUE
# )
# 
# # assign row names
# rownames(C) <- levels(data_tidy$condition)
# 
# # apply and center columns
# contrasts(data_tidy$condition) <- C[,1:3]  # first 3 rows are true contrasts
# 
# 
# # condition * test feature type
# glmm_condition_testfeaturetype_phys <-
#   glmmTMB(prevalence ~ condition * test_feature_type + (1|participant) + (1|test_feature), 
#           data = data_tidy, 
#           family = beta_family(link = "logit"))
# 
# glmm_condition_testfeaturetype_phys %>% 
#   summary()

By test feature type

We can look at how prevalence judgments vary by condition and test feature type (i.e., physical, diet, or personality).

If the chosen clusters capture some systematicity in how people generalize, the physical condition should make the highest prevalence estimates for physical test features, the diet condition for the diet test features, and the personality condition for personality test features. This appears to be true for the physical and personality conditions, but not for the diet condition.

By test feature

We can look at how prevalence judgments vary by condition and individual test feature.

Group characterization

Participants were asked to describe what characterizes Zarpies as a group. TBD

Secondary results

Straight-lining

Despite the induction task plots suggesting a lot of anchoring around the 50% marker, straightlining was not a pervasive phenomenon.

4 out of 294 participants (1.36%) answered 50% to all test questions.
5 out of 294 participants (1.70%) answered 48-52% to all test questions, a looser criterion.

Test features order effects

All participants rated the prevalence of the same set of 15 test features, in random order. Did the order of test feature/prevalence judgment questions matter for prevalence judgments?

## Analysis of Deviance Table (Type II Wald chisquare tests)
## 
## Response: prevalence
##                                Chisq Df     Pr(>Chisq)    
## condition                    39.9187  2 0.000000002147 ***
## test_feature_order            2.1052  1         0.1468    
## condition:test_feature_order  0.1830  2         0.9126    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##  Family: beta  ( logit )
## Formula:          
## prevalence ~ test_feature_order + (1 | participant) + (1 | test_feature)
## Data: data_tidy
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##   -2234.4   -2202.4    1122.2   -2244.4      4405 
## 
## Random effects:
## 
## Conditional model:
##  Groups       Name        Variance Std.Dev.
##  participant  (Intercept) 0.5685   0.7540  
##  test_feature (Intercept) 0.1457   0.3817  
## Number of obs: 4410, groups:  participant, 294; test_feature, 15
## 
## Dispersion parameter for beta family (): 4.42 
## 
## Conditional model:
##                     Estimate Std. Error z value Pr(>|z|)
## (Intercept)        -0.061960   0.111476  -0.556    0.578
## test_feature_order -0.004467   0.003074  -1.453    0.146

There’s no effect of test feature order, or of test feature order interacting with condition.

Session info

## R version 4.5.2 (2025-10-31)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.7.2
## 
## Matrix products: default
## BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/New_York
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] emmeans_2.0.0   car_3.1-3       carData_3.0-5   glmmTMB_1.1.13 
##  [5] lubridate_1.9.4 forcats_1.0.1   stringr_1.6.0   dplyr_1.1.4    
##  [9] purrr_1.2.0     readr_2.1.6     tidyr_1.3.1     tibble_3.3.0   
## [13] ggplot2_4.0.1   tidyverse_2.0.0 gt_1.1.0        scales_1.4.0   
## [17] janitor_2.2.1   here_1.0.2     
## 
## loaded via a namespace (and not attached):
##  [1] Rdpack_2.6.4        gridExtra_2.3       sandwich_3.1-1     
##  [4] rlang_1.1.6         magrittr_2.0.4      multcomp_1.4-29    
##  [7] snakecase_0.11.1    compiler_4.5.2      mgcv_1.9-4         
## [10] systemfonts_1.3.1   vctrs_0.6.5         pkgconfig_2.0.3    
## [13] crayon_1.5.3        fastmap_1.2.0       backports_1.5.0    
## [16] labeling_0.4.3      rmarkdown_2.30      tzdb_0.5.0         
## [19] nloptr_2.2.1        ragg_1.5.0          bit_4.6.0          
## [22] xfun_0.54           cachem_1.1.0        jsonlite_2.0.0     
## [25] parallel_4.5.2      cluster_2.1.8.1     R6_2.6.1           
## [28] bslib_0.9.0         stringi_1.8.7       RColorBrewer_1.1-3 
## [31] boot_1.3-32         rpart_4.1.24        jquerylib_0.1.4    
## [34] numDeriv_2016.8-1.1 estimability_1.5.1  Rcpp_1.1.0         
## [37] knitr_1.50          zoo_1.8-14          base64enc_0.1-3    
## [40] Matrix_1.7-4        splines_4.5.2       nnet_7.3-20        
## [43] timechange_0.3.0    tidyselect_1.2.1    rstudioapi_0.17.1  
## [46] abind_1.4-8         yaml_2.3.10         TMB_1.9.18         
## [49] codetools_0.2-20    lattice_0.22-7      withr_3.0.2        
## [52] S7_0.2.1            coda_0.19-4.1       evaluate_1.0.5     
## [55] foreign_0.8-90      survival_3.8-3      xml2_1.5.0         
## [58] pillar_1.11.1       checkmate_2.3.3     reformulas_0.4.2   
## [61] generics_0.1.4      vroom_1.6.6         rprojroot_2.1.1    
## [64] hms_1.1.4           minqa_1.2.8         xtable_1.8-4       
## [67] glue_1.8.0          Hmisc_5.2-4         tools_4.5.2        
## [70] data.table_1.17.8   lme4_1.1-37         fs_1.6.6           
## [73] mvtnorm_1.3-3       rbibutils_2.4       colorspace_2.1-2   
## [76] nlme_3.1-168        htmlTable_2.4.3     Formula_1.2-5      
## [79] cli_3.6.5           textshaping_1.0.4   ggthemes_5.1.0     
## [82] gtable_0.3.6        sass_0.4.10         digest_0.6.38      
## [85] TH.data_1.1-5       htmlwidgets_1.6.4   farver_2.1.2       
## [88] htmltools_0.5.8.1   lifecycle_1.0.4     bit64_4.6.0-1      
## [91] MASS_7.3-65

Compgenerics study 10 exploratory study

Marianna Zhang

2025-12-02