# Seed for random number generation
set.seed(42)
knitr::opts_chunk$set(cache.extra = knitr::rand_seed, echo = TRUE, message=FALSE)

library(tidyverse)
library(brms)
library(psych)
library(modelr)
library(glue)
library(here)
library(quanteda)
library(tmcn)
library(janitor)
library(GGally)
library(lme4)
library(stringr)
library(lmerTest)  # For p-values in lme4 models
library(skimr)     # For quick and nice summaries
library(performance) # For model diagnostics
library(ggvenn)
library(purrr)
source(here("helper.R"))
library("papaja")
library(psych)
library(irr)
library(ggplot2)
library(corrr)
library(broom)
library(effectsize)
library(patchwork)
library(ggeffects)

#r_refs("r-references.bib")

Introduction

Study 1A:

Methods

In Study 1, we aimed to replicate (Samuelson 1999) by collecting judgment ratings on the first 300 words learned from the MCDI US English across the same three dimensions of solidity, count, and mass syntax, and shape-based categorization. The study, however, collected these ratings on the first 300 words learned by children in sixteen languages including English.

english_demo1 <- read_csv(here("data/demographics/first-sample-english", "1.csv"))
english_demo2 <- read_csv(here("data/demographics/first-sample-english", "2.csv"))
english_demo3 <- read_csv(here("data/demographics/first-sample-english", "3.csv")) %>%
  filter(!is.na(Age)) %>% mutate(Age = as.character(Age))
english_demo4 <- read_csv(here("data/demographics/first-sample-english", "4.csv")) %>%
  filter(!is.na(Age)) %>% mutate(Age = as.character(Age))
english_demo5 <- read_csv(here("data/demographics/first-sample-english", "5.csv"))

english_sample1_demo <- bind_rows(english_demo1, english_demo2, english_demo3, english_demo4, english_demo5) %>%
  clean_names() %>%
  mutate(language = "English", sample = "1") %>%
  mutate(age = as.numeric(age)) %>% 
  drop_na(age)

# %>% filter(age >= 18 & age <= 60)

#english_sample1_demo %>% skim()
english_sample1_demo %>% select(age) %>% filter(!is.na(age)) %>%
summarize(meanage = mean(age), sd = sd(age), max = max(age), min = min(age))
words <- read_csv(here("data/ratings", "alllangs.csv")) %>% 
  filter(!is.na(uni_lemma))  %>% 
  mutate(word = uni_lemma) %>%
  select(word, language, category)

unique(words$language)
##  [1] "English (American)"   "Mandarin (Beijing)"   "Korean"              
##  [4] "English (Australian)" "Cantonese"            "Catalan"             
##  [7] "Japanese"             "Turkish"              "Spanish (Mexican)"   
## [10] "Russian"              "Kiswahili"            "Mandarin (Taiwanese)"
## [13] "Spanish (Peruvian)"   "Spanish (European)"   "Hebrew"              
## [16] "French (French)"
smenatic_Cate <- words %>% group_by(language) %>% mutate(langucount= n()) %>% ungroup() 

smenatic_Cate %>% group_by(language, category) %>% mutate(semn_count = n()) %>% distinct() %>% select(category, semn_count, langucount) %>% distinct() %>%
  mutate(percent = (semn_count/langucount)*100) %>%
  select(-langucount, -semn_count) %>%
  distinct() %>% pivot_wider(names_from = category, values_from = percent, values_fill = 0)

Participants

377 English-speaking adults recruited online from the Prolific platform (mean age = 41, sd = 14) were asked to rate words in three dimensions: solidity (solid, non-solid, unclear), count and mass noun syntax (count, mass, unclear), and organizing feature (shape, color, material, none of these).

Material

Three hundred nouns were sourced from MCDI data on the WordBank repository of 16 languages that belonged to seven linguistic families: (LIST OF LANGUAGES). The nouns were chosen based on the earliest age of acquisition in the dataset, up to a maximum of 300 nouns (CITE). Because we only recruited English-speaking participants, Nouns were presented in English and mapped across languages via conceptual mappings (“unilemmas”, Frank et al., 2021). A “unilemma” is the concept behind the noun expressed in the English language (“dog” and “dog in german” both map to the same underlying concept of “dog”, which is used as the unilemma in this case).

The final sample consisted of a total of 23355 nouns from all languages, with N = ~664 after removing duplicates “uni-lemmas shared between languages”. (Figure to show how much overlap between languages). With the Semantic categories breakdown of the words in each language attached in the supplementary materials. Most words belonged to the categories of food/drinks, clothing, and animals, with a few words referring to vehicles, toys, body parts, household items, and outside things.

Procedure

The experiment started with familiarization trials in which participants were shown examples of count and mass nouns, solid objects, and non-solid objects. They were also familiarized with what a category-organizing or characterizing feature means, as in Table [2]. Then the test trials followed the same wording and structure. Every dimension (solidity, count-mass syntax, and category organizing feature) constituted a block, and every participant rated 20 words per block, so there were 60 test trials in total. Familiarization trials, test blocks, and words within a block were all randomized across participants. No information about the semantic category of words was given to participants, nor were the words grouped by their semantic category. Participants were only allowed to choose one feature in all questions.

Data analysis

Results

english_sample1 <- read_csv(here("data/predictors","english_sample1_props.csv")) 
english_sample1 <- english_sample1 %>% filter(!is.na(response)) %>% distinct()
english_sample1_raw <- read_csv(here("data/predictors","english_sample1_raw.csv")) 
# Apply the mapping function in helper.R to the response column
english_sample1$response_numeric <- mapply(map_categories, english_sample1$response, english_sample1$block)
# more 2 participants -- 21 words less than 8 ratings

correspondence <- english_sample1 |> filter (matching_response %in% c("solid", "count" ,"shape"))
english_sample1 |>
  filter(block == "category_organization") |>
ggplot(aes(x = proportion, color = response, fill = response)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ language) +
  labs(
    title = "Density of Proportions by Language",
    x = "Proportion Value",
    y = "Density (words)"
  ) +
  theme_minimal()

english_sample1 |>
  filter(block == "solidity") |>
ggplot(aes(x = proportion, color = response, fill = response)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ language) +
  labs(
    title = "Density of Proportions by Language",
    x = "Proportion Value",
    y = "Density (words)"
  ) +
  theme_minimal()

english_sample1 |>
  filter(block == "count_mass") |>
ggplot(aes(x = proportion, color = response, fill = response)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ language) +
  labs(
    title = "Density of Proportions by Language",
    x = "Proportion Value",
    y = "Density (words)"
  ) +
  theme_minimal()

correspondence |>
  ggplot(aes(x = proportion, color = response, fill = response)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ language) +
  labs(
    title = "Density of Proportions by Language",
    x = "Proportion Value",
    y = "Density (words)"
  ) +
  theme_minimal()
This is a figure caption.

This is a figure caption.

# how much correspondence there is between the three dimensions 
correspodence_data <-  english_sample1 %>% select(-c(language)) %>%
  distinct() %>%
  filter(response %in% c("solid", "shape", "count noun")) %>%  # Filter only the desired responses
  select(word, response, proportion) %>%  # Keep only relevant columns
  pivot_wider(names_from = response, values_from = proportion, values_fill = 0)  # Reshape

correspodence_data %>% select(`count noun`, shape, solid) %>% corr.test()
## Call:corr.test(x = .)
## Correlation matrix 
##            count noun shape solid
## count noun       1.00  0.45  0.62
## shape            0.45  1.00  0.38
## solid            0.62  0.38  1.00
## Sample Size 
## [1] 649
## Probability values (Entries above the diagonal are adjusted for multiple tests.) 
##            count noun shape solid
## count noun          0     0     0
## shape               0     0     0
## solid               0     0     0
## 
##  To see confidence intervals of the correlations, print with the short=FALSE option
#Venn Diagram

threshold <- 0.8

category_english_sample1 <- english_sample1 |>
  filter(response %in% c("solid","count noun","shape")) |>
  mutate(over_threshold = proportion >= threshold) |>
  select(-block, -totcount, -count, -proportion, -response_numeric, -eng_response, - cleaned_response, - matching_response) |>
  pivot_wider(names_from = "response", values_from = "over_threshold", 
              values_fill = FALSE) |>
  rename(count_noun = `count noun`)

category_english_sample1 %>% filter(count_noun == TRUE)
category_english_sample1 %>% filter(uni_lemma == "banana") %>% select(count_noun, shape, solid)
english_sample1 %>% filter(proportion >= 0.8, uni_lemma == "banana") %>% select (response, uni_lemma)
english_sample1 |>
  filter(response %in% c("solid","count noun","shape"), uni_lemma == "banana") |>
  mutate(over_threshold = proportion >= threshold) 
venns <- category_english_sample1 |>
  group_by(language) |>
  nest() |>
  mutate(plot = map(data, ~ggvenn(.x, c("count_noun", "solid", "shape"),
                                  show_percentage = FALSE,
                                  text_size = 2,
                                  set_name_size = 2)
                    + ggtitle(language)
                    + theme(plot.title = element_text(size = 8), ))) |> # Adjust title size here
  pull(plot)

gridExtra::grid.arrange(grobs = venns, ncol = 4)

venns <- category_english_sample1 |>
  group_by(language) |>
  nest() |>
  mutate(plot = map(data, ~ggvenn(.x, c("count_noun","solid","shape"),
                                  show_percentage = FALSE, 
                                  text_size = 3,
                                  set_name_size = 2) + 
                    labs(y = language) + # Use labs() to set the vertical title
                    theme(
                      axis.title.y = element_text(angle = 90, # Rotate text to be vertical
                                                 # size = 8,  # Adjust size as needed
                                                  vjust = 0.5), # Vertically center the title
                      axis.text.y = element_blank(), # Remove y-axis labels
                      axis.ticks.y = element_blank(), # Remove y-axis ticks
                      axis.line.y = element_blank() # Remove y-axis line
                    ))) |>
  pull(plot)

gridExtra::grid.arrange(grobs = venns, ncol = 4)

#permutation test
# Preparation: isolate one response (e.g. “shape”)
english_sample1_shape_df <- english_sample1 %>%
  filter(response == "shape") %>%
  replace_na(list(proportion = 0)) %>%
  select(uni_lemma, language, proportion)

english_sample1_shape_df %>% glimpse()
## Rows: 3,032
## Columns: 3
## $ uni_lemma  <chr> "air conditioner", "air conditioner", "air conditioner", "a…
## $ language   <chr> "Cantonese", "Mandarin (Beijing)", "Mandarin (Taiwanese)", …
## $ proportion <dbl> 0.3750000, 0.3750000, 0.3750000, 0.3000000, 0.3000000, 0.30…
# 2.A. Decide on a grid of proportion‐values between 0 and 1
grid_x <- seq(0, 1, length.out = 100)

# 2.B. A helper function that, given a vector of proportions, returns its density 
#      evaluated at the grid points.
get_density_on_grid <- function(x_values, grid_x) {
  # Remove any NAs (we replaced above)
  x_vals <- x_values[!is.na(x_values)]
  
  # If a language has fewer than 2 unique points, density() might fail—so 
  # we require at least 2 non‐identical values to form a kernel density.
  if(length(unique(x_vals)) < 2) {
    # Return a vector of zeros or NA so that it’s the “degenerate” distribution.
    return(rep(0, length(grid_x)))
  }
  
  d <- density(x_vals, from = 0, to = 1, n = length(grid_x))
  # d$y is the estimated density at each of the grid points; d$x is almost equal to grid_x
  # To be safe, we’ll interpolate:
  approx(d$x, d$y, xout = grid_x, rule = 2)$y
}

# 2.C. Compute the *observed* densities for each language
obs_density_df <- english_sample1_shape_df %>%
  group_by(language) %>%
  summarize(
    dens = list(get_density_on_grid(proportion, grid_x)),
    .groups = "drop"
  ) %>%
  # Turn the list‐column “dens” into one row per (language, grid_x)
  unnest_longer(dens, values_to = "density") %>%
  group_by(language) %>%
  mutate(x = grid_x) %>%
  ungroup()
set.seed(123)  # for reproducibility
n_perm <- 500

# A helper: given a data frame, shuffle language labels and recompute densities
one_permuted_densities <- function(df, grid_x) {
  shuffled <- df %>%
    mutate(language = sample(language))  # randomly shuffle language labels
  
  # now recompute densities by “fake” language grouping
  perm_df <- shuffled %>%
    group_by(language) %>%
    summarize(
      dens = list(get_density_on_grid(proportion, grid_x)),
      .groups = "drop"
    ) %>%
    unnest_longer(dens, values_to = "density") %>%
    group_by(language) %>%
    mutate(x = grid_x) %>%
    ungroup()
  
  # We want a tidy format: (language, x, density_perm_i)
  perm_df
}

# 3.A. Run all permutations, storing results in a single data frame
perm_densities <- map_dfr(seq_len(n_perm), function(i) {
  one_permuted_densities(english_sample1_shape_df, grid_x) %>%
    mutate(perm = i)
})

# perm_densities now has columns: language, x, density, perm
# For each (language, x), we have n_perm density values.
null_envelopes <- perm_densities %>%
  group_by(language, x) %>%
  summarize(
    lower = quantile(density, 0.025),
    upper = quantile(density, 0.975),
    .groups = "drop"
  )
ggplot() +
  # NULL envelope bands (gray ribbons)
  geom_ribbon(
    data = null_envelopes,
    aes(x = x, ymin = lower, ymax = upper),
    fill = "darkgrey", alpha = 0.5 #"gray80"
  ) +
  # OBSERVED density curves (bold lines)
  geom_line(
    data = obs_density_df,
    aes(x = x, y = density),
    color = "blue", size = 1
  ) +
  facet_wrap(~ language, scales = "free_y") +
  labs(
    title = "Observed 'shape' density vs. Null Envelope by Language",
    x = "Proportion of 'shape' choices",
    y = "Density"
  ) +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Quantify Difference: KS‐Stat on Observed vs. Null: not done yet
# Quantify Difference: Approximate Area Between Observed and Null Densities

# First compute the median permuted density at each (language, x)
median_perm <- perm_densities %>%
  group_by(language, x) %>%
  summarize(median_density = median(density), .groups = "drop")

# Join with observed
diff_df <- obs_density_df %>%
  left_join(median_perm, by = c("language", "x")) %>%
  mutate(diff_from_median = abs(density - median_density))

# Integrate difference over x (approximate area between curves)
area_diff <- diff_df %>%
  group_by(language) %>%
  summarize(
    area = sum(diff_from_median) * (1 / (length(grid_x) - 1)),  # Riemann approx
    .groups = "drop"
  )

# area now is a single “distance” metric for each language.
print(area_diff)
## # A tibble: 16 × 2
##    language               area
##    <chr>                 <dbl>
##  1 Cantonese            0.0771
##  2 Catalan              0.0533
##  3 English (American)   0.0216
##  4 English (Australian) 0.0422
##  5 French (French)      0.0549
##  6 Hebrew               0.225 
##  7 Japanese             0.0395
##  8 Kiswahili            0.0772
##  9 Korean               0.0615
## 10 Mandarin (Beijing)   0.0949
## 11 Mandarin (Taiwanese) 0.0885
## 12 Russian              0.0433
## 13 Spanish (European)   0.0619
## 14 Spanish (Mexican)    0.0742
## 15 Spanish (Peruvian)   0.152 
## 16 Turkish              0.128
# After perm_densities is built
# Join permuted densities with median permuted densities
perm_diff_df <- perm_densities %>%
  left_join(median_perm, by = c("language", "x")) %>%
  mutate(diff_from_median = abs(density - median_density))

# Then compute area under the difference for each perm
perm_area_diff <- perm_diff_df %>%
  group_by(perm, language) %>%
  summarize(
    area_perm = sum(diff_from_median) * (1 / (length(grid_x) - 1)),
    .groups = "drop"
  )

p_values_df <- area_diff %>%
  rename(area_obs = area) %>%
  left_join(perm_area_diff, by = "language") %>%
  group_by(language) %>%
  summarize(
    area_diff = first(area_obs),
    num_extreme_perms = sum(area_perm >= area_diff),
    total_perms = n(),
    .groups = "drop"
  ) %>%
  mutate(
    p_value = (num_extreme_perms + 1) / (total_perms + 1)
  )

print("P-values for each language:")
## [1] "P-values for each language:"
print(p_values_df)
## # A tibble: 16 × 5
##    language             area_diff num_extreme_perms total_perms p_value
##    <chr>                    <dbl>             <int>       <int>   <dbl>
##  1 Cantonese               0.0771               325         500  0.651 
##  2 Catalan                 0.0533               454         500  0.908 
##  3 English (American)      0.0216               500         500  1     
##  4 English (Australian)    0.0422               484         500  0.968 
##  5 French (French)         0.0549               456         500  0.912 
##  6 Hebrew                  0.225                 66         500  0.134 
##  7 Japanese                0.0395               494         500  0.988 
##  8 Kiswahili               0.0772               444         500  0.888 
##  9 Korean                  0.0615               417         500  0.834 
## 10 Mandarin (Beijing)      0.0949               169         500  0.339 
## 11 Mandarin (Taiwanese)    0.0885               199         500  0.399 
## 12 Russian                 0.0433               483         500  0.966 
## 13 Spanish (European)      0.0619               397         500  0.794 
## 14 Spanish (Mexican)       0.0742               340         500  0.681 
## 15 Spanish (Peruvian)      0.152                256         500  0.513 
## 16 Turkish                 0.128                 44         500  0.0898

We collected a total of 20,130 ratings across 377 participants for 664 words in three blocks (solidity, count-mass syntax, and category-organizing feature). After removing responses with no answer, we retained 19,635 ratings. Each word received an average of between 8 and 10 ratings per block. Included in the supplementary materials are the demographic details of the participants (Table 1) and the distribution of ratings across words and blocks (Figure 1).

Across languages, ratings were very consistent across languages, suggesting limited variation in lexical statistics as can be seen in figure @ref{fig:correspondenceeng1}. Nouns were predominantly rated as solid (M = 0.72, SD = 0.32) and count (M = 0.68, SD = 0.29), but were less reliably judged as being organized by shape (M = 0.31, SD = 0.24). Pearson correlation analysis revealed a moderate positive correlation between solidity and count syntax (r = 0.62, p < 0.001), a weak positive correlation between count syntax and shape-based categorization (r = 0.45, p < 0.001), and a weaker positive correlation between solidity and shape-based categorization (r = 0.38, p < 0.001). The venn diagram in figure \(\ref{fig:venn_eng_1}\) shows categories of words as solid, count noun, and shape_characterized by using a cutoff of 80% i.e. a word that is rated by at least 80% of the participants to be a solid is categorized as solid. The venn diagram shows that a large number of nouns were consistently rated as solid and count nouns, with more overlap between these two dimensions. However, the minority of words rated as referring to objects organized by shape.

To test for distributional differences in shape responses across languages, we conducted a permutation test (see Methods). The results showed that the observed density curves for each language fell within the null distribution generated by permuting language labels, indicating no significant differences in lexical statistics across languages \(\ref{fig:permutated_dist_eng_1}\).

#inter-rater agreement in the word level: by word-level consensus

get_mode <- function(v) {
  uniq_v <- unique(na.omit(v))
  uniq_v[which.max(tabulate(match(v, uniq_v)))]
}


consensus_data <- english_sample1_raw %>%
  group_by(block, uni_lemma, word) %>%
  summarise(
    mode_rating = get_mode(response_numeric),
    agreement_score = (mean(response_numeric == mode_rating, na.rm = TRUE) -0.009) * 100,
    sd_rating = sd(response_numeric, na.rm = TRUE)
  ) 
ggplot(consensus_data, aes(x = sd_rating, y = agreement_score, color = block)) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Agreement Score vs. Rating Standard Deviation", x = "Standard Deviation of Ratings", y = "Agreement Score (%)") +
  theme_minimal() +
  scale_color_brewer(palette = "Set3")
## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Discussion The emergence of the shape bias as a word learning strategy in English speaking kids in the US, is thought to be a product of the distribution of early learned count nouns that correspond to solid objects which happen to have similar shapes. Their attention is thus tuned to shape as the relevant feature for categorizing and correctly extending nouns to new members of the referent category. Even the fewest set of noun-referents that follow this distirbution and its correspondence is sufficient to trigger the shape bias in toddlers. Following from this, we expected to find significant cross-linguistic variation in the distribution of nouns across these three dimensions, which would then predict differences in the emergence of the shape bias across languages. However, our results showed that lexical statistics were remarkably consistent across languages. Across the three dimensions, distributions of all languages look the same and are not significantly different from the null distribution using a permutation test. Correspondence between the ontological status, syntactic frame, and shape organization was also very weak. However, maybe the question about the organizing feature was not clear, not intuitive for a naive participants, and instructions were ambiguous, which is why solidity and countability correspond to each other but less with shape ratings. We test the reliability of these measure in study 1B.

Study 1B

The idea of the organizing feature of an object category is not an intuitive one, and is not something that people actively think about when recognizing category members i.e. when they are searching for apples, they don’t break down the category into features of apple shape, red color, and leaf material and then decide which feature is the most diagnosing of apples. It is possible that the instructions were not clear enough for participants to understand what was being asked. To test the reliability of this measure, we collected a second sample of ratings on the same set of words, with improved instructions and examples.

Participants

377 English-speaking adults recruited online from the Prolific platform (mean age = 41, sd = 14) were asked to rate words in three dimensions: solidity (solid, non-solid, unclear), count and mass noun syntax (count, mass, unclear), and organizing feature (shape, color, material, none of these).

Material

The same set of words used in Study 1A. However, only the category-organizing feature block was included in this study.

Procedure

The experiment started with familiarization trials of what a category-organizing or characterizing feature means. However, this time the instructions were improved with more examples, adding picture depictions, training questions with feedback and clarifications involving the four feature answers so not to bias subjects into a specific answer as in Table [2]. The rest of the study was similar to Study 1A.

Data analysis

Similar to study 1A, we calculated the proportion of responses for each feature per word, and then conducted a permutation test to compare the observed density of shape-based categorization responses against a null distribution generated by permuting language labels.

Results

english_sample2 <- read_csv(here("data/predictors","english_sample2_props.csv"))
english_sample2 <- english_sample2 %>% filter(!is.na(response)) %>% distinct()
english_sample2_raw <- read_csv(here("data/predictors","english_sample2_raw.csv"))
# Apply the mapping function in helper.R to the response column
english_sample2$response_numeric <- mapply(map_categories, english_sample2$response, english_sample2$block)

# more 1 or 2 participants -- 29 words less than 8 ratings

# english_sample1 %>% filter(totcount < 8, block == "category_organization") %>% select(word) %>% distinct() %>% jsonlite::toJSON()

#english_sample2 %>% filter(totcount < 8, block == "category_organization") %>% select(word) %>% distinct() %>% jsonlite::toJSON()
english_sample2 |>
  filter(block == "category_organization") |>
ggplot(aes(x = proportion, color = response, fill = response)) +
  geom_density(alpha = 0.5) +
  facet_wrap(~ language) +
  labs(
    title = "Density of Proportions by Language",
    x = "Proportion Value",
    y = "Density (words)"
  ) +
  theme_minimal()

english_sample2_shape_df <- english_sample2 %>%
  filter(response == "shape") %>%
  replace_na(list(proportion = 0)) %>%
  select(uni_lemma, language, proportion)

# 2.C. Compute the *observed* densities for each language
obs_density_df_sample2 <- english_sample2_shape_df %>%
  group_by(language) %>%
  summarize(
    dens = list(get_density_on_grid(proportion, grid_x)),
    .groups = "drop"
  ) %>%
  # Turn the list‐column “dens” into one row per (language, grid_x)
  unnest_longer(dens, values_to = "density") %>%
  group_by(language) %>%
  mutate(x = grid_x) %>%
  ungroup()


perm_densities_sample2 <- map_dfr(seq_len(n_perm), function(i) {
  one_permuted_densities(english_sample2_shape_df, grid_x) %>%
    mutate(perm = i)
})

null_envelopes_sample2 <- perm_densities_sample2 %>%
  group_by(language, x) %>%
  summarize(
    lower = quantile(density, 0.025),
    upper = quantile(density, 0.975),
    .groups = "drop"
  )
ggplot() +
  # NULL envelope bands (gray ribbons)
  geom_ribbon(
    data = null_envelopes_sample2,
    aes(x = x, ymin = lower, ymax = upper),
    fill = "gray80", alpha = 0.5
  ) +
  # OBSERVED density curves (bold lines)
  geom_line(
    data = obs_density_df,
    aes(x = x, y = density),
    color = "blue", size = 1
  ) +
  facet_wrap(~ language, scales = "free_y") +
  labs(
    title = "Observed 'shape' density vs. Null Envelope by Language",
    x = "Proportion of 'shape' choices",
    y = "Density"
  ) +
  theme_minimal()

eng_samp1_shape <- english_sample1 %>% filter(response == "shape") %>% select(word, proportion)
eng_samp2_shape <- english_sample2 %>% filter(response == "shape") %>% select(word, proportion)

# Merge the data frames based on 'word_id'
merged_data <- merge(eng_samp1_shape, eng_samp2_shape, by = "word", suffixes = c(".sample1", ".sample2")) %>% distinct()

# Calculate the correlation between proportion values
correlation_score <- cor(merged_data$proportion.sample1, merged_data$proportion.sample2, use = "complete.obs") # Handles missing values

# Print the correlation score
print(paste("correlation between shape proportions in 1A VS 1B is: ", correlation_score))
## [1] "correlation between shape proportions in 1A VS 1B is:  0.502217804807458"
eng_samp1_none<- english_sample1 %>% filter(response == "none of these") %>% select(word, proportion)
eng_samp2_none<- english_sample2 %>% filter(response == "none of these") %>% select(word, proportion)
# Merge the data frames based on 'word_id'
merged_data_none <- merge(eng_samp1_none, eng_samp2_none, by = "word", suffixes = c(".sample1", ".sample2")) %>% distinct()

# Calculate the correlation between proportion values
correlation_score_none <- cor(merged_data_none$proportion.sample1, merged_data_none$proportion.sample2, use = "complete.obs") # Handles missing values

# Print the correlation score
print(paste("correlation between 'none of these' option proportions in 1A VS 1B is: ", correlation_score_none))
## [1] "correlation between 'none of these' option proportions in 1A VS 1B is:  0.582871233077584"
print(correlation_score_none)
## [1] 0.5828712
# Load your data (adjust file paths as needed)
study1a_data <- read_csv(here("data/predictors","study1a_data.csv"))

#fix names of columns
study1a_data <- study1a_data %>%
  rename(participant_id = `_id`)

study1b_data <-  read_csv(here("data/predictors","study1b_data.csv"))
study1b_data <- study1b_data %>%
  rename(participant_id = `_id`)


# 1. DATA PREPARATION
# ===================
study1a_data<- study1a_data %>% 
    select(uni_lemma, language, shape_rating) %>%
    mutate(study = "1A")

study1b_data <- study1b_data %>% 
    select(uni_lemma, language, shape_rating) %>%
    mutate(study = "1B")

# Combine datasets with study identifier
combined_data <- bind_rows( 
  study1b_data, study1a_data
   ) %>%
  filter(!is.na(shape_rating)) %>%  # Remove missing values
mutate(shape_rating = as.numeric(shape_rating)) %>%
  distinct()  # Remove duplicates

combined_data %>% filter(study == "1A")
combined_data %>% filter(study == "1B")
# Create wide format for reliability analysis
ratings_wide <- combined_data %>%
  pivot_wider(names_from = study, 
              values_from = shape_rating,
              names_prefix = "Study_") %>%
  filter(!is.na(Study_1A) & !is.na(Study_1B))  # Only words rated in both studies
# 2. DESCRIPTIVE STATISTICS
# =========================

# Summary statistics by study
descriptive_stats <- combined_data %>%
  group_by(study) %>%
  summarise(
    n_ratings = n(),
    n_words = n_distinct(uni_lemma),
    mean_rating = mean(shape_rating, na.rm = TRUE),
    sd_rating = sd(shape_rating, na.rm = TRUE),
    median_rating = median(shape_rating, na.rm = TRUE),
    q25 = quantile(shape_rating, 0.25, na.rm = TRUE),
    q75 = quantile(shape_rating, 0.75, na.rm = TRUE),
    .groups = "drop"
  )

print("Descriptive Statistics by Study:")
## [1] "Descriptive Statistics by Study:"
print(descriptive_stats)
## # A tibble: 2 × 8
##   study n_ratings n_words mean_rating sd_rating median_rating   q25   q75
##   <chr>     <int>   <int>       <dbl>     <dbl>         <dbl> <dbl> <dbl>
## 1 1A         3032     559       0.377     0.201         0.364 0.222   0.5
## 2 1B         2863     488       0.346     0.209         0.308 0.182   0.5
# Summary by language and study
descriptive_by_language <- combined_data %>%
  group_by(study, language) %>%
  summarise(
    mean_rating = mean(shape_rating, na.rm = TRUE),
    sd_rating = sd(shape_rating, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  pivot_wider(names_from = study, 
              values_from = c(mean_rating, sd_rating),
              names_sep = "_Study_")

print("Descriptive Statistics by Language:")
## [1] "Descriptive Statistics by Language:"
print(descriptive_by_language)
## # A tibble: 16 × 5
##    language         mean_rating_Study_1A mean_rating_Study_1B sd_rating_Study_1A
##    <chr>                           <dbl>                <dbl>              <dbl>
##  1 Cantonese                       0.390                0.361              0.212
##  2 Catalan                         0.370                0.348              0.202
##  3 English (Americ…                0.379                0.339              0.201
##  4 English (Austra…                0.387                0.340              0.203
##  5 French (French)                 0.390                0.354              0.203
##  6 Hebrew                          0.346                0.284              0.204
##  7 Japanese                        0.383                0.352              0.203
##  8 Kiswahili                       0.365                0.345              0.198
##  9 Korean                          0.392                0.350              0.201
## 10 Mandarin (Beiji…                0.389                0.357              0.193
## 11 Mandarin (Taiwa…                0.380                0.348              0.190
## 12 Russian                         0.368                0.349              0.200
## 13 Spanish (Europe…                0.368                0.350              0.206
## 14 Spanish (Mexica…                0.364                0.336              0.200
## 15 Spanish (Peruvi…                0.405                0.372              0.212
## 16 Turkish                         0.350                0.322              0.197
## # ℹ 1 more variable: sd_rating_Study_1B <dbl>
# 3. RELIABILITY ANALYSIS
# =======================

# Test-retest reliability (Pearson correlation)
overall_reliability <- cor.test(ratings_wide$Study_1A, 
                               ratings_wide$Study_1B, 
                               use = "complete.obs")

print(paste("Overall Test-Retest Reliability (r):", 
            round(overall_reliability$estimate, 3)))
## [1] "Overall Test-Retest Reliability (r): 0.546"
print(paste("95% CI:", 
            round(overall_reliability$conf.int[1], 3), "to", 
            round(overall_reliability$conf.int[2], 3)))
## [1] "95% CI: 0.518 to 0.572"
# Intraclass Correlation Coefficient (ICC)
icc_data <- ratings_wide %>%
  select(Study_1A, Study_1B) %>%
  as.matrix()

icc_results <- ICC(icc_data)
print("Intraclass Correlation Coefficients:")
## [1] "Intraclass Correlation Coefficients:"
print(icc_results)
## Call: ICC(x = icc_data)
## 
## Intraclass correlation coefficients 
##                          type  ICC   F  df1  df2        p lower bound
## Single_raters_absolute   ICC1 0.53 3.3 2621 2622 7.3e-190        0.50
## Single_random_raters     ICC2 0.53 3.4 2621 2621 6.5e-203        0.49
## Single_fixed_raters      ICC3 0.54 3.4 2621 2621 6.5e-203        0.52
## Average_raters_absolute ICC1k 0.69 3.3 2621 2622 7.3e-190        0.67
## Average_random_raters   ICC2k 0.70 3.4 2621 2621 6.5e-203        0.66
## Average_fixed_raters    ICC3k 0.71 3.4 2621 2621 6.5e-203        0.68
##                         upper bound
## Single_raters_absolute         0.56
## Single_random_raters           0.57
## Single_fixed_raters            0.57
## Average_raters_absolute        0.72
## Average_random_raters          0.73
## Average_fixed_raters           0.73
## 
##  Number of subjects = 2622     Number of Judges =  2
## See the help file for a discussion of the other 4 McGraw and Wong estimates,
# 4. STATISTICAL COMPARISON
# =========================

# Mixed-effects model comparing studies
comparison_model <- lmer(shape_rating ~ study + 
                        (1 | uni_lemma) + 
                        (1 | language),
                        data = combined_data)

summary(comparison_model)
## Linear mixed model fit by REML. t-tests use Satterthwaite's method [
## lmerModLmerTest]
## Formula: shape_rating ~ study + (1 | uni_lemma) + (1 | language)
##    Data: combined_data
## 
## REML criterion at convergence: -9023.1
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.2314 -0.4749 -0.0396  0.4838  4.3768 
## 
## Random effects:
##  Groups    Name        Variance Std.Dev.
##  uni_lemma (Intercept) 0.029510 0.17178 
##  language  (Intercept) 0.000000 0.00000 
##  Residual              0.009269 0.09628 
## Number of obs: 5895, groups:  uni_lemma, 606; language, 16
## 
## Fixed effects:
##               Estimate Std. Error         df t value Pr(>|t|)    
## (Intercept)  3.402e-01  7.361e-03  6.449e+02   46.21   <2e-16 ***
## study1B     -3.988e-02  2.648e-03  5.379e+03  -15.06   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##         (Intr)
## study1B -0.161
## optimizer (nloptwrap) convergence code: 0 (OK)
## boundary (singular) fit: see help('isSingular')
# Effect size for study difference
study_effect <- eta_squared(comparison_model, partial = TRUE)
print("Effect size for study difference:")
## [1] "Effect size for study difference:"
print(study_effect)
## # Effect Size for ANOVA (Type III)
## 
## Parameter | Eta2 (partial) |       95% CI
## -----------------------------------------
## study     |           0.04 | [0.03, 1.00]
## 
## - One-sided CIs: upper bound fixed at [1.00].
# Paired t-test on word-level means
word_means <- combined_data %>%
  group_by(uni_lemma, study) %>%
  summarise(mean_rating = mean(shape_rating, na.rm = TRUE), .groups = "drop") %>%
  pivot_wider(names_from = study, values_from = mean_rating) %>%
  filter(!is.na(`1A`) & !is.na(`1B`))

paired_test <- t.test(word_means$`1A`, word_means$`1B`, paired = TRUE)
print("Paired t-test results:")
## [1] "Paired t-test results:"
print(paired_test)
## 
##  Paired t-test
## 
## data:  word_means$`1A` and word_means$`1B`
## t = 3.6969, df = 440, p-value = 0.0002458
## alternative hypothesis: true mean difference is not equal to 0
## 95 percent confidence interval:
##  0.01623763 0.05309835
## sample estimates:
## mean difference 
##      0.03466799
# Cohen's d for the difference
cohens_d_result <- cohens_d(word_means$`1A`, word_means$`1B`, paired = TRUE)
print(paste("Cohen's d:", round(cohens_d_result$Cohens_d, 3)))
## [1] "Cohen's d: 0.176"
# 6. AGREEMENT AND CONSISTENCY ANALYSIS
# ====================================

# Bland-Altman analysis
bland_altman_data <- ratings_wide %>%
  mutate(
    mean_rating = (Study_1A + Study_1B) / 2,
    diff_rating = Study_1B - Study_1A
  )

# Calculate limits of agreement
mean_diff <- mean(bland_altman_data$diff_rating, na.rm = TRUE)
sd_diff <- sd(bland_altman_data$diff_rating, na.rm = TRUE)
upper_loa <- mean_diff + 1.96 * sd_diff
lower_loa <- mean_diff - 1.96 * sd_diff

print("Bland-Altman Analysis:")
## [1] "Bland-Altman Analysis:"
print(paste("Mean difference:", round(mean_diff, 3)))
## [1] "Mean difference: -0.041"
print(paste("95% Limits of Agreement:", 
            round(lower_loa, 3), "to", round(upper_loa, 3)))
## [1] "95% Limits of Agreement: -0.42 to 0.339"
# Generate model predictions
predictions <- ggpredict(comparison_model, terms = "study")

# Plot
ggplot(predictions, aes(x = x, y = predicted)) +
  geom_point(size = 4) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 0.1) +
  labs(
    title = "Predicted Shape Ratings by Study",
    x = "Study",
    y = "Predicted Mean Shape Rating"
  ) +
  theme_minimal()

# Plot Bland–Altman
ggplot(bland_altman_data, aes(x = mean_rating, y = diff_rating)) +
  geom_point(alpha = 0.3) +
  geom_hline(yintercept = mean_diff, color = "blue", linetype = "dashed") +
  geom_hline(yintercept = upper_loa, color = "red", linetype = "dotted") +
  geom_hline(yintercept = lower_loa, color = "red", linetype = "dotted") +
  labs(
    title = "Bland–Altman Plot: Study 1A vs. Study 1B Ratings",
    x = "Mean Rating (1A & 1B)",
    y = "Difference (1B - 1A)"
  ) +
  theme_minimal()

# 7. VISUALIZATIONS
# =================

# Plot 1: Scatterplot of Study 1A vs 1B ratings
p1 <- ggplot(ratings_wide, aes(x = Study_1A, y = Study_1B)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = "lm", se = TRUE, color = "blue") +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed", color = "red") +
  labs(title = "Test-Retest Reliability: Study 1A vs 1B",
       subtitle = paste("r =", round(overall_reliability$estimate, 3)),
       x = "Study 1A Shape Rating",
       y = "Study 1B Shape Rating") +
  theme_minimal() +
  coord_fixed()

# Plot 2: Distribution comparison
p2 <- ggplot(combined_data, aes(x = shape_rating, fill = study)) +
  geom_density(alpha = 0.7) +
  geom_vline(data = descriptive_stats, 
             aes(xintercept = mean_rating, color = study),
             linetype = "dashed", size = 1) +
  labs(title = "Distribution of Shape Ratings by Study",
       x = "Shape Rating",
       y = "Density") +
  theme_minimal() +
  scale_fill_brewer(type = "qual", palette = "Set1")

# Plot 3: Bland-Altman plot
p3 <- ggplot(bland_altman_data, aes(x = mean_rating, y = diff_rating)) +
  geom_point(alpha = 0.6) +
  geom_hline(yintercept = mean_diff, color = "blue") +
  geom_hline(yintercept = c(upper_loa, lower_loa), 
             color = "red", linetype = "dashed") +
  labs(title = "Bland-Altman Plot",
       subtitle = paste("Mean difference =", round(mean_diff, 3)),
       x = "Mean Rating [(1A + 1B) / 2]",
       y = "Difference (1B - 1A)") +
  theme_minimal()

p1

p2

p3

# 8. SUMMARY INTERPRETATION
# ========================

cat("\n=== SUMMARY INTERPRETATION ===\n")
## 
## === SUMMARY INTERPRETATION ===
# Reliability interpretation
reliability_interpretation <- case_when(
  overall_reliability$estimate >= 0.9 ~ "Excellent reliability",
  overall_reliability$estimate >= 0.8 ~ "Good reliability",
  overall_reliability$estimate >= 0.7 ~ "Acceptable reliability",
  overall_reliability$estimate >= 0.6 ~ "Questionable reliability",
  TRUE ~ "Poor reliability"
)

cat("1. Test-Retest Reliability:", reliability_interpretation, 
    paste0("(r = ", round(overall_reliability$estimate, 3), ")\n"))
## 1. Test-Retest Reliability: Poor reliability (r = 0.546)
# Effect size interpretation
effect_size_interpretation <- case_when(
  abs(cohens_d_result$Cohens_d) < 0.2 ~ "negligible",
  abs(cohens_d_result$Cohens_d) < 0.5 ~ "small",
  abs(cohens_d_result$Cohens_d) < 0.8 ~ "medium",
  TRUE ~ "large"
)

cat("2. Study Difference:", effect_size_interpretation, "effect size",
    paste0("(d = ", round(cohens_d_result$Cohens_d, 3), ")\n"))
## 2. Study Difference: negligible effect size (d = 0.176)
# Statistical significance
if (paired_test$p.value < 0.05) {
  cat("3. Statistical Significance: Significant difference between studies (p =", 
      round(paired_test$p.value, 4), ")\n")
} else {
  cat("3. Statistical Significance: No significant difference between studies (p =", 
      round(paired_test$p.value, 3), ")\n")
}
## 3. Statistical Significance: Significant difference between studies (p = 2e-04 )
cat("4. Number of words with reliable ratings:", length(unique(ratings_wide$uni_lemma)), 
    "out of", length(unique(combined_data$uni_lemma)), "total words\n")
## 4. Number of words with reliable ratings: 441 out of 606 total words

Similiarly, ratings of the organizing feature in study 1B were consistent across all languages and didn’t differ significantly from the null distribution (Figure \(\ref{fig:permutated_dist_eng_2}\)). With a mean of () shape ratings and high standard deviation of (). The distribution of shape-based categorization responses was generally similar to that observed in Study 1A. We compared shape ratings between Study 1A and Study 1B using both mixed-effects modeling and paired comparisons. Ratings in Study 1B were slightly but significantly lower than in Study 1A (“B” = -0.04, p < .001), with a small effect size (d² = 0.04, d = 0.18). A Bland–Altman analysis indicated that, while mean agreement was high, individual word differences ranged from -0.42 to +0.34, reflecting some variability at the item level.

Reliability analysis

We assessed the consistency of shape ratings between Study 1A and Study 1B using both Pearson correlation and Intraclass Correlation Coefficients (ICC). The test–retest correlation was moderate (r = 0.546). ICC estimates indicated good agreement between studies (ICC ~ 0.70 for average ratings), suggesting that despite some individual variability, the overall rating patterns are reliably replicable across studies. in how participants rated the organizing feature of nouns across the two samples. This suggests that participants understood the concept of an organizing feature and were able to reliably apply it to a wide range of nouns. The correlation for the ‘none of these’ option was (r = 0.58, p < 0.001), suggesting that this option may have been used more consistently across studies, possibly reflecting a clearer understanding of when none of the provided features applied.

Study 2:

korean_sample1 <- read_csv(here("data/predictors","korean_sample1_props.csv")) 
korean_sample1 <- korean_sample1 %>% filter(!is.na(response)) %>% distinct()
# more 60 participants -- 241 words -- pending

korean_sample2 <- read_csv(here("data/predictors","korean_sample2_props.csv"))
korean_sample2 <- korean_sample2 %>% filter(!is.na(response)) %>% distinct()
# no more participants needed

#korean_sample1 %>% filter(totcount > 8) %>% select(theword) %>% distinct() %>% jsonlite::toJSON()

Methods

Participants

Material

Procedure

Data analysis

Results

Discussion