Introduction

This RPub documents a pipeline for simulation-based Bayesian sample size planning. This planning pipeline uses tools from tidyverse for data wrangling, brms for Bayesian modeling, and tidybayes for posterior analysis.

The Pipeline

The planning pipeline includes two components. The first component is an analysis pipeline. The second component iterates over that analysis pipeline using simulated data. The results of the second component inform one’s choice of sample size.

The first component is the analysis pipeline that one will use to analyze the data from one’s experiment. This component is created before or after collecting pilot data and is used to analyze that data. In turn, the results of the pilot analysis inform the second component of the planning pipeline. In the second component, one simulates a population using parameter values inferred from the pilot data. Then, this simulated population is randomly sampled many times over while systematically increasing the sample size. The analysis pipeline from the first component is then iterated over, once per sample. The results of this pilot data-informed, iterative procedure are the quantities used for sample size planning.

The use of pilot data in this pipeline is intended to decrease the uncertainty about effects associated with power analysis. Both components should be preregistered. Specifically, one should preregister their main inference pipeline (i.e., the first component) and the code and results of their sample size simulation (i.e., the second component).

Importantly, several code chunks in this pipeline may be repurposed with little or no modification. Other chunks require modification to fit one’s experimental design. Repurposing this pipeline requires nontrivial conceptual and R skill. However, the extensive documentation and commenting should ease things.

Skippable

Background

This pipeline collects the statistical practices that I have found useful in my own research. In part, these practices implement my effort to maximize the generalizability of my inferences in light of the sparse data characteristic of my studies. Towards this end, this pipeline assumes that investigators adopt a posture of model expansion via the construction of probabilistic models that account for various sources of uncertainty, rather than of comparison with simpler models; the use of marginal rather than conditional probabilities; and prediction via expected values rather than inference on parameters.

Moreover, this pipeline implements my effort to faithfully translate my scientific hypotheses into statistical hypotheses. As I see it, this is the only way to conduct statistical tests that logically, rather than merely conventionally, speak to the content of the conclusions that I want to advance. Typically, I (and many others) advance directional hypotheses that beget directional predictions. For example, I might hypothesize some psychological mechanism that predicts that observed mean A > observed mean B. Then, I advance directional conclusions based on the results of my directional tests. Specifically, this pipeline measures the evidence for the hypothesized directional effect, compared to the alternative directional effect that was not hypothesized (here is a comparison of Bayesian directional testing to one-tailed p-values; here is a recent revival of one-tailed p-values). In curder terms, this pipeline assumes that the null-is-nil is fake news because it is quasi-always false and so is ignored here.

Specifically, this pipeline implements inference over the odds of competing directional hypotheses. If both directional hypotheses are a priori equiprobable, then their posterior odds is equivalent to a Bayes factor that quantifies the evidence for one directional hypothesis over the other. Notably, the posterior odds of directional hypotheses qua Bayes factor avoids the sensitivity to prior dispersion characteristic of model comparison Bayes factors or Bayes factors computed via the Savage-Dickey density ratio. This is because the posterior odds of direction is an “objective” property of the posterior. Specifically, symmetric changes to prior dispersion about zero cancel each other out, as the prior odds is always 1. In practice, I have found the posterior odds to be essentially unimpacted by prior dispersion until one places assumes theoretically implausibly tight priors, at which point the posterior odds collapses to 1. Therefore, while priors can encode theory, in the present pipeline priors are practical tools for model identifiability, regularization, and robust prediction in rich, hierarchical, probabilistic models. Moreover, in this pipeline the posterior odds are computed from the posterior distribution over the marginalized expected value of the response, as opposed to the parameters. That is, the quantity used for inference is averaged across all the uncertainty of the modeled covariates.

Additional Resources

Quick introductions to Bayesian inference can be found here and here. In addition to Kruschke’s and Gelman’s groups is Wagenmakers’ group (e.g., here is the latter group’s Bayesian sample size planning pipeline for developmental psychologists). Personally, I prefer Kruschke and Gelman. Gelman et al.’s Bayesian Data Analysis has been a key learning tool for me. In my opinion, Wagenmakers group relies too heavily on model comparison Bayes factors, which are overly sensitive to the prior (discussed above) while, nonetheless, still offering important benefits over p-values (summarized here, such as optional stopping. Here’s an introduction to brms by the creator of brms. This resource is particularly useful for the present pipeline, which implements Bayesian inference via brms. Generalist Bayesian workflows for data analysis can be found here and here. Kruschke has a thorough article on Bayesian reporting guidelines. For the deep lore on Bayesian (vs. Frequentist) inference, one might start with Ramsey, Fisher, Savage (this, this, this), Lindley, de Finetti, or Gigerenzer (e.g., this, this, or this). Clayton cloaks a thorough summary of the references, above, in best-practices garb while also introducing Bayesian and frequentist inference in an informal but precise way. Classic statements on the history of the concept of probability are by Daston and Hacking, while Porter provides a history of statistics. Gigerenzer and colleagues’ classic work summarizes the former three works, and describes the influence of the concept of probability on science and society. Note that the preceding is merely a selection of some of the sources from which I have benefited, rather than a representative collection of any area of literature.

Future Directions

Conceivably, this pipeline could develop into a simple web interface or R package, like the Frequentist SuperPower. In the future, I envision a user experience that allows one to easily specify various population parameters, parameters for sampling from the population, model structures and priors, and statistical quantities to be computed from the posterior models to be used for sample size planning. I also hope to publish a tutorial of this pipeline, which may perhaps be seen as bringing together recent advances in simulation-based sample size planning for experimental designs in hierarchical models by Pargent and colleagues with advances in Bayesian sample size planning for developmentalists by Wagenmakers and colleagues.

First Component

In the first component, we develop our analysis pipeline based on an exploratory study. This pipeline is the pipeline that we will use for inference in the target experiment whose sample size we are planning.

To illustrate this, the analysis pipeline below is developed for an exploratory experiment into 4-year-olds’ understanding of generic knowledge as common ground. Here is the preregistration. The data are all valid PANDA cases, although interference and reliability coding has not yet been completed.

In brief, participants behaved as predicted in the warmup and control phases. The purposes of these phases were to ensure that participants appropriately interpreted unambiguous uses of the pronominal forms used in the test phase (warmup) and to ensure that participants’ test phase responses were due to their conceptual learning rather than syntactic or semantic learning.

In the test phase there was, unexpectedly, strong evidence for a crossover interaction between condition and gender. Specifically, whereas boys behaved as expected, girls reliably responded in the opposite direction. This interaction contrasted with our prediction of an unmoderated main effect of condition. After children learned about Zarpies with generic language and Flurps with specific language, we expected that children would choose Zarpies in response to ambiguous generic reference and Flurps in response to ambiguous specific reference. Boys generally followed this pattern, although they equivocated between Zarpies and Flurps in response to ambiguous specific reference. In contrast, girls often chose Zarpies in response to ambiguous specific reference, with some evidence that they did so more often than in the generic condition. So, boys and girls displayed opposite directional effects, although the evidence for effect direction among boys was stronger than it was among girls.

Globals

Session packages and seed.

library(brms)      # Models
library(furrr)     # Parallelization
library(future)    # Parallelization
library(ggtext)    # Plots
library(janitor)   # Cleaning
library(knitr)     # Reporting
library(parallel)  # Parallelization
library(patchwork) # Plots
library(readxl)    # Data import
library(scales)    # Plots
library(tidybayes) # Analysis
library(tidyverse) # Wrangling

seed <- 10003 # RNG seed

Data Import and Wrangling

# Import Qualtrics data
gkcg_qualtrics0 <- read_csv("data/kids_qualtrics_2026_feb14.csv")
gkcg_qualtrics <- gkcg_qualtrics0 %>% 
  row_to_names(1) %>% # first row to column names
  clean_names()

# Import PANDA data
gkcg_panda0 <- read_csv("data/kids_panda_2026_feb14.csv")
gkcg_panda <- gkcg_panda0 %>% 
  clean_names() %>% 
  mutate(participation_date = as.Date(recorded_on), # date of participation
         age_num = as.numeric(participation_date - birthday) / 365.25)  # exact age in years

# Remove invalid PANDA cases
gkcg_panda_nopilot <- gkcg_panda %>% 
  filter(participation_date >= "2025-12-12")  # Remove experimenter and pilot observations

gkcg_panda_valid <- gkcg_panda_nopilot %>% 
  filter(is_valid_case == "Yes") %>% # Only valid participants
  arrange(recorded_on) %>% # arrange participation records by date/time
  mutate(id_new = row_number()) # id's organized by date/time of participation

sum(gkcg_panda_nopilot$is_valid_case == "No") # Count of invalid cases from non-pilot data

id_new <- setNames(gkcg_panda_valid$id_new, gkcg_panda_valid$id)

# Create dataset
gkcg_data0 <- gkcg_qualtrics %>%
  filter(id %in% gkcg_panda_valid$id) %>%  # Include only valid PANDA cases
  select(   # Include only these columns
    id, # participant id
    starts_with("wrm"), # warmup trials
    starts_with("generic"), # test trials
    starts_with("specific"), # test trials
    starts_with("c_")) %>% # control trials
  select(-c(ends_with("_start"), ends_with("_end"))) %>% # Remove timing columns
  pivot_longer(cols = !id, names_to = "variable", values_to = "response") %>% # Long data
  filter(response == "On") %>% # Exclude non-selected items
  left_join(gkcg_panda_valid %>% # Join with covariates from PANDA data (by PANDA ID)
              select(id, gender, age_num), by = "id") %>% 
  separate(variable,
           into = c("variable", "response"),
           sep = "_(?=[^_]+$)") %>%  # Split variable and response
  mutate(gender = case_when( # Relevel gender
    gender == "M" ~ "boy",
    gender == "F" ~ "girl"),
    variable = str_remove(variable, "_q"), # Remove leading "_q" in string
    depicted_gender = case_when( # Create depicted gender
      str_ends(variable, "boy") ~ NA_character_,
      str_ends(variable, "girl") ~ NA_character_,
      str_ends(variable, "_b") ~ "depicted_boy",
      str_ends(variable, "_g") ~ "depicted_girl",
      T ~ "0"),
    trial_type = case_when( # Create trial type
      str_detect(variable, "wrm") ~ "warmup",
      str_detect(variable, "generic") ~ "test",
      str_detect(variable, "specific") ~ "test",
      str_detect(variable, "c_") ~ "control",
      T ~ NA_character_),
    trial_subtype = case_when( # Create trial subtype
      str_detect(variable, "boy") ~ "boy_correct", # Warmup (attention)
      str_detect(variable, "girl") ~ "girl_correct",
      str_detect(variable, "wrm_3pp") ~ "3pp", # Warmup (3p.)
      str_detect(variable, "wrm_3ps") ~ "3ps", 
      str_detect(variable, "generic") ~ "generic", # Test
      str_detect(variable, "specific") ~ "specific",
      str_detect(variable, "num_phrs_match") ~ "number_or_phrasal_matching", # Control
      str_detect(variable, "cnjgtn_match") ~ "conjugation_matching",
      T ~ NA_character_),
    condition = ifelse(trial_subtype %in% c("generic", "specific"), paste0(trial_subtype), NA_character_), # Create condition
    item = case_when( # Create item
      str_detect(variable, "ride") ~ "ride",
      str_detect(variable, "catch") ~ "catch",
      str_detect(variable, "play") ~ "play",
      str_detect(variable, "train") ~ "train",
      str_detect(variable, "wash") ~ "wash",
      str_detect(variable, "rake") ~ "rake",
      str_detect(variable, "demo") ~ "demonstrative_pronoun",
      str_detect(variable, "personal") ~ "personal_pronoun",
      T ~ NA_character_),
    response_correct = case_when(trial_subtype == "3pp" & response == "plural" ~ "Correct", # Correct (warmup trials)
                                 trial_subtype == "3ps" & response == "singular" ~ "Correct",
                                 trial_subtype == "boy_correct" & response == "boy" ~ "Correct",
                                 trial_subtype == "girl_correct" & response == "girl" ~ "Correct",
                                 trial_type %in% c("test", "control") ~ NA_character_,
                                 T ~ "Incorrect"),
    across(where(is.character), ~ str_to_lower(.)), # convert characters to lowercase
    across(where(is.character), ~ as.factor(.)), # convert characters to factors
    id = id_new[id]) %>%
  group_by(id, trial_type) %>% # Group by id, trial type
  mutate(trial = row_number()) %>% # Count trial number
  ungroup() %>% 
  select(-variable) # Remove "variable" column

# Remove participants with zero test trials
trials <- gkcg_data0 %>% count(id, trial_subtype) %>% # count of each trial_subtype by participant
  ungroup() %>% 
  complete(id,
           trial_subtype,
           fill = list(n = 0)) %>% 
  pivot_wider(id_cols = id,
              names_from = trial_subtype,
              values_from = n) %>% 
  filter(generic == 0 & specific == 0) # only participant 33 contributed no trials

age_exclusions <- gkcg_data0 %>% 
  distinct(id, gender, age_num) %>% # distinct id - gender - age
  filter(age_num < 4 | age_num > 5) # include kids < 4-yo or > 5-yo

# Participants included
gkcg_inclusions <- gkcg_data0 %>% 
  filter(!id %in% trials$id, # remove X participants who contributed no trials
         !id %in% age_exclusions$id) %>% 
         # id <= 50 + length(trials$id)) %>% # include X many more participants with id's over 50
  distinct(id)

gkcg_data <- gkcg_data0 %>% 
  filter(id %in% gkcg_inclusions$id) # id's to include

n_distinct(gkcg_data$id) # participant count

Results

Descriptive Statistics

There were 1368 observations from 118 participants (mean age: 4.52 years, range: 4.01 years to 4.99 years).

Of the 1368 observations, 457 were warmup trials, 684 were test trials, and 227 were control trials.

Of the 118 participants, 65 were girls (mean age: 4.53 years, range: 4.02 years to 4.99 years) and 53 were boys (mean age: 4.53 years, range: 4.02 years to 4.99 years).

Warmup Phase

The trials completed, data, predictive accuracy, model, analysis, and posterior distributions are presented for the warmup and test phases of the experiment. The data from the control phase is displayed at the end.

Summary: Participants responded correctly at high rates in the warmup phase. The weakest evidence for systematic, correct responding was found for participants’ comprehension of the third-person plural in its personal (they) and demonstrative (those ones) forms. Nonetheless, correct response rates were high in these trials. (NB: Those ones is, really, a third-person plural construction, unlike they, but both are referred to as third-person plural pronouns due to their common function in this experiment.)

Trials Completed

Participant counts are inset.

Data

Participant counts are inset.

descriptives <- function(trial_type_level){
  gkcg_data %>% 
    filter(trial_type == {{ trial_type_level }}) %>% # Include only this trial type
    count(trial_subtype, response) %>% # Count observations per cell
    group_by(trial_subtype) %>% # Set grouping
    mutate(prop = n / sum(n)) %>% # Compute proportion
    ungroup() # Remove grouping
}

descriptives("warmup") # compute descriptives for warmup trials

# main figure
complete_boy <- descriptives("warmup") %>%
  filter(trial_subtype %in% c("boy_correct", "girl_correct")) %>%
  droplevels() %>%
  complete(trial_subtype, response, fill = list(n = 0, prop = 0))

descriptives_warmup <- descriptives("warmup") %>%
  filter(!trial_subtype %in% c("boy_correct", "girl_correct")) %>%
  rbind(complete_boy) %>%
  mutate(
    trial_subtype = factor(case_when(
      str_detect(trial_subtype, "girl") ~ "Girl",
      str_detect(trial_subtype, "boy") ~ "Boy",
      str_detect(trial_subtype, "3pp") ~ "Plural",
      str_detect(trial_subtype, "3ps") ~ "Singular",
      TRUE ~ NA_character_
    ), levels = c("Girl", "Boy", "Plural", "Singular")),
    response = str_to_title(response),
    vjust_text = ifelse(n <= 4, -0.3, 1.3),
    response = case_when(response == "Singular" ~ "Sg.",
                         response == "Plural" ~ "Pl.",
                         T ~ response),
    response = factor(response, levels = c("Girl", "Boy", "Pl.", "Sg.")))

fig1a <- ggplot(descriptives_warmup, aes(x = response, y = prop)) +
  facet_grid(~ trial_subtype, scales = "free") +
  geom_bar(stat = "identity", fill = "deepskyblue1") +
  geom_text(aes(label = n, vjust = vjust_text),
            color = "black",
            family = "Times New Roman",
            size = 4.25) +
  scale_y_continuous(
    labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
    breaks = c(.25, .5, .75)
  ) +
  labs(
    x = "Response",
    y = "Proportion of Participants"
    ) +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
  )
# print(fig1a) # print for markdown

Here is the distribution of responses across plural and singular third-person demonstrative and personal pronouns. Participant counts are inset.

# Comprehension figure
comprehension_data <- gkcg_data %>%
  filter(item %in% c("demonstrative_pronoun",
                     "personal_pronoun")) %>% 
  count(trial_subtype, item, response) %>% 
  ungroup() %>% 
  group_by(trial_subtype, item) %>% 
  mutate(response = str_to_title(response),
         levels_for_graph = case_when(trial_subtype == "3pp" & str_detect(item, "dem") ~ "Plural Demonstrative",
                                      trial_subtype == "3pp" & str_detect(item, "per") ~ "Plural Personal",
                                      trial_subtype == "3ps" & str_detect(item, "dem") ~ "Singular Demonstrative",
                                      trial_subtype == "3ps" & str_detect(item, "per") ~ "Singular Personal",
                                      T ~ NA),
         prop = n / sum(n),
         vjust_text = ifelse(n == 0, -0.3, 1.3)) %>% 
  ungroup() %>% 
  mutate(levels_for_graph = str_replace(levels_for_graph, "Plural", "Pl."),
         levels_for_graph = str_replace(levels_for_graph, "Singular", "Sg."),
         levels_for_graph = str_replace(levels_for_graph, "Demonstrative", "Dem."),
         levels_for_graph = str_replace(levels_for_graph, "Personal", "Pers."),
         response = case_when(response == "Plural" ~ "Pl.",
                              response == "Singular" ~ "Sg.",
                              TRUE ~ NA_character_))

func_fig2 <- function(level, x_val, y_val, label) {
  geom_text(
    data = data.frame(
      levels_for_graph = {{level}},
      y = y_val,
      x = x_val,
      label = label
    ),
    aes(x = x, y = y, label = label),
    inherit.aes = FALSE,
    family = "Times New Roman",
    fontface = "bold",
    size = 3.5
  )
}

fig2a <- ggplot(comprehension_data, aes(x = response, y = prop)) +
  facet_grid(~ levels_for_graph, scales = "free") +
  geom_bar(stat = "identity", fill = "deepskyblue1") +
  geom_text(aes(label = n, vjust = vjust_text),
            color = "black",
            family = "Times New Roman",
            size = 4.25) +
  scale_y_continuous(
    labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
    breaks = c(.25, .5, .75)
  ) +
  labs(
    x = "Response",
    y = "Proportion of Participants",
  ) +
  scale_fill_identity() +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(
    plot.title = element_text(face = "bold", hjust = 0.5),
    panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
  ) +
  func_fig2(level = "Pl. Dem.", y = .9, x = 1.5, label = "Those ones") +
  func_fig2(level = "Pl. Pers.", y = .9, x = 1.5, label = "They") +
  func_fig2(level = "Sg. Dem.", y = .9, x = 1.5, label = "This one") +
  func_fig2(level = "Sg. Pers.", y = .9, x = 1.5, label = "S/He")

Predictive Accuracy

predictive_accuracy_warmup <- gkcg_data %>%  # How often were responses predicted compared to unpredicted?
  filter(trial_type == "warmup") %>% 
  mutate(prediction = case_when(trial_subtype == "girl_correct" & response == "girl" ~ "accurate",
                               trial_subtype == "boy_correct" & response == "boy" ~ "accurate",
                               trial_subtype == "3pp" & response == "plural" ~ "accurate",
                               trial_subtype == "3ps" & response == "singular" ~ "accurate",
                               T ~ "inaccurate")) %>% 
  count(name = "count", prediction) %>% 
  mutate(odds_of_accurate_prediction = count[prediction == "accurate"] / count[prediction == "inaccurate"])

Of the 457 warmup trial observations, 380 were predicted accurately (e.g., participants chose a dyad in the plural comprehension trial or an individual in the singular trial). In contrast, 77 observations were predicted inaccurately. Thus, our post-hoc odds of accurately predicting participants’ response in each warmup trial was 4.94.

Models

Models of the warmup observations.

# Priors
priors <- c(prior(normal(0, 1.00), Intercept),
            prior(normal(0, 0.8), b),
            prior(normal(0, 0.2), sd),
            prior(lkj(2), cor))

# Response family
response_family <- "bernoulli"

# Formula
warmup_formula               <- brmsformula(response_correct ~ trial_subtype +        (trial_subtype | id))
warmup_formula_comprehension <- brmsformula(response_correct ~ trial_subtype + item + (trial_subtype + item | id))

warmup_data_to_be_modeled <- gkcg_data %>% 
  drop_na(response_correct) %>% 
  mutate(trial_subtype = fct_drop(trial_subtype),
         response_correct = fct_relevel(response_correct, "incorrect", "correct"))

mod_warmup <- brm(formula = warmup_formula,
                  data = warmup_data_to_be_modeled,
                  family = response_family,
                  chains = detectCores() - 1,
                  cores = detectCores() - 1,
                  prior = priors,
                  backend = "cmdstanr")

mod_warmup_comprehension <- brm(formula = warmup_formula_comprehension,
                                data = warmup_data_to_be_modeled %>% filter(!is.na(item)),
                                family = response_family,
                                chains = detectCores() - 1,
                                cores = detectCores() - 1,
                                prior = priors,
                                backend = "cmdstanr")

Checks

func_bayesian_p_val <- function(mod, responsevar) {
  family <- mod$family$family                 # Detect family (for bernoulli or poisson -- categorical is computed separately)
  yrep <- posterior_predict(mod)              # Posterior predictive draws
  yobs <- as.numeric(mod$data[[responsevar]]) # Observed data
  if (family == "bernoulli") {                # Adjust for Bernoulli if needed
    yobs <- yobs - 1                          # assumes 1 = success, 0 = failure
  }
  T_obs_mean <- mean(yobs)                    # Compute summary statistics
  T_rep_mean <- apply(yrep, 1, mean)
  pvals = mean(T_rep_mean >= T_obs_mean) # Bayesian p-values
  print(pvals)
}

func_bayesian_p_val(mod_warmup, "response_correct")
func_bayesian_p_val(mod_warmup_comprehension, "response_correct")

# plot(mod_warmup)
# plot(mod_warmup_comprehension)
# pp_check(mod_warmup, type = "dens_overlay_grouped", ndraws = 1000, group = "trial_subtype")
# pp_check(mod_warmup_compr1ehension, type = "dens_overlay_grouped", ndraws = 1000, group = "trial_subtype")
# pp_check(mod_warmup_comprehension, type = "ecdf_overlay_grouped", ndraws = 1000, group = "trial_subtype")

Analysis

Analysis of the posterior models of the warmup phase observations.

# Main model
newdata_warmup <- tibble(trial_subtype = levels(mod_warmup$data$trial_subtype))

epreds_warmup <- local({
  set.seed(10003)
  epreds <- epred_draws(mod_warmup,
                        newdata_warmup,
                        re_formula = NA)
}) %>% 
  ungroup() %>% 
  mutate(trial_subtype = case_when(trial_subtype == "3pp" ~ "Plural",
                                   trial_subtype == "3ps" ~ "Singular",
                                   trial_subtype == "girl_correct" ~ "Girl",
                                   trial_subtype == "boy_correct" ~ "Boy",
                                   T ~ NA_character_),
         trial_subtype = factor(trial_subtype, levels = c("Girl", "Boy", "Plural", "Singular")))

epreds_summarized_warmup <- epreds_warmup %>%
  group_by(trial_subtype) %>% 
  summarize(
    min     = min(.epred),
    hdi_lo  = hdi(.epred)[1],
    mean    = mean(.epred),
    hdi_hi  = hdi(.epred)[2],
    max     = max(.epred),
    prop_gt = mean(.epred > .5),
    prop_gt_char = sprintf("%.0f%%", 100 * prop_gt))

# Comprehension model
newdata_warmup_c <- crossing(trial_subtype = levels(mod_warmup_comprehension$data$trial_subtype),
                             item = levels(mod_warmup_comprehension$data$item))

epreds_warmup_c <- local({
  set.seed(10003)
  epreds <- epred_draws(mod_warmup_comprehension, newdata_warmup_c, re_formula = NA)
}) %>% 
  ungroup() %>% 
  mutate(trial_subtype = case_when(trial_subtype == "3pp" ~ "Plural",
                                   trial_subtype == "3ps" ~ "Singular",
                                   trial_subtype == "girl_correct" ~ "Girl",
                                   trial_subtype == "boy_correct" ~ "Boy",
                                   T ~ NA_character_),
         item = ifelse(item == "demonstrative_pronoun", "Demonstrative", "Personal"),
         trial_subtype = factor(trial_subtype, levels = c("Plural", "Singular", "Boy", "Girl")),
         variable = case_when(trial_subtype == "Plural" & item == "Demonstrative" ~ "Pl. Dem.",
                              trial_subtype == "Plural" & item == "Personal" ~ "Pl. Pers.",
                              trial_subtype == "Singular" & item == "Demonstrative" ~ "Sg. Dem.",
                              trial_subtype == "Singular" & item == "Personal" ~ "Sg. Pers.",
                              TRUE ~ NA_character_))

epreds_summarized_warmup_c <- epreds_warmup_c %>%
  group_by(variable) %>% 
  summarize(
    min     = min(.epred),
    hdi_lo  = hdi(.epred)[1],
    mean    = mean(.epred),
    hdi_hi  = hdi(.epred)[2],
    max     = max(.epred),
    prop_gt = mean(.epred > .5),
    prop_gt_char = sprintf("%.0f%%", 100 * prop_gt))

Posterior Distributions

Posterior distributions of the mean rate of choosing correctly.

fig1b <- ggplot(epreds_warmup, aes(y = .epred, x = trial_subtype)) +
  facet_grid( ~ trial_subtype, scales = "free") +
  stat_halfeye(
    aes(fill = after_stat(y < 0.5)),
    .width = 0.95,
    slab_alpha = 0.8,
    slab_linewidth = 0.5,
    slab_color = "black"
  ) +
  scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"),
                    guide = "none") +
  geom_text(
    data = epreds_summarized_warmup,
    aes(
      y = hdi_hi + .03,
      hjust = -.15,
      label = prop_gt_char
    ),
    size = 4.25,
    family = "Times New Roman",
    fontface = "bold",
    color = "grey35"
  ) +
  geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
  labs(
    y = "Predicted Mean Correct",
    x = NULL
  ) +
  scale_y_continuous(
    labels = function(x)
      sub("^0\\.", ".", sprintf("%.2f", x)),
    limits = c(0.4, 1),
    breaks = c(.5, .75, 1.00)
  ) +
  scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(
    plot.subtitle = element_text(hjust = .5, size = base_size - 3),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = .5, face = "bold"),
  )

fig1a + fig1b

ggsave("figs/fig1.png", height = 4, width = 9)

fig2b <- ggplot(epreds_warmup_c, aes(y = .epred, x = variable)) +
  facet_grid( ~ variable, scales = "free") +
  stat_halfeye(aes(fill = after_stat(y < 0.5)),
               .width = 0.95,
               slab_alpha = .8,
               slab_linewidth = 0.5,
               slab_color = "black") +
  scale_fill_manual(values = c("TRUE" = "grey85",
                               "FALSE" = "grey35"),
                    guide = "none") +
  geom_text(data = epreds_summarized_warmup_c,
    aes(
      y = hdi_hi + .03, hjust = -.15, label = prop_gt_char
    ), 
    size = 4.25, 
    family = "Times New Roman", 
    fontface = "bold",
    color = "grey35"
  ) +
  geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
  labs(
    y = "Predicted Mean Correct",
    x = NULL
  ) +
  scale_y_continuous(
    labels = function(x)
      sub("^0\\.", ".", sprintf("%.2f", x)),
    limits = c(0.4, 1),
    breaks = c(.5, .75, 1.00)
  ) +
  scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(
    plot.subtitle = element_text(hjust = .5, size = base_size - 3),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
  ) 

fig2a + fig2b

ggsave("figs/fig2.png", height = 4, width = 9)

Test Phase

Trials Completed

Here is the distribution of the number of test trials completed per trial type. Participant counts are inset.

Data

Predictive Accuracy

# Compute individuals' outcome distribution
individual_dat <- gkcg_data %>% 
  drop_na(condition) %>% 
  group_by(id) %>% 
  count(condition, response) %>% 
  ungroup()

# By participant
individual_dat %>%
  complete(id, condition, response, fill = list(n = 0)) %>%
  group_by(id, condition) %>%
  mutate(
    sum  = sum(n),
    prop = ifelse(sum == 0, 0, n / sum)
  ) %>%
  ungroup() %>% 
  mutate(response_profile = case_when(condition == "generic" & response == "zarpie" & prop > .5 ~ "predicted",
                                      condition == "specific" & response == "flurp" & prop > .5 ~ "predicted",
                                      T ~ "unpredicted")) %>% 
  filter(condition == "generic" & response == "zarpie" |
         condition == "specific" & response == "flurp") %>% 
  count(response_profile) %>% 
  mutate(odds_of_accurate_prediction = n[1]/n[2]) # odds of correctly predicting individual level children's response profiles

# Descriptives
descriptives("test") # compute descriptives for test trials

# figure
fig3a_pos <- position_jitter(width = 0.20, height = 0.01, seed = 10003)
begin_value <- 0.35
end_value <- 0.90

test_results0 <- individual_dat %>%
  group_by(id) %>%
  complete(condition, response, fill = list(n = 0)) %>%
  ungroup() %>%
  group_by(id, condition) %>%
  mutate(sum = sum(n),
         prop = round(n / sum(n), 2),
         condition = str_to_title(condition)) %>%
  ungroup() %>%
  filter(response == "zarpie")

test_results1 <- test_results0 %>%
  pivot_wider(id_cols = id,
              names_from = condition,
              values_from = prop) %>%
  mutate(more_zarpie_responses_in = factor(case_when(
      Generic > Specific ~ "G", # generic condition (shorthand for plotting)
      Generic == Specific ~ "E", # equal between generic and specific conditions (shorthand for plotting)
      Generic < Specific ~ "S", # specific condition (shorthand for plotting)
      T ~ NA),
    levels = c("G", "E", "S")))

df_fig3a0 <- test_results0 %>%
  left_join(test_results1 %>% select(id, more_zarpie_responses_in), by = "id")

df_fig3a00 <- df_fig3a0 %>% 
  left_join(gkcg_data %>% distinct(id, gender) %>% mutate(gender = paste0(gender, "s")), by = "id")

df_fig3a <- rbind(df_fig3a0 %>% mutate(gender = "overall"), df_fig3a00) %>% 
  mutate(gender = str_to_title(gender),
         gender = fct_relevel(gender, "Overall", "Boys", "Girls"),
         condition = case_when(condition == "Generic" ~ "Gen.",
                               condition == "Specific" ~ "Spec.")) %>% 
               drop_na(more_zarpie_responses_in)

fig3a <- ggplot(df_fig3a, 
               aes(x = condition, y = prop)) +
  facet_grid( ~ gender, scales = "free") +
  geom_line(aes(group = id),
                color = "deepskyblue1", #factor(more_zarpie_responses_in)),
            linewidth = 0.5,
            alpha = .2,
            position = fig3a_pos) + 
  geom_jitter(color = "deepskyblue1", #aes(color = factor(more_zarpie_responses_in)),
              size = 1,
              alpha = .4,
              position = fig3a_pos) + 
  stat_summary(
    fun = mean,
    geom = "point",
    size = 5,
    ) +
  theme_classic(base_size = base_size,
                base_family = "Times New Roman") +
  scale_y_continuous(limits = c(-0.05, 1.05),
                     breaks = pretty_breaks(3),
                     labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x))) +
  labs(x = "Condition",
       y = "Proportion Zarpie") +
  theme(plot.subtitle = element_text(size = base_size - 1, hjust = .5),
        panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3),
        legend.position = "none")

df_fig5a0 <- test_results1 %>%
  left_join(gkcg_data %>% distinct(id, gender), by = "id") %>%
  mutate(gender = paste0(gender, "s"))

df_fig5a00 <- rbind(test_results1 %>% mutate(gender = "overall"),
                  df_fig5a0) %>%
  drop_na(more_zarpie_responses_in) %>%
  mutate(gender = str_to_title(gender),
         gender = fct_relevel(gender, "Overall", "Boys", "Girls"))

df_fig5a <- df_fig5a00 %>% 
  group_by(gender, more_zarpie_responses_in) %>% 
  summarize(n = n(),
            .groups = "drop") %>% 
  group_by(gender) %>% 
  mutate(prop = n / sum(n),
         vjust_text = ifelse(n == 5, -0.3, 1.3))

fig5a <- ggplot(df_fig5a,
                aes(x = more_zarpie_responses_in,
                    y = prop)) +
  facet_grid(~ gender, scales = "free") +
  geom_bar(stat = "identity", fill = "deepskyblue1") +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  geom_text(aes(label = n, vjust = vjust_text),
            color = "black",
            family = "Times New Roman",
            size = 4.25) +
  labs(
    x = "Response Profile",
    y = "Proportion of Participants") +
  scale_y_continuous(
    labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
    breaks = c(0, 1 / length(unique(df_fig5a$more_zarpie_responses_in))),
  ) +
  theme(
    plot.title = element_text(hjust = .5, face = "bold"),
    legend.position = "none",
    panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
  )

Of the 684 test trial observations, 353 were predicted accurately (i.e., participants chose Zarpies in the generic condition and Flurps in the specific condition). In contrast, 331 observations were predicted inaccurately. Thus, our post-hoc odds of accurately predicting participants’ response in each test trial was 1.07. That is, for every 100 times that we incorrectly predicted participants’ behavior, we did so accurately 107 times.)

Models

Models of the test phase observations.

# bernoulli model of group behavior
test_data_to_be_modeled <- gkcg_data %>% 
  drop_na(condition) %>% 
  mutate(response = fct_relevel(response, "flurp", "zarpie"),
         age_num_stand = as.numeric(scale(age_num))) # standardize age

formyouluh <- brmsformula(response ~
    condition * age_num_stand * gender + depicted_gender +
    (condition + depicted_gender | id) + (1 | item))

mod_test <- brm(
  formula = formyouluh,
  data = test_data_to_be_modeled,
  family = response_family,
  chains = detectCores() - 1,
  cores = detectCores() - 1,
  prior = priors,
  backend = "cmdstanr"
)

# categorical model of individuals' behavior
cat_dat <- test_results1 %>%
  drop_na(more_zarpie_responses_in) %>%
  left_join(gkcg_data %>% distinct(id, gender, age_num)) %>%
  mutate(
    cat_response = case_when( # recode for interpretability
      more_zarpie_responses_in == "E" ~ "equal",
      more_zarpie_responses_in == "G" ~ "generic",
      more_zarpie_responses_in == "S" ~ "specific",
      TRUE ~ NA_character_
    ),
    age_num_stand = as.numeric(scale(age_num)
    )
  )

mod_test_cat <- brm(
  formula = cat_response ~ gender * age_num_stand,
  data = cat_dat,
  family = categorical(refcat = "generic"),
  chains = detectCores() - 1,
  cores = detectCores() - 1,
  prior = c(
    prior(normal(0, 1.00), Intercept, dpar = muequal),
    prior(normal(0, 1.00), Intercept, dpar = muspecific),
    prior(normal(0, .8), b, dpar = muequal),
    prior(normal(0, .8), b, dpar = muspecific)
  ),
  backend = "cmdstanr"
)

Checks

# graphical checks
# plot(mod_test)
# plot(mod_test_cat)

# pp_check(mod_test, ndraws = 1000, "dens_overlay_grouped", group = "condition")
# pp_check(mod_test, ndraws = 1000, "dens_overlay_grouped", group = "gender")
# pp_check(mod_test, ndraws = 1000, "ecdf_overlay_grouped", group = "gender")
# pp_check(mod_test_cat, ndraws = 1000, "dens_overlay_grouped", group = "gender")
# pp_check(mod_test_cat, ndraws = 1000, "ecdf_overlay_grouped", group = "gender")

# p val of bernoulli model (mean)
func_bayesian_p_val(mod_test, responsevar = "response")

# p val of categorical model (category means)
yrep <- posterior_predict(mod_test_cat)
yrep_props <- apply(yrep, # matrix to vectorize function over
                    1, # rowwise vectorization
                    function(x) as.data.frame(prop.table(table(factor(x)))) %>% 
                      mutate(Var1 = case_when(Var1 == 1 ~ attr(yrep, "levels")[1], # get the internal level assigned to 1
                                              Var1 == 2 ~ attr(yrep, "levels")[2], # same for 2
                                              Var1 == 3 ~ attr(yrep, "levels")[3], # same for 3
                                              T ~ NA))) 

yrep_props_tibble0 <- bind_rows(yrep_props, .id = "draw") %>% # convert list to tibble
  as_tibble() %>% 
  rename(response = Var1, # rename vars
         predicted_proportion = Freq)

yrep_props_tibble0 %>% # sometimes, posterior_predict will generate zero counts for a category within a simulated dataset. this quantifies the number of times that that happened.
  count(response) %>% 
  mutate(zero_predicted = dim(yrep)[1] - n) # number of simulated datasets without that category represented

yrep_props_tibble <- yrep_props_tibble0 %>% # account for the zeros, above
  complete(draw, response, fill = list(predicted_proportion = 0)) %>% # irrelevant for gkcg because there were zero simulated datasets with zero predicted observations of a category, but retaining for future robust reuse
  left_join(
    tibble(response = mod_test_cat$data[["cat_response"]]) %>% # get observed data
      count(response, name = "observed_n") %>% # count up the observed responses by category
      mutate(observed_proportion = observed_n / sum(observed_n))) %>% # compute observed proportion
  mutate(prediction_exceeds_observation = ifelse(predicted_proportion > observed_proportion, 1, 0)) # compute whether predicted exceeds observed

mod_test_pvals <- yrep_props_tibble %>% group_by(response) %>% summarize(bayes_p_val = mean(prediction_exceeds_observation)) # compute category-wise bayesian p-values
print(mod_test_pvals)

Analysis

Analysis of the posterior models of the test phase observations.

# epreds
n_bins <- 30 # number of bins for prediction grid

newdata <- crossing(
  condition = unique(levels(mod_test$data$condition)),
  depicted_gender = unique(levels(mod_test$data$depicted_gender)),
  gender = unique(levels(mod_test$data$gender)),
  age_num_stand = seq(from = min(mod_test$data$age_num_stand),
                         to = max(mod_test$data$age_num_stand),
                         length.out = n_bins))

epreds_test <- local(
  {
    set.seed(10003)
    epred_draws(mod_test, newdata, re_formula = NA)
  }
) %>%
  ungroup() %>% 
  mutate(age_num = age_num_stand * sd(gkcg_data$age_num) + mean(gkcg_data$age_num))


func_summarize <- function(df, quantity, prop_gt_val) {
  df %>%
    summarize(
      min     = min({{quantity}}),
      hdi_lo  = hdi({{quantity}})[1],
      mean    = mean({{quantity}}),
      hdi_hi  = hdi({{quantity}})[2],
      max     = max({{quantity}}),
      prop_gt = mean({{quantity}} > prop_gt_val),
      prop_gt_char = sprintf("%.0f%%", 100 * prop_gt)
      ) %>% 
    mutate(across(where(is.double), ~round(., 3))) %>% 
    ungroup()
  }

func_epreds_marginal <- function(variable){

  # Capture variable name as string
  var_name <- rlang::as_name(rlang::enquo(variable))

  # Marginalize predictions
  epreds_marginalized <- epreds_test %>%
    group_by(.draw, {{variable}}) %>%
    summarize(mean_epred = mean(.epred), .groups = "drop") %>%
    mutate(
      !!var_name := str_to_title(as.character({{variable}}))  # dynamic rename
    )

  # Summarize marginal predictions
  epreds_marginalized_summary <- epreds_marginalized %>%
    group_by(!!sym(var_name)) %>%  # use dynamic column
    func_summarize(mean_epred, 0.5)

  # Compute contrasts
  epreds_marginalized_contrasts <- epreds_marginalized %>%
    compare_levels(mean_epred, by = var_name) %>%  # dynamic column name
    ungroup()

  epreds_marginalized_contrasts_summary <- epreds_marginalized_contrasts %>%
    func_summarize(mean_epred, 0)

  # Return list
  return(list(
    draws = epreds_marginalized,
    draws_summary = epreds_marginalized_summary,
    contrasts = epreds_marginalized_contrasts,
    contrasts_summary = epreds_marginalized_contrasts_summary
  ))
}

# marginal gender results
results_gender <- func_epreds_marginal(gender) # both above chance, no effect

# marginal depicted gender results
results_depgender <- func_epreds_marginal(depicted_gender) # both above chance, no effect

# marginal condition results
results_condition <- func_epreds_marginal(condition) # only generic above chance, trendng towards predicted effect

# marginal age results
epreds_slope <- epreds_test %>%
  group_by(.draw) %>%
  summarize(age_slope = cov(age_num, .epred)/var(age_num),
            .groups = "drop")
epreds_slope %>% # no marginal age effect (trending towards increasing zarpies w/ age)
  func_summarize(age_slope, 0) %>%
  ungroup()

# interaction (condition * gender)
epreds_intxn <- epreds_test %>% 
  group_by(.draw, condition, gender) %>% 
  summarize(mean_epred = mean(.epred),
            .groups = "drop")

epreds_intxn_summary <- epreds_intxn %>% # marginal means of condition * gender cells
  group_by(gender, condition) %>% 
  func_summarize(mean_epred, .5) # boys do the thing, girls do the opposite!!!! wow!

epreds_intxn_contrasts <- epreds_intxn %>% 
  group_by(gender) %>% 
  compare_levels(mean_epred, by = "condition") %>% 
  ungroup() %>% 
  rename(diff_mean_epred = mean_epred)

epreds_intxn_contrasts %>% 
  group_by(gender) %>% 
  func_summarize(diff_mean_epred, 0) # strong evidence for condition effect in boys, weaker but still notable evidence for the OPPOSITE effect in girls

epreds_intxn_contrasts %>% # marginal distribution over the interaction of gender and condition
  compare_levels(diff_mean_epred, by = "gender") %>% 
  ungroup() %>% 
  rename(diff_diff_mean_epred = diff_mean_epred) %>% 
  func_summarize(diff_diff_mean_epred, 0)

# interaction (condition * age -- plotted next, below)
epreds_slope_condition <- epreds_test %>%
  group_by(.draw, condition) %>%
  summarize(age_slope = cov(age_num, .epred)/var(age_num),
            .groups = "drop")
epreds_slope_condition %>% # participants more likely w/ age to choose zarpies in zarpie condition, some evidence that they become less likely to do that in the specific condition
  group_by(condition) %>% 
  func_summarize(age_slope, 0) %>%
  ungroup()

# interaction (condition * age -- plotted next, below)
epreds_slope_condition_gender <- epreds_test %>%
  group_by(.draw, condition, gender) %>%
  summarize(age_slope = cov(age_num, .epred)/var(age_num),
            .groups = "drop")
epreds_slope_condition_gender %>% # participants more likely w/ age to choose zarpies in zarpie condition, some evidence that they become less likely to do that in the specific condition
  group_by(gender, condition) %>% 
  func_summarize(age_slope, 0) %>%
  ungroup()

# plot of condition * age interaction
epreds_age <- epreds_test %>% 
  group_by(.draw, age_num, condition) %>% 
  summarize(mean_epred = mean(.epred),
            .groups = "drop")

epreds_age_marginal_overall <- epreds_age %>% 
  group_by(age_num, condition) %>% 
  summarize(ci_lo = quantile(mean_epred, .025),
            mean   = mean(mean_epred),
            ci_hi = quantile(mean_epred, .975),
            .groups = "drop")

epreds_age_gen <- epreds_test %>% 
  group_by(.draw, age_num, condition, gender) %>% 
  summarize(mean_epred = mean(.epred),
            .groups = "drop")

epreds_age_marginal_gen <- epreds_age_gen %>% 
  group_by(age_num, condition, gender) %>% 
  summarize(ci_lo = quantile(mean_epred, .025),
            mean   = mean(mean_epred),
            ci_hi = quantile(mean_epred, .975),
            .groups = "drop")

Posterior Distributions

df_fig3b_draws <- rbind(epreds_intxn %>% mutate(gender = paste0(gender, "s")),
                        results_condition$draws %>% mutate(gender = "overall"))

df_fig3b_summary <- rbind(epreds_intxn_summary %>% mutate(gender = paste0(gender, "s")),
                          results_condition$draws_summary %>% mutate(gender = "overall"))

list_dfs_fig3 <- list(draws = df_fig3b_draws, summary = df_fig3b_summary)

df_fig3b <- map(list_dfs_fig3, ~ .x %>%
                  mutate(
                    gender = factor(str_to_title(gender), levels = c("Overall", "Boys", "Girls")),
                    condition = str_to_lower(condition),
                    condition = case_when(
                      condition == "generic" ~ "Gen.",
                      condition == "specific" ~ "Spec.",
                      TRUE ~ NA_character_
                    )
                  ))

fig3b <- ggplot(df_fig3b$draws,
                aes(x = condition, y = mean_epred)) +
  stat_halfeye(aes(fill = after_stat(y < 0.5)), .width = 0.95, slab_alpha = 0.8, slab_linewidth = 0.5, slab_color = "black") +
  scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"), guide = "none") +
  geom_text(data = df_fig3b$summary,
            aes(x = condition, y = hdi_hi + .005, label = prop_gt_char),
            nudge_x = .5,
            size = 4.5,
            vjust = -.6,
            family = "Times New Roman",
            fontface = "bold",
            color = "grey35") +
  geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
  facet_grid( ~ gender) +
  labs(x = "Condition",
       y = "Predicted Mean Zarpies") +
  scale_y_continuous(
    labels = function(x) sub("^0+", "", format(x, trim = TRUE)), # trim the leading zero
    breaks = c(.5, .75)
  ) +
  theme_classic(base_size = base_size, base_family = "Times New Roman") 

fig3a + fig3b

ggsave("figs/fig3.png", height = 4, width = 9)

epreds_age_marginal <- rbind(epreds_age_marginal_overall %>% mutate(gender = "overall"),
                             epreds_age_marginal_gen %>% mutate(gender = paste0(gender, "s"))) %>% 
  mutate(gender = factor(str_to_title(gender), levels = c("Overall", "Boys", "Girls")))

ggplot(epreds_age_marginal,
       aes(x = age_num,
           y = mean,
           fill = condition,
           color = condition)) +
  facet_grid(~ gender) +
  geom_ribbon(aes(ymin = ci_lo,
                  ymax = ci_hi),
              alpha = 0.4, color = NA) +
  geom_line(linewidth = 1.5) +
  theme_classic(base_size = 17,
                base_family = "Times New Roman") +
  labs(x = "Age",
       y = "Predicted Mean Zarpies",
       color = "Condition", fill = "Condition") +
  scale_color_manual(values = c("green4", "deepskyblue4"),
                     labels = c("Generic", "Specific")) +
  scale_fill_manual(values = c("green3", "deepskyblue3"),
                    labels = c("Generic", "Specific")) +
  geom_hline(yintercept = .5, linetype = 2) +
  scale_y_continuous(labels = function(x) sub("^0+", "", format(x, trim = TRUE)),  # remove leading zeros
                     breaks = c(.3, .5, .7)) +
  scale_x_continuous(breaks = c(4, 5))

ggsave("figs/fig4.png", height = 3, width = 6)

# categorical model
newdata_test_cat <- crossing(gender = levels(mod_test_cat$data$gender),
                        age_num_stand = seq(from = min(mod_test_cat$data$age_num_stand),
                                            to = max(mod_test_cat$data$age_num_stand),
                                            length.out = n_bins)) # use crossing, not tibble, for intercepts-only predictions (epred_draws requires 1 row)

epreds_test_cat <- local({
  set.seed(10003)
  epreds <- epred_draws(mod_test_cat, newdata_test_cat)
}) %>% 
  ungroup() 

epreds_cat_draws_overall <- epreds_test_cat %>% 
  group_by(.draw, .category) %>% 
  summarize(mean_epred = mean(.epred),
            .groups = "drop")

epreds_cat_draws <- epreds_test_cat %>% 
  group_by(.draw, gender, .category) %>% 
  summarize(mean_epred = mean(.epred),
            .groups = "drop")

epreds_cat_draws_overall_summary <- epreds_cat_draws_overall %>% # boys clearly only have generic profiles above chance; girls slight tendency for specific profiles but not as pronounced as boys
  group_by(.category) %>% 
  func_summarize(mean_epred, 1/n_distinct(epreds_cat_draws$.category)) # chance = 1 / number of response categories

epreds_cat_draws_summary <- epreds_cat_draws %>% # boys clearly only have generic profiles above chance; girls slight tendency for specific profiles but not as pronounced as boys
  group_by(gender, .category) %>% 
  func_summarize(mean_epred, 1/n_distinct(epreds_cat_draws$.category)) # chance = 1 / number of response categories

epreds_cat_contrasts <- epreds_cat_draws %>% 
  group_by(gender) %>% 
  compare_levels(mean_epred, by = ".category") %>% 
  ungroup() %>% 
  rename(diff_mean_epred = mean_epred)

epreds_cat_contrasts_summary <- epreds_cat_contrasts %>% # boys above chance for generic response profiles; girls not
  group_by(gender, .category) %>% 
  func_summarize(diff_mean_epred, 0) # boys more likely to have generic profiles than specific or equal profiles; girls show no clear pattern

fig5b0 <- rbind(epreds_cat_draws_overall %>% mutate(gender = "overall"), 
                epreds_cat_draws %>% mutate(gender = paste0(gender, "s")))

fig5b00 <- rbind(epreds_cat_draws_overall_summary %>% mutate(gender = "overall"), 
                epreds_cat_draws_summary %>% mutate(gender = paste0(gender, "s")))

fig5b_list <- list(draws = fig5b0, summary = fig5b00)

fig5b_data <- map(fig5b_list, ~ .x %>% 
               mutate(across(c(.category, gender), ~ str_to_title(.)),
                      .category = case_when(.category == "Equal" ~ "E",
                                            .category == "Generic" ~ "G",
                                            .category == "Specific" ~ "S"),
                      .category = factor(.category, levels = c("G", "E", "S")),
                      gender = factor(gender, levels = c("Overall", "Boys", "Girls"))))

fig5b <- ggplot(fig5b_data$draws,
                aes(x = .category, y = mean_epred)) +
  stat_halfeye(aes(fill = after_stat(y < (1 / length(unique(fig5b_data$summary$.category))))), 
               .width = 0.95, 
               slab_alpha = 0.8, 
               slab_linewidth = 0.5, 
               slab_color = "black") +
  scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"), guide = "none") +
  geom_text(data = fig5b_data$summary,
            aes(x = .category, y = hdi_hi + .005, label = prop_gt_char),
            nudge_x = .5,
            size = 3.5,
            vjust = -.6,
            family = "Times New Roman",
            color = "grey35",
            fontface = "bold") +
  geom_hline(yintercept = 1 / length(unique(fig5b_data$summary$.category)), linewidth = .5, linetype = 2) +
  facet_grid( ~ gender) +
  labs(x = "Response Profile",
       y = "Predicted Rate") +
  scale_y_continuous(
    labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
    breaks = c(1 / length(unique(fig5b_data$draws$.category)),
               2 * 1 / length(unique(fig5b_data$draws$.category)))
  ) +
  theme_classic(base_size = base_size, base_family = "Times New Roman")

fig5a + fig5b

ggsave("figs/fig5.png", height = 4, width = 10)

The participant-level results, below (left), suggest that participants chose Zarpies more often in the generic condition than in the specific condition at rates above chance. In contrast, participants chose Zarpies more often in the specific condition, and at equal rates in both conditions, at rates below chance.

The results on the right suggest that participants were more likely to choose Zarpies more often in the generic condition than in the specific condition than the opposite (i.e., generic > specific vs specific > generic; top distribution) and compared to choosing Zarpies at equal rates in both condition (i.e., generic > equal; bottom distribution). Moreover, participants were similarly likely to choose Zarpies at rates similar in both condition compared to choosing Zarpies at higher rates in the specific than in the generic condition (i.e., equal vs. specific; middle distribution).

While all of these directional effects follow our predictions, the evidence for them ought to be stronger.

Control Phase

Here’s the control phase data.

df_fig6 <- descriptives("control") %>%
  mutate(
    trial_subtype = factor(case_when(trial_subtype == "conjugation_matching" ~ "Conjugation",
                                     trial_subtype == "number_or_phrasal_matching" ~ "Number or Phrase",
                                     T ~ NA_character_)),
    response = str_to_title(response),
    vjust_text = ifelse(n == 5, -0.3, 1.3)) 

fig6a <- ggplot(df_fig6, aes(x = response, y = prop)) +
  facet_grid(~ trial_subtype, scales = "free") +
  geom_bar(stat = "identity", fill = "deepskyblue") +
  scale_y_continuous(labels = function(x) sub("^0\\.", ".", sprintf("%.1f", x)),
                     breaks = pretty_breaks(2)) +
    geom_text(aes(label = n, vjust = vjust_text),
            color = "black",
            family = "Times New Roman",
            size = 5) +
  labs(x = "Response",
       y = "Proportion of Participants") +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3))

Control phase model.

control_data0 <- gkcg_data %>% 
    filter(trial_type == "control") # Include only this trial type

control_data0_age <- control_data0 %>% 
  distinct(id, age_num) %>% mutate(age_num_stand = as.numeric(scale(age_num)))

control_data <- control_data0 %>% 
  left_join(control_data0_age)

mod_control <- brm(
  formula = response ~ trial_subtype * gender * age_num_stand + depicted_gender +
    (trial_subtype + depicted_gender | id),
  data = control_data,
  family = "bernoulli",
  chains = detectCores() - 1,
  cores = detectCores() - 1,
  prior = priors,
  backend = "cmdstanr"
)

Checks

# plot(mod_control)
# pp_check(mod_control, ndraws = 1000)
func_bayesian_p_val(mod_control, "response")

Analysis

newdata_control <- crossing(trial_subtype = levels(mod_control$data$trial_subtype),
                            gender = levels(mod_control$data$gender),
                            depicted_gender = levels(mod_control$data$depicted_gender),
                            age_num_stand = seq(from = min(mod_control$data$age_num_stand),
                                                to = max(mod_control$data$age_num_stand),
                                                length.out = n_bins))

epreds_control <- local({
  set.seed(10003)
  epreds <- epred_draws(mod_control,
                        newdata_control,
                        re_formula = NA)
}) %>% 
  ungroup()

epreds_control_draws <- epreds_control %>% 
  group_by(.draw, trial_subtype) %>% 
  summarize(mean_epred = mean(.epred))

epreds_control_draws_summary <- epreds_control_draws %>% 
  group_by(trial_subtype) %>% 
  func_summarize(mean_epred, .5)

list_fig6b <- list(draws = epreds_control_draws, summary = epreds_control_draws_summary)

df_fig6b <- map(list_fig6b, ~ .x %>%
  mutate(
    trial_subtype = trial_subtype %>%
      str_replace_all("_", " ") %>%         # replace underscores with spaces
      str_to_title() %>%                    # convert to title case
      str_replace("al", "e") %>%            # replace 'al' with 'e'
      str_remove("Matching") %>%            # remove 'Matching'
      str_replace("Or", "or")               # replace 'Or' with 'or'
  )
)

fig6b <- ggplot(df_fig6b$draws, aes(y = mean_epred, x = trial_subtype)) +
  facet_grid( ~ trial_subtype, scales = "free") +
  stat_halfeye(
    aes(fill = after_stat(y < 0.5)),
    .width = 0.95,
    slab_alpha = 0.8,
    slab_linewidth = 0.5,
    slab_color = "black"
  ) +
  scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"),
                    guide = "none") +
  geom_text(
    data = df_fig6b$summary,
    aes(
      y = hdi_hi + .03,
      hjust = -.5,
      label = prop_gt_char
    ),
    size = 5,
    family = "Times New Roman",
    fontface = "bold",
    color = "grey35"
  ) +
  geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
  labs(
    y = "Predicted Mean Zarpies",
    x = NULL
  ) +
  scale_y_continuous(
    labels = function(x)
      sub("^0\\.", ".", sprintf("%.2f", x)),
    breaks = c(.5, .7)
  ) +
  scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
  theme_classic(base_size = base_size, base_family = "Times New Roman") +
  theme(
    plot.subtitle = element_text(hjust = .5, size = base_size - 3),
    axis.text.x = element_blank(),
    axis.ticks.x = element_blank(),
    plot.title = element_text(hjust = .5, face = "bold"),
  )

fig6a + fig6b

ggsave("figs/fig6.png", height = 4, width = 9)

Second Component

Now that we have estimates from our pilot data, let’s use them to inform the simulated population in a sample-size planning analysis.

NB: The planning analysis, below, assumes a simpler generative structure than what our exploratory analysis suggested, although the exploratory model structure is retained. In particular, the planning analysis below ignores the gender-condition interaction that was revealed by our exploratory analysis. I am working to integrate this interaction into the simulated population.

The Generative Process

First, we define the generative process, including its design and effects. Then, we simulate the generative process.

Design

We set the design parameters of the generative process. The function of each parameter is commented.

# Population size
n_population <- 15000

# Trial count from pilot data
n_trials <- max(gkcg_data$trial)

# Item count from pilot data (one unique item per trial)
items <- n_trials

# Conditions from pilot data
condition <- unique(gkcg_data %>%
                       filter(trial_type == "test") %>%
                       pull(trial_subtype)) %>%
  droplevels()

# Lower bound of population age
age_lower_bound <- min(gkcg_data$age_num)

# Upper bound of population age
age_upper_bound <- max(gkcg_data$age_num)

# Randomly sample ages
ages <- runif(n_population, age_lower_bound, age_upper_bound)

# Depicted gender from pilot data
depicted_gender <- set_names( # deterministically tie each depicted gender to an item, like in the experiment (otherwise they would be orthogonal to the items, which was not the experimental design)
  rep(
    unique(
      na.omit(
        gkcg_data$depicted_gender
        )
      ),
    3),
  seq_len(n_trials))

# Population genders
genders <- unique(gkcg_data$gender)

Effects

Then, we set the effect sizes of the generative process. These quantities are very important. They are the true effects. When we sample the population, we don’t know these effects. So, we have to infer them. Depending on the size of the effects, their number, grouping structure, and so on, we may need larger sample sizes to be adequately confident that we will find adequate evidence for our predictions. The effect sizes are on the logit scale.

# Observed trial exclusion rate from pilot data
trial_exclusion_rate <- gkcg_data %>% 
  summarise(n_ids = n_distinct(id),
            test_n = sum(trial_type == "test")) %>%
  transmute(exclusion_rate = (n_trials * n_ids - test_n) / test_n) %>%
  pull()

# Estimated Zarpie probability in pilot sample generic condition (logit scale for simulation)
logit_baseline <- qlogis(results_condition$draws_summary %>% # qlogis for prob -> logistic
                           filter(condition == "Generic") %>%
                           pull(mean))

# Estimated condition effect (Estimated change in probability of choosing Zarpies in the specific condition)
logit_effect <- qlogis(results_condition$draws_summary %>% # qlogis for prob -> logistic
                           filter(condition == "Specific") %>%
                           pull(mean)) - logit_baseline

# Estimated Zarpie probability in pilot sample boys
# logit_baseline_gender <- qlogis(results_gender$epreds_marginalized_summary %>% 
#                            filter(gender == "Boy") %>%
#                            pull(mean))

# Estimated gender effect
# logit_effect_gender <- qlogis(results_gender$epreds_marginalized_summary %>% 
#                            filter(gender == "Girl") %>%
#                            pull(mean)) - logit_baseline_gender

# Estimated Zarpie probability in pilot sample depicted boys
# logit_baseline_depgender <- qlogis(results_depgender$epreds_marginalized_summary %>% 
#                            filter(depicted_gender == "Depicted_boy") %>%
#                            pull(mean))

# Estimated depicted gender effect
# logit_effect_depgender <- qlogis(results_depgender$epreds_marginalized_summary %>% 
#                            filter(depicted_gender == "Depicted_girl") %>%
#                            pull(mean)) - logit_baseline_depgender

# Estimated age effect
# Compute slope on the logit scale
# epreds_slope_logit <- epreds %>%
#   group_by(.draw) %>%
#   summarise(
#     age_slope_logit = cov(age_standardized, qlogis(.epred)) / var(age_standardized),
#     .groups = "drop"
#   )

# results_age_logit <- epreds_slope_logit %>%
#   func_summarize(age_slope_logit, 0) %>%
#   ungroup()

# logit_age <- results_age_logit %>% pull(mean)

# SD of participant random intercept
logit_raneff_id_sd <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
  select(sd_id__Intercept) %>% 
  summarize(mean = mean(sd_id__Intercept)) %>% 
  pull()

# SD of condition random slope
logit_raneff_id_condition <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
  select(sd_id__conditionspecific) %>% 
  summarize(mean = mean(sd_id__conditionspecific)) %>% 
  pull()

# SD of item random intercept
logit_raneff_item_sd <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
  select(sd_item__Intercept) %>% 
  summarize(mean = mean(sd_item__Intercept)) %>% 
  pull()

Simulation

Next, we simulate the generative process. This simulates the population from which we will subsequently draw random samples.

# Simulate population of seq_len(n_population) * seq_len(n_trials) observations from seq_len(n_population) children
population_logistic_raw <- crossing(id = seq_len(n_population), # ID
                                    trial = seq_len(n_trials)) %>% # Trial
  group_by(id) %>% # Group
  mutate(condition = factor(rep(condition, length.out = n_trials), # Condition
                            levels = unique(condition)),
         age_num = ages[id], # Age
         item = sample(items), # Items
         gender = genders[id]) %>% # Gender
  ungroup() %>%
  mutate(
    depicted_gender = depicted_gender[item], # Depicted gender tied to specific items (per study design)
    gender = rep(genders, length.out = n_population)[id], # Gender
    logit_raneff_id = rnorm(n_population, 0, logit_raneff_id_sd)[id], # ID random intercept
    logit_raneff_item = rnorm(items, 0, logit_raneff_item_sd)[item], # Item random intercept
    logit_raneff_id_condition = rnorm(n_population, 0, logit_raneff_id_condition)[id], # condition random slope
    # logit_gender = case_when(gender == "boy" ~ logit_baseline_gender, # Logit (boy)
    #                          gender == "girl" ~ logit_baseline_gender + logit_effect_gender, #  Logit (girl)
    #                          T ~ NA),
    # logit_depgender = case_when(depicted_gender == "depicted_boy" ~ logit_baseline_depgender, # Logit (d. boy)
    #                          depicted_gender == "depicted_girl" ~ logit_baseline_depgender + logit_effect_depgender, #  Logit (d. girl)
    #                          T ~ NA),
    logit_condition = case_when(condition == "generic" ~ logit_baseline, # Logit (generic)
                                condition == "specific" ~ logit_baseline + logit_effect, #  Logit (specific)
                                T ~ NA),
    logit_condition_and_raneff = logit_condition + logit_raneff_id_condition * (condition == "specific"),
    # logit_age = logit_age * as.numeric(scale(age_num)),
    logit = # Response logit
      logit_condition_and_raneff +
      # logit_age +
      # logit_gender +
      # logit_depgender +
      logit_raneff_id +
      logit_raneff_item,
    prob = plogis(logit), # Response probability
    response = rbinom(n(), 1, prob), # Simulate responses
    include = rbinom(n(), 1, 1 - trial_exclusion_rate)) # Simulate exclusions

sum(is.na(population_logistic_raw)) # Count NAs (should equal zero)

Remove Generative Parameters

Next, we remove generative parameters from our dataset. That is, we create a dataset that will contain all and only the variables that we will have in our empirical dataset gathered by our experiment.

vars <- all.vars( as.formula( as.character(formyouluh)[1] ) )

population_logistic <- population_logistic_raw %>% # Simulated population with unknown parameters
  select(setdiff(vars, "age_num_stand"), # Columns to retain
         age_num,
         include)

setdiff(names(population_logistic_raw), names(population_logistic)) # Check which parameters have been removed

Samples

Now that we have simulated a population and created a dataframe with unknown parameters, we randomly sample from that population many times over. There is some parallel computation going on here that will enable your machine to draw several random samples simultaneously. This always saves time on computation, but the time savings will become noticeable only when drawing thousands of samples.

Parameters of the Random Sampling Process

First, we set the parameters for our random samples. This includes parameters for the number of random samples per sample size, the minimum and maximum sample sizes, and the “sample size delta.”

The latter, delta parameter controls the step size between samples. For example, if the sample size delta is set to 20 and the minimum and maximum sizes are 20 and 100, respectively, then the algorithm will produce samples of size 20, 40, 60, 80, and 100.

Be conscientious of the values that you set for your these parameters. The stakes aren’t high, but you can mistakenly produce samples of unexpected size. For instance, a delta of 20 with minimum and maximum sample sizes of 10 and 100, respectively, will produce samples of size 10, 30, 50, 70, and 90 (not 100).

# Population sampling parameters
n_samples_per_sample_size <- 10 # Number of random samples per sample size
min_sample_size <- 50 # Minimum sample size
max_sample_size <- 100 # Maximum sample size
granularity <- 10 # Sample size delta

Check Samples to Be Collected

Now that we have set the parameters of our random samples, lets check the unique samples, sample sizes, and number of sample sizes that we will collect.

samples <- rep( # unique samples to be collected
  seq(min_sample_size, # Minimum sample size
      max_sample_size, # Maximum sample size
      by = granularity), # Sample size delta
  each = n_samples_per_sample_size) # Number of samples per sample size

length(samples) # sample sizes that we will collect

unique(samples) # sample sizes that we will collect

n_distinct(samples) # number of sample sizes that we will collect

Parallelization

Next, we parallelize our computation. The following code tells R to utilize your machine’s n - 1 cores. The number of cores is the number of independent processes (minus 1) that your machine can perform simultaneously. For instance, my machine has 8 cores. Thus, the code tells my machine to use 7 cores to draw random samples. For a fixed number of samples, this speeds up the sampling process 7-fold. So that I don’t melt my machine, I usually leave a spare core for other computations, like watching Youtube videos or texting my mom.

plan(multisession, workers = detectCores() - 1) # Parallelization
print(plan(multisession)) # Print parallelization plan

Draw random samples

We have set the parameters of our random sampling algorithm and checked the samples that we will collect from the generative process. We are ready to draw our samples. If you have set a large number of samples to be collected (e.g., thousands), this may take several seconds to several minutes. Your machine’s hardware (e.g., cores) will also play a role in determining the algorithm’s compute time. The following code will produce a list of samples with length equal to {r length(samples)}. That is, each element of the list will be one random sample from your simulated population. Each random sample will contain all and only the variables that you plan to use in your model.

# Draw random samples from our simulated population
list_samples <- local(
  
  { # Do all of the following for each sample
    
  start_time <- Sys.time() # Record start time
  
  list_samples <- future_map(samples, # Sample sampler
                             
                               ~ { # Do
                                 
    # Sample participants
    sampled <- population_logistic %>%
      filter(id %in% sample(unique(id), size = .x, replace = FALSE)) %>%
      ungroup() %>%
      filter(include == 1) %>%
      select(-include)
    
    # Compute participant-level age mean and SD
    age_stats <- sampled %>%
      distinct(id, age_num) %>% # one row per participant
      mutate(age_num_stand = as.numeric(scale(age_num))) %>% # standardize age
      select(id, age_num_stand)
    
    # Standardize age per participant
    sampled %>%
      left_join(age_stats, by = "id")
    
    },
    
  .options = furrr_options( # Import from global environment
    globals = list(population_logistic = population_logistic),
    packages = c("tidyverse"),
    seed = seed),
  .progress = TRUE) # Report progress

    end_time <- Sys.time() # Record end time
  
  print(end_time - start_time) # Print runtime
  
  return(list_samples) # Return list of random samples
}
)

Check Samples

Now that you have generated your random samples, I suggest a cursory check of the object and the structure of each sample, just to ensure that everything checks out aight.

length(list_samples) # Check the number of samples drawn (will equal length(samples))

list_samples[[1]] # Check random sample [[1]]
test_data_to_be_modeled

# The set of pilot variables properly includes the set of simulated variables
setdiff(names(list_samples[[1]]), names(test_data_to_be_modeled)) # Should be an empty vector because all vars in simulated data contained in pilot data
setdiff(names(test_data_to_be_modeled), names(list_samples[[1]])) # Pilot data contains X vars not in simulated data

Iterative Modeling

Now, we iterate a modeling function over each simulated dataset. In the attached code, the name of the modeling function to be iterated is “func_brm.” Running this function on a dataset produces a joint posterior distribution, i.e., one posterior model of that dataset. We go further by constructing a conditional if-then statement, called “func_model_fitting,” that says: for each iteration of func_brm, if the posterior model has more than one divergence, has any r-hat equal to 1.05 or greater, or (inclusive) includes a warning about effective sample size, then refit the model according to the correspondingly modified sampler call (see code); otherwise, return the posterior model. So, technically, this conditional if-then statement is iterated over, once per sample. The output is a list of posterior models, each fitted to one of our simulated datasets.

NB: “func_model_fitting” iterates only once per simulated dataset, so, in principle, a model that is flagged for modified resampling of its posterior may possess the same (or another) pathology in its refitted format. I am working to fix this by iterating the algorithm until the pathology is removed, so that all models are satisfactory. Thus far, I have found this issue to be more technical than practical, as typically zero, or nearly zero, models have any issues whatsoever.

# Model sampler parameters (in effect, "tune" the iterative algo)
nuts_delta <- .85 # Adapt delta
n_iter_val <- 1200 # Iterations per chain

func_brm <- function(formula = formyouluh,
                     datty = .x, # Data
                     fam = response_family, # Response family
                     n_chains = 2, # Sampler chains
                     n_iter = n_iter_val, # Posterior sampler per chain
                     control_specs = list(adapt_delta = nuts_delta), # Control specs
                     priors_in_func = priors){
  # Model
  brm(formula = formula,
      data = datty,
      family = fam,
      chains = n_chains,
      cores = n_chains,
      iter = n_iter,
      control = control_specs,
      prior = priors_in_func,
      backend = "cmdstanr",
      seed = seed)
}

func_model_fitting <- function(.x, n_chains){
  fit <- func_brm(datty = .x) # Fit base model
  if (sum(subset(nuts_params(fit), Parameter == "divergent__")$Value) > 1) { # Check divergences
    fit <- func_brm(datty = .x, # Refit divergent models by decreasing sampler step size
                    control_specs = list(adapt_delta = nuts_delta + .10))}

  if (any(map_dbl(rhat(fit), ~ .x) > 1.05)) { # Check chain mixing
    fit <- func_brm(datty = .x, # Refit unmixed models by increasing iterations
                    control_specs = n_iter_val + (0.5 * n_iter_val))}

  warnings <- capture.output(summary(fit), type = "message")

  if (any(grepl("The ESS has been capped to avoid unstable estimates", warnings))) { # Check stability
    fit <- func_brm(datty = .x, # Refit instable models with more chains, more iterations
                    chains = n_chains + 1,
                    control_specs = n_iter_val + (0.5 * n_iter_val))}

  return(fit) # Return fitted model
  }

n_workers <- 3 # Parallelization (n models)

print(plan(multisession, workers = n_workers)) # Check plan (cores = n_workers * n_chains)

mods_list <- local({
  
  start_time <- Sys.time() # Start time
  
  mods <- future_map(list_samples, ~ func_model_fitting(.x, n_chains), # Model list
                     .options = furrr_options(
                       globals = list(func_brm = func_brm,
                                      func_model_fitting = func_model_fitting,
                                      seed = seed,
                                      nuts_delta = nuts_delta,
                                      n_iter_val = n_iter_val,
                                      formyouluh = formyouluh,
                                      priors = priors,
                                      response_family = response_family),
                       packages = c("brms"),
                       seed = seed),
                     .progress = TRUE)
  
    end_time <- Sys.time() # End time
    
    print(end_time - start_time) # Print runtime
    
    return(mods) # Return model list
    
    }
  )

# prior_summary(mods_list[[1]]) # ensuring that priors were passed correctly to models

Results

Here we iterate the planned analyses over each posterior model. This produces the statistics that we use to inform our decision about sample size. First, we perform a cursory check of the models by checking the number of divergences. The divergences in each model is contained in “divergences.” After that, we analyze our models. The results of our analysis are contained in “results_list” and summarized in “results_summaries.”

# Divergences
divergences <- map_dfr(
  mods_list, # Compute the below for each model in mods_list
  ~ {
    nuts <- nuts_params(.x$fit) # Sampler parameters
    tibble(
      divergences = sum(nuts$Value[nuts$Parameter == "divergent__"]), # Divergences
      postwarm_iter = ndraws(.x), # Post warmup iterations
      prop_divergent = sum(divergences) / sum(postwarm_iter), # Proportion of iterations that were divergences
      sample_size = length(unique(.x$data$id))) # Sample size
    }
  )

sum(divergences %>% pull(divergences)) # N divergences across all models

# Sample the posterior of one model per population sample
results_list <- map(mods_list,
    ~ {

      # Prediction grid
      newdata_planning <- crossing(
        condition = unique(levels(.x$data$condition)),
        depicted_gender = unique(levels(.x$data$depicted_gender)),
        gender = unique(levels(.x$data$gender)),
        age_num_stand = seq(from = min(.x$data$age_num_stand),
                               to = max(.x$data$age_num_stand),
                               length.out = 20))

      # Sample size
      sample_size <- length(unique(.x$data$id))

      # Reproducible subsampling of posterior expected value
      epreds <- local(
        {
          set.seed(10003) # RNG seed

          epred_draws(.x, # draws of the posterior expected value
                      newdata = newdata_planning,
                      re_formula = NA, 
                      ndraws = 500)}) %>%
        ungroup() %>% # Ungroup computations
        mutate(sample_size = sample_size) # Sample size

      # Draws of the posterior expected value (not stored)
      epreds_marginal <- epreds %>%
        group_by(.draw, condition) %>% # Marginalize the posterior expected across gender
        summarize(mean_epred = mean(.epred), # Compute mean marginal posterior expected value
                  .groups = "drop") %>%
        mutate(sample_size = sample_size) # Sample size

      # Summary of the marginalized posterior expected value
      epreds_marginal_summary <- epreds_marginal %>%
        group_by(condition) %>% # Group
        summarize(min = min(mean_epred), # Minimum posterior expected value
                  hdi_lo = hdi(mean_epred)[ncol(hdi(mean_epred)) - 1], # Rightmost lower bound of 95% HDI of posterior expected value
                  mean_mean_epred = mean(mean_epred), # Mean posterior expected value
                  median_mean_epred = median(mean_epred), # Median posterior expected value
                  hdi_hi = hdi(mean_epred)[ncol(hdi(mean_epred))], # Rightmost upper bound of 95% HDI of posterior expected value
                  max = max(mean_epred), # Maximum posterior expected value
                  hdi_diff = hdi_hi - hdi_lo, # Size of the support of the 95% HDI
                  prop_gt = mean(mean_epred > 0.5), # Proportion of posterior draws with values greater than chance
          .groups = "drop") %>% # Ungroup computations
        mutate(sample_size = sample_size) # Sample size

      # Draws of contrasts of the marginalized posterior expected value (not stored)
      epreds_marginal_contrasts <- epreds_marginal %>%
        compare_levels(mean_epred, by = "condition") %>% # Compare draws of posterior expected value across levels
        ungroup() %>% # Ungroup computations
        rename(diff_mean_epred = mean_epred) %>% # Rename computed value
        mutate(sample_size = sample_size) # Sample size

      # Summary of contrasts of the marginalized posterior expected value
      epreds_marginal_contrasts_summary <- epreds_marginal_contrasts %>%
        summarize(min = min(diff_mean_epred), # Minimum posterior expected value
          hdi_lo = hdi(diff_mean_epred)[ncol(hdi(diff_mean_epred)) - 1],  # Rightmost lower bound of 95% HDI of the posterior expected value
          mean_diff_mean_epred = mean(diff_mean_epred), # Mean posterior expected value
          median_diff_mean_epred = median(diff_mean_epred), # Median posterior expected value
          hdi_hi = hdi(diff_mean_epred)[ncol(hdi(diff_mean_epred))], # Rightmost upper bound of 95% HDI of the posterior expected value
          hdi_diff = hdi_hi - hdi_lo, # Size of the support of the 95% HDI
          max = max(diff_mean_epred), # Maximum posterior expected value
          prop_gt = mean(diff_mean_epred > 0), # Proportion of posterior draws with values greater than chance
          .groups = "drop") %>%
        mutate(sample_size = sample_size) # Sample size

      # Collect computed elements
      return(list(epreds = epreds,
                  # epreds_marginal = epreds_marginal, # exclude draws from the list because it'll become astronomically large when scaled up
                  epreds_marginal_summary = epreds_marginal_summary,
                  # epreds_marginal_contrasts = epreds_marginal_contrasts, # exclude draws from the list because it'll become astronomically large when scaled up
                  epreds_marginal_contrasts_summary = epreds_marginal_contrasts_summary))
    }
)

# Collect summaries of each model's distribution of the marginalized expected value
critical_hyfer_value <- 10

results_summaries <- map_dfr(results_list, ~ {
  bind_rows(.x$epreds_marginal_summary,
            .x$epreds_marginal_contrasts_summary %>% mutate(condition = "contrast")) %>%
    mutate(mean = case_when(is.na(mean_mean_epred) ~ mean_diff_mean_epred,
                            is.na(mean_diff_mean_epred) ~ mean_mean_epred,
                            T ~ NA),
           median = case_when(is.na(median_mean_epred) ~ median_diff_mean_epred,
                            is.na(median_diff_mean_epred) ~ median_mean_epred,
                            T ~ NA),
           hdi_exclude_chance = case_when(hdi_lo > 0.5 & condition %in% c("generic", "specific") ~ 1,
                                        hdi_lo > 0 & condition == "contrast" ~ 1,
                                        hdi_hi < 0.5 & condition %in% c("generic", "specific") ~ 1,
                                        hdi_hi < 0 & condition == "contrast" ~ 1,
                                        T ~ 0),
           odds = prop_gt / (1 - prop_gt), # Compute odds
           hyfer = ifelse(odds < 1, 1 / odds, odds), # Compute the hypothesis free evidence ratio
           hyfer = ifelse(hyfer == Inf, 1000, hyfer), # Upper bound on HYFER
           hyfer_gt = ifelse(hyfer > critical_hyfer_value, 1, 0), # Critical HYFER value
           note = ifelse(condition == "contrast",
                         "prop_gt = prop greater than 0",
                         "prop_gt = prop greater than .5")) %>%
    select(-c(mean_mean_epred, mean_diff_mean_epred,
              median_mean_epred, median_diff_mean_epred)) %>%
    relocate(mean, .before = hdi_hi) %>%
    relocate(median, .before = hdi_hi)
},
.id = "model"
) %>%
  mutate(model = as.numeric(model))

# summarize the summaries of each model's distribution of the marginalized expected value
# results_summaries %>%
#   group_by(condition, sample_size) %>%
#   summarize(mean_mean = mean(mean), # average expected value across models
#             mean_prop_gt = mean(prop_gt), # average prop_gt across models
#             median_prop_gt = median(prop_gt), # median prop_gt across models
#             sd_prop_gt = sd(prop_gt), # sd prop_gt across models
#             mean_hdi_diff = mean(hdi_diff)) %>% print(n = Inf)

# results_summaries %>%
#   group_by(condition, sample_size) %>%
#   mutate(prop_gt_indicator = case_when(condition == "contrast" & prop_gt > 0 ~ 1,
#                                        condition %in% c("specific", "generic") & prop_gt > 0.5 ~ 1,
#                                        T ~ 0)) %>%
#   summarize(prop_prop_gt_indicator = mean(prop_gt_indicator))  %>% print(n = Inf)

Here are visualizations of the results summaries.

# mean of posterior marginalized expected value
ggplot(data = results_summaries,
       mapping = aes(x = sample_size, y = median)) +
  stat_slabinterval(aes(fill = condition),
                    slab_alpha = .40, .width = c(0.5, 0.95),
                    point_interval = "median_hdi",
                    position = position_dodge(width = 8)) +
  theme_classic(base_size = 18, base_family = "Times New Roman") +
  theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
  geom_hline(yintercept = 0, linetype = 3) +
  geom_hline(yintercept = 0.5, linetype = 3) +
  scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.1f", x)), breaks = pretty_breaks(10)) +
  labs(title = "Median of Marginal Posterior Expected Value", subtitle = "Mean, 50% and 95% Intervals", x = "Sample Size", y = "Marginal Posterior Expected Value")

# Dispersion of posterior marginal expected value
ggplot(data = results_summaries,
       mapping = aes(x = sample_size, y = hdi_diff)) +
  geom_smooth(data = results_summaries,
              aes(color = condition),
              se = FALSE) +
  stat_slabinterval(aes(fill = condition),
                    slab_alpha = .40, .width = c(0.5, 0.95),
                    point_interval = "median_hdi",
                    show.legend = FALSE,
                    position = position_dodge(width = 8)) +
  theme_classic(base_size = 18, base_family = "Times New Roman") +
  theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
  scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.2f", x))) +
  labs(title = "Posterior Dispersion of Expected Value", subtitle = "Mean, 50% and 95% Intervals", x = "Sample Size", y = "Size of HDI of Model-Averaged Expected Values (Probability Scale)")

plots_df <- results_summaries %>%
  group_by(condition, sample_size) %>%
  summarize(mean_prop_gt = mean(prop_gt),
            sd_prop_gt = sd(prop_gt),
            prop_hdi_exclude_chance = mean(hdi_exclude_chance),
            prop_hyfer_gt = mean(hyfer_gt)) %>%
  ungroup() %>%
  mutate(sd_lo = mean_prop_gt - sd_prop_gt,
         sd_hi = mean_prop_gt + sd_prop_gt) %>% print(n = Inf)

# Proportions
critical_prop_value <- .80 # cutoff value (a la power)

ggplot(data = plots_df %>% pivot_longer(cols = c(prop_hdi_exclude_chance, prop_hyfer_gt)),
       aes(
         x = sample_size,
         y = value,
         color = condition,
         linetype = name,
         group = interaction(condition, name)
         )) +
  geom_line() +
  geom_point(size = 3) +
  theme_classic(base_size = 18, base_family = "Times New Roman") +
    theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
  scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.1f", x))) +
  labs(title = "Binary Statistics",
       x = "Sample Size",
       y = "Proportion of Posterior Models",
       subtitle = paste0(n_samples_per_sample_size, " Samples Per Size Sample"),
       color = "Condition",
       linetype = "Measure"
       ) +
  geom_hline(yintercept = critical_prop_value, linetype = 2) +
  scale_linetype_manual(
    values = c("prop_hdi_exclude_chance" = "solid",
               "prop_hyfer_gt" = "dashed"),
    labels = c("prop_hdi_exclude_chance" = "HDI Excludes Chance",
               "prop_hyfer_gt" = paste0("HYFER > ", critical_hyfer_value)))

SessionInfo

sessionInfo() # print session information

end <- Sys.time() # record endtime
print(end - start) # runtime

Simulation-Based Bayesian Sample Size Planning

Jared Vasil

2026-03-02

Introduction

The Pipeline

Skippable

Background

Additional Resources

Future Directions

First Component

Globals

Data Import and Wrangling

Results

Descriptive Statistics

Warmup Phase

Trials Completed

Data

Predictive Accuracy

Models

Checks

Analysis

Posterior Distributions

Test Phase

Trials Completed

Data

Predictive Accuracy

Models

Checks

Analysis

Posterior Distributions

Control Phase

Checks

Analysis

Second Component

The Generative Process

Design

Effects

Simulation

Remove Generative Parameters

Samples

Parameters of the Random Sampling Process

Check Samples to Be Collected

Parallelization

Draw random samples

Check Samples

Iterative Modeling

Results

SessionInfo