This RPub documents a pipeline for simulation-based Bayesian sample size planning. This planning pipeline uses tools from tidyverse for data wrangling, brms for Bayesian modeling, and tidybayes for posterior analysis.
The planning pipeline includes two components. The first component is an analysis pipeline. The second component iterates over that analysis pipeline using simulated data. The results of the second component inform one’s choice of sample size.
The first component is the analysis pipeline that one will use to analyze the data from one’s experiment. This component is created before or after collecting pilot data and is used to analyze that data. In turn, the results of the pilot analysis inform the second component of the planning pipeline. In the second component, one simulates a population using parameter values inferred from the pilot data. Then, this simulated population is randomly sampled many times over while systematically increasing the sample size. The analysis pipeline from the first component is then iterated over, once per sample. The results of this pilot data-informed, iterative procedure are the quantities used for sample size planning.
The use of pilot data in this pipeline is intended to decrease the uncertainty about effects associated with power analysis. Both components should be preregistered. Specifically, one should preregister their main inference pipeline (i.e., the first component) and the code and results of their sample size simulation (i.e., the second component).
Importantly, several code chunks in this pipeline may be repurposed with little or no modification. Other chunks require modification to fit one’s experimental design. Repurposing this pipeline requires nontrivial conceptual and R skill. However, the extensive documentation and commenting should ease things.
This pipeline collects the statistical practices that I have found useful in my own research. In part, these practices implement my effort to maximize the generalizability of my inferences in light of the sparse data characteristic of my studies. Towards this end, this pipeline assumes that investigators adopt a posture of model expansion via the construction of probabilistic models that account for various sources of uncertainty, rather than of comparison with simpler models; the use of marginal rather than conditional probabilities; and prediction via expected values rather than inference on parameters.
Moreover, this pipeline implements my effort to faithfully translate my scientific hypotheses into statistical hypotheses. As I see it, this is the only way to conduct statistical tests that logically, rather than merely conventionally, speak to the content of the conclusions that I want to advance. Typically, I (and many others) advance directional hypotheses that beget directional predictions. For example, I might hypothesize some psychological mechanism that predicts that observed mean A > observed mean B. Then, I advance directional conclusions based on the results of my directional tests. Specifically, this pipeline measures the evidence for the hypothesized directional effect, compared to the alternative directional effect that was not hypothesized (here is a comparison of Bayesian directional testing to one-tailed p-values; here is a recent revival of one-tailed p-values). In curder terms, this pipeline assumes that the null-is-nil is fake news because it is quasi-always false and so is ignored here.
Specifically, this pipeline implements inference over the odds of competing directional hypotheses. If both directional hypotheses are a priori equiprobable, then their posterior odds is equivalent to a Bayes factor that quantifies the evidence for one directional hypothesis over the other. Notably, the posterior odds of directional hypotheses qua Bayes factor avoids the sensitivity to prior dispersion characteristic of model comparison Bayes factors or Bayes factors computed via the Savage-Dickey density ratio. This is because the posterior odds of direction is an “objective” property of the posterior. Specifically, symmetric changes to prior dispersion about zero cancel each other out, as the prior odds is always 1. In practice, I have found the posterior odds to be essentially unimpacted by prior dispersion until one places assumes theoretically implausibly tight priors, at which point the posterior odds collapses to 1. Therefore, while priors can encode theory, in the present pipeline priors are practical tools for model identifiability, regularization, and robust prediction in rich, hierarchical, probabilistic models. Moreover, in this pipeline the posterior odds are computed from the posterior distribution over the marginalized expected value of the response, as opposed to the parameters. That is, the quantity used for inference is averaged across all the uncertainty of the modeled covariates.
Quick introductions to Bayesian inference can be found here and here. In addition to Kruschke’s and Gelman’s groups is Wagenmakers’ group (e.g., here is the latter group’s Bayesian sample size planning pipeline for developmental psychologists). Personally, I prefer Kruschke and Gelman. Gelman et al.’s Bayesian Data Analysis has been a key learning tool for me. In my opinion, Wagenmakers group relies too heavily on model comparison Bayes factors, which are overly sensitive to the prior (discussed above) while, nonetheless, still offering important benefits over p-values (summarized here, such as optional stopping. Here’s an introduction to brms by the creator of brms. This resource is particularly useful for the present pipeline, which implements Bayesian inference via brms. Generalist Bayesian workflows for data analysis can be found here and here. Kruschke has a thorough article on Bayesian reporting guidelines. For the deep lore on Bayesian (vs. Frequentist) inference, one might start with Ramsey, Fisher, Savage (this, this, this), Lindley, de Finetti, or Gigerenzer (e.g., this, this, or this). Clayton cloaks a thorough summary of the references, above, in best-practices garb while also introducing Bayesian and frequentist inference in an informal but precise way. Classic statements on the history of the concept of probability are by Daston and Hacking, while Porter provides a history of statistics. Gigerenzer and colleagues’ classic work summarizes the former three works, and describes the influence of the concept of probability on science and society. Note that the preceding is merely a selection of some of the sources from which I have benefited, rather than a representative collection of any area of literature.
Conceivably, this pipeline could develop into a simple web interface or R package, like the Frequentist SuperPower. In the future, I envision a user experience that allows one to easily specify various population parameters, parameters for sampling from the population, model structures and priors, and statistical quantities to be computed from the posterior models to be used for sample size planning. I also hope to publish a tutorial of this pipeline, which may perhaps be seen as bringing together recent advances in simulation-based sample size planning for experimental designs in hierarchical models by Pargent and colleagues with advances in Bayesian sample size planning for developmentalists by Wagenmakers and colleagues.
In the first component, we develop our analysis pipeline based on an exploratory study. This pipeline is the pipeline that we will use for inference in the target experiment whose sample size we are planning.
To illustrate this, the analysis pipeline below is developed for an exploratory experiment into 4-year-olds’ understanding of generic knowledge as common ground. Here is the preregistration. The data are all valid PANDA cases, although interference and reliability coding has not yet been completed.
In brief, participants behaved as predicted in the warmup and control phases. The purposes of these phases were to ensure that participants appropriately interpreted unambiguous uses of the pronominal forms used in the test phase (warmup) and to ensure that participants’ test phase responses were due to their conceptual learning rather than syntactic or semantic learning.
In the test phase there was, unexpectedly, strong evidence for a crossover interaction between condition and gender. Specifically, whereas boys behaved as expected, girls reliably responded in the opposite direction. This interaction contrasted with our prediction of an unmoderated main effect of condition. After children learned about Zarpies with generic language and Flurps with specific language, we expected that children would choose Zarpies in response to ambiguous generic reference and Flurps in response to ambiguous specific reference. Boys generally followed this pattern, although they equivocated between Zarpies and Flurps in response to ambiguous specific reference. In contrast, girls often chose Zarpies in response to ambiguous specific reference, with some evidence that they did so more often than in the generic condition. So, boys and girls displayed opposite directional effects, although the evidence for effect direction among boys was stronger than it was among girls.
Session packages and seed.
library(brms) # Models
library(furrr) # Parallelization
library(future) # Parallelization
library(ggtext) # Plots
library(janitor) # Cleaning
library(knitr) # Reporting
library(parallel) # Parallelization
library(patchwork) # Plots
library(readxl) # Data import
library(scales) # Plots
library(tidybayes) # Analysis
library(tidyverse) # Wrangling
seed <- 10003 # RNG seed
# Import Qualtrics data
gkcg_qualtrics0 <- read_csv("data/kids_qualtrics_2026_feb14.csv")
gkcg_qualtrics <- gkcg_qualtrics0 %>%
row_to_names(1) %>% # first row to column names
clean_names()
# Import PANDA data
gkcg_panda0 <- read_csv("data/kids_panda_2026_feb14.csv")
gkcg_panda <- gkcg_panda0 %>%
clean_names() %>%
mutate(participation_date = as.Date(recorded_on), # date of participation
age_num = as.numeric(participation_date - birthday) / 365.25) # exact age in years
# Remove invalid PANDA cases
gkcg_panda_nopilot <- gkcg_panda %>%
filter(participation_date >= "2025-12-12") # Remove experimenter and pilot observations
gkcg_panda_valid <- gkcg_panda_nopilot %>%
filter(is_valid_case == "Yes") %>% # Only valid participants
arrange(recorded_on) %>% # arrange participation records by date/time
mutate(id_new = row_number()) # id's organized by date/time of participation
sum(gkcg_panda_nopilot$is_valid_case == "No") # Count of invalid cases from non-pilot data
id_new <- setNames(gkcg_panda_valid$id_new, gkcg_panda_valid$id)
# Create dataset
gkcg_data0 <- gkcg_qualtrics %>%
filter(id %in% gkcg_panda_valid$id) %>% # Include only valid PANDA cases
select( # Include only these columns
id, # participant id
starts_with("wrm"), # warmup trials
starts_with("generic"), # test trials
starts_with("specific"), # test trials
starts_with("c_")) %>% # control trials
select(-c(ends_with("_start"), ends_with("_end"))) %>% # Remove timing columns
pivot_longer(cols = !id, names_to = "variable", values_to = "response") %>% # Long data
filter(response == "On") %>% # Exclude non-selected items
left_join(gkcg_panda_valid %>% # Join with covariates from PANDA data (by PANDA ID)
select(id, gender, age_num), by = "id") %>%
separate(variable,
into = c("variable", "response"),
sep = "_(?=[^_]+$)") %>% # Split variable and response
mutate(gender = case_when( # Relevel gender
gender == "M" ~ "boy",
gender == "F" ~ "girl"),
variable = str_remove(variable, "_q"), # Remove leading "_q" in string
depicted_gender = case_when( # Create depicted gender
str_ends(variable, "boy") ~ NA_character_,
str_ends(variable, "girl") ~ NA_character_,
str_ends(variable, "_b") ~ "depicted_boy",
str_ends(variable, "_g") ~ "depicted_girl",
T ~ "0"),
trial_type = case_when( # Create trial type
str_detect(variable, "wrm") ~ "warmup",
str_detect(variable, "generic") ~ "test",
str_detect(variable, "specific") ~ "test",
str_detect(variable, "c_") ~ "control",
T ~ NA_character_),
trial_subtype = case_when( # Create trial subtype
str_detect(variable, "boy") ~ "boy_correct", # Warmup (attention)
str_detect(variable, "girl") ~ "girl_correct",
str_detect(variable, "wrm_3pp") ~ "3pp", # Warmup (3p.)
str_detect(variable, "wrm_3ps") ~ "3ps",
str_detect(variable, "generic") ~ "generic", # Test
str_detect(variable, "specific") ~ "specific",
str_detect(variable, "num_phrs_match") ~ "number_or_phrasal_matching", # Control
str_detect(variable, "cnjgtn_match") ~ "conjugation_matching",
T ~ NA_character_),
condition = ifelse(trial_subtype %in% c("generic", "specific"), paste0(trial_subtype), NA_character_), # Create condition
item = case_when( # Create item
str_detect(variable, "ride") ~ "ride",
str_detect(variable, "catch") ~ "catch",
str_detect(variable, "play") ~ "play",
str_detect(variable, "train") ~ "train",
str_detect(variable, "wash") ~ "wash",
str_detect(variable, "rake") ~ "rake",
str_detect(variable, "demo") ~ "demonstrative_pronoun",
str_detect(variable, "personal") ~ "personal_pronoun",
T ~ NA_character_),
response_correct = case_when(trial_subtype == "3pp" & response == "plural" ~ "Correct", # Correct (warmup trials)
trial_subtype == "3ps" & response == "singular" ~ "Correct",
trial_subtype == "boy_correct" & response == "boy" ~ "Correct",
trial_subtype == "girl_correct" & response == "girl" ~ "Correct",
trial_type %in% c("test", "control") ~ NA_character_,
T ~ "Incorrect"),
across(where(is.character), ~ str_to_lower(.)), # convert characters to lowercase
across(where(is.character), ~ as.factor(.)), # convert characters to factors
id = id_new[id]) %>%
group_by(id, trial_type) %>% # Group by id, trial type
mutate(trial = row_number()) %>% # Count trial number
ungroup() %>%
select(-variable) # Remove "variable" column
# Remove participants with zero test trials
trials <- gkcg_data0 %>% count(id, trial_subtype) %>% # count of each trial_subtype by participant
ungroup() %>%
complete(id,
trial_subtype,
fill = list(n = 0)) %>%
pivot_wider(id_cols = id,
names_from = trial_subtype,
values_from = n) %>%
filter(generic == 0 & specific == 0) # only participant 33 contributed no trials
age_exclusions <- gkcg_data0 %>%
distinct(id, gender, age_num) %>% # distinct id - gender - age
filter(age_num < 4 | age_num > 5) # include kids < 4-yo or > 5-yo
# Participants included
gkcg_inclusions <- gkcg_data0 %>%
filter(!id %in% trials$id, # remove X participants who contributed no trials
!id %in% age_exclusions$id) %>%
# id <= 50 + length(trials$id)) %>% # include X many more participants with id's over 50
distinct(id)
gkcg_data <- gkcg_data0 %>%
filter(id %in% gkcg_inclusions$id) # id's to include
n_distinct(gkcg_data$id) # participant count
There were 1368 observations from 118 participants (mean age: 4.52 years, range: 4.01 years to 4.99 years).
Of the 1368 observations, 457 were warmup trials, 684 were test trials, and 227 were control trials.
Of the 118 participants, 65 were girls (mean age: 4.53 years, range: 4.02 years to 4.99 years) and 53 were boys (mean age: 4.53 years, range: 4.02 years to 4.99 years).
The trials completed, data, predictive accuracy, model, analysis, and posterior distributions are presented for the warmup and test phases of the experiment. The data from the control phase is displayed at the end.
Summary: Participants responded correctly at high rates in the warmup phase. The weakest evidence for systematic, correct responding was found for participants’ comprehension of the third-person plural in its personal (they) and demonstrative (those ones) forms. Nonetheless, correct response rates were high in these trials. (NB: Those ones is, really, a third-person plural construction, unlike they, but both are referred to as third-person plural pronouns due to their common function in this experiment.)
Participant counts are inset.
Participant counts are inset.
descriptives <- function(trial_type_level){
gkcg_data %>%
filter(trial_type == {{ trial_type_level }}) %>% # Include only this trial type
count(trial_subtype, response) %>% # Count observations per cell
group_by(trial_subtype) %>% # Set grouping
mutate(prop = n / sum(n)) %>% # Compute proportion
ungroup() # Remove grouping
}
descriptives("warmup") # compute descriptives for warmup trials
# main figure
complete_boy <- descriptives("warmup") %>%
filter(trial_subtype %in% c("boy_correct", "girl_correct")) %>%
droplevels() %>%
complete(trial_subtype, response, fill = list(n = 0, prop = 0))
descriptives_warmup <- descriptives("warmup") %>%
filter(!trial_subtype %in% c("boy_correct", "girl_correct")) %>%
rbind(complete_boy) %>%
mutate(
trial_subtype = factor(case_when(
str_detect(trial_subtype, "girl") ~ "Girl",
str_detect(trial_subtype, "boy") ~ "Boy",
str_detect(trial_subtype, "3pp") ~ "Plural",
str_detect(trial_subtype, "3ps") ~ "Singular",
TRUE ~ NA_character_
), levels = c("Girl", "Boy", "Plural", "Singular")),
response = str_to_title(response),
vjust_text = ifelse(n <= 4, -0.3, 1.3),
response = case_when(response == "Singular" ~ "Sg.",
response == "Plural" ~ "Pl.",
T ~ response),
response = factor(response, levels = c("Girl", "Boy", "Pl.", "Sg.")))
fig1a <- ggplot(descriptives_warmup, aes(x = response, y = prop)) +
facet_grid(~ trial_subtype, scales = "free") +
geom_bar(stat = "identity", fill = "deepskyblue1") +
geom_text(aes(label = n, vjust = vjust_text),
color = "black",
family = "Times New Roman",
size = 4.25) +
scale_y_continuous(
labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
breaks = c(.25, .5, .75)
) +
labs(
x = "Response",
y = "Proportion of Participants"
) +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
)
# print(fig1a) # print for markdown
Here is the distribution of responses across plural and singular third-person demonstrative and personal pronouns. Participant counts are inset.
# Comprehension figure
comprehension_data <- gkcg_data %>%
filter(item %in% c("demonstrative_pronoun",
"personal_pronoun")) %>%
count(trial_subtype, item, response) %>%
ungroup() %>%
group_by(trial_subtype, item) %>%
mutate(response = str_to_title(response),
levels_for_graph = case_when(trial_subtype == "3pp" & str_detect(item, "dem") ~ "Plural Demonstrative",
trial_subtype == "3pp" & str_detect(item, "per") ~ "Plural Personal",
trial_subtype == "3ps" & str_detect(item, "dem") ~ "Singular Demonstrative",
trial_subtype == "3ps" & str_detect(item, "per") ~ "Singular Personal",
T ~ NA),
prop = n / sum(n),
vjust_text = ifelse(n == 0, -0.3, 1.3)) %>%
ungroup() %>%
mutate(levels_for_graph = str_replace(levels_for_graph, "Plural", "Pl."),
levels_for_graph = str_replace(levels_for_graph, "Singular", "Sg."),
levels_for_graph = str_replace(levels_for_graph, "Demonstrative", "Dem."),
levels_for_graph = str_replace(levels_for_graph, "Personal", "Pers."),
response = case_when(response == "Plural" ~ "Pl.",
response == "Singular" ~ "Sg.",
TRUE ~ NA_character_))
func_fig2 <- function(level, x_val, y_val, label) {
geom_text(
data = data.frame(
levels_for_graph = {{level}},
y = y_val,
x = x_val,
label = label
),
aes(x = x, y = y, label = label),
inherit.aes = FALSE,
family = "Times New Roman",
fontface = "bold",
size = 3.5
)
}
fig2a <- ggplot(comprehension_data, aes(x = response, y = prop)) +
facet_grid(~ levels_for_graph, scales = "free") +
geom_bar(stat = "identity", fill = "deepskyblue1") +
geom_text(aes(label = n, vjust = vjust_text),
color = "black",
family = "Times New Roman",
size = 4.25) +
scale_y_continuous(
labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
breaks = c(.25, .5, .75)
) +
labs(
x = "Response",
y = "Proportion of Participants",
) +
scale_fill_identity() +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(
plot.title = element_text(face = "bold", hjust = 0.5),
panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
) +
func_fig2(level = "Pl. Dem.", y = .9, x = 1.5, label = "Those ones") +
func_fig2(level = "Pl. Pers.", y = .9, x = 1.5, label = "They") +
func_fig2(level = "Sg. Dem.", y = .9, x = 1.5, label = "This one") +
func_fig2(level = "Sg. Pers.", y = .9, x = 1.5, label = "S/He")
predictive_accuracy_warmup <- gkcg_data %>% # How often were responses predicted compared to unpredicted?
filter(trial_type == "warmup") %>%
mutate(prediction = case_when(trial_subtype == "girl_correct" & response == "girl" ~ "accurate",
trial_subtype == "boy_correct" & response == "boy" ~ "accurate",
trial_subtype == "3pp" & response == "plural" ~ "accurate",
trial_subtype == "3ps" & response == "singular" ~ "accurate",
T ~ "inaccurate")) %>%
count(name = "count", prediction) %>%
mutate(odds_of_accurate_prediction = count[prediction == "accurate"] / count[prediction == "inaccurate"])
Of the 457 warmup trial observations, 380 were predicted accurately (e.g., participants chose a dyad in the plural comprehension trial or an individual in the singular trial). In contrast, 77 observations were predicted inaccurately. Thus, our post-hoc odds of accurately predicting participants’ response in each warmup trial was 4.94.
Models of the warmup observations.
# Priors
priors <- c(prior(normal(0, 1.00), Intercept),
prior(normal(0, 0.8), b),
prior(normal(0, 0.2), sd),
prior(lkj(2), cor))
# Response family
response_family <- "bernoulli"
# Formula
warmup_formula <- brmsformula(response_correct ~ trial_subtype + (trial_subtype | id))
warmup_formula_comprehension <- brmsformula(response_correct ~ trial_subtype + item + (trial_subtype + item | id))
warmup_data_to_be_modeled <- gkcg_data %>%
drop_na(response_correct) %>%
mutate(trial_subtype = fct_drop(trial_subtype),
response_correct = fct_relevel(response_correct, "incorrect", "correct"))
mod_warmup <- brm(formula = warmup_formula,
data = warmup_data_to_be_modeled,
family = response_family,
chains = detectCores() - 1,
cores = detectCores() - 1,
prior = priors,
backend = "cmdstanr")
mod_warmup_comprehension <- brm(formula = warmup_formula_comprehension,
data = warmup_data_to_be_modeled %>% filter(!is.na(item)),
family = response_family,
chains = detectCores() - 1,
cores = detectCores() - 1,
prior = priors,
backend = "cmdstanr")
func_bayesian_p_val <- function(mod, responsevar) {
family <- mod$family$family # Detect family (for bernoulli or poisson -- categorical is computed separately)
yrep <- posterior_predict(mod) # Posterior predictive draws
yobs <- as.numeric(mod$data[[responsevar]]) # Observed data
if (family == "bernoulli") { # Adjust for Bernoulli if needed
yobs <- yobs - 1 # assumes 1 = success, 0 = failure
}
T_obs_mean <- mean(yobs) # Compute summary statistics
T_rep_mean <- apply(yrep, 1, mean)
pvals = mean(T_rep_mean >= T_obs_mean) # Bayesian p-values
print(pvals)
}
func_bayesian_p_val(mod_warmup, "response_correct")
func_bayesian_p_val(mod_warmup_comprehension, "response_correct")
# plot(mod_warmup)
# plot(mod_warmup_comprehension)
# pp_check(mod_warmup, type = "dens_overlay_grouped", ndraws = 1000, group = "trial_subtype")
# pp_check(mod_warmup_compr1ehension, type = "dens_overlay_grouped", ndraws = 1000, group = "trial_subtype")
# pp_check(mod_warmup_comprehension, type = "ecdf_overlay_grouped", ndraws = 1000, group = "trial_subtype")
Analysis of the posterior models of the warmup phase observations.
# Main model
newdata_warmup <- tibble(trial_subtype = levels(mod_warmup$data$trial_subtype))
epreds_warmup <- local({
set.seed(10003)
epreds <- epred_draws(mod_warmup,
newdata_warmup,
re_formula = NA)
}) %>%
ungroup() %>%
mutate(trial_subtype = case_when(trial_subtype == "3pp" ~ "Plural",
trial_subtype == "3ps" ~ "Singular",
trial_subtype == "girl_correct" ~ "Girl",
trial_subtype == "boy_correct" ~ "Boy",
T ~ NA_character_),
trial_subtype = factor(trial_subtype, levels = c("Girl", "Boy", "Plural", "Singular")))
epreds_summarized_warmup <- epreds_warmup %>%
group_by(trial_subtype) %>%
summarize(
min = min(.epred),
hdi_lo = hdi(.epred)[1],
mean = mean(.epred),
hdi_hi = hdi(.epred)[2],
max = max(.epred),
prop_gt = mean(.epred > .5),
prop_gt_char = sprintf("%.0f%%", 100 * prop_gt))
# Comprehension model
newdata_warmup_c <- crossing(trial_subtype = levels(mod_warmup_comprehension$data$trial_subtype),
item = levels(mod_warmup_comprehension$data$item))
epreds_warmup_c <- local({
set.seed(10003)
epreds <- epred_draws(mod_warmup_comprehension, newdata_warmup_c, re_formula = NA)
}) %>%
ungroup() %>%
mutate(trial_subtype = case_when(trial_subtype == "3pp" ~ "Plural",
trial_subtype == "3ps" ~ "Singular",
trial_subtype == "girl_correct" ~ "Girl",
trial_subtype == "boy_correct" ~ "Boy",
T ~ NA_character_),
item = ifelse(item == "demonstrative_pronoun", "Demonstrative", "Personal"),
trial_subtype = factor(trial_subtype, levels = c("Plural", "Singular", "Boy", "Girl")),
variable = case_when(trial_subtype == "Plural" & item == "Demonstrative" ~ "Pl. Dem.",
trial_subtype == "Plural" & item == "Personal" ~ "Pl. Pers.",
trial_subtype == "Singular" & item == "Demonstrative" ~ "Sg. Dem.",
trial_subtype == "Singular" & item == "Personal" ~ "Sg. Pers.",
TRUE ~ NA_character_))
epreds_summarized_warmup_c <- epreds_warmup_c %>%
group_by(variable) %>%
summarize(
min = min(.epred),
hdi_lo = hdi(.epred)[1],
mean = mean(.epred),
hdi_hi = hdi(.epred)[2],
max = max(.epred),
prop_gt = mean(.epred > .5),
prop_gt_char = sprintf("%.0f%%", 100 * prop_gt))
Posterior distributions of the mean rate of choosing correctly.
fig1b <- ggplot(epreds_warmup, aes(y = .epred, x = trial_subtype)) +
facet_grid( ~ trial_subtype, scales = "free") +
stat_halfeye(
aes(fill = after_stat(y < 0.5)),
.width = 0.95,
slab_alpha = 0.8,
slab_linewidth = 0.5,
slab_color = "black"
) +
scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"),
guide = "none") +
geom_text(
data = epreds_summarized_warmup,
aes(
y = hdi_hi + .03,
hjust = -.15,
label = prop_gt_char
),
size = 4.25,
family = "Times New Roman",
fontface = "bold",
color = "grey35"
) +
geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
labs(
y = "Predicted Mean Correct",
x = NULL
) +
scale_y_continuous(
labels = function(x)
sub("^0\\.", ".", sprintf("%.2f", x)),
limits = c(0.4, 1),
breaks = c(.5, .75, 1.00)
) +
scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(
plot.subtitle = element_text(hjust = .5, size = base_size - 3),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = .5, face = "bold"),
)
fig1a + fig1b
ggsave("figs/fig1.png", height = 4, width = 9)
fig2b <- ggplot(epreds_warmup_c, aes(y = .epred, x = variable)) +
facet_grid( ~ variable, scales = "free") +
stat_halfeye(aes(fill = after_stat(y < 0.5)),
.width = 0.95,
slab_alpha = .8,
slab_linewidth = 0.5,
slab_color = "black") +
scale_fill_manual(values = c("TRUE" = "grey85",
"FALSE" = "grey35"),
guide = "none") +
geom_text(data = epreds_summarized_warmup_c,
aes(
y = hdi_hi + .03, hjust = -.15, label = prop_gt_char
),
size = 4.25,
family = "Times New Roman",
fontface = "bold",
color = "grey35"
) +
geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
labs(
y = "Predicted Mean Correct",
x = NULL
) +
scale_y_continuous(
labels = function(x)
sub("^0\\.", ".", sprintf("%.2f", x)),
limits = c(0.4, 1),
breaks = c(.5, .75, 1.00)
) +
scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(
plot.subtitle = element_text(hjust = .5, size = base_size - 3),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
)
fig2a + fig2b
ggsave("figs/fig2.png", height = 4, width = 9)
Here is the distribution of the number of test trials completed per trial type. Participant counts are inset.
# Compute individuals' outcome distribution
individual_dat <- gkcg_data %>%
drop_na(condition) %>%
group_by(id) %>%
count(condition, response) %>%
ungroup()
# By participant
individual_dat %>%
complete(id, condition, response, fill = list(n = 0)) %>%
group_by(id, condition) %>%
mutate(
sum = sum(n),
prop = ifelse(sum == 0, 0, n / sum)
) %>%
ungroup() %>%
mutate(response_profile = case_when(condition == "generic" & response == "zarpie" & prop > .5 ~ "predicted",
condition == "specific" & response == "flurp" & prop > .5 ~ "predicted",
T ~ "unpredicted")) %>%
filter(condition == "generic" & response == "zarpie" |
condition == "specific" & response == "flurp") %>%
count(response_profile) %>%
mutate(odds_of_accurate_prediction = n[1]/n[2]) # odds of correctly predicting individual level children's response profiles
# Descriptives
descriptives("test") # compute descriptives for test trials
# figure
fig3a_pos <- position_jitter(width = 0.20, height = 0.01, seed = 10003)
begin_value <- 0.35
end_value <- 0.90
test_results0 <- individual_dat %>%
group_by(id) %>%
complete(condition, response, fill = list(n = 0)) %>%
ungroup() %>%
group_by(id, condition) %>%
mutate(sum = sum(n),
prop = round(n / sum(n), 2),
condition = str_to_title(condition)) %>%
ungroup() %>%
filter(response == "zarpie")
test_results1 <- test_results0 %>%
pivot_wider(id_cols = id,
names_from = condition,
values_from = prop) %>%
mutate(more_zarpie_responses_in = factor(case_when(
Generic > Specific ~ "G", # generic condition (shorthand for plotting)
Generic == Specific ~ "E", # equal between generic and specific conditions (shorthand for plotting)
Generic < Specific ~ "S", # specific condition (shorthand for plotting)
T ~ NA),
levels = c("G", "E", "S")))
df_fig3a0 <- test_results0 %>%
left_join(test_results1 %>% select(id, more_zarpie_responses_in), by = "id")
df_fig3a00 <- df_fig3a0 %>%
left_join(gkcg_data %>% distinct(id, gender) %>% mutate(gender = paste0(gender, "s")), by = "id")
df_fig3a <- rbind(df_fig3a0 %>% mutate(gender = "overall"), df_fig3a00) %>%
mutate(gender = str_to_title(gender),
gender = fct_relevel(gender, "Overall", "Boys", "Girls"),
condition = case_when(condition == "Generic" ~ "Gen.",
condition == "Specific" ~ "Spec.")) %>%
drop_na(more_zarpie_responses_in)
fig3a <- ggplot(df_fig3a,
aes(x = condition, y = prop)) +
facet_grid( ~ gender, scales = "free") +
geom_line(aes(group = id),
color = "deepskyblue1", #factor(more_zarpie_responses_in)),
linewidth = 0.5,
alpha = .2,
position = fig3a_pos) +
geom_jitter(color = "deepskyblue1", #aes(color = factor(more_zarpie_responses_in)),
size = 1,
alpha = .4,
position = fig3a_pos) +
stat_summary(
fun = mean,
geom = "point",
size = 5,
) +
theme_classic(base_size = base_size,
base_family = "Times New Roman") +
scale_y_continuous(limits = c(-0.05, 1.05),
breaks = pretty_breaks(3),
labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x))) +
labs(x = "Condition",
y = "Proportion Zarpie") +
theme(plot.subtitle = element_text(size = base_size - 1, hjust = .5),
panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3),
legend.position = "none")
df_fig5a0 <- test_results1 %>%
left_join(gkcg_data %>% distinct(id, gender), by = "id") %>%
mutate(gender = paste0(gender, "s"))
df_fig5a00 <- rbind(test_results1 %>% mutate(gender = "overall"),
df_fig5a0) %>%
drop_na(more_zarpie_responses_in) %>%
mutate(gender = str_to_title(gender),
gender = fct_relevel(gender, "Overall", "Boys", "Girls"))
df_fig5a <- df_fig5a00 %>%
group_by(gender, more_zarpie_responses_in) %>%
summarize(n = n(),
.groups = "drop") %>%
group_by(gender) %>%
mutate(prop = n / sum(n),
vjust_text = ifelse(n == 5, -0.3, 1.3))
fig5a <- ggplot(df_fig5a,
aes(x = more_zarpie_responses_in,
y = prop)) +
facet_grid(~ gender, scales = "free") +
geom_bar(stat = "identity", fill = "deepskyblue1") +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
geom_text(aes(label = n, vjust = vjust_text),
color = "black",
family = "Times New Roman",
size = 4.25) +
labs(
x = "Response Profile",
y = "Proportion of Participants") +
scale_y_continuous(
labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
breaks = c(0, 1 / length(unique(df_fig5a$more_zarpie_responses_in))),
) +
theme(
plot.title = element_text(hjust = .5, face = "bold"),
legend.position = "none",
panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3)
)
Of the 684 test trial observations, 353 were predicted accurately (i.e., participants chose Zarpies in the generic condition and Flurps in the specific condition). In contrast, 331 observations were predicted inaccurately. Thus, our post-hoc odds of accurately predicting participants’ response in each test trial was 1.07. That is, for every 100 times that we incorrectly predicted participants’ behavior, we did so accurately 107 times.)
Models of the test phase observations.
# bernoulli model of group behavior
test_data_to_be_modeled <- gkcg_data %>%
drop_na(condition) %>%
mutate(response = fct_relevel(response, "flurp", "zarpie"),
age_num_stand = as.numeric(scale(age_num))) # standardize age
formyouluh <- brmsformula(response ~
condition * age_num_stand * gender + depicted_gender +
(condition + depicted_gender | id) + (1 | item))
mod_test <- brm(
formula = formyouluh,
data = test_data_to_be_modeled,
family = response_family,
chains = detectCores() - 1,
cores = detectCores() - 1,
prior = priors,
backend = "cmdstanr"
)
# categorical model of individuals' behavior
cat_dat <- test_results1 %>%
drop_na(more_zarpie_responses_in) %>%
left_join(gkcg_data %>% distinct(id, gender, age_num)) %>%
mutate(
cat_response = case_when( # recode for interpretability
more_zarpie_responses_in == "E" ~ "equal",
more_zarpie_responses_in == "G" ~ "generic",
more_zarpie_responses_in == "S" ~ "specific",
TRUE ~ NA_character_
),
age_num_stand = as.numeric(scale(age_num)
)
)
mod_test_cat <- brm(
formula = cat_response ~ gender * age_num_stand,
data = cat_dat,
family = categorical(refcat = "generic"),
chains = detectCores() - 1,
cores = detectCores() - 1,
prior = c(
prior(normal(0, 1.00), Intercept, dpar = muequal),
prior(normal(0, 1.00), Intercept, dpar = muspecific),
prior(normal(0, .8), b, dpar = muequal),
prior(normal(0, .8), b, dpar = muspecific)
),
backend = "cmdstanr"
)
# graphical checks
# plot(mod_test)
# plot(mod_test_cat)
# pp_check(mod_test, ndraws = 1000, "dens_overlay_grouped", group = "condition")
# pp_check(mod_test, ndraws = 1000, "dens_overlay_grouped", group = "gender")
# pp_check(mod_test, ndraws = 1000, "ecdf_overlay_grouped", group = "gender")
# pp_check(mod_test_cat, ndraws = 1000, "dens_overlay_grouped", group = "gender")
# pp_check(mod_test_cat, ndraws = 1000, "ecdf_overlay_grouped", group = "gender")
# p val of bernoulli model (mean)
func_bayesian_p_val(mod_test, responsevar = "response")
# p val of categorical model (category means)
yrep <- posterior_predict(mod_test_cat)
yrep_props <- apply(yrep, # matrix to vectorize function over
1, # rowwise vectorization
function(x) as.data.frame(prop.table(table(factor(x)))) %>%
mutate(Var1 = case_when(Var1 == 1 ~ attr(yrep, "levels")[1], # get the internal level assigned to 1
Var1 == 2 ~ attr(yrep, "levels")[2], # same for 2
Var1 == 3 ~ attr(yrep, "levels")[3], # same for 3
T ~ NA)))
yrep_props_tibble0 <- bind_rows(yrep_props, .id = "draw") %>% # convert list to tibble
as_tibble() %>%
rename(response = Var1, # rename vars
predicted_proportion = Freq)
yrep_props_tibble0 %>% # sometimes, posterior_predict will generate zero counts for a category within a simulated dataset. this quantifies the number of times that that happened.
count(response) %>%
mutate(zero_predicted = dim(yrep)[1] - n) # number of simulated datasets without that category represented
yrep_props_tibble <- yrep_props_tibble0 %>% # account for the zeros, above
complete(draw, response, fill = list(predicted_proportion = 0)) %>% # irrelevant for gkcg because there were zero simulated datasets with zero predicted observations of a category, but retaining for future robust reuse
left_join(
tibble(response = mod_test_cat$data[["cat_response"]]) %>% # get observed data
count(response, name = "observed_n") %>% # count up the observed responses by category
mutate(observed_proportion = observed_n / sum(observed_n))) %>% # compute observed proportion
mutate(prediction_exceeds_observation = ifelse(predicted_proportion > observed_proportion, 1, 0)) # compute whether predicted exceeds observed
mod_test_pvals <- yrep_props_tibble %>% group_by(response) %>% summarize(bayes_p_val = mean(prediction_exceeds_observation)) # compute category-wise bayesian p-values
print(mod_test_pvals)
Analysis of the posterior models of the test phase observations.
# epreds
n_bins <- 30 # number of bins for prediction grid
newdata <- crossing(
condition = unique(levels(mod_test$data$condition)),
depicted_gender = unique(levels(mod_test$data$depicted_gender)),
gender = unique(levels(mod_test$data$gender)),
age_num_stand = seq(from = min(mod_test$data$age_num_stand),
to = max(mod_test$data$age_num_stand),
length.out = n_bins))
epreds_test <- local(
{
set.seed(10003)
epred_draws(mod_test, newdata, re_formula = NA)
}
) %>%
ungroup() %>%
mutate(age_num = age_num_stand * sd(gkcg_data$age_num) + mean(gkcg_data$age_num))
func_summarize <- function(df, quantity, prop_gt_val) {
df %>%
summarize(
min = min({{quantity}}),
hdi_lo = hdi({{quantity}})[1],
mean = mean({{quantity}}),
hdi_hi = hdi({{quantity}})[2],
max = max({{quantity}}),
prop_gt = mean({{quantity}} > prop_gt_val),
prop_gt_char = sprintf("%.0f%%", 100 * prop_gt)
) %>%
mutate(across(where(is.double), ~round(., 3))) %>%
ungroup()
}
func_epreds_marginal <- function(variable){
# Capture variable name as string
var_name <- rlang::as_name(rlang::enquo(variable))
# Marginalize predictions
epreds_marginalized <- epreds_test %>%
group_by(.draw, {{variable}}) %>%
summarize(mean_epred = mean(.epred), .groups = "drop") %>%
mutate(
!!var_name := str_to_title(as.character({{variable}})) # dynamic rename
)
# Summarize marginal predictions
epreds_marginalized_summary <- epreds_marginalized %>%
group_by(!!sym(var_name)) %>% # use dynamic column
func_summarize(mean_epred, 0.5)
# Compute contrasts
epreds_marginalized_contrasts <- epreds_marginalized %>%
compare_levels(mean_epred, by = var_name) %>% # dynamic column name
ungroup()
epreds_marginalized_contrasts_summary <- epreds_marginalized_contrasts %>%
func_summarize(mean_epred, 0)
# Return list
return(list(
draws = epreds_marginalized,
draws_summary = epreds_marginalized_summary,
contrasts = epreds_marginalized_contrasts,
contrasts_summary = epreds_marginalized_contrasts_summary
))
}
# marginal gender results
results_gender <- func_epreds_marginal(gender) # both above chance, no effect
# marginal depicted gender results
results_depgender <- func_epreds_marginal(depicted_gender) # both above chance, no effect
# marginal condition results
results_condition <- func_epreds_marginal(condition) # only generic above chance, trendng towards predicted effect
# marginal age results
epreds_slope <- epreds_test %>%
group_by(.draw) %>%
summarize(age_slope = cov(age_num, .epred)/var(age_num),
.groups = "drop")
epreds_slope %>% # no marginal age effect (trending towards increasing zarpies w/ age)
func_summarize(age_slope, 0) %>%
ungroup()
# interaction (condition * gender)
epreds_intxn <- epreds_test %>%
group_by(.draw, condition, gender) %>%
summarize(mean_epred = mean(.epred),
.groups = "drop")
epreds_intxn_summary <- epreds_intxn %>% # marginal means of condition * gender cells
group_by(gender, condition) %>%
func_summarize(mean_epred, .5) # boys do the thing, girls do the opposite!!!! wow!
epreds_intxn_contrasts <- epreds_intxn %>%
group_by(gender) %>%
compare_levels(mean_epred, by = "condition") %>%
ungroup() %>%
rename(diff_mean_epred = mean_epred)
epreds_intxn_contrasts %>%
group_by(gender) %>%
func_summarize(diff_mean_epred, 0) # strong evidence for condition effect in boys, weaker but still notable evidence for the OPPOSITE effect in girls
epreds_intxn_contrasts %>% # marginal distribution over the interaction of gender and condition
compare_levels(diff_mean_epred, by = "gender") %>%
ungroup() %>%
rename(diff_diff_mean_epred = diff_mean_epred) %>%
func_summarize(diff_diff_mean_epred, 0)
# interaction (condition * age -- plotted next, below)
epreds_slope_condition <- epreds_test %>%
group_by(.draw, condition) %>%
summarize(age_slope = cov(age_num, .epred)/var(age_num),
.groups = "drop")
epreds_slope_condition %>% # participants more likely w/ age to choose zarpies in zarpie condition, some evidence that they become less likely to do that in the specific condition
group_by(condition) %>%
func_summarize(age_slope, 0) %>%
ungroup()
# interaction (condition * age -- plotted next, below)
epreds_slope_condition_gender <- epreds_test %>%
group_by(.draw, condition, gender) %>%
summarize(age_slope = cov(age_num, .epred)/var(age_num),
.groups = "drop")
epreds_slope_condition_gender %>% # participants more likely w/ age to choose zarpies in zarpie condition, some evidence that they become less likely to do that in the specific condition
group_by(gender, condition) %>%
func_summarize(age_slope, 0) %>%
ungroup()
# plot of condition * age interaction
epreds_age <- epreds_test %>%
group_by(.draw, age_num, condition) %>%
summarize(mean_epred = mean(.epred),
.groups = "drop")
epreds_age_marginal_overall <- epreds_age %>%
group_by(age_num, condition) %>%
summarize(ci_lo = quantile(mean_epred, .025),
mean = mean(mean_epred),
ci_hi = quantile(mean_epred, .975),
.groups = "drop")
epreds_age_gen <- epreds_test %>%
group_by(.draw, age_num, condition, gender) %>%
summarize(mean_epred = mean(.epred),
.groups = "drop")
epreds_age_marginal_gen <- epreds_age_gen %>%
group_by(age_num, condition, gender) %>%
summarize(ci_lo = quantile(mean_epred, .025),
mean = mean(mean_epred),
ci_hi = quantile(mean_epred, .975),
.groups = "drop")
df_fig3b_draws <- rbind(epreds_intxn %>% mutate(gender = paste0(gender, "s")),
results_condition$draws %>% mutate(gender = "overall"))
df_fig3b_summary <- rbind(epreds_intxn_summary %>% mutate(gender = paste0(gender, "s")),
results_condition$draws_summary %>% mutate(gender = "overall"))
list_dfs_fig3 <- list(draws = df_fig3b_draws, summary = df_fig3b_summary)
df_fig3b <- map(list_dfs_fig3, ~ .x %>%
mutate(
gender = factor(str_to_title(gender), levels = c("Overall", "Boys", "Girls")),
condition = str_to_lower(condition),
condition = case_when(
condition == "generic" ~ "Gen.",
condition == "specific" ~ "Spec.",
TRUE ~ NA_character_
)
))
fig3b <- ggplot(df_fig3b$draws,
aes(x = condition, y = mean_epred)) +
stat_halfeye(aes(fill = after_stat(y < 0.5)), .width = 0.95, slab_alpha = 0.8, slab_linewidth = 0.5, slab_color = "black") +
scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"), guide = "none") +
geom_text(data = df_fig3b$summary,
aes(x = condition, y = hdi_hi + .005, label = prop_gt_char),
nudge_x = .5,
size = 4.5,
vjust = -.6,
family = "Times New Roman",
fontface = "bold",
color = "grey35") +
geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
facet_grid( ~ gender) +
labs(x = "Condition",
y = "Predicted Mean Zarpies") +
scale_y_continuous(
labels = function(x) sub("^0+", "", format(x, trim = TRUE)), # trim the leading zero
breaks = c(.5, .75)
) +
theme_classic(base_size = base_size, base_family = "Times New Roman")
fig3a + fig3b
ggsave("figs/fig3.png", height = 4, width = 9)
epreds_age_marginal <- rbind(epreds_age_marginal_overall %>% mutate(gender = "overall"),
epreds_age_marginal_gen %>% mutate(gender = paste0(gender, "s"))) %>%
mutate(gender = factor(str_to_title(gender), levels = c("Overall", "Boys", "Girls")))
ggplot(epreds_age_marginal,
aes(x = age_num,
y = mean,
fill = condition,
color = condition)) +
facet_grid(~ gender) +
geom_ribbon(aes(ymin = ci_lo,
ymax = ci_hi),
alpha = 0.4, color = NA) +
geom_line(linewidth = 1.5) +
theme_classic(base_size = 17,
base_family = "Times New Roman") +
labs(x = "Age",
y = "Predicted Mean Zarpies",
color = "Condition", fill = "Condition") +
scale_color_manual(values = c("green4", "deepskyblue4"),
labels = c("Generic", "Specific")) +
scale_fill_manual(values = c("green3", "deepskyblue3"),
labels = c("Generic", "Specific")) +
geom_hline(yintercept = .5, linetype = 2) +
scale_y_continuous(labels = function(x) sub("^0+", "", format(x, trim = TRUE)), # remove leading zeros
breaks = c(.3, .5, .7)) +
scale_x_continuous(breaks = c(4, 5))
ggsave("figs/fig4.png", height = 3, width = 6)
# categorical model
newdata_test_cat <- crossing(gender = levels(mod_test_cat$data$gender),
age_num_stand = seq(from = min(mod_test_cat$data$age_num_stand),
to = max(mod_test_cat$data$age_num_stand),
length.out = n_bins)) # use crossing, not tibble, for intercepts-only predictions (epred_draws requires 1 row)
epreds_test_cat <- local({
set.seed(10003)
epreds <- epred_draws(mod_test_cat, newdata_test_cat)
}) %>%
ungroup()
epreds_cat_draws_overall <- epreds_test_cat %>%
group_by(.draw, .category) %>%
summarize(mean_epred = mean(.epred),
.groups = "drop")
epreds_cat_draws <- epreds_test_cat %>%
group_by(.draw, gender, .category) %>%
summarize(mean_epred = mean(.epred),
.groups = "drop")
epreds_cat_draws_overall_summary <- epreds_cat_draws_overall %>% # boys clearly only have generic profiles above chance; girls slight tendency for specific profiles but not as pronounced as boys
group_by(.category) %>%
func_summarize(mean_epred, 1/n_distinct(epreds_cat_draws$.category)) # chance = 1 / number of response categories
epreds_cat_draws_summary <- epreds_cat_draws %>% # boys clearly only have generic profiles above chance; girls slight tendency for specific profiles but not as pronounced as boys
group_by(gender, .category) %>%
func_summarize(mean_epred, 1/n_distinct(epreds_cat_draws$.category)) # chance = 1 / number of response categories
epreds_cat_contrasts <- epreds_cat_draws %>%
group_by(gender) %>%
compare_levels(mean_epred, by = ".category") %>%
ungroup() %>%
rename(diff_mean_epred = mean_epred)
epreds_cat_contrasts_summary <- epreds_cat_contrasts %>% # boys above chance for generic response profiles; girls not
group_by(gender, .category) %>%
func_summarize(diff_mean_epred, 0) # boys more likely to have generic profiles than specific or equal profiles; girls show no clear pattern
fig5b0 <- rbind(epreds_cat_draws_overall %>% mutate(gender = "overall"),
epreds_cat_draws %>% mutate(gender = paste0(gender, "s")))
fig5b00 <- rbind(epreds_cat_draws_overall_summary %>% mutate(gender = "overall"),
epreds_cat_draws_summary %>% mutate(gender = paste0(gender, "s")))
fig5b_list <- list(draws = fig5b0, summary = fig5b00)
fig5b_data <- map(fig5b_list, ~ .x %>%
mutate(across(c(.category, gender), ~ str_to_title(.)),
.category = case_when(.category == "Equal" ~ "E",
.category == "Generic" ~ "G",
.category == "Specific" ~ "S"),
.category = factor(.category, levels = c("G", "E", "S")),
gender = factor(gender, levels = c("Overall", "Boys", "Girls"))))
fig5b <- ggplot(fig5b_data$draws,
aes(x = .category, y = mean_epred)) +
stat_halfeye(aes(fill = after_stat(y < (1 / length(unique(fig5b_data$summary$.category))))),
.width = 0.95,
slab_alpha = 0.8,
slab_linewidth = 0.5,
slab_color = "black") +
scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"), guide = "none") +
geom_text(data = fig5b_data$summary,
aes(x = .category, y = hdi_hi + .005, label = prop_gt_char),
nudge_x = .5,
size = 3.5,
vjust = -.6,
family = "Times New Roman",
color = "grey35",
fontface = "bold") +
geom_hline(yintercept = 1 / length(unique(fig5b_data$summary$.category)), linewidth = .5, linetype = 2) +
facet_grid( ~ gender) +
labs(x = "Response Profile",
y = "Predicted Rate") +
scale_y_continuous(
labels = function(x) sub("^0\\.", ".", sprintf("%.2f", x)),
breaks = c(1 / length(unique(fig5b_data$draws$.category)),
2 * 1 / length(unique(fig5b_data$draws$.category)))
) +
theme_classic(base_size = base_size, base_family = "Times New Roman")
fig5a + fig5b
ggsave("figs/fig5.png", height = 4, width = 10)
The participant-level results, below (left), suggest that participants chose Zarpies more often in the generic condition than in the specific condition at rates above chance. In contrast, participants chose Zarpies more often in the specific condition, and at equal rates in both conditions, at rates below chance.
The results on the right suggest that participants were more likely to choose Zarpies more often in the generic condition than in the specific condition than the opposite (i.e., generic > specific vs specific > generic; top distribution) and compared to choosing Zarpies at equal rates in both condition (i.e., generic > equal; bottom distribution). Moreover, participants were similarly likely to choose Zarpies at rates similar in both condition compared to choosing Zarpies at higher rates in the specific than in the generic condition (i.e., equal vs. specific; middle distribution).
While all of these directional effects follow our predictions, the evidence for them ought to be stronger.
Here’s the control phase data.
df_fig6 <- descriptives("control") %>%
mutate(
trial_subtype = factor(case_when(trial_subtype == "conjugation_matching" ~ "Conjugation",
trial_subtype == "number_or_phrasal_matching" ~ "Number or Phrase",
T ~ NA_character_)),
response = str_to_title(response),
vjust_text = ifelse(n == 5, -0.3, 1.3))
fig6a <- ggplot(df_fig6, aes(x = response, y = prop)) +
facet_grid(~ trial_subtype, scales = "free") +
geom_bar(stat = "identity", fill = "deepskyblue") +
scale_y_continuous(labels = function(x) sub("^0\\.", ".", sprintf("%.1f", x)),
breaks = pretty_breaks(2)) +
geom_text(aes(label = n, vjust = vjust_text),
color = "black",
family = "Times New Roman",
size = 5) +
labs(x = "Response",
y = "Proportion of Participants") +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(panel.grid.major.y = element_line(color = "grey85", linewidth = 0.3))
Control phase model.
control_data0 <- gkcg_data %>%
filter(trial_type == "control") # Include only this trial type
control_data0_age <- control_data0 %>%
distinct(id, age_num) %>% mutate(age_num_stand = as.numeric(scale(age_num)))
control_data <- control_data0 %>%
left_join(control_data0_age)
mod_control <- brm(
formula = response ~ trial_subtype * gender * age_num_stand + depicted_gender +
(trial_subtype + depicted_gender | id),
data = control_data,
family = "bernoulli",
chains = detectCores() - 1,
cores = detectCores() - 1,
prior = priors,
backend = "cmdstanr"
)
# plot(mod_control)
# pp_check(mod_control, ndraws = 1000)
func_bayesian_p_val(mod_control, "response")
newdata_control <- crossing(trial_subtype = levels(mod_control$data$trial_subtype),
gender = levels(mod_control$data$gender),
depicted_gender = levels(mod_control$data$depicted_gender),
age_num_stand = seq(from = min(mod_control$data$age_num_stand),
to = max(mod_control$data$age_num_stand),
length.out = n_bins))
epreds_control <- local({
set.seed(10003)
epreds <- epred_draws(mod_control,
newdata_control,
re_formula = NA)
}) %>%
ungroup()
epreds_control_draws <- epreds_control %>%
group_by(.draw, trial_subtype) %>%
summarize(mean_epred = mean(.epred))
epreds_control_draws_summary <- epreds_control_draws %>%
group_by(trial_subtype) %>%
func_summarize(mean_epred, .5)
list_fig6b <- list(draws = epreds_control_draws, summary = epreds_control_draws_summary)
df_fig6b <- map(list_fig6b, ~ .x %>%
mutate(
trial_subtype = trial_subtype %>%
str_replace_all("_", " ") %>% # replace underscores with spaces
str_to_title() %>% # convert to title case
str_replace("al", "e") %>% # replace 'al' with 'e'
str_remove("Matching") %>% # remove 'Matching'
str_replace("Or", "or") # replace 'Or' with 'or'
)
)
fig6b <- ggplot(df_fig6b$draws, aes(y = mean_epred, x = trial_subtype)) +
facet_grid( ~ trial_subtype, scales = "free") +
stat_halfeye(
aes(fill = after_stat(y < 0.5)),
.width = 0.95,
slab_alpha = 0.8,
slab_linewidth = 0.5,
slab_color = "black"
) +
scale_fill_manual(values = c("TRUE" = "grey85", "FALSE" = "grey35"),
guide = "none") +
geom_text(
data = df_fig6b$summary,
aes(
y = hdi_hi + .03,
hjust = -.5,
label = prop_gt_char
),
size = 5,
family = "Times New Roman",
fontface = "bold",
color = "grey35"
) +
geom_hline(yintercept = .5, linewidth = .5, linetype = 2) +
labs(
y = "Predicted Mean Zarpies",
x = NULL
) +
scale_y_continuous(
labels = function(x)
sub("^0\\.", ".", sprintf("%.2f", x)),
breaks = c(.5, .7)
) +
scale_x_discrete(expand = expansion(mult = c(0.5, NA))) +
theme_classic(base_size = base_size, base_family = "Times New Roman") +
theme(
plot.subtitle = element_text(hjust = .5, size = base_size - 3),
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
plot.title = element_text(hjust = .5, face = "bold"),
)
fig6a + fig6b
ggsave("figs/fig6.png", height = 4, width = 9)
Now that we have estimates from our pilot data, let’s use them to inform the simulated population in a sample-size planning analysis.
NB: The planning analysis, below, assumes a simpler generative structure than what our exploratory analysis suggested, although the exploratory model structure is retained. In particular, the planning analysis below ignores the gender-condition interaction that was revealed by our exploratory analysis. I am working to integrate this interaction into the simulated population.
First, we define the generative process, including its design and effects. Then, we simulate the generative process.
We set the design parameters of the generative process. The function of each parameter is commented.
# Population size
n_population <- 15000
# Trial count from pilot data
n_trials <- max(gkcg_data$trial)
# Item count from pilot data (one unique item per trial)
items <- n_trials
# Conditions from pilot data
condition <- unique(gkcg_data %>%
filter(trial_type == "test") %>%
pull(trial_subtype)) %>%
droplevels()
# Lower bound of population age
age_lower_bound <- min(gkcg_data$age_num)
# Upper bound of population age
age_upper_bound <- max(gkcg_data$age_num)
# Randomly sample ages
ages <- runif(n_population, age_lower_bound, age_upper_bound)
# Depicted gender from pilot data
depicted_gender <- set_names( # deterministically tie each depicted gender to an item, like in the experiment (otherwise they would be orthogonal to the items, which was not the experimental design)
rep(
unique(
na.omit(
gkcg_data$depicted_gender
)
),
3),
seq_len(n_trials))
# Population genders
genders <- unique(gkcg_data$gender)
Then, we set the effect sizes of the generative process. These quantities are very important. They are the true effects. When we sample the population, we don’t know these effects. So, we have to infer them. Depending on the size of the effects, their number, grouping structure, and so on, we may need larger sample sizes to be adequately confident that we will find adequate evidence for our predictions. The effect sizes are on the logit scale.
# Observed trial exclusion rate from pilot data
trial_exclusion_rate <- gkcg_data %>%
summarise(n_ids = n_distinct(id),
test_n = sum(trial_type == "test")) %>%
transmute(exclusion_rate = (n_trials * n_ids - test_n) / test_n) %>%
pull()
# Estimated Zarpie probability in pilot sample generic condition (logit scale for simulation)
logit_baseline <- qlogis(results_condition$draws_summary %>% # qlogis for prob -> logistic
filter(condition == "Generic") %>%
pull(mean))
# Estimated condition effect (Estimated change in probability of choosing Zarpies in the specific condition)
logit_effect <- qlogis(results_condition$draws_summary %>% # qlogis for prob -> logistic
filter(condition == "Specific") %>%
pull(mean)) - logit_baseline
# Estimated Zarpie probability in pilot sample boys
# logit_baseline_gender <- qlogis(results_gender$epreds_marginalized_summary %>%
# filter(gender == "Boy") %>%
# pull(mean))
# Estimated gender effect
# logit_effect_gender <- qlogis(results_gender$epreds_marginalized_summary %>%
# filter(gender == "Girl") %>%
# pull(mean)) - logit_baseline_gender
# Estimated Zarpie probability in pilot sample depicted boys
# logit_baseline_depgender <- qlogis(results_depgender$epreds_marginalized_summary %>%
# filter(depicted_gender == "Depicted_boy") %>%
# pull(mean))
# Estimated depicted gender effect
# logit_effect_depgender <- qlogis(results_depgender$epreds_marginalized_summary %>%
# filter(depicted_gender == "Depicted_girl") %>%
# pull(mean)) - logit_baseline_depgender
# Estimated age effect
# Compute slope on the logit scale
# epreds_slope_logit <- epreds %>%
# group_by(.draw) %>%
# summarise(
# age_slope_logit = cov(age_standardized, qlogis(.epred)) / var(age_standardized),
# .groups = "drop"
# )
# results_age_logit <- epreds_slope_logit %>%
# func_summarize(age_slope_logit, 0) %>%
# ungroup()
# logit_age <- results_age_logit %>% pull(mean)
# SD of participant random intercept
logit_raneff_id_sd <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
select(sd_id__Intercept) %>%
summarize(mean = mean(sd_id__Intercept)) %>%
pull()
# SD of condition random slope
logit_raneff_id_condition <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
select(sd_id__conditionspecific) %>%
summarize(mean = mean(sd_id__conditionspecific)) %>%
pull()
# SD of item random intercept
logit_raneff_item_sd <- as_draws_df(mod_test) %>% # NOT using qlogis (cf above) because this is already in logits
select(sd_item__Intercept) %>%
summarize(mean = mean(sd_item__Intercept)) %>%
pull()
Next, we simulate the generative process. This simulates the population from which we will subsequently draw random samples.
# Simulate population of seq_len(n_population) * seq_len(n_trials) observations from seq_len(n_population) children
population_logistic_raw <- crossing(id = seq_len(n_population), # ID
trial = seq_len(n_trials)) %>% # Trial
group_by(id) %>% # Group
mutate(condition = factor(rep(condition, length.out = n_trials), # Condition
levels = unique(condition)),
age_num = ages[id], # Age
item = sample(items), # Items
gender = genders[id]) %>% # Gender
ungroup() %>%
mutate(
depicted_gender = depicted_gender[item], # Depicted gender tied to specific items (per study design)
gender = rep(genders, length.out = n_population)[id], # Gender
logit_raneff_id = rnorm(n_population, 0, logit_raneff_id_sd)[id], # ID random intercept
logit_raneff_item = rnorm(items, 0, logit_raneff_item_sd)[item], # Item random intercept
logit_raneff_id_condition = rnorm(n_population, 0, logit_raneff_id_condition)[id], # condition random slope
# logit_gender = case_when(gender == "boy" ~ logit_baseline_gender, # Logit (boy)
# gender == "girl" ~ logit_baseline_gender + logit_effect_gender, # Logit (girl)
# T ~ NA),
# logit_depgender = case_when(depicted_gender == "depicted_boy" ~ logit_baseline_depgender, # Logit (d. boy)
# depicted_gender == "depicted_girl" ~ logit_baseline_depgender + logit_effect_depgender, # Logit (d. girl)
# T ~ NA),
logit_condition = case_when(condition == "generic" ~ logit_baseline, # Logit (generic)
condition == "specific" ~ logit_baseline + logit_effect, # Logit (specific)
T ~ NA),
logit_condition_and_raneff = logit_condition + logit_raneff_id_condition * (condition == "specific"),
# logit_age = logit_age * as.numeric(scale(age_num)),
logit = # Response logit
logit_condition_and_raneff +
# logit_age +
# logit_gender +
# logit_depgender +
logit_raneff_id +
logit_raneff_item,
prob = plogis(logit), # Response probability
response = rbinom(n(), 1, prob), # Simulate responses
include = rbinom(n(), 1, 1 - trial_exclusion_rate)) # Simulate exclusions
sum(is.na(population_logistic_raw)) # Count NAs (should equal zero)
Next, we remove generative parameters from our dataset. That is, we create a dataset that will contain all and only the variables that we will have in our empirical dataset gathered by our experiment.
vars <- all.vars( as.formula( as.character(formyouluh)[1] ) )
population_logistic <- population_logistic_raw %>% # Simulated population with unknown parameters
select(setdiff(vars, "age_num_stand"), # Columns to retain
age_num,
include)
setdiff(names(population_logistic_raw), names(population_logistic)) # Check which parameters have been removed
Now that we have simulated a population and created a dataframe with unknown parameters, we randomly sample from that population many times over. There is some parallel computation going on here that will enable your machine to draw several random samples simultaneously. This always saves time on computation, but the time savings will become noticeable only when drawing thousands of samples.
First, we set the parameters for our random samples. This includes parameters for the number of random samples per sample size, the minimum and maximum sample sizes, and the “sample size delta.”
The latter, delta parameter controls the step size between samples. For example, if the sample size delta is set to 20 and the minimum and maximum sizes are 20 and 100, respectively, then the algorithm will produce samples of size 20, 40, 60, 80, and 100.
Be conscientious of the values that you set for your these parameters. The stakes aren’t high, but you can mistakenly produce samples of unexpected size. For instance, a delta of 20 with minimum and maximum sample sizes of 10 and 100, respectively, will produce samples of size 10, 30, 50, 70, and 90 (not 100).
# Population sampling parameters
n_samples_per_sample_size <- 10 # Number of random samples per sample size
min_sample_size <- 50 # Minimum sample size
max_sample_size <- 100 # Maximum sample size
granularity <- 10 # Sample size delta
Now that we have set the parameters of our random samples, lets check the unique samples, sample sizes, and number of sample sizes that we will collect.
samples <- rep( # unique samples to be collected
seq(min_sample_size, # Minimum sample size
max_sample_size, # Maximum sample size
by = granularity), # Sample size delta
each = n_samples_per_sample_size) # Number of samples per sample size
length(samples) # sample sizes that we will collect
unique(samples) # sample sizes that we will collect
n_distinct(samples) # number of sample sizes that we will collect
Next, we parallelize our computation. The following code tells R to utilize your machine’s n - 1 cores. The number of cores is the number of independent processes (minus 1) that your machine can perform simultaneously. For instance, my machine has 8 cores. Thus, the code tells my machine to use 7 cores to draw random samples. For a fixed number of samples, this speeds up the sampling process 7-fold. So that I don’t melt my machine, I usually leave a spare core for other computations, like watching Youtube videos or texting my mom.
plan(multisession, workers = detectCores() - 1) # Parallelization
print(plan(multisession)) # Print parallelization plan
We have set the parameters of our random sampling algorithm and checked the samples that we will collect from the generative process. We are ready to draw our samples. If you have set a large number of samples to be collected (e.g., thousands), this may take several seconds to several minutes. Your machine’s hardware (e.g., cores) will also play a role in determining the algorithm’s compute time. The following code will produce a list of samples with length equal to {r length(samples)}. That is, each element of the list will be one random sample from your simulated population. Each random sample will contain all and only the variables that you plan to use in your model.
# Draw random samples from our simulated population
list_samples <- local(
{ # Do all of the following for each sample
start_time <- Sys.time() # Record start time
list_samples <- future_map(samples, # Sample sampler
~ { # Do
# Sample participants
sampled <- population_logistic %>%
filter(id %in% sample(unique(id), size = .x, replace = FALSE)) %>%
ungroup() %>%
filter(include == 1) %>%
select(-include)
# Compute participant-level age mean and SD
age_stats <- sampled %>%
distinct(id, age_num) %>% # one row per participant
mutate(age_num_stand = as.numeric(scale(age_num))) %>% # standardize age
select(id, age_num_stand)
# Standardize age per participant
sampled %>%
left_join(age_stats, by = "id")
},
.options = furrr_options( # Import from global environment
globals = list(population_logistic = population_logistic),
packages = c("tidyverse"),
seed = seed),
.progress = TRUE) # Report progress
end_time <- Sys.time() # Record end time
print(end_time - start_time) # Print runtime
return(list_samples) # Return list of random samples
}
)
Now that you have generated your random samples, I suggest a cursory check of the object and the structure of each sample, just to ensure that everything checks out aight.
length(list_samples) # Check the number of samples drawn (will equal length(samples))
list_samples[[1]] # Check random sample [[1]]
test_data_to_be_modeled
# The set of pilot variables properly includes the set of simulated variables
setdiff(names(list_samples[[1]]), names(test_data_to_be_modeled)) # Should be an empty vector because all vars in simulated data contained in pilot data
setdiff(names(test_data_to_be_modeled), names(list_samples[[1]])) # Pilot data contains X vars not in simulated data
Now, we iterate a modeling function over each simulated dataset. In the attached code, the name of the modeling function to be iterated is “func_brm.” Running this function on a dataset produces a joint posterior distribution, i.e., one posterior model of that dataset. We go further by constructing a conditional if-then statement, called “func_model_fitting,” that says: for each iteration of func_brm, if the posterior model has more than one divergence, has any r-hat equal to 1.05 or greater, or (inclusive) includes a warning about effective sample size, then refit the model according to the correspondingly modified sampler call (see code); otherwise, return the posterior model. So, technically, this conditional if-then statement is iterated over, once per sample. The output is a list of posterior models, each fitted to one of our simulated datasets.
NB: “func_model_fitting” iterates only once per simulated dataset, so, in principle, a model that is flagged for modified resampling of its posterior may possess the same (or another) pathology in its refitted format. I am working to fix this by iterating the algorithm until the pathology is removed, so that all models are satisfactory. Thus far, I have found this issue to be more technical than practical, as typically zero, or nearly zero, models have any issues whatsoever.
# Model sampler parameters (in effect, "tune" the iterative algo)
nuts_delta <- .85 # Adapt delta
n_iter_val <- 1200 # Iterations per chain
func_brm <- function(formula = formyouluh,
datty = .x, # Data
fam = response_family, # Response family
n_chains = 2, # Sampler chains
n_iter = n_iter_val, # Posterior sampler per chain
control_specs = list(adapt_delta = nuts_delta), # Control specs
priors_in_func = priors){
# Model
brm(formula = formula,
data = datty,
family = fam,
chains = n_chains,
cores = n_chains,
iter = n_iter,
control = control_specs,
prior = priors_in_func,
backend = "cmdstanr",
seed = seed)
}
func_model_fitting <- function(.x, n_chains){
fit <- func_brm(datty = .x) # Fit base model
if (sum(subset(nuts_params(fit), Parameter == "divergent__")$Value) > 1) { # Check divergences
fit <- func_brm(datty = .x, # Refit divergent models by decreasing sampler step size
control_specs = list(adapt_delta = nuts_delta + .10))}
if (any(map_dbl(rhat(fit), ~ .x) > 1.05)) { # Check chain mixing
fit <- func_brm(datty = .x, # Refit unmixed models by increasing iterations
control_specs = n_iter_val + (0.5 * n_iter_val))}
warnings <- capture.output(summary(fit), type = "message")
if (any(grepl("The ESS has been capped to avoid unstable estimates", warnings))) { # Check stability
fit <- func_brm(datty = .x, # Refit instable models with more chains, more iterations
chains = n_chains + 1,
control_specs = n_iter_val + (0.5 * n_iter_val))}
return(fit) # Return fitted model
}
n_workers <- 3 # Parallelization (n models)
print(plan(multisession, workers = n_workers)) # Check plan (cores = n_workers * n_chains)
mods_list <- local({
start_time <- Sys.time() # Start time
mods <- future_map(list_samples, ~ func_model_fitting(.x, n_chains), # Model list
.options = furrr_options(
globals = list(func_brm = func_brm,
func_model_fitting = func_model_fitting,
seed = seed,
nuts_delta = nuts_delta,
n_iter_val = n_iter_val,
formyouluh = formyouluh,
priors = priors,
response_family = response_family),
packages = c("brms"),
seed = seed),
.progress = TRUE)
end_time <- Sys.time() # End time
print(end_time - start_time) # Print runtime
return(mods) # Return model list
}
)
# prior_summary(mods_list[[1]]) # ensuring that priors were passed correctly to models
Here we iterate the planned analyses over each posterior model. This produces the statistics that we use to inform our decision about sample size. First, we perform a cursory check of the models by checking the number of divergences. The divergences in each model is contained in “divergences.” After that, we analyze our models. The results of our analysis are contained in “results_list” and summarized in “results_summaries.”
# Divergences
divergences <- map_dfr(
mods_list, # Compute the below for each model in mods_list
~ {
nuts <- nuts_params(.x$fit) # Sampler parameters
tibble(
divergences = sum(nuts$Value[nuts$Parameter == "divergent__"]), # Divergences
postwarm_iter = ndraws(.x), # Post warmup iterations
prop_divergent = sum(divergences) / sum(postwarm_iter), # Proportion of iterations that were divergences
sample_size = length(unique(.x$data$id))) # Sample size
}
)
sum(divergences %>% pull(divergences)) # N divergences across all models
# Sample the posterior of one model per population sample
results_list <- map(mods_list,
~ {
# Prediction grid
newdata_planning <- crossing(
condition = unique(levels(.x$data$condition)),
depicted_gender = unique(levels(.x$data$depicted_gender)),
gender = unique(levels(.x$data$gender)),
age_num_stand = seq(from = min(.x$data$age_num_stand),
to = max(.x$data$age_num_stand),
length.out = 20))
# Sample size
sample_size <- length(unique(.x$data$id))
# Reproducible subsampling of posterior expected value
epreds <- local(
{
set.seed(10003) # RNG seed
epred_draws(.x, # draws of the posterior expected value
newdata = newdata_planning,
re_formula = NA,
ndraws = 500)}) %>%
ungroup() %>% # Ungroup computations
mutate(sample_size = sample_size) # Sample size
# Draws of the posterior expected value (not stored)
epreds_marginal <- epreds %>%
group_by(.draw, condition) %>% # Marginalize the posterior expected across gender
summarize(mean_epred = mean(.epred), # Compute mean marginal posterior expected value
.groups = "drop") %>%
mutate(sample_size = sample_size) # Sample size
# Summary of the marginalized posterior expected value
epreds_marginal_summary <- epreds_marginal %>%
group_by(condition) %>% # Group
summarize(min = min(mean_epred), # Minimum posterior expected value
hdi_lo = hdi(mean_epred)[ncol(hdi(mean_epred)) - 1], # Rightmost lower bound of 95% HDI of posterior expected value
mean_mean_epred = mean(mean_epred), # Mean posterior expected value
median_mean_epred = median(mean_epred), # Median posterior expected value
hdi_hi = hdi(mean_epred)[ncol(hdi(mean_epred))], # Rightmost upper bound of 95% HDI of posterior expected value
max = max(mean_epred), # Maximum posterior expected value
hdi_diff = hdi_hi - hdi_lo, # Size of the support of the 95% HDI
prop_gt = mean(mean_epred > 0.5), # Proportion of posterior draws with values greater than chance
.groups = "drop") %>% # Ungroup computations
mutate(sample_size = sample_size) # Sample size
# Draws of contrasts of the marginalized posterior expected value (not stored)
epreds_marginal_contrasts <- epreds_marginal %>%
compare_levels(mean_epred, by = "condition") %>% # Compare draws of posterior expected value across levels
ungroup() %>% # Ungroup computations
rename(diff_mean_epred = mean_epred) %>% # Rename computed value
mutate(sample_size = sample_size) # Sample size
# Summary of contrasts of the marginalized posterior expected value
epreds_marginal_contrasts_summary <- epreds_marginal_contrasts %>%
summarize(min = min(diff_mean_epred), # Minimum posterior expected value
hdi_lo = hdi(diff_mean_epred)[ncol(hdi(diff_mean_epred)) - 1], # Rightmost lower bound of 95% HDI of the posterior expected value
mean_diff_mean_epred = mean(diff_mean_epred), # Mean posterior expected value
median_diff_mean_epred = median(diff_mean_epred), # Median posterior expected value
hdi_hi = hdi(diff_mean_epred)[ncol(hdi(diff_mean_epred))], # Rightmost upper bound of 95% HDI of the posterior expected value
hdi_diff = hdi_hi - hdi_lo, # Size of the support of the 95% HDI
max = max(diff_mean_epred), # Maximum posterior expected value
prop_gt = mean(diff_mean_epred > 0), # Proportion of posterior draws with values greater than chance
.groups = "drop") %>%
mutate(sample_size = sample_size) # Sample size
# Collect computed elements
return(list(epreds = epreds,
# epreds_marginal = epreds_marginal, # exclude draws from the list because it'll become astronomically large when scaled up
epreds_marginal_summary = epreds_marginal_summary,
# epreds_marginal_contrasts = epreds_marginal_contrasts, # exclude draws from the list because it'll become astronomically large when scaled up
epreds_marginal_contrasts_summary = epreds_marginal_contrasts_summary))
}
)
# Collect summaries of each model's distribution of the marginalized expected value
critical_hyfer_value <- 10
results_summaries <- map_dfr(results_list, ~ {
bind_rows(.x$epreds_marginal_summary,
.x$epreds_marginal_contrasts_summary %>% mutate(condition = "contrast")) %>%
mutate(mean = case_when(is.na(mean_mean_epred) ~ mean_diff_mean_epred,
is.na(mean_diff_mean_epred) ~ mean_mean_epred,
T ~ NA),
median = case_when(is.na(median_mean_epred) ~ median_diff_mean_epred,
is.na(median_diff_mean_epred) ~ median_mean_epred,
T ~ NA),
hdi_exclude_chance = case_when(hdi_lo > 0.5 & condition %in% c("generic", "specific") ~ 1,
hdi_lo > 0 & condition == "contrast" ~ 1,
hdi_hi < 0.5 & condition %in% c("generic", "specific") ~ 1,
hdi_hi < 0 & condition == "contrast" ~ 1,
T ~ 0),
odds = prop_gt / (1 - prop_gt), # Compute odds
hyfer = ifelse(odds < 1, 1 / odds, odds), # Compute the hypothesis free evidence ratio
hyfer = ifelse(hyfer == Inf, 1000, hyfer), # Upper bound on HYFER
hyfer_gt = ifelse(hyfer > critical_hyfer_value, 1, 0), # Critical HYFER value
note = ifelse(condition == "contrast",
"prop_gt = prop greater than 0",
"prop_gt = prop greater than .5")) %>%
select(-c(mean_mean_epred, mean_diff_mean_epred,
median_mean_epred, median_diff_mean_epred)) %>%
relocate(mean, .before = hdi_hi) %>%
relocate(median, .before = hdi_hi)
},
.id = "model"
) %>%
mutate(model = as.numeric(model))
# summarize the summaries of each model's distribution of the marginalized expected value
# results_summaries %>%
# group_by(condition, sample_size) %>%
# summarize(mean_mean = mean(mean), # average expected value across models
# mean_prop_gt = mean(prop_gt), # average prop_gt across models
# median_prop_gt = median(prop_gt), # median prop_gt across models
# sd_prop_gt = sd(prop_gt), # sd prop_gt across models
# mean_hdi_diff = mean(hdi_diff)) %>% print(n = Inf)
# results_summaries %>%
# group_by(condition, sample_size) %>%
# mutate(prop_gt_indicator = case_when(condition == "contrast" & prop_gt > 0 ~ 1,
# condition %in% c("specific", "generic") & prop_gt > 0.5 ~ 1,
# T ~ 0)) %>%
# summarize(prop_prop_gt_indicator = mean(prop_gt_indicator)) %>% print(n = Inf)
Here are visualizations of the results summaries.
# mean of posterior marginalized expected value
ggplot(data = results_summaries,
mapping = aes(x = sample_size, y = median)) +
stat_slabinterval(aes(fill = condition),
slab_alpha = .40, .width = c(0.5, 0.95),
point_interval = "median_hdi",
position = position_dodge(width = 8)) +
theme_classic(base_size = 18, base_family = "Times New Roman") +
theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
geom_hline(yintercept = 0, linetype = 3) +
geom_hline(yintercept = 0.5, linetype = 3) +
scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.1f", x)), breaks = pretty_breaks(10)) +
labs(title = "Median of Marginal Posterior Expected Value", subtitle = "Mean, 50% and 95% Intervals", x = "Sample Size", y = "Marginal Posterior Expected Value")
# Dispersion of posterior marginal expected value
ggplot(data = results_summaries,
mapping = aes(x = sample_size, y = hdi_diff)) +
geom_smooth(data = results_summaries,
aes(color = condition),
se = FALSE) +
stat_slabinterval(aes(fill = condition),
slab_alpha = .40, .width = c(0.5, 0.95),
point_interval = "median_hdi",
show.legend = FALSE,
position = position_dodge(width = 8)) +
theme_classic(base_size = 18, base_family = "Times New Roman") +
theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.2f", x))) +
labs(title = "Posterior Dispersion of Expected Value", subtitle = "Mean, 50% and 95% Intervals", x = "Sample Size", y = "Size of HDI of Model-Averaged Expected Values (Probability Scale)")
plots_df <- results_summaries %>%
group_by(condition, sample_size) %>%
summarize(mean_prop_gt = mean(prop_gt),
sd_prop_gt = sd(prop_gt),
prop_hdi_exclude_chance = mean(hdi_exclude_chance),
prop_hyfer_gt = mean(hyfer_gt)) %>%
ungroup() %>%
mutate(sd_lo = mean_prop_gt - sd_prop_gt,
sd_hi = mean_prop_gt + sd_prop_gt) %>% print(n = Inf)
# Proportions
critical_prop_value <- .80 # cutoff value (a la power)
ggplot(data = plots_df %>% pivot_longer(cols = c(prop_hdi_exclude_chance, prop_hyfer_gt)),
aes(
x = sample_size,
y = value,
color = condition,
linetype = name,
group = interaction(condition, name)
)) +
geom_line() +
geom_point(size = 3) +
theme_classic(base_size = 18, base_family = "Times New Roman") +
theme(panel.grid.major.x = element_line(color = "lightgrey", linewidth = 0.3)) +
scale_y_continuous(labels = function(x) sub("^0", "", sprintf("%.1f", x))) +
labs(title = "Binary Statistics",
x = "Sample Size",
y = "Proportion of Posterior Models",
subtitle = paste0(n_samples_per_sample_size, " Samples Per Size Sample"),
color = "Condition",
linetype = "Measure"
) +
geom_hline(yintercept = critical_prop_value, linetype = 2) +
scale_linetype_manual(
values = c("prop_hdi_exclude_chance" = "solid",
"prop_hyfer_gt" = "dashed"),
labels = c("prop_hdi_exclude_chance" = "HDI Excludes Chance",
"prop_hyfer_gt" = paste0("HYFER > ", critical_hyfer_value)))
sessionInfo() # print session information
end <- Sys.time() # record endtime
print(end - start) # runtime