This document reports two preparatory steps for the quantitative analysis of a future experiment called “GKCG.” The first step conducts a sample size planning analysis using simulated data. The second step creates a pipeline for the analysis of the future empirical sample. The results of both steps will inform a preregistration of GKCG. Both steps assume Bernoulli distributed outcomes with hierarchical structure. Specifically, the sample size planning analysis and the analysis pipeline both assume a hierarchical logistic regression, while the analysis pipeline also includes code for a hierarchical Bernoulli mixture model. The hierarchical logistic regression enables inference about group means; the mixture model enables inference about putative psychological mechanisms. Given the ubiquity of Bernoulli distributed data, the pipeline encompassing both steps invites reapplication to other studies, with appropriate modification (e.g., of the simulated population parameters). Presently, this document is written for personal future reuse by the author as opposed to a learning tool. However, the thorough reporting and line comments may be useful for those learning to implement Bayes.
The modeling approach adopted, below, is one of model expansion, rather than model comparison. Model expansion calls for building probability models that incorporate as many sources of uncertainty as one’s data contain information about (i.e., theoretically motivated parameters, demographics parameters, and other covariates). Appropriately informative, neutral priors fill the gap where data are uninformative. Prior information maintains the fidelity of the sampler while simultaneously protecting against overfitting sparse data. The maximally expanded model is used for inference.
Inferences rely on the posterior odds of directional hypotheses of the (marginalized) expected value. Good psychological theory provides directional predictions. For example, theory T predicts that the mean rate of response R is greater in condition 1 than in condition 2. Whereas two-tailed p-values do not investigate this kind of prediction, the posterior odds of directional hypotheses does. Below, it is demonstrated that, given uniform priors symmetric about zero, the posterior odds is mathematically equivalent to a Bayes factor for directional hypotheses. However, despite mathematical equivalence, the posterior odds has the pleasing property of relative invariance to prior dispersion compared to the Bayes factor.
Import packages, set the seed for the random number generator, and set the cores used for sampling the posterior.
# packages
library(brms) # models
library(furrr) # parallelization
library(future) # parallelization
library(janitor) # cleaning
library(knitr) # reporting
library(parallel) # parallelization
library(scales) # plots
library(tidybayes) # analysis
library(tidyverse) # wrangling
# seed
set.seed(10003) # replicability
# cores
n_cores <- detectCores() - 1 # parallelization
The first preparatory step is sample size planning. The goal of this step is to choose a sample size based on the probability that it will yield a dataset that adequately informs the posterior odds. To do this, we simulate a large population, e.g., approximately the size of the target population (e.g., the PANDA database). This simulated population is characterized by known parameter values. After simulating our population, we draw random samples of varying size. Each sample size is drawn many times over, independently of others and with replacement. For example, we will draw, say, 100 random samples each of size 20, 40, …, 200. This will produce 100 * 10 = 1,000 random samples from our simulated population.
# design
n_population <- 15000 # declare population size
n_trials <- 6 # declare trial count
conditions <- c("specific", "generic") # declare conditions
ages <- runif(n_population, 4, 6) # generate continuous age
genders <- c("boy", "girl") # declare genders
# effects
logit_effect <- 0.4 # effect size
logit_raneff_id_sd <- 0.1 # sd of individual response probability
logit_raneff_id_condition <- 0.1 # sd of individual variation in effect size
logit_raneff_trial_sd <- 0.1 # sd of trial response probability
logit_raneff_item_sd <- 0.1 # sd of item response probability
# population
population_logistic_raw <- crossing(
id = seq_len(n_population),
trial = seq_len(n_trials)) %>%
group_by(id) %>%
mutate(condition = rep(conditions, length.out = n_trials),
age = ages[id]) %>%
ungroup() %>%
mutate(gender = rep(genders, length.out = n_population)[id],
logit_raneff_id = rnorm(n_population, 0, logit_raneff_id_sd)[id],
logit_raneff_id_condition = rnorm(n_population, 0, logit_raneff_id_condition)[id],
logit_raneff_trial = rnorm(n_population, 0, logit_raneff_trial_sd)[id],
logit_condition = case_when(condition == "generic" ~ 0.5 * logit_effect,
condition == "specific" ~ - 0.5 * logit_effect,
T ~ NA),
logit_condition_re = logit_condition + if_else(condition == "generic", # individual-specific effects
logit_raneff_id_condition,
-logit_raneff_id_condition),
logit = logit_raneff_id + logit_condition_re,
prob = plogis(logit),
response = rbinom(n(), 1, prob))
sum(is.na(population_logistic_raw)) # ensure 0 na's
# simulated population with unknown parameters
population_logistic <- population_logistic_raw %>% # cull population parameters
select(id, trial, condition, gender, response)
setdiff(names(population_logistic_raw), names(population_logistic)) # check parameters removed