This RPub documents a pipeline for simulation-based Bayesian sample size planning. This planning pipeline uses tools from tidyverse for data wrangling, brms for Bayesian modeling, and tidybayes for posterior analysis.
The planning pipeline includes two components. The first component is an analysis pipeline. The second component iterates over that analysis pipeline using simulated data. The results of the second component inform one’s choice of sample size.
The first component is the analysis pipeline that one will use to analyze the data from one’s experiment. This component is created before or after collecting pilot data and is used to analyze that data. In turn, the results of the pilot analysis inform the second component of the planning pipeline. In the second component, one simulates a population using parameter values inferred from the pilot data. Then, this simulated population is randomly sampled many times over while systematically increasing the sample size. The analysis pipeline from the first component is then iterated over, once per sample. The results of this pilot data-informed, iterative procedure are the quantities used for sample size planning.
The use of pilot data in this pipeline is intended to decrease the uncertainty about effects associated with power analysis. Both components should be preregistered. Specifically, one should preregister their main inference pipeline and the results of their sample size simulation.
Importantly, several code chunks of this pipeline may be repurposed with little or no modification. Other chunks require modification to fit one’s experimental design. Repurposing this pipeline requires nontrivial R literacy. However, the extensive documentation and commenting should lessen the requirements.
This pipeline collects the statistical practices that I have found useful in my own research. In part, these practices implement my effort to maximize the generalizability of my inferences in light of the sparse data characteristic of my studies and that of experimental developmental psychological research, generally. Towards this end, this pipeline assumes that investigators adopt a posture of model expansion via the construction of probabilistic models that account for numerous sources of uncertainty, rather than of comparison with simpler models; the use of marginal rather than conditional probabilities; and prediction via exected values rather than inference on parameters.
Moreover, this pipeline implements my effort to faithfully translate my scientific hypotheses into statistical hypotheses. As I see it, this is the only way to conduct statistical tests that logically, rather than merely conventionally, speak to the content of the conclusions that I want to advance. Typically, I (and many others) advance directional hypotheses that beget directional predictions, like observed mean A > observed mean B. Then, I advance directional conclusions based on the results of my directional tests. Specifically, this pipeline measures the evidence for hypothesized directional effect, compared to the alternative directional effect that was not hypothesized (for a comparison of Bayesian directional testing to one-tailed p-values, see this; for a recent revival of one-tailed p-values, see this thang). That is, this pipeline assumes that the null-is-nil is fake news because it is quasi-always false and so is ignored here.
Specifically, this pipeline measures the posterior odds of directional hypotheses. If both directional hypotheses are a priori equiprobable, then their posterior odds is equivalent to a Bayes factor that quantifies the evidence for one or the other directional hypothesis. Notably, the posterior odds of directional hypotheses qua Bayes factor avoids the sensitivity to prior dispersion characteristic of model comparison Bayes factors or Bayes factors computed via the Savage-Dickey density ratio. This is because the posterior odds of direction is an “objective” property of the posterior - symmetric changes to prior dispersion about zero “cancel each other out,” so to speak. In practice, I have found the posterior odds to be essentially unimpacted by prior dispersion until one places unreasonably small dispersions, at which point the posterior odds collapses to 1. Therefore, while priors can encode theory, in the present pipeline priors are tools for identifiability, regularization, and robust estimation in rich, hierarchical, probabilistic models. Moreover, the posterior odds are computed from the posterior distribution over the marginalized expected value of the response, as opposed to the parameters. That is, the quantity used for inference is averaged across all the other covariates’ modeled uncertainty.
Quick introductions to Bayesian inference can be found here and here. In addition to the Kruschke and Gelman groups, there’s also the Wagenmakers group (e.g., here is that group’s Bayesian sample size planning pipeline for developmental psychologists). Personally, I much prefer Kruschke and Gelman group; Gelman et al.’s Bayesian Data Analysis has been a useful learning tool for me. The Wagenmakers group relies too heavily on Bayes factors, which are overly sensitive to the prior (above) while, nonetheless, offering important benefits over p-values (summarized here, e.g., optional stopping). Here’s an introduction to brms by the creator of brms. Bayesian workflows can be found here and here. Kruschke has a baller, hella thorough article on Bayesian reporting guidelines.
Conceivably, this pipeline could develop into a simple web interface or R package, like the Frequentist SuperPower. In the future, I envision a user experience that allows one to easily specify various population parameters, parameters for sampling from the population, model structures and priors, and statistical quantities to be computed from the posterior models to be used for sample size planning.
We start up our packages and globals. The general function of each package is commented. The seed tells the random number generator where to begin its process.
library(brms) # Models
library(furrr) # Parallelization
library(future) # Parallelization
library(janitor) # Cleaning
library(knitr) # Reporting
library(parallel) # Parallelization
library(scales) # Plots
library(tidybayes) # Analysis
library(tidyverse) # Wrangling
seed <- 10003 # RNG seed
First, we define the generative process, including its design and effects. Then, we simulate the generative process.
Next, we set the design parameters of the generative process. The function of each parameter is commented.
n_population <- 15000 # Population size
n_trials <- 6 # Trial count
items <- n_trials # Item count
conditions <- c("specific", "generic") # Conditions
age_lower_bound <- 4 # Lower bound of population age
age_upper_bound <- 6 # Upper bound of population age
ages <- runif(n_population, age_lower_bound, age_upper_bound) # Randomly sample ages
genders <- c("boy", "girl") # Population genders
Then, we set the effect sizes of the generative process. These quantities are very important. They are the true effects. When we sample the population, we don’t know these effects. So, we have to infer them. Depending on the size of the effects, their number, grouping structure, and so on, we may need larger sample sizes to be adequately confident that we will find adequate evidence for our predictions. The effect sizes are on the logit scale.
# Effects parameters of the generative process
trial_exclusion_rate <- 0.05 # Population rate of trial exclusion
logit_baseline <- qlogis(.576) # Estimated probability (pilot sample, specific condition)
logit_effect <- 0.8 # Effect size
logit_raneff_id_sd <- 0.1 # SD of participant random intercept
logit_raneff_id_condition <- 0.1 # SD of condition random slope
logit_raneff_trial_sd <- 0 # SD of trial random intercept
logit_raneff_item_sd <- 0.1 # SD of item random intercept
Next, we simulate the generative process. This simulates the population from which we will subsequently draw random samples.
population_logistic_raw <- crossing(id = seq_len(n_population), # ID
trial = seq_len(n_trials)) %>% # Trial
group_by(id) %>% # Group
mutate(condition = factor(rep(conditions, length.out = n_trials), # Condition
levels = unique(conditions)),
age = ages[id], # Age
item = sample(items)) %>% # Items
ungroup() %>%
mutate(gender = rep(genders, length.out = n_population)[id], # Gender
logit_raneff_id = rnorm(n_population, 0, logit_raneff_id_sd)[id], # ID random intercept
logit_raneff_trial = rnorm(n_trials, 0, logit_raneff_trial_sd)[trial], # Trial random intercept
logit_raneff_item = rnorm(items, 0, logit_raneff_item_sd)[item], # Item random intercept
logit_raneff_id_condition = rnorm(n_population, 0, logit_raneff_id_condition)[id], # Condition random slope
logit_condition = case_when(condition == "generic" ~ logit_baseline + logit_effect, # Condition effect
condition == "specific" ~ logit_baseline,
T ~ NA),
logit_condition_re = logit_condition + if_else(condition == "generic", # Individual-specific condition effects
logit_raneff_id_condition,
-logit_raneff_id_condition),
logit = # Response logit
logit_raneff_id +
logit_condition_re +
logit_raneff_trial +
logit_raneff_item,
prob = plogis(logit), # Response probability
response = rbinom(n(), 1, prob), # Simulate responses
include = rbinom(n(), 1, 1 - trial_exclusion_rate), # Simulate exclusions
across(where(is.character), ~ as.factor(.))) # Convert characters to factor
sum(is.na(population_logistic_raw)) # Count NAs (should equal zero)
## [1] 0
Next, we remove generative parameters from our dataset. That is, we create a dataset that will contain all and only the variables that we will have in our empirical dataset gathered by our experiment.
population_logistic <- population_logistic_raw %>% # Simulated population with unknown parameters
select(id, trial, condition, item, gender, response, include) # Columns to retain
setdiff(names(population_logistic_raw), names(population_logistic)) # Check which parameters have been removed
## [1] "age" "logit_raneff_id"
## [3] "logit_raneff_trial" "logit_raneff_item"
## [5] "logit_raneff_id_condition" "logit_condition"
## [7] "logit_condition_re" "logit"
## [9] "prob"
Now that we have simulated a population and created a dataframe with unknown parameters, we randomly sample from that population many times over. There is some parallel computation going on here that will enable your machine to draw several random samples simultaneously. This always saves time on computation, but the time savings will become noticeable only when drawing thousands of samples.
First, we set the parameters for our random samples. This includes parameters for the number of random samples per sample size, the minimum and maximum sample sizes, and the “sample size delta.”
The latter, delta parameter controls the step size between samples. For example, if the sample size delta is set to 20 and the minimum and maximum sizes are 20 and 100, respectively, then the algorithm will produce samples of size 20, 40, 60, 80, and 100.
Be conscientious of the values that you set for your these parameters. The stakes aren’t high, but you can mistakenly produce samples of unexpected size. For instance, a delta of 20 with minimum and maximum sample sizes of 10 and 100, respectively, will produce samples of size 10, 30, 50, 70, and 90 (not 100).
# Population sampling parameters
n_samples_per_sample_size <- 200 # Number of random samples per sample size
min_sample_size <- 20 # Minimum sample size
max_sample_size <- 100 # Maximum sample size
granularity <- 10 # Sample size delta
Now that we have set the parameters of our random samples, lets check the unique samples, sample sizes, and number of sample sizes that we will collect.
samples <- rep( # unique samples to be collected
seq(min_sample_size, # Minimum sample size
max_sample_size, # Maximum sample size
by = granularity), # Sample size delta
each = n_samples_per_sample_size) # Number of samples per sample size
length(samples) # sample sizes that we will collect
unique(samples) # sample sizes that we will collect
n_distinct(samples) # number of sample sizes that we will collect
Next, we parallelize our computation. The following code tells R to utilize your machine’s n - 1 cores. The number of cores is the number of independent processes (minus 1) that your machine can perform simultaneously. For instance, my machine has 8 cores. Thus, the code tells my machine to use 7 cores to draw random samples. For a fixed number of samples, this speeds up the sampling process 7-fold. So that I don’t melt my machine, I usually leave a spare core for other computations, like watching Youtube videos or texting my mom.
plan(multisession, workers = detectCores() - 1) # Parallelization
print(plan(multisession)) # Print parallelization plan
We have set the parameters of our random sampling algorithm and checked the samples that we will collect from the generative process. We are ready to draw our samples. If you have set a large number of samples to be collected (e.g., thousands), this may take several seconds to several minutes. Your machine’s hardware (e.g., cores) will also play a role in determining the algorithm’s compute time. The following code will produce a list of samples with length equal to 1800. That is, each element of the list will be one random sample from your simulated population. Each random sample will contain all and only the variables that you plan to use in your model.
list_samples <- local({ # Draw random samples
start_time <- Sys.time() # Record start time
list_samples <- future_map(samples, # Sample sampler
~ population_logistic %>% # Simulated population
filter(id %in% sample(unique(id), size = .x, replace = FALSE)) %>% # Samples
ungroup() %>% # Ungroup computations
filter(include == 1) %>% # Include only included trials
select(-include), # Remove "include" column
.options = furrr_options( # Import from global environment
globals = list(population_logistic = population_logistic),
packages = c("tidyverse"),
seed = seed),
.progress = TRUE) # Report progress
end_time <- Sys.time() # Record end time
print(end_time - start_time) # Print runtime
return(list_samples) # Return list of random samples
}
)
–> –> –>
–> –> –>
–> –>
–> –> –> –>
–> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –>
–> –>
–> –> –> –> –>
–>
–> –> –>
–> –> –>
–> –> –> –> –> –> –> –> –>
–> –> –> –> –> –> –> –> –> –> –> –> –> –>
–> –> –>
–>
–>
–> –> –>
–>