Introduction

This RPub documents a pipeline for simulation-based Bayesian sample size planning. This planning pipeline uses tools from tidyverse for data wrangling, brms for Bayesian modeling, and tidybayes for posterior analysis.

The Pipeline

The planning pipeline includes two components. The first component is an analysis pipeline. The second component iterates over that analysis pipeline using simulated data. The results of the second component inform one’s choice of sample size.

The first component is the analysis pipeline that one will use to analyze the data from one’s experiment. This component is created before or after collecting pilot data and is used to analyze that data. In turn, the results of the pilot analysis inform the second component of the planning pipeline. In the second component, one simulates a population using parameter values inferred from the pilot data. Then, this simulated population is randomly sampled many times over while systematically increasing the sample size. The analysis pipeline from the first component is then iterated over, once per sample. The results of this pilot data-informed, iterative procedure are the quantities used for sample size planning.

The use of pilot data in this pipeline is intended to decrease the uncertainty about effects associated with power analysis. Both components should be preregistered. Specifically, one should preregister their main inference pipeline and the results of their sample size simulation.

Importantly, several code chunks of this pipeline may be repurposed with little or no modification. Other chunks require modification to fit one’s experimental design. Repurposing this pipeline requires nontrivial R literacy. However, the extensive documentation and commenting should lessen the requirements.

Feel Free to Skip This Part Unles You’re Having Trouble Falling Asleep

Background

This pipeline collects the statistical practices that I have found useful in my own research. In part, these practices implement my effort to maximize the generalizability of my inferences in light of the sparse data characteristic of my studies and that of experimental developmental psychological research, generally. Towards this end, this pipeline assumes that investigators adopt a posture of model expansion via the construction of probabilistic models that account for numerous sources of uncertainty, rather than of comparison with simpler models; the use of marginal rather than conditional probabilities; and prediction via exected values rather than inference on parameters.

Moreover, this pipeline implements my effort to faithfully translate my scientific hypotheses into statistical hypotheses. As I see it, this is the only way to conduct statistical tests that logically, rather than merely conventionally, speak to the content of the conclusions that I want to advance. Typically, I (and many others) advance directional hypotheses that beget directional predictions, like observed mean A > observed mean B. Then, I advance directional conclusions based on the results of my directional tests. Specifically, this pipeline measures the evidence for hypothesized directional effect, compared to the alternative directional effect that was not hypothesized (for a comparison of Bayesian directional testing to one-tailed p-values, see this; for a recent revival of one-tailed p-values, see this thang). That is, this pipeline assumes that the null-is-nil is fake news because it is quasi-always false and so is ignored here.

Specifically, this pipeline measures the posterior odds of directional hypotheses. If both directional hypotheses are a priori equiprobable, then their posterior odds is equivalent to a Bayes factor that quantifies the evidence for one or the other directional hypothesis. Notably, the posterior odds of directional hypotheses qua Bayes factor avoids the sensitivity to prior dispersion characteristic of model comparison Bayes factors or Bayes factors computed via the Savage-Dickey density ratio. This is because the posterior odds of direction is an “objective” property of the posterior - symmetric changes to prior dispersion about zero “cancel each other out,” so to speak. In practice, I have found the posterior odds to be essentially unimpacted by prior dispersion until one places unreasonably small dispersions, at which point the posterior odds collapses to 1. Therefore, while priors can encode theory, in the present pipeline priors are tools for identifiability, regularization, and robust estimation in rich, hierarchical, probabilistic models. Moreover, the posterior odds are computed from the posterior distribution over the marginalized expected value of the response, as opposed to the parameters. That is, the quantity used for inference is averaged across all the other covariates’ modeled uncertainty.

Additional Resources

Quick introductions to Bayesian inference can be found here and here. In addition to the Kruschke and Gelman groups, there’s also the Wagenmakers group (e.g., here is that group’s Bayesian sample size planning pipeline for developmental psychologists). Personally, I much prefer Kruschke and Gelman group; Gelman et al.’s Bayesian Data Analysis has been a useful learning tool for me. The Wagenmakers group relies too heavily on Bayes factors, which are overly sensitive to the prior (above) while, nonetheless, offering important benefits over p-values (summarized here, e.g., optional stopping). Here’s an introduction to brms by the creator of brms. Bayesian workflows can be found here and here. Kruschke has a baller, hella thorough article on Bayesian reporting guidelines.

Future Directions

Conceivably, this pipeline could develop into a simple web interface or R package, like the Frequentist SuperPower. In the future, I envision a user experience that allows one to easily specify various population parameters, parameters for sampling from the population, model structures and priors, and statistical quantities to be computed from the posterior models to be used for sample size planning.

Globals

We start up our packages and globals. The general function of each package is commented. The seed tells the random number generator where to begin its process.

library(brms)      # Models
library(furrr)     # Parallelization
library(future)    # Parallelization
library(janitor)   # Cleaning
library(knitr)     # Reporting
library(parallel)  # Parallelization
library(scales)    # Plots
library(tidybayes) # Analysis
library(tidyverse) # Wrangling

seed <- 10003 # RNG seed

The Generative Process

First, we define the generative process, including its design and effects. Then, we simulate the generative process.

Design

Next, we set the design parameters of the generative process. The function of each parameter is commented.

n_population <- 15000 # Population size
n_trials <- 6 # Trial count
items <- n_trials # Item count
conditions <- c("specific", "generic") # Conditions
age_lower_bound <- 4 # Lower bound of population age
age_upper_bound <- 6 # Upper bound of population age
ages <- runif(n_population, age_lower_bound, age_upper_bound) # Randomly sample ages
genders <- c("boy", "girl") # Population genders

Effects

Then, we set the effect sizes of the generative process. These quantities are very important. They are the true effects. When we sample the population, we don’t know these effects. So, we have to infer them. Depending on the size of the effects, their number, grouping structure, and so on, we may need larger sample sizes to be adequately confident that we will find adequate evidence for our predictions. The effect sizes are on the logit scale.

# Effects parameters of the generative process
trial_exclusion_rate <- 0.05 # Population rate of trial exclusion
logit_baseline <- qlogis(.576) # Estimated probability (pilot sample, specific condition)
logit_effect <- 0.8 # Effect size
logit_raneff_id_sd <- 0.1 # SD of participant random intercept
logit_raneff_id_condition <- 0.1 # SD of condition random slope
logit_raneff_trial_sd <- 0 # SD of trial random intercept
logit_raneff_item_sd <- 0.1 # SD of item random intercept

Simulation

Next, we simulate the generative process. This simulates the population from which we will subsequently draw random samples.

population_logistic_raw <- crossing(id = seq_len(n_population), # ID
                                    trial = seq_len(n_trials)) %>% # Trial
  group_by(id) %>% # Group
  mutate(condition = factor(rep(conditions, length.out = n_trials), # Condition
                            levels = unique(conditions)),
    age = ages[id], # Age
    item = sample(items)) %>% # Items
  ungroup() %>%
  mutate(gender = rep(genders, length.out = n_population)[id], # Gender
    logit_raneff_id = rnorm(n_population, 0, logit_raneff_id_sd)[id], # ID random intercept
    logit_raneff_trial = rnorm(n_trials, 0, logit_raneff_trial_sd)[trial], # Trial random intercept
    logit_raneff_item = rnorm(items, 0, logit_raneff_item_sd)[item], # Item random intercept
    logit_raneff_id_condition = rnorm(n_population, 0, logit_raneff_id_condition)[id], # Condition random slope
    logit_condition = case_when(condition == "generic" ~ logit_baseline + logit_effect, # Condition effect
                                condition == "specific" ~ logit_baseline,
                                T ~ NA),
    logit_condition_re = logit_condition + if_else(condition == "generic", # Individual-specific condition effects
                                                   logit_raneff_id_condition,
                                                   -logit_raneff_id_condition),
    logit = # Response logit
      logit_raneff_id +
      logit_condition_re +
      logit_raneff_trial +
      logit_raneff_item,
    prob = plogis(logit), # Response probability
    response = rbinom(n(), 1, prob), # Simulate responses
    include = rbinom(n(), 1, 1 - trial_exclusion_rate), # Simulate exclusions
    across(where(is.character), ~ as.factor(.))) # Convert characters to factor

sum(is.na(population_logistic_raw)) # Count NAs (should equal zero)
## [1] 0

Remove Generative Parameters

Next, we remove generative parameters from our dataset. That is, we create a dataset that will contain all and only the variables that we will have in our empirical dataset gathered by our experiment.

population_logistic <- population_logistic_raw %>% # Simulated population with unknown parameters
  select(id, trial, condition, item, gender, response, include) # Columns to retain

setdiff(names(population_logistic_raw), names(population_logistic)) # Check which parameters have been removed
## [1] "age"                       "logit_raneff_id"          
## [3] "logit_raneff_trial"        "logit_raneff_item"        
## [5] "logit_raneff_id_condition" "logit_condition"          
## [7] "logit_condition_re"        "logit"                    
## [9] "prob"

Samples

Now that we have simulated a population and created a dataframe with unknown parameters, we randomly sample from that population many times over. There is some parallel computation going on here that will enable your machine to draw several random samples simultaneously. This always saves time on computation, but the time savings will become noticeable only when drawing thousands of samples.

Parameters of the Random Sampling Process

First, we set the parameters for our random samples. This includes parameters for the number of random samples per sample size, the minimum and maximum sample sizes, and the “sample size delta.”

The latter, delta parameter controls the step size between samples. For example, if the sample size delta is set to 20 and the minimum and maximum sizes are 20 and 100, respectively, then the algorithm will produce samples of size 20, 40, 60, 80, and 100.

Be conscientious of the values that you set for your these parameters. The stakes aren’t high, but you can mistakenly produce samples of unexpected size. For instance, a delta of 20 with minimum and maximum sample sizes of 10 and 100, respectively, will produce samples of size 10, 30, 50, 70, and 90 (not 100).

# Population sampling parameters
n_samples_per_sample_size <- 200 # Number of random samples per sample size
min_sample_size <- 20 # Minimum sample size
max_sample_size <- 100 # Maximum sample size
granularity <- 10 # Sample size delta

Check Samples to Be Collected

Now that we have set the parameters of our random samples, lets check the unique samples, sample sizes, and number of sample sizes that we will collect.

samples <- rep( # unique samples to be collected
  seq(min_sample_size, # Minimum sample size
      max_sample_size, # Maximum sample size
      by = granularity), # Sample size delta
  each = n_samples_per_sample_size) # Number of samples per sample size

length(samples) # sample sizes that we will collect

unique(samples) # sample sizes that we will collect

n_distinct(samples) # number of sample sizes that we will collect

Parallelization

Next, we parallelize our computation. The following code tells R to utilize your machine’s n - 1 cores. The number of cores is the number of independent processes (minus 1) that your machine can perform simultaneously. For instance, my machine has 8 cores. Thus, the code tells my machine to use 7 cores to draw random samples. For a fixed number of samples, this speeds up the sampling process 7-fold. So that I don’t melt my machine, I usually leave a spare core for other computations, like watching Youtube videos or texting my mom.

plan(multisession, workers = detectCores() - 1) # Parallelization
print(plan(multisession)) # Print parallelization plan

Draw random samples

We have set the parameters of our random sampling algorithm and checked the samples that we will collect from the generative process. We are ready to draw our samples. If you have set a large number of samples to be collected (e.g., thousands), this may take several seconds to several minutes. Your machine’s hardware (e.g., cores) will also play a role in determining the algorithm’s compute time. The following code will produce a list of samples with length equal to 1800. That is, each element of the list will be one random sample from your simulated population. Each random sample will contain all and only the variables that you plan to use in your model.

list_samples <- local({ # Draw random samples
  start_time <- Sys.time() # Record start time
  list_samples <- future_map(samples, # Sample sampler
                             ~ population_logistic %>% # Simulated population
                               filter(id %in% sample(unique(id), size = .x, replace = FALSE)) %>% # Samples
                               ungroup() %>% # Ungroup computations
                               filter(include == 1) %>% # Include only included trials
                               select(-include), # Remove "include" column
                             .options = furrr_options( # Import from global environment
                               globals = list(population_logistic = population_logistic),
                               packages = c("tidyverse"),
                               seed = seed),
                             .progress = TRUE) # Report progress
  end_time <- Sys.time() # Record end time
  print(end_time - start_time) # Print runtime
  return(list_samples) # Return list of random samples
}
)

–> –> –>

–> –> –>

–> –>

–> –> –> –>

–> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –> –>

–> –>

–> –> –> –> –>

–>

–> –> –>

–> –> –>

–> –> –> –> –> –> –> –> –>

–> –> –> –> –> –> –> –> –> –> –> –> –> –>

–> –> –>

–>

–>

–> –> –>

–>