Introduction

This tutorial illustrates how to use the open-access R package BayesSSD for Bayesian sample size determination (SSD; see Lösener & Moerbeek, 2025 for a description of an earlier version of this software). The starting point is the data of an existing empirical study, which informs the SSD for our (fictional) longitudinal experiment with three treatment conditions and expected attrition. Our goal is to replicate a previous longitudinal experiment using Bayesian hypothesis evaluation and to make an informed choice about the number of individuals in the sample. Note that the model estimation is maximum likelihood, while the hypothesis about the frequentist estimates is evaluated in a Bayesian way. We would like to achieve a Bayesian power level of at least \(\eta = .8\), given that our design and expected attrition is identical to the previous study.

The data come from the study “How Can We Increase Privacy Protection Behavior? A Longitudinal Experiment Testing Three Intervention Strategies” by Boerman, Strycharz and Smit (2023). The data are fully anonymous and openly available on the Open Science Framework (OSF). Note that for illustrative purposes, we only focus on a subset of treatment conditions and outcome variables that were investigated in the original study.

Each participant was measured three times (at baseline, two weeks after, and two months after) in this longitudinal experiment. The participants were randomly allocated to four strategy training conditions: a) no strategy (control), b) addressing the severity of the threat, c) acknowledging and combating privacy fatigue, and d) a combination of b) and c). The dependent variable was perceived severity of the threat of infringement of privacy. Prior research suggests that in adult samples, these types of interventions actually lead to a decrease in the perceived severity (Strycharz et al., 2019 & 2021); an effect we aim to replicate. Thus, our research hypothesis states that all three interventions (b, c and d) lead to a steeper decline in perceived threat compared to the control condition a). Furthermore, we hypothesize that the decline is steepest in condition d), followed by condition c), in turn followed by condition b). Hence, the research hypothesis can be parametrized as: \(\mathscr{H}_1:\beta_a > \beta_b > \beta_c > \beta_d\), where \(\beta\) refers to the coefficient of interaction between time and the respective condition in the multilevel regression model. The research hypothesis will be evaluated against its complement \(\mathscr{H}_c\), which covers all parameter orderings left uncovered by \(\mathscr{H}_1\). Note that the research hypothesis is constructed purely for illustrative purposes and may not reflect the author’s content-related expectations nor be of any conceptual value.

The steps of this tutorial are as follows: First, we load and clean the data. Next, we inspect some plots in order to determine that fitting a multilevel model to this data is indeed sensible. After that, we fit the multilevel model. Next, we extract the estimates of effects sizes along with the (co-)variance components necessary to inform our sample size determination (SSD) procedure. Then, we specify the pattern of attrition we expect in our replication study based on the reported attrition rates in Boerman et al. (2023). Finally, we execute the SSD using the BayesSSD package.

Data preparation

Load data

First, we load the data. These are openly available on OSF and are fully anonymous. Note that to be able to execute the code below, the data must be stored in the same location as the working directory of the R session.

dat_raw <- read.csv("waveall_final.csv")

Subset data, convert from wide to long

Next, we have to convert the data frame into the long format in order to be able to fit a multilevel regression model using the lme4 package. We also convert some variable types to make them more suitable for model fitting. To do this, we load the dplyr and the tidyr package for more convenient data restructuring.

We first create a data frame with only those variables relevant to the analysis. Also, we will only be considering participants in conditions a), b), c) and d) (coded as 0-3). We also need to rename the values in the condition variable.

In the variable condition, we see that always the last value for each participant in missing, so we need to fill those in. As participants did not switch conditions, this will simply be equal to the previous value for condition within that person.

# install packages if needed
if (!require("dplyr")) {install.packages("dplyr")}
if (!require("tidyr")) {install.packages("tidyr")}

# call their libraries
library(dplyr)
library(tidyr)

# select only relevant variables
relevant_vars <- c("severity_1_W1", "severity_2_W1", "severity_3_W1", 
                   "severity_1_W2", "severity_2_W2", "severity_3_W2", 
                   "severity_1_W3", "severity_2_W3", "severity_3_W3", 
                   "conditie_W1", "conditie_W2") 
dat_raw_subset <- dat_raw[,relevant_vars] # subset dataset

# convert from wide to long format
dat_raw_long <- dat_raw_subset %>%
  mutate(id = row_number()) %>%
  pivot_longer(
    cols = -id,
    names_to = c(".value", "wave"),
    names_pattern = "(.*)_W(\\d+)"
  )

# take mean to construct total severity variable
dat_raw_long <- dat_raw_long %>%
  mutate(severity = rowMeans(select(., severity_1, severity_2, severity_3), na.rm = TRUE))


# rename variables for clarity
names(dat_raw_long)[names(dat_raw_long) == "conditie"] <- "condition"

# fill in missing condition values for last wave and only select individuals in conditions 0-3
dat_long <- dat_raw_long %>%
  group_by(id) %>%
  fill(condition, .direction = "updown") %>%
  ungroup() %>%
  filter(condition == 0 |
         condition == 1 |
         condition == 3 |
         condition == 5)

# rename conditions to a), b), c), d)
dat_long$condition[dat_long$condition==0] <- "a"
dat_long$condition[dat_long$condition==1] <- "b"
dat_long$condition[dat_long$condition==3] <- "c"
dat_long$condition[dat_long$condition==5] <- "d"

Add time variable

As the last step in the data cleaning process, we need to recode the wave variable into a new time variable with the actual time points (0, 2, 8). We inspect the first couple of rows of the new data frame to make sure it is formatted correctly.

# if wave is one, recode to 0, if wave is 2, it stays the same, if wave is 3, recode to 8
dat <- dat_long %>% 
          mutate(time = ifelse(wave == 1, 0, ifelse(wave == 3, 8, wave)))
          
dat$time <- as.numeric(dat$time) # make time numeric

head(dat) # inspect first couple of rows
## # A tibble: 6 × 8
##      id wave  severity_1 severity_2 severity_3 condition severity  time
##   <int> <chr>      <int>      <dbl>      <int> <chr>        <dbl> <dbl>
## 1     1 1              7       7             7 d             7        0
## 2     1 2              7       7             7 d             7        2
## 3     1 3              7       7             7 d             7        8
## 4     6 1              7       7             7 d             7        0
## 5     6 2              6       6.33          7 d             6.44     2
## 6     6 3              6       6.17          7 d             6.39     8

Now, our dataset is ready for model fitting. Before we fit the model, however, we first make some plots to show that the data indeed has a multilevel structure.

Plotting multilevel structure

We inspect the data via lineplots using the ggplot2 package. First, we illustrate the multilevel structure of the data on a random subset of 25 participants, as showing all 1000 participant’s lines would not be practical We thus plot the severity of each of these 25 individuals over time separately in a facet plot. The black solid line shows the raw values per point in time while the dotted red line represents the linear trend of that person. Note that some individuals have missing values due to dropping out.

# install if needed
if (!require("ggplot2")) {install.packages("ggplot2")}

# randomly select 25 individuals to showcase
set.seed(123) # for reproducibility
sample_ids <- sample(unique(dat$id), 25)

# filter the data to only include those IDs
dat_sample <- dat %>%
  filter(id %in% sample_ids)

# create the faceted plot
ggplot2::ggplot(dat_sample, aes(x = time, y = severity)) +
                  geom_point() +                   # add points for the actual observations
                  geom_line() +                    # connect the points with lines
                  geom_smooth(method = "lm", se = FALSE, color = "red", linetype = "dashed") + # add individual LM fit
                  facet_wrap(~ id, ncol = 5) +     # create a 4x4 grid of panels
                  labs(title = "Linear trends over time of a random sample of 25 individuals") +
                  theme_minimal()

We can see that there is considerable variability in both the initial scores and the linear trend over time between individuals. Some individuals start with a higher score, others with a lower score. Some individuals show a positive trend, others a negative one, and yet others do not change over time at all with respect to the outcome. Hence, a multilevel regression model with random intercepts and slopes seems like an appropriate choice.

Multilevel regression model

In this step, we fit the multilevel regression model to the data using the lme4 package. We do this in order to obtain estimates necessary to run the SSD algorithm. Alternatively, if no data are available, these estimates can be obtained by consulting experts in the respective field of research.

Fit the model

First, we fit the multilevel regression model to the data. The predictors are time, condition and their interaction time:condition. Intercepts and slopes for time are allowed to vary per person as indicated by the random part of the regression equation(1 + time | id).

# install if needed
if (!require("lme4")) {install.packages("lme4")}
# call the library
library(lme4)

# model for severity with condition and time as well as their interaction as predictors
mlm <- lmer(severity ~ -1 + condition + time:condition + (1 + time | id), dat, REML = F)
summ_mlm <- summary(mlm)

# inspect model output
summ_mlm
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: severity ~ -1 + condition + time:condition + (1 + time | id)
##    Data: dat
## 
##       AIC       BIC    logLik -2*log(L)  df.resid 
##    3200.4    3260.6   -1588.2    3176.4      1102 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -4.2596 -0.4207  0.0877  0.4384  4.0952 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  id       (Intercept) 1.269526 1.12673       
##           time        0.002934 0.05417  -0.54
##  Residual             0.447243 0.66876       
## Number of obs: 1114, groups:  id, 432
## 
## Fixed effects:
##                  Estimate Std. Error t value
## conditiona       5.570233   0.120518  46.219
## conditionb       5.651525   0.119346  47.354
## conditionc       5.378745   0.121639  44.219
## conditiond       5.391863   0.120557  44.725
## conditiona:time  0.015541   0.015198   1.023
## conditionb:time  0.006049   0.014274   0.424
## conditionc:time -0.006859   0.015226  -0.450
## conditiond:time -0.021731   0.015617  -1.391
## 
## Correlation of Fixed Effects:
##             condtn cndtnb cndtnc cndtnd cndtn:t cndtnb: cndtnc:
## conditionb   0.000                                             
## conditionc   0.000  0.000                                      
## conditiond   0.000  0.000  0.000                               
## conditin:tm -0.424  0.000  0.000  0.000                        
## conditnb:tm  0.000 -0.436  0.000  0.000  0.000                 
## conditnc:tm  0.000  0.000 -0.426  0.000  0.000   0.000         
## conditnd:tm  0.000  0.000  0.000 -0.419  0.000   0.000   0.000

From the model output, we gather that in the first condition a), the perceived severity increases slightly over time, as indicated by the estimate for conditiona:time which is equal to 0.016. We also see that in the conditions b), c) and d), the change over time is more negative compared to condition a), as indicated by the coefficients for conditionb:time, conditionc:time and conditiond:time, which are equal to 0.006, -0.007 and -0.022, respectively.

Extract estimates

Next, we extract all the (co-)variance components needed for the SSD. Specifically, the intercept variance, the slope variance, their covariance, and the residual variance.

# intercept variance
var_intercept <- summ_mlm$varcor$id[1,1]

# slope variance
var_slope <- summ_mlm$varcor$id[2,2]

# covariance between random intercept and random slope
cov_int_slope <- summ_mlm$varcor$id[1,2]

# residual variance
var_resid <- sigma(mlm)^2

Calculate effect sizes

The effect size found in the previous study serves as an estimate for the effect size we expect to find. It is defined as the ratio between the regression coefficient and the square root of the slope variance \(\delta = \frac{\beta}{\sqrt{\sigma^2_{u1}}}\) (Raudenbush & Liu, 2001). This means that we can derive the effect sizes from the regression coefficients and the slope variance as follows:

# extract the coefficients of interaction between time and condition
beta_cond_a <- mlm@beta[5] 
beta_cond_b <- mlm@beta[6]
beta_cond_c <- mlm@beta[7]
beta_cond_d <- mlm@beta[8]

# calculate effect sizes for each treatment condition according to Raudenbush & Liu (2001)
(delta_cond_a <- beta_cond_a / sqrt(var_slope))
## [1] 0.2869102
(delta_cond_b <- beta_cond_b / sqrt(var_slope))
## [1] 0.111666
(delta_cond_c <- beta_cond_c / sqrt(var_slope))
## [1] -0.1266238
(delta_cond_d <- beta_cond_d / sqrt(var_slope))
## [1] -0.4011699

We can see that the effect sizes for treatments a), b), c), and d) are equal to \(\delta_a=0.29\), \(\delta_b=0.11\), \(\delta_c=-0.13\), and \(\delta_d=-.40\), respectively. Positive (negative) effect sizes indicate an increase (decline) of perceived severity over time. For an illustrative overview of the influence of the multilevel model parameters on the growth curves, please visit the ShinyApp developed for this purpose.

Time points and attrition

The \(n=3\) timepoints are at baseline (\(t_1=0\)), two weeks later (\(t_2=2\)) and two months later (\(t_3=8\)), with \(t\) denoting the number of weeks since the start of the study. At each of the three measurement occasions, there are \(N_1=1000\), \(N_2=799\) and \(N_3=465\) individuals remaining in the sample (Boerman et al., 2024). This corresponds to a remaining proportion of .799 at \(t_2\) and of .465 at \(t_2\).

The plot below shows the proportion of dropout at each measurement occasion.

# install scales package if needed
if (!require("scales")) {install.packages("scales")}

# create data of timepoints and remaining individuals
timepoints <- c(0, 2, 8)
proportions <- c(1, 0.799, 0.465)

# create a data frame in long format for plotting
df <- data.frame(
  x = rep(factor(timepoints), each = 2),  
  group = rep(c("remaining", "dropped out"), times = length(timepoints)),  
  value = c(rbind(proportions, 1 - proportions))) 

# plot
ggplot(df, aes(x = x, y = value, fill = group)) +
  geom_col(position = "stack") +
  scale_y_continuous(labels = scales::percent, limits = c(0, 1)) +
  labs(
    x = "Time elapsed in weeks",
    y = "Proportion of individuals remaining",
    fill = "Status"
  ) +
  theme_minimal()

For an extensive overview of the available survival functions that can be used for modelling attrition in the BayesSSD package, please visit the ShinyApp developed for this purpose.

Using the BayesSSD package

Package installation

Now that we obtained all the estimates necessary to perform the Bayesian Sample Size Determination, we need to install the package BayesSSD from GitHub using the pak package.

if (!require("pak")) {install.packages("pak")}

pak::pak("ulrichlosener/BayesSSD")
library(BayesSSD)

Arguments to the SSD_longit function

The following table provides an overview of all the arguments to the function SSD_longit and their default values.

Arguments to the SSD_longit function
Argument Type Description Default
eta numeric the desired power level 0.8
attrition string if FALSE, no attrition is simulated. Otherwise, input determines the survival function(s) used. ‘weibull’
params list/vector parameter values passed to the survival function. Single vector if the same attrition pattern applies to all treatment conditions. Otherwise list of vectors for each pattern. First entry is omega, second gamma and third (if applicable) kappa. c(0.5, 1)
BFthres numeric The threshold a BF must exceed in order to be considered convincing evidence 3
PMPthres numeric The threshold a PMP must exceed in order to be considered convincing evidence 0.9
eff.sizes numeric delta, the expected standardized effect size(s) c(0, 0.5, 0.8)
m numeric the number of datasets to be simulated in each iteration 1000
t.points vector the time points of the measurement occasions, equal and unequal spacing possible c(0,1,2,3,4)
log logical use log-linear growth? FALSE
var.u0 numeric the intercept variance 0.03
var.u1 numeric the slope variance 0.1
var.e numeric the residual variance 0.02
cov numeric the covariance between intercept and slope variance 0
sensitivity logical execute sensitivity analysis for different values (1, 2, 3) for fraction? FALSE
hypothesis string Either a single hypothesis or a list of hypotheses with the research hypothesis of interest in first place. Treatment conditions are referred to as ‘a, b, c, …,’ ‘a<b<c’
method string The method of hypothesis evaluating: ‘bfc’ refers to the BF versus the complement hypothesis, ‘bf’ to the BF of the first versus the second hypothesis in ‘hypothesis’ and ‘pmp’ to the posterior model probability in the set of all hypotheses including the complement hypothesis. ‘bfc’
N.min numeric The minimal sample size to be considered 30
N.max numeric The maximal sample size to be considered 1000
group.sizes vector NULL for a balanced design. When group sizes differ, a vector with the proportions of the total sample size in each treatment condition with the first element corresponding to condition ‘a’, the second to ‘b’, etc. NULL
tol numeric tolerance of deviation from desired power. Higher values may speed up performance. 0.001
seed numeric set a seed for reproducibility NULL

Performing Bayesian SSD

Now, we can run the simulation-based Bayesian SSD with the BayesSSD command and pass all the ingredients to the appropriate arguments. Remember that we strive for a power level of \(\eta=.8\) and our research hypothesis is \(\mathscr{H}_1:\beta_a>\beta_b>\beta_c>\beta_d\). We are planning on evaluating this hypothesis against the complement hypothesis \(\mathscr{H}_c\) which covers all parameter orderings not included in our research hypothesis. Because we are planning on running a exploratory rather than a confirmatory study, we set the threshold for a convincing Bayes Factor \(BF_{1c}\) to 3 in order not to miss any small but relevant effects.

We search for the appropriate sample size in a range from \(N_{min}=100\) and \(N_{max}=1500\) because we know that the original study used \(N=1000\) participants. For reliable results, we set m=10000, meaning that 10,000 datasets are simulated and evaluated in each iteration. Increasing m increases the precision as well as the computing time of the algorithm. We also set a seed for reproducibility, as some parts of the simulation rely on stochastic processes. For more information on the arguments of the SSD_longit function, run ?SSD_longit or see the table above.

ssd_result <- SSD_longit(
               eta         = 0.8,
               hypothesis  = "a > b > c > d",
               attrition   = "nonparametric",
               params      = c(1, 0.799, 0.465),
               m           = 10000,
               t.points    = c(0,2,8),
               var.u0      = var_intercept,
               var.u1      = var_slope,
               var.e       = var_resid,
               cov         = cov_int_slope,
               eff.sizes   = c(delta_cond_a, delta_cond_b, delta_cond_c, delta_cond_d),
               BFthres     = 3,
               method      = "bfc",
               sensitivity = F,
               seed        = 123,
               group.sizes = NULL,
               N.min       = 100,
               N.max       = 1500
)
##   |                                                                              |                                                                      |   0%  |                                                                              |=========                                                             |  12%  |                                                                              |==================                                                    |  25%  |                                                                              |==========================                                            |  38%  |                                                                              |===================================                                   |  50%  |                                                                              |======================================================================| 100%
## 
## Final sample size
## ================= 
## Hypotheses:
##     H1: a > b > c > d 
##     Hc: not H1
## Using a total of N = 1062 Participants
## Power = P(BF_1c > 3 | H1) = 0.805

The output informs us that the minimal sample size needed to achieve a power of \(\eta=.80\) when testing \(\mathscr{H}_1\) against its complement \(\mathscr{H}_c\) is \(\mathbf{N_{total}=1062}\), while taking the anticipated attrition pattern into account. So, for our fictitious replication study, we would need to sample 62 more participants than in the original study.

For more information on the SSD procedure, we can inspect the object that the function was assigned to.

ssd_result
## $evaluations
##      N power prop_simplified
## 1  800  0.73               0
## 2 1150  0.82               0
## 3  975  0.78               0
## 4 1062  0.80               0
## 
## $final
## $final$N
## [1] 1062
## 
## $final$power
## [1] 0.805
## 
## $final$threshold_BF
## [1] 3
## 
## $final$threshold_PMP
## [1] 0.9
## 
## $final$sensitivity
## [1] FALSE
## 
## 
## $hypotheses
## $hypotheses$hypothesis
## [1] "a > b > c > d"
## 
## $hypotheses$comparison
## [1] "bfc"
## 
## 
## $iterations
## [1] 5
## 
## $runtime
## [1] 34

This tells us the sample sizes that were evaluated during the binary search (800, 1150, 975, 1062) along with their resulting power, the proportion of models that had to be simplified due to high attrition (0%), the number of iterations (5) and the runtime of the SSD procedure in minutes (35). While this information is not essential for the result of the SSD, it can be useful for gaining more insight into the SSD algorithm.


References

Boerman, S. C., Strycharz, J., & Smit, E. G. (2024). How can we increase privacy protection behavior? A longitudinal experiment testing three intervention strategies. Communication research, 51(2), 115-145.

Lösener, U., & Moerbeek, M. (2025). Bayesian sample size determination for longitudinal intervention studies with linear and log-linear growth. Behavior Research Methods, 57(9), 239.

Raudenbush, S. W., & Liu, X. F. (2001). Effects of study duration, frequency of observation, and sample size on power in studies of group differences in polynomial change. Psychological methods, 6(4), 387.

Strycharz J., Smit E. G., Helberger N., van Noort G. (2021). No to cookies: Empowering impact of technical and legal knowledge on rejecting tracking cookies. Computers in Human Behavior, 120, 106750. https://doi.org/10.1016/j.chb.2021.106750

Strycharz J., Van Noort G., Smit E. G., Helberger N. (2019). Protective behavior against personalized ads: Motivation to turn personalization off. Cyberpsychology: Journal of Psychosocial Research on Cyberspace, 13(2), 1. https://doi.org/10.5817/CP2019-2-1


If you have questions or comments about this Markdown document or about the BayesSSD Package, please contact me under .