This note roughly summarizes how SEM programs such as lavaan generate bootstrapped confidence intervals for estimated parameters.
First let’s simulate data for parallel mediation:
# Set a random seed for reproducibility
set.seed(1984)
# Independent variable
X <- rnorm(100)
# Mediator 1: positively influenced by X
# True a1 = 0.5
M1 <- 0.5 * X + rnorm(100)
# Mediator 2: negatively influenced by X
# True a2 = -0.35
M2 <- -0.35 * X + rnorm(100)
# Outcome: influenced by both mediators
# True b1 (M1->Y) = -0.48, True b2 (M2->Y) = 0.7, no direct X->Y effect
Y <- -0.48 * M1 + 0.7 * M2 + 0.05*X + rnorm(100)
# Combine into a data frame
Data <- data.frame(X = X, Y = Y, M1 = M1, M2 = M2)
The idea of bootstrapping is to treat your sample as a surrogate for the population and resample from it with replacement to create a new sample 1. We do this many times, say 1000 times, to create 1000 new samples. Then, using each sample, we re-run our analysis 1000 times. We then use these bootstrap replicates to create a nonparametric 95% confidence interval using the 0.025 and 0.975 percentiles.
To implement the bootstrap procedure we’ll use the boot package that
comes with base R. The boot packages requires a function that it can use
to run the analysis with each new sample. The boot()
function then performs the resampling for us and runs the analysis
defined in our function. Let’s load the package and define our function,
which we call boot_fn().
library(boot)
#1. Define the bootstrap function
boot_fn <- function(data, indices) {
d <- data[indices, ]
# Fit the the 3 regression models for the mediation paths
m1 <- lm(M1 ~ X, data = d)
m2 <- lm(M2 ~ X, data = d)
y <- lm(Y ~ M1 + M2 + X, data = d)
# Fit the total effect model
total_model <- lm(Y ~ X, data = d)
# Extract path coefficient estimates
a1 <- coef(m1)["X"]
a2 <- coef(m2)["X"]
b1 <- coef(y)["M1"]
b2 <- coef(y)["M2"]
cp <- coef(y)["X"]
c_total <- coef(total_model)["X"]
# Define the indirect effects
ind_m1 <- a1 * b1
ind_m2 <- a2 * b2
tot_ind <- ind_m1 + ind_m2
# Return the effects to be bootstrapped
c(
ind_m1 = ind_m1,
ind_m2 = ind_m2,
total_indirect = tot_ind,
direct_effect = cp,
total_effect = c_total
)
}
The function might look complicated but it’s basically fitting three
simple models with a few extra steps to extract coefficients and
calculate indirect and total effects. The boot_fn()
function takes two arguments: data and
indices. The data argument is simply for our
data frame. The indices argument performs the resampling,
which is performed in the first line of code:
d <- data[indices, ].
Let’s try it out. First we resample the row numbers (or indices) of
our data using the sample() function
with replacement. These numbers are used to resample rows
from our data frame.
i <- sample(x = nrow(Data), replace = TRUE)
head(i)
[1] 59 35 61 45 19 70
Now use boot_fn() to generate one bootstrap
replicate.
boot_fn(data = Data, indices = i)
ind_m1.X ind_m2.X total_indirect.X direct_effect.X
-0.1412019 -0.1071022 -0.2483041 -0.1576868
total_effect.X
-0.4059909
We can try it again with a new sample of our data.
i <- sample(x = nrow(Data), replace = TRUE)
boot_fn(data = Data, indices = i)
ind_m1.X ind_m2.X total_indirect.X direct_effect.X
-0.20865110 -0.06333337 -0.27198447 -0.21559863
total_effect.X
-0.48758310
Notice we get slightly different results. This is the idea of bootstrapping: resample our data, run the analysis, and then summarize the variability in our results using percentile confidence intervals.
The boot() function automates the resampling for us. Be
sure and save the result. Below we save the result as “boot_res”.
Specify the number of resamples using the R argument and
supply your function to the statistic argument. Setting a
seed allows you to replicate the results. The function will take a few
seconds to run.
#2. Run the bootstrap
set.seed(123)
boot_res <- boot(
data = Data,
statistic = boot_fn,
R = 1000
)
The bootstrap replicates are stored in the t element of
the “boot_res” object. We can use head() to look at the
first six rows of the matrix. Column 1 contains the estimates for
ind_m1, column2 contains the estimates of ind_m2, and so on.
head(boot_res$t)
[,1] [,2] [,3] [,4] [,5]
[1,] -0.2918441 -0.099897322 -0.3917414 0.06509423 -0.3266472
[2,] -0.2164570 -0.080051348 -0.2965083 -0.02264087 -0.3191492
[3,] -0.2589680 -0.107395879 -0.3663639 0.01051310 -0.3558508
[4,] -0.3216455 -0.126339726 -0.4479853 0.08787194 -0.3601133
[5,] -0.2310214 -0.009286045 -0.2403075 -0.04579584 -0.2861033
[6,] -0.3358054 -0.068777565 -0.4045829 -0.05674194 -0.4613249
We can calculate a 95% percentile confidence interval by finding the
0.025 and 0.975 percentiles of one of the columns using the
quantile() function. For example, for the ind_m1
column:
quantile(boot_res$t[,1], probs = c(0.025, 0.975))
2.5% 97.5%
-0.4276461 -0.1271311
The boot package provides the bootci() function to do
this for us. Use the type argument to specify “perc” for
percentile intervals and the index argument to specify which column to
calculate the 95% confidence interval for. Below we get the confidence
interval for the first column, ind_m1.
boot.ci(boot_res, type = "perc", conf = 0.95, index = 1)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates
CALL :
boot.ci(boot.out = boot_res, conf = 0.95, type = "perc", index = 1)
Intervals :
Level Percentile
95% (-0.4314, -0.1249 )
Calculations and Intervals on Original Scale
To help compute all five confidence intervals, we define a new
function called get_ci() which saves the result of
boot.ci(), extracts the confidence interval, and rounds it
to four decimal places. We then apply this function to the numbers 1 - 5
to get all five confidence intervals and present the results in a data
frame. Notice we use the t() function to transpose the
result.
#3. Extract percentile bootstrap confidence intervals
get_ci <- function(i) {
ci <- boot.ci(boot_res, type = "perc", index = i)
round(ci$percent[4:5], 4)
}
# Apply the get_ci function to the 5 effects
ci_mat <- t(sapply(1:5, get_ci))
ci_mat
[,1] [,2]
[1,] -0.4314 -0.1249
[2,] -0.1991 0.0303
[3,] -0.5606 -0.1608
[4,] -0.3700 0.1975
[5,] -0.6641 -0.1862
We can then add these results to a data frame with proper labels.
#4. Create a summary table
results_table <- data.frame(
Path = c(
"Indirect via M1 (a1*b1)",
"Indirect via M2 (a2*b2)",
"Total indirect effect",
"Direct effect (c')",
"Total effect (c)"
),
Estimate = round(boot_res$t0, 4), #boot_res$t0 contains the original-sample estimates
CI_lower = ci_mat[, 1],
CI_upper = ci_mat[, 2]
)
#5. Determine significance based on whether the CI includes 0
results_table$Significant <- ifelse(
results_table$CI_lower > 0 | results_table$CI_upper < 0,
"Yes",
"No"
)
rownames(results_table) <- NULL # Remove row names for a cleaner table
results_table
Path Estimate CI_lower CI_upper Significant
1 Indirect via M1 (a1*b1) -0.2714 -0.4314 -0.1249 Yes
2 Indirect via M2 (a2*b2) -0.0770 -0.1991 0.0303 No
3 Total indirect effect -0.3485 -0.5606 -0.1608 Yes
4 Direct effect (c') -0.0699 -0.3700 0.1975 No
5 Total effect (c) -0.4184 -0.6641 -0.1862 Yes
To replicate this procedure in lavaan, set
se = "bootstrap" in the call to sem(). You can
use the optional arguments iseed to make results
reproducible and bootstrap to change the number of
replications. (The default is 1000.)
library(lavaan)
# specify the parallel mediation model syntax
model_parallel <- '
M1 ~ a1 * X
M2 ~ a2 * X
Y ~ b1 * M1 + b2 * M2 + c * X
indirect1 := a1 * b1
indirect2 := a2 * b2
total_indirect := indirect1 + indirect2
total := c + total_indirect
'
# estimate the model using maximum likelihood
fit_parallel <- sem(model_parallel,
data = Data,
se = "bootstrap", # use bootstrapping for SE
iseed = 2026,
bootstrap = 5000) # set number of bootstrap replications
summary(fit_parallel,
ci = TRUE) # include confidence intervals
lavaan 0.6-21 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 8
Number of observations 100
Model Test User Model:
Test statistic 1.706
Degrees of freedom 1
P-value (Chi-square) 0.191
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 5000
Number of successful bootstrap draws 5000
Regressions:
Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
M1 ~
X (a1) 0.592 0.089 6.644 0.000 0.417 0.765
M2 ~
X (a2) -0.125 0.090 -1.384 0.167 -0.307 0.050
Y ~
M1 (b1) -0.459 0.102 -4.481 0.000 -0.654 -0.259
M2 (b2) 0.616 0.100 6.178 0.000 0.427 0.817
X (c) -0.070 0.149 -0.470 0.639 -0.363 0.221
Variances:
Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
.M1 0.884 0.125 7.060 0.000 0.643 1.124
.M2 0.898 0.132 6.806 0.000 0.640 1.157
.Y 0.959 0.127 7.531 0.000 0.687 1.187
Defined Parameters:
Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
indirect1 -0.271 0.078 -3.465 0.001 -0.439 -0.131
indirect2 -0.077 0.060 -1.292 0.196 -0.206 0.029
total_indirect -0.348 0.102 -3.411 0.001 -0.561 -0.161
total -0.418 0.131 -3.185 0.001 -0.681 -0.166
If we sampled without replacement we would just be analyzing the same data over and over.↩︎