This note roughly summarizes how SEM programs such as lavaan generate bootstrapped confidence intervals for estimated parameters.

First let’s simulate data for parallel mediation:

# Set a random seed for reproducibility
set.seed(1984)

# Independent variable 
X <- rnorm(100)

# Mediator 1: positively influenced by X 
# True a1 = 0.5
M1 <- 0.5 * X + rnorm(100)

# Mediator 2: negatively influenced by X
# True a2 = -0.35
M2 <- -0.35 * X + rnorm(100)

# Outcome: influenced by both mediators
# True b1 (M1->Y) = -0.48,  True b2 (M2->Y) = 0.7,  no direct X->Y effect
Y <- -0.48 * M1 + 0.7 * M2 + 0.05*X + rnorm(100)

# Combine into a data frame
Data <- data.frame(X = X, Y = Y, M1 = M1, M2 = M2)

The idea of bootstrapping is to treat your sample as a surrogate for the population and resample from it with replacement to create a new sample 1. We do this many times, say 1000 times, to create 1000 new samples. Then, using each sample, we re-run our analysis 1000 times. We then use these bootstrap replicates to create a nonparametric 95% confidence interval using the 0.025 and 0.975 percentiles.

To implement the bootstrap procedure we’ll use the boot package that comes with base R. The boot packages requires a function that it can use to run the analysis with each new sample. The boot() function then performs the resampling for us and runs the analysis defined in our function. Let’s load the package and define our function, which we call boot_fn().

library(boot) 

#1. Define the bootstrap function 
boot_fn <- function(data, indices) {
  d <- data[indices, ]
 
  # Fit the the 3 regression models for the mediation paths
  m1 <- lm(M1 ~ X, data = d)
  m2 <- lm(M2 ~ X, data = d)
  y  <- lm(Y ~ M1 + M2 + X, data = d)
  # Fit the total effect model
  total_model <- lm(Y ~ X, data = d)
  
  # Extract path coefficient estimates
  a1 <- coef(m1)["X"]
  a2 <- coef(m2)["X"]
  b1 <- coef(y)["M1"]
  b2 <- coef(y)["M2"]
  cp <- coef(y)["X"]
  c_total <- coef(total_model)["X"]
  
  # Define the indirect effects
  ind_m1  <- a1 * b1
  ind_m2  <- a2 * b2
  tot_ind <- ind_m1 + ind_m2
  
  # Return the effects to be bootstrapped
  c(
    ind_m1 = ind_m1,
    ind_m2 = ind_m2,
    total_indirect = tot_ind,
    direct_effect = cp,
    total_effect = c_total
  )
}

The function might look complicated but it’s basically fitting three simple models with a few extra steps to extract coefficients and calculate indirect and total effects. The boot_fn() function takes two arguments: data and indices. The data argument is simply for our data frame. The indices argument performs the resampling, which is performed in the first line of code: d <- data[indices, ].

Let’s try it out. First we resample the row numbers (or indices) of our data using the sample() function with replacement. These numbers are used to resample rows from our data frame.

i <- sample(x = nrow(Data), replace = TRUE)
head(i)
[1] 59 35 61 45 19 70

Now use boot_fn() to generate one bootstrap replicate.

boot_fn(data = Data, indices = i)
        ind_m1.X         ind_m2.X total_indirect.X  direct_effect.X 
      -0.1412019       -0.1071022       -0.2483041       -0.1576868 
  total_effect.X 
      -0.4059909 

We can try it again with a new sample of our data.

i <- sample(x = nrow(Data), replace = TRUE)
boot_fn(data = Data, indices = i)
        ind_m1.X         ind_m2.X total_indirect.X  direct_effect.X 
     -0.20865110      -0.06333337      -0.27198447      -0.21559863 
  total_effect.X 
     -0.48758310 

Notice we get slightly different results. This is the idea of bootstrapping: resample our data, run the analysis, and then summarize the variability in our results using percentile confidence intervals.

The boot() function automates the resampling for us. Be sure and save the result. Below we save the result as “boot_res”. Specify the number of resamples using the R argument and supply your function to the statistic argument. Setting a seed allows you to replicate the results. The function will take a few seconds to run.

#2. Run the bootstrap 
set.seed(123)

boot_res <- boot(
  data = Data,
  statistic = boot_fn,
  R = 1000
)

The bootstrap replicates are stored in the t element of the “boot_res” object. We can use head() to look at the first six rows of the matrix. Column 1 contains the estimates for ind_m1, column2 contains the estimates of ind_m2, and so on.

head(boot_res$t)
           [,1]         [,2]       [,3]        [,4]       [,5]
[1,] -0.2918441 -0.099897322 -0.3917414  0.06509423 -0.3266472
[2,] -0.2164570 -0.080051348 -0.2965083 -0.02264087 -0.3191492
[3,] -0.2589680 -0.107395879 -0.3663639  0.01051310 -0.3558508
[4,] -0.3216455 -0.126339726 -0.4479853  0.08787194 -0.3601133
[5,] -0.2310214 -0.009286045 -0.2403075 -0.04579584 -0.2861033
[6,] -0.3358054 -0.068777565 -0.4045829 -0.05674194 -0.4613249

We can calculate a 95% percentile confidence interval by finding the 0.025 and 0.975 percentiles of one of the columns using the quantile() function. For example, for the ind_m1 column:

quantile(boot_res$t[,1], probs = c(0.025, 0.975))
      2.5%      97.5% 
-0.4276461 -0.1271311 

The boot package provides the bootci() function to do this for us. Use the type argument to specify “perc” for percentile intervals and the index argument to specify which column to calculate the 95% confidence interval for. Below we get the confidence interval for the first column, ind_m1.

boot.ci(boot_res, type = "perc", conf = 0.95, index = 1)
BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_res, conf = 0.95, type = "perc", index = 1)

Intervals : 
Level     Percentile     
95%   (-0.4314, -0.1249 )  
Calculations and Intervals on Original Scale

To help compute all five confidence intervals, we define a new function called get_ci() which saves the result of boot.ci(), extracts the confidence interval, and rounds it to four decimal places. We then apply this function to the numbers 1 - 5 to get all five confidence intervals and present the results in a data frame. Notice we use the t() function to transpose the result.

#3. Extract percentile bootstrap confidence intervals 
get_ci <- function(i) {
  ci <- boot.ci(boot_res, type = "perc", index = i)
  round(ci$percent[4:5], 4)
}

# Apply the get_ci function to the 5 effects
ci_mat <- t(sapply(1:5, get_ci))
ci_mat
        [,1]    [,2]
[1,] -0.4314 -0.1249
[2,] -0.1991  0.0303
[3,] -0.5606 -0.1608
[4,] -0.3700  0.1975
[5,] -0.6641 -0.1862

We can then add these results to a data frame with proper labels.

#4. Create a summary table
results_table <- data.frame(
  Path = c(
    "Indirect via M1 (a1*b1)",
    "Indirect via M2 (a2*b2)",
    "Total indirect effect",
    "Direct effect (c')",
    "Total effect (c)"
  ),
  
  Estimate = round(boot_res$t0, 4), #boot_res$t0 contains the original-sample estimates
  CI_lower = ci_mat[, 1],
  CI_upper = ci_mat[, 2]
)

#5. Determine significance based on whether the CI includes 0  
results_table$Significant <- ifelse(
  results_table$CI_lower > 0 | results_table$CI_upper < 0,
  "Yes",
  "No"
)

rownames(results_table) <- NULL # Remove row names for a cleaner table
results_table
                     Path Estimate CI_lower CI_upper Significant
1 Indirect via M1 (a1*b1)  -0.2714  -0.4314  -0.1249         Yes
2 Indirect via M2 (a2*b2)  -0.0770  -0.1991   0.0303          No
3   Total indirect effect  -0.3485  -0.5606  -0.1608         Yes
4      Direct effect (c')  -0.0699  -0.3700   0.1975          No
5        Total effect (c)  -0.4184  -0.6641  -0.1862         Yes

To replicate this procedure in lavaan, set se = "bootstrap" in the call to sem(). You can use the optional arguments iseed to make results reproducible and bootstrap to change the number of replications. (The default is 1000.)

library(lavaan)
# specify the parallel mediation model syntax

model_parallel <- '
  M1 ~ a1 * X
  M2 ~ a2 * X
  Y  ~ b1 * M1 + b2 * M2 + c * X

  indirect1      := a1 * b1
  indirect2      := a2 * b2
  total_indirect := indirect1 + indirect2
  total          := c + total_indirect
'

# estimate the model using maximum likelihood
fit_parallel <- sem(model_parallel, 
                    data = Data,
                    se = "bootstrap",  # use bootstrapping for SE
                    iseed = 2026,
                    bootstrap = 5000)  # set number of bootstrap replications 

summary(fit_parallel,
        ci = TRUE) # include confidence intervals 
lavaan 0.6-21 ended normally after 1 iteration

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                         8

  Number of observations                           100

Model Test User Model:
                                                      
  Test statistic                                 1.706
  Degrees of freedom                                 1
  P-value (Chi-square)                           0.191

Parameter Estimates:

  Standard errors                            Bootstrap
  Number of requested bootstrap draws             5000
  Number of successful bootstrap draws            5000

Regressions:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
  M1 ~                                                                  
    X         (a1)    0.592    0.089    6.644    0.000    0.417    0.765
  M2 ~                                                                  
    X         (a2)   -0.125    0.090   -1.384    0.167   -0.307    0.050
  Y ~                                                                   
    M1        (b1)   -0.459    0.102   -4.481    0.000   -0.654   -0.259
    M2        (b2)    0.616    0.100    6.178    0.000    0.427    0.817
    X          (c)   -0.070    0.149   -0.470    0.639   -0.363    0.221

Variances:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
   .M1                0.884    0.125    7.060    0.000    0.643    1.124
   .M2                0.898    0.132    6.806    0.000    0.640    1.157
   .Y                 0.959    0.127    7.531    0.000    0.687    1.187

Defined Parameters:
                   Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
    indirect1        -0.271    0.078   -3.465    0.001   -0.439   -0.131
    indirect2        -0.077    0.060   -1.292    0.196   -0.206    0.029
    total_indirect   -0.348    0.102   -3.411    0.001   -0.561   -0.161
    total            -0.418    0.131   -3.185    0.001   -0.681   -0.166

  1. If we sampled without replacement we would just be analyzing the same data over and over.↩︎