For SEM Workshop HW #1 Code, skip to Katrina Data: SEM HW 1
This is a short introduction to using Lavaan for mediation and SEM models. I’m using a very basic SEM model with a latent variable and single outcome.
pkgs <- c("tidyverse",
"dplyr",
"haven",
"foreign",
"lme4",
"nlme",
"lsr",
"emmeans",
"afex",
"knitr",
"kableExtra",
"car",
"mediation",
"rockchalk",
"multilevel",
"bda",
"gvlma",
"stargazer",
"QuantPsyc",
"pequod",
"MASS",
"texreg",
"pwr",
"effectsize",
"semPlot",
"lmtest",
"semptools",
"conflicted",
"nnet",
"ordinal",
"DescTools")
packages <- rownames(installed.packages())
p_to_install <- pkgs[!(pkgs %in% packages)]
if(length(p_to_install) > 0){
install.packages(p_to_install)
}
lapply(pkgs, library, character.only = TRUE)
# devtools::install_github("cardiomoon/semMediation")
library(semMediation)
# tell R which package to use for functions that are in multiple packages
these_functions <- c("mutate", "select", "summarize", "filter")
lapply(these_functions, conflict_prefer, "dplyr")
conflict_prefer("mutate", "dplyr")
conflict_prefer("select", "dplyr")
conflict_prefer("summarize", "dplyr")The Baron & Kenny method is among the original methods for testing for mediation but tends to have low statistical power. We’re covering it here because it provides a very clear approach to establishing relationships between variables and is still occasionally requested by reviewers.
Mediation tests whether the effects of X (the independent variable) on Y (the dependent variable) operate through a third variable, M (the mediator). In this way, mediators explain the causal relationship between two variables or “how” the relationship works.
c = the total effect of X on Y with no consideration of mediator variables
c' = the direct effect of X on Y after controlling for M
ab = the indirect effect of X on Y through M
The above shows the standard mediation model. Full mediation occurs when the effect of X on Y disappears with M in the model. Partial mediation occurs when the effect of X on Y decreases by a nontrivial amount (the actual amount is up for debate) with M in the model.
In this example we’ll say we are interested in whether the number of hours since dawn (X) affect the subjective ratings of wakefulness (Y) 100 graduate students through the consumption of coffee (M).
Note that we are intentionally creating a mediation effect here (because statistics is always more fun if we have something to find) and we do so below by creating M so that it is related to X and Y so that it is related to M. This creates the causal chain for our analysis to parse.
set.seed(123) # Standardizes the numbers generated by rnorm
N <- 100 # Number of participants; graduate students
X.hours <- rnorm(N, 175, 7) # IV; hours since dawn
M.coffee <- 0.7*X.hours + rnorm(N, 0, 5) # Suspected mediator; coffee consumption
Y.wake <- 0.4*M.coffee + rnorm(N, 0, 5) # DV; wakefulness
Meddata <- data.frame(X.hours, M.coffee, Y.wake)This is the original 4-step method used to describe a mediation effect. Steps 1 and 2 use basic linear regression while steps 3 and 4 use multiple regression.
Now, let’s apply the Baron & Kenny method to our data:
The Steps:
#1. Total Effect
fit <- lm(Y.wake ~ X.hours, data=Meddata)
#2. Path A (X on M)
fita <- lm(M.coffee ~ X.hours, data=Meddata)
#3. Path B (M on Y, controlling for X)
fitb <- lm(Y.wake ~ M.coffee + X.hours, data=Meddata)
#4. Reversed Path C (Y on X, controlling for M)
fitc <- lm(X.hours ~ Y.wake + M.coffee, data=Meddata)#Summary Table
stargazer::stargazer(fit, fita, fitb, fitc, type="html", digits = 2, font.size = "footnotesize",title = "Baron and Kenny Method")| Dependent variable: | ||||
| Y.wake | M.coffee | Y.wake | X.hours | |
| (1) | (2) | (3) | (4) | |
| Y.wake | -0.11 | |||
| (0.10) | ||||
| M.coffee | 0.42*** | 0.70*** | ||
| (0.10) | (0.08) | |||
| X.hours | 0.17** | 0.66*** | -0.11 | |
| (0.08) | (0.08) | (0.10) | ||
| Constant | 19.88 | 6.04 | 17.32 | 96.11*** |
| (14.26) | (13.42) | (13.16) | (9.28) | |
| Observations | 100 | 100 | 100 | 100 |
| R2 | 0.04 | 0.43 | 0.19 | 0.44 |
| Adjusted R2 | 0.03 | 0.43 | 0.18 | 0.43 |
| Residual Std. Error | 5.16 (df = 98) | 4.85 (df = 98) | 4.76 (df = 97) | 4.82 (df = 97) |
| F Statistic | 4.34** (df = 1; 98) | 75.31*** (df = 1; 98) | 11.72*** (df = 2; 97) | 38.39*** (df = 2; 97) |
| Note: | p<0.1; p<0.05; p<0.01 | |||
Since the relationship between hours since dawn and wakefulness is no longer significant when controlling for coffee consumption, this suggests that coffee consumption does in fact mediate this relationship. However, this method alone does not allow for a formal test of the indirect effect so we don’t know if the change in this relationship is truly meaningful.
library(lavaan)
model <- "
# regressions
M.coffee ~ a*X.hours # Path a: X.hours predicts coffee
Y.wake ~ b*M.coffee # Path b: Coffee predicts wakefulness
Y.wake ~ c*X.hours # Path c (aka direct effect): X.hours predicts wakefulness
# indirect effect
ind := a*b # Path ab: the product of path a (M~X) and path b (Y~X)
# total effect: direct + indirect
total := c + (a*b) # overall effect of X.hours on Y.wake, accounting for both the direct # effect and indirect.
"lavaan 0.6.17 ended normally after 1 iteration
Estimator ML
Optimization method NLMINB
Number of model parameters 5
Number of observations 100
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 78.650
Degrees of freedom 3
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -595.177
Loglikelihood unrestricted model (H1) -595.177
Akaike (AIC) 1200.354
Bayesian (BIC) 1213.380
Sample-size adjusted Bayesian (SABIC) 1197.589
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
M.coffee ~
X.hours (a) 0.663 0.076 8.766 0.000 0.663 0.659
Y.wake ~
M.coffee (b) 0.424 0.097 4.347 0.000 0.424 0.519
X.hours (c) -0.112 0.098 -1.141 0.254 -0.112 -0.136
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.M.coffee 23.086 3.265 7.071 0.000 23.086 0.565
.Y.wake 21.945 3.104 7.071 0.000 21.945 0.805
R-Square:
Estimate
M.coffee 0.435
Y.wake 0.195
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ind 0.281 0.072 3.894 0.000 0.281 0.342
total 0.169 0.080 2.103 0.035 0.169 0.206
Model Fit - The chi-square is 0 with 0 dfs, indicating that the model is fully saturated. The model perfectly fits the data.
Path A: More hours since dawn (X) predicts higher rate of coffee consumption (M).
Path B: More coffee consumption (M) means more wakefulness (Y).
Path C: Direct effect of hours since dawn (X) on wakefulness (Y) is not significant.
Indirect effect: When accounting for coffee consumption (M), hours since dawn (X) on wakefulness (Y) is significant.
Total effect: When accounting for coffee consumption (M), hours since dawn (X) on wakefulness (Y) is significant.
R-squared for M.coffee = 0.435, so 43.5% of the variance in coffee consumption is explained by hours since dawn.
R-squared for Y.wake = 0.195, so 19.5% of the variance in wakefulness is explained by coffee consumption and hours since dawn combined.
SEM Question: How do hours since dawn and coffee consumption (mediator) affect a latent construct of wakefulness, as measured by alertness, focus, and mood?
# Adding variables that represent wakefulness. These will define our new latent variable
Y_alertness <- 0.6*Y.wake + rnorm(N, 0, 2)
Y_focus <- 0.7*Y.wake + rnorm(N, 0, 2)
Y_mood <- 0.5*Y.wake + rnorm(N, 0, 2)
# Combine into the dataset
Meddata_sem <- data.frame(X.hours, M.coffee, Y_alertness, Y_focus, Y_mood)sem_model <- '
# latent variable for wakefulness
Y.wake_latent =~ Y_alertness + Y_focus + Y_mood
# mediation path
M.coffee ~ a*X.hours
Y.wake_latent ~ b*M.coffee
Y.wake_latent ~ c*X.hours
# indirect effect of X.hours on wakefulness through coffee
ab := a*b
total := c + (a*b)
'
fit_sem <- sem(sem_model, data = Meddata_sem)
summary(fit_sem, standardized = T,fit.measures=T,rsquare=T)lavaan 0.6.17 ended normally after 33 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 10
Number of observations 100
Model Test User Model:
Test statistic 5.455
Degrees of freedom 4
P-value (Chi-square) 0.244
Model Test Baseline Model:
Test statistic 244.930
Degrees of freedom 10
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.994
Tucker-Lewis Index (TLI) 0.985
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -1018.423
Loglikelihood unrestricted model (H1) -1015.696
Akaike (AIC) 2056.847
Bayesian (BIC) 2082.899
Sample-size adjusted Bayesian (SABIC) 2051.316
Root Mean Square Error of Approximation:
RMSEA 0.060
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.172
P-value H_0: RMSEA <= 0.050 0.362
P-value H_0: RMSEA >= 0.080 0.475
Standardized Root Mean Square Residual:
SRMR 0.029
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
Y.wake_latent =~
Y_alertness 1.000 2.888 0.784
Y_focus 1.212 0.131 9.287 0.000 3.502 0.896
Y_mood 0.970 0.107 9.064 0.000 2.802 0.857
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
M.coffee ~
X.hours (a) 0.663 0.076 8.766 0.000 0.663 0.659
Y.wake_latent ~
M.coffee (b) 0.233 0.061 3.795 0.000 0.081 0.516
X.hours (c) -0.092 0.059 -1.545 0.122 -0.032 -0.202
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.Y_alertness 5.228 0.926 5.646 0.000 5.228 0.385
.Y_focus 3.003 0.901 3.335 0.001 3.003 0.197
.Y_mood 2.837 0.650 4.362 0.000 2.837 0.265
.M.coffee 23.086 3.265 7.071 0.000 23.086 0.565
.Y.wake_latent 6.929 1.558 4.447 0.000 0.831 0.831
R-Square:
Estimate
Y_alertness 0.615
Y_focus 0.803
Y_mood 0.735
M.coffee 0.435
Y.wake_latent 0.169
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
ab 0.154 0.044 3.483 0.000 0.053 0.340
total 0.063 0.048 1.310 0.190 0.022 0.138
Model Fit: chi-square represents how well the model fits the data. The fact that it’s not significant means there’s no significant difference between the predicted vs actual data.
Latent Variables: these estimates represent the factor loadings. So, how well do these variables indicate wakefulness? Note that “alertness” is fixed at 1.00 to act as a reference for the other loadings and our latent variable (wake) is measured on the same scale as alertness. They’re all significant and strong indicators.
Path A: Hours since dawn (X) predicts coffee consumption (M).
Path B: Coffee consumption (M) predicts wakefulness (Y).
Path C: No significant direct relationship between hours since dawn (X) and wakefulness (Y)
Indirect Effect: Significant indirect effect of hours (X) on wakefulness (Y) through coffee (M)
Total Effect: When we combine the direct and indirect effects, the total effect is basically not significant (p = 0.098). So, it seems that hours (X) has a marginal/not significant effect on wake (Y) and the relationship is primarily driven by coffee (M).
MPlus Code
VARIABLE:
NAMES ARE y1-y23 rc mm pc as;
USEVARIABLES ARE rc mm pc as;
ANALYSIS:
TYPE IS GENERAL;
ESTIMATOR IS ML;
ITERATIONS = 1000;
CONVERGENCE = 0.00005;
BOOTSTRAP = 5000;
MODEL:
mm on rc;
pc on rc;
as on rc mm pc;
mm WITH pc;
Model indirect: as IND mm rc;
as IND pc rc;
Katrina dataset variables
Sample code for multiple mediation
multipleMediation <- ’ store the model
Y ~ b1 * M1 + b2 * M2 + c * X DV ~ Mediator1 + Mediator2 + IV
M1 ~ a1 * X A path 1 (first mediation)
M2 ~ a2 * X A path 2 (second mediation)
M1 ~~ M2 relationship between mediators
indirect1 := a1 * b1 first indirect effect (X and M1)
indirect2 := a2 * b2 second indirect effect (X and M2)
total := c + (a1 * b1) + (a2 * b2) total effect ’
In the results table below, the “Std.all” column matches the MPlus output.
[1] "asds1_1_1" "asds1_2_1" "asds1_3_1" "asds1_4_1" "asds1_5_1" "asds1_6_1"
[7] "asds1_7_1" "asds1_8_1" "asds1_9_1" "asds1_10_1" "asds1_11_1" "asds1_12_1"
[13] "asds1_13_1" "asds1_14_1" "asds1_15_1" "asds1_16_1" "asds1_17_1" "asds1_18_1"
[19] "asds1_19_1" "RACE" "PRC" "PRP" "SC" "RC"
[25] "MM" "PC" "AS"
library(lavaan)
sem_katrina <- '
# mediation path
MM ~ a1*RC
PC ~ a2*RC
AS ~ c*RC + b1*MM + b2*PC
MM~~PC
# indirect effect of X.hours on wakefulness through coffee
indirect1 := a1*b1
indirect2 := a2*b2
total := c + (a1*b2) + (a2*b2)
'
medsem <- sem(model = sem_katrina, data = katrina_data, se="bootstrap")
summary(medsem, standardized = T,fit.measures=T,rsquare=T)lavaan 0.6.17 ended normally after 11 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 9
Number of observations 132
Model Test User Model:
Test statistic 0.000
Degrees of freedom 0
Model Test Baseline Model:
Test statistic 30.956
Degrees of freedom 6
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 1.000
Tucker-Lewis Index (TLI) 1.000
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -999.398
Loglikelihood unrestricted model (H1) -999.398
Akaike (AIC) 2016.796
Bayesian (BIC) 2042.741
Sample-size adjusted Bayesian (SABIC) 2014.274
Root Mean Square Error of Approximation:
RMSEA 0.000
90 Percent confidence interval - lower 0.000
90 Percent confidence interval - upper 0.000
P-value H_0: RMSEA <= 0.050 NA
P-value H_0: RMSEA >= 0.080 NA
Standardized Root Mean Square Residual:
SRMR 0.000
Parameter Estimates:
Standard errors Bootstrap
Number of requested bootstrap draws 1000
Number of successful bootstrap draws 995
Regressions:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
MM ~
RC (a1) -0.035 0.028 -1.280 0.200 -0.035 -0.118
PC ~
RC (a2) 0.114 0.069 1.657 0.097 0.114 0.152
AS ~
RC (c) 0.478 0.524 0.912 0.362 0.478 0.068
MM (b1) 6.631 1.811 3.661 0.000 6.631 0.283
PC (b2) -1.816 0.793 -2.289 0.022 -1.816 -0.194
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.MM ~~
.PC -0.346 0.142 -2.436 0.015 -0.346 -0.216
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.MM 0.643 0.076 8.483 0.000 0.643 0.986
.PC 3.994 0.350 11.419 0.000 3.994 0.977
.AS 308.962 30.681 10.070 0.000 308.962 0.861
R-Square:
Estimate
MM 0.014
PC 0.023
AS 0.139
Defined Parameters:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
indirect1 -0.234 0.198 -1.180 0.238 -0.234 -0.033
indirect2 -0.207 0.163 -1.266 0.206 -0.207 -0.030
total 0.336 0.529 0.634 0.526 0.336 0.062