First you need to install and load the required package Lavaan.Then you need to load the dataset for the first part of the seminar. This is found on minerva path.csv.
# Add packages to Library
library("lavaan")
## This is lavaan 0.6-15
## lavaan is FREE software! Please report any bugs.
# Load data
dat <- read.csv("path.csv")
Structural equation modeling (SEM) is a label for a diverse set of methods which allows the simultaneous analysis of multiple interrelated variables, including both observed and latent variables. As you can see from the figure below the combination of a single or multiple endogenous variables (latent or observed) can lead a different form of a method in the structural equation modelling. Linear regression can be also considered a special case of structural equation modeling where we have one outcome variable and no latent variables. We will start the seminar by showing how to perform a linear regression with Lavaan.
{width=50%}
For the purpose of the first part of the seminar we will examine the relationships among variables predicting success in graduate school, as assessed by measures of grade point average (gpa) in high school, and college/university, and the result of the Graduate Record Exam (gre). For this exercise you will use the dataset (path.csv) that contains 200 observations with four variables:
These are observed variables and the main outcome variable is graduate school gpa (grad). The first task is using the lavaan package to examine the simple multiple linear regression where the outcome variable is directly predicted by the three independent variables.The syntax for this is first to define the model and then to use the cfa function cfa() and as in the previous methods we cover we will take the summary of the cfa analysis.
model <- ' grad ~ 1+ gre + col + hs '
fit <- cfa(model = model, data = dat)
summary(fit, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 16 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 5
##
## Number of observations 200
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## grad ~
## gre 0.369 0.078 4.755 0.000
## col 0.123 0.084 1.465 0.143
## hs 0.372 0.075 4.937 0.000
##
## Intercepts:
## Estimate Std.Err z-value P(>|z|)
## .grad 6.971 3.506 1.989 0.047
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .grad 59.998 6.000 10.000 0.000
##
## R-Square:
## Estimate
## grad 0.477
If you perform the known linear regression with the lm function as below you will see that the result are identical. The main difference is that the cfa uses a ML estimator.
model1 <- lm(grad ~ gre + col + hs,data=dat)
summary(model1)
In the previous syntax the inclusion of “1” in the model was for the intercept. In most cases in the cfa and SEM we do not really care about the intercept and therefore this will not be included.
The purpose of the previous task was to show you that SEM analysis in the simplest form is a linear regression. However the purpose of using SEM is to perform more complex relationships. The following relationships are possible in SEM:
-Direct and indirect analysis. -latent to observed variables analysis (confirmatory factor analysis) -latent to latent variables analysis (structural regression)
An extension of the previous linear analysis is to consider not only direct but also indirect relationships among. In that case we will have also more that one endogenous variable (a variable that is influenced by other variables in the model). For example, imagine that we want to examine the relationship that are depicted in the following path diagram.
Can you understand the model described and which variable(s) are exogenous (not explained within the model) and which are endogenous (explained in the model) ?
The above diagram describes a model where we have more than one endogenous variables. Specifically, the arrows describe regression paths. Given that we can understand the following relationships
Task: Try to define and run the model that was described in the path diagram
model2 <- 'col ~ hs
gre ~ col
grad ~ col + gre '
fit <- cfa(model = model2, data = dat)
summary(fit, fit.measures=TRUE, standardized= TRUE, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 1 iteration
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 7
##
## Number of observations 200
##
## Model Test User Model:
##
## Test statistic 44.429
## Degrees of freedom 2
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 362.474
## Degrees of freedom 6
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.881
## Tucker-Lewis Index (TLI) 0.643
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -2062.830
## Loglikelihood unrestricted model (H1) -2040.616
##
## Akaike (AIC) 4139.660
## Bayesian (BIC) 4162.748
## Sample-size adjusted Bayesian (SABIC) 4140.571
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.326
## 90 Percent confidence interval - lower 0.247
## 90 Percent confidence interval - upper 0.412
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 1.000
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.102
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## col ~
## hs 0.605 0.048 12.500 0.000 0.605 0.662
## gre ~
## col 0.625 0.056 11.101 0.000 0.625 0.617
## grad ~
## col 0.317 0.079 4.014 0.000 0.317 0.276
## gre 0.492 0.078 6.303 0.000 0.492 0.434
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .col 49.025 4.903 10.000 0.000 49.025 0.561
## .gre 55.313 5.531 10.000 0.000 55.313 0.619
## .grad 67.311 6.731 10.000 0.000 67.311 0.587
##
## R-Square:
## Estimate
## col 0.439
## gre 0.381
## grad 0.413
In the output this time we have include more information. First we have standardized values with the parameter standardized= TRUE and second we have measures of fit with the parameter fit.measures=TRUE. These measures include CFI, TLI, RMSEA etc. Based on the known threshold this model does not have a very good fit.
Task: 1) Compare this results with a smaller model: What happens if you delete the regression path col -> gre? 2) Compare with a larger model: What happens if you add the path: hs →gre 3) Check what happens to the fit measurements if you include all possible relationships (add hs <- gre and also the effect of hs to grad). Can you explain why?
A mediation effect occurs when the observed relationship between a dependent and an independent variable is explained through a third variable, aka as the mediator.
In the mediation the idea is that the effect we observe from an independent (x) to the dependent (y) it could be partially of fully explained through the effect of the variable (x) to another variable. For example the level of education may predict spending. However, the true underlying relationship could be that the level of education will influence the spending through the effect it has to the income of people.
In order to test if a mediated relationship exist we need first to make sure that there is a direct significant effect which means that the x is correlated with Y. If there is no relationship of x and y then there is no need for a mediation analysis. If this relationship is established then we need to make sure the x and m have also relationship and the same should be for m and y.
After the introduction of the mediator in our model if the effect of x on y becomes insignificant then we can say that we have a full mediation which means that the effect of the observed relationship is only through the mediator. However, we may also have the case where the relationship is still significant. In that case what we observe is a partial mediation
The way we do this in Lavaan is by creating placeholders in the syntax. Each placeholder will store the estimated of the path coefficients of those equations (a,b,c). Using the notation from the previous graph the direct effect will be c, the indirect will be the product ab and the total effect will be the direct (c) + the indirect (a*b).
So Imagine that we have initially assumed that this was our true model. This is a simple multiple linear regression where the variables hs and col predict/explain grad
Task: Define and run the described model in Lavaan
Now we want to introduce the gre variable as a mediator and we want to perform the mediation analysis we discussed.So our new model will have the following form.
Now we will examine the mediation effect of gre to the effect of hs to grad. To do this we will use the placeholders. In the code below I used the placeholders a,b,c for convenience but you could use any letter.
model <- '
# regression
gre ~ a*hs
gre ~ col
# regression
grad ~ c*hs
grad ~ col
grad ~ b*gre
# indirect effect (a*b)
hsgre := a*b
# total effect
total := c + (a*b)
'
fit <- cfa(model = model, data = dat) # sobel test
summary(fit, standardized= TRUE, rsquare=TRUE, ci=TRUE)
## lavaan 0.6.15 ended normally after 1 iteration
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 7
##
## Number of observations 200
##
## Model Test User Model:
##
## Test statistic 0.000
## Degrees of freedom 0
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
## gre ~
## hs (a) 0.309 0.065 4.756 0.000 0.182 0.437
## col 0.400 0.071 5.626 0.000 0.261 0.540
## grad ~
## hs (c) 0.372 0.075 4.937 0.000 0.225 0.520
## col 0.123 0.084 1.465 0.143 -0.042 0.288
## gre (b) 0.369 0.078 4.755 0.000 0.217 0.522
## Std.lv Std.all
##
## 0.309 0.335
## 0.400 0.396
##
## 0.372 0.356
## 0.123 0.108
## 0.369 0.326
##
## Variances:
## Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
## .gre 49.694 4.969 10.000 0.000 39.954 59.434
## .grad 59.998 6.000 10.000 0.000 48.238 71.757
## Std.lv Std.all
## 49.694 0.556
## 59.998 0.523
##
## R-Square:
## Estimate
## gre 0.444
## grad 0.477
##
## Defined Parameters:
## Estimate Std.Err z-value P(>|z|) ci.lower ci.upper
## hsgre 0.114 0.034 3.362 0.001 0.048 0.181
## total 0.487 0.075 6.453 0.000 0.339 0.634
## Std.lv Std.all
## 0.114 0.109
## 0.487 0.465
The above result indicate that both the indirect (which we have defined it as hsgre) and the direct effect (hs on grad) are significant parts, hence, this suggest a partial mediation.
By default the test used for the significance of the parameters in the mediation is the Sobel test. This has limitation that it assumes normality which may not hold true and therefore an alternative widely used approach is through bootstrapping. This is could be done with the following syntax
fit <- cfa(model = model, data = dat, se='bootstrap', bootstrap = 1000)
Task: Now it is your turn to perform a mediation analysis this time to examine the mediation effect of gre on the effect of col to grad. Perform the analysis as before and try to answer the following questions:
So far we have examined models with observed variables. However, the beauty of SEM is mainly on models with latent variables. In the SEM we have two parts. The part where we examine the measurement model which is the CFA and the structural part where we test the assumed relationships. The syntax is similar to what we have covered so far but what is different is the latent variable definition with the use of the =~ operator. See the figure below for a recap of the operators used in Lavaan syntax.
The data used in this part are taken from a SEM seminar by the Institute for Digital Research and Education Statistical Consulting/UCLA.This is a very detailed and interesting series of seminars which cover the EFA, CFA, SEM and more advanced models.
The data has six observed variables ascribed to 3 hypothetical latent variables describing the effects of student background on academic achievement. Nine observed variables are clustered into hypothetical latent constructs based on prior knowledge or theory. In this example, the three hypothesized latent constructs are adjustment, risk, and achievement. Adjustment is defined by the observed variables motivation, harmony, and stability. Similarly, risk is defined by negative parental psychology, socioeconomic status, and verbal IQ. Finally, achievement is composed of reading, arithmetic, and spelling.
Variable abbreviations are:
Motivation = motiv
Harmony = harm
Stability = stabi
Negative parental psychology = ppsych
Socioeconomic status = ses
Verbal IQ = verbal
Reading = read
Arithmetic = arith
Spelling = spell
For this example, we will create a structural model that relates multiple latent variables. First, we create the measurement model by defining each latent variable. Adjustment will be defined/measured by motivation, harmony, and stability. Risk will be verbal, negative parent psychology, and socioeconomic status. Achievement will be reading, arithmetic, and spelling.
Next, we will define the regression path. Achievement will be the combination of adjustment and risk.
We can illustrate this analysis using a path diagram such as the one below.
To sum up we have three latent variables (circles). Two of them are exogenous (adjustment and risk) and one endogenous (achievement). The rectangles are the indicators/items or a simply words what is measured usually through surveys. The syntax involves initially to define the latent variables
modelsem<- '
# measurement model/define latent variables
adjust=~ motiv + harm + stabi
risk=~ verbal + ppsych + ses
achieve =~ read + arith + spell
# regression/ structural part
achieve ~ adjust + risk
'
fit3<- cfa(modelsem, data=dat2) # how I named the data
summary(fit3, standardized= TRUE, fit.measures= TRUE)
## lavaan 0.6.15 ended normally after 130 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 21
##
## Number of observations 500
##
## Model Test User Model:
##
## Test statistic 148.982
## Degrees of freedom 24
## P-value (Chi-square) 0.000
##
## Model Test Baseline Model:
##
## Test statistic 2597.972
## Degrees of freedom 36
## P-value 0.000
##
## User Model versus Baseline Model:
##
## Comparative Fit Index (CFI) 0.951
## Tucker-Lewis Index (TLI) 0.927
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -15517.857
## Loglikelihood unrestricted model (H1) -15443.366
##
## Akaike (AIC) 31077.713
## Bayesian (BIC) 31166.220
## Sample-size adjusted Bayesian (SABIC) 31099.565
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.102
## 90 Percent confidence interval - lower 0.087
## 90 Percent confidence interval - upper 0.118
## P-value H_0: RMSEA <= 0.050 0.000
## P-value H_0: RMSEA >= 0.080 0.990
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.041
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## adjust =~
## motiv 1.000 9.324 0.933
## harm 0.884 0.041 21.774 0.000 8.246 0.825
## stabi 0.695 0.043 15.987 0.000 6.478 0.648
## risk =~
## verbal 1.000 7.319 0.733
## ppsych -0.770 0.075 -10.223 0.000 -5.636 -0.564
## ses 0.807 0.076 10.607 0.000 5.906 0.591
## achieve =~
## read 1.000 9.404 0.941
## arith 0.837 0.034 24.437 0.000 7.873 0.788
## spell 0.976 0.028 34.338 0.000 9.178 0.919
##
## Regressions:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## achieve ~
## adjust 0.375 0.046 8.085 0.000 0.372 0.372
## risk 0.724 0.078 9.253 0.000 0.564 0.564
##
## Covariances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## adjust ~~
## risk 32.098 4.320 7.431 0.000 0.470 0.470
##
## Variances:
## Estimate Std.Err z-value P(>|z|) Std.lv Std.all
## .motiv 12.870 2.852 4.512 0.000 12.870 0.129
## .harm 31.805 2.973 10.698 0.000 31.805 0.319
## .stabi 57.836 3.990 14.494 0.000 57.836 0.580
## .verbal 46.239 4.788 9.658 0.000 46.239 0.463
## .ppsych 68.033 5.068 13.425 0.000 68.033 0.682
## .ses 64.916 4.975 13.048 0.000 64.916 0.650
## .read 11.372 1.608 7.074 0.000 11.372 0.114
## .arith 37.818 2.680 14.109 0.000 37.818 0.379
## .spell 15.560 1.699 9.160 0.000 15.560 0.156
## adjust 86.930 6.830 12.727 0.000 1.000 1.000
## risk 53.561 6.757 7.927 0.000 1.000 1.000
## .achieve 30.685 3.449 8.896 0.000 0.347 0.347
Having this output we care about the fit measurement which will give an idea about how good is our measurement model and then as in any other analysis we are interested in the coefficients and the statistical significance among the latent variables.