First you need to install and load the required package Lavaan.Then you need to load the dataset for the first part of the seminar. This is found on minerva path.csv.

# Add packages to Library
library("lavaan")
## This is lavaan 0.6-15
## lavaan is FREE software! Please report any bugs.
# Load data
dat <- read.csv("path.csv")

What is SEM ?

Structural equation modeling (SEM) is a label for a diverse set of methods which allows the simultaneous analysis of multiple interrelated variables, including both observed and latent variables. As you can see from the figure below the combination of a single or multiple endogenous variables (latent or observed) can lead a different form of a method in the structural equation modelling. Linear regression can be also considered a special case of structural equation modeling where we have one outcome variable and no latent variables. We will start the seminar by showing how to perform a linear regression with Lavaan.

{width=50%}

Dataset Description

For the purpose of the first part of the seminar we will examine the relationships among variables predicting success in graduate school, as assessed by measures of grade point average (gpa) in high school, and college/university, and the result of the Graduate Record Exam (gre). For this exercise you will use the dataset (path.csv) that contains 200 observations with four variables:

  • the respondent’s high school gpa (hs),
  • college gpa (col),
  • GRE score (gre), and
  • graduate school gpa (grad).

These are observed variables and the main outcome variable is graduate school gpa (grad). The first task is using the lavaan package to examine the simple multiple linear regression where the outcome variable is directly predicted by the three independent variables.The syntax for this is first to define the model and then to use the cfa function cfa() and as in the previous methods we cover we will take the summary of the cfa analysis.

model <- ' grad ~ 1+ gre + col + hs '
fit <- cfa(model = model, data = dat)
summary(fit,  rsquare=TRUE)
## lavaan 0.6.15 ended normally after 16 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         5
## 
##   Number of observations                           200
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   grad ~                                              
##     gre               0.369    0.078    4.755    0.000
##     col               0.123    0.084    1.465    0.143
##     hs                0.372    0.075    4.937    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .grad              6.971    3.506    1.989    0.047
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .grad             59.998    6.000   10.000    0.000
## 
## R-Square:
##                    Estimate
##     grad              0.477

If you perform the known linear regression with the lm function as below you will see that the result are identical. The main difference is that the cfa uses a ML estimator.

model1 <-  lm(grad ~ gre + col + hs,data=dat)
summary(model1)

In the previous syntax the inclusion of “1” in the model was for the intercept. In most cases in the cfa and SEM we do not really care about the intercept and therefore this will not be included.

Direct and Indirect relationships

The purpose of the previous task was to show you that SEM analysis in the simplest form is a linear regression. However the purpose of using SEM is to perform more complex relationships. The following relationships are possible in SEM:

-Direct and indirect analysis. -latent to observed variables analysis (confirmatory factor analysis) -latent to latent variables analysis (structural regression)

An extension of the previous linear analysis is to consider not only direct but also indirect relationships among. In that case we will have also more that one endogenous variable (a variable that is influenced by other variables in the model). For example, imagine that we want to examine the relationship that are depicted in the following path diagram.

Can you understand the model described and which variable(s) are exogenous (not explained within the model) and which are endogenous (explained in the model) ?

The above diagram describes a model where we have more than one endogenous variables. Specifically, the arrows describe regression paths. Given that we can understand the following relationships

  1. The colege gpa (col) is assumed to be affected/predicted by the high school gpa (hs)
  2. The Gre score (gre) is affected by the colege gpa (col).
  3. There is not a direct relationship from hs to gre but we can say that there maybe is an indirect effect through the effect of hs on col. 4 Finally graduate school gpa (grad) is affected by both gre and col. In fact, the col variable it is assumed in the model to have both an indirect (through gre) and a direct effect.

Task: Try to define and run the model that was described in the path diagram

model2 <- 'col ~ hs
           gre ~ col
           grad ~ col + gre '
fit <- cfa(model = model2, data = dat)
summary(fit, fit.measures=TRUE, standardized= TRUE, rsquare=TRUE)
## lavaan 0.6.15 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         7
## 
##   Number of observations                           200
## 
## Model Test User Model:
##                                                       
##   Test statistic                                44.429
##   Degrees of freedom                                 2
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                               362.474
##   Degrees of freedom                                 6
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.881
##   Tucker-Lewis Index (TLI)                       0.643
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)              -2062.830
##   Loglikelihood unrestricted model (H1)      -2040.616
##                                                       
##   Akaike (AIC)                                4139.660
##   Bayesian (BIC)                              4162.748
##   Sample-size adjusted Bayesian (SABIC)       4140.571
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.326
##   90 Percent confidence interval - lower         0.247
##   90 Percent confidence interval - upper         0.412
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    1.000
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.102
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   col ~                                                                 
##     hs                0.605    0.048   12.500    0.000    0.605    0.662
##   gre ~                                                                 
##     col               0.625    0.056   11.101    0.000    0.625    0.617
##   grad ~                                                                
##     col               0.317    0.079    4.014    0.000    0.317    0.276
##     gre               0.492    0.078    6.303    0.000    0.492    0.434
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .col              49.025    4.903   10.000    0.000   49.025    0.561
##    .gre              55.313    5.531   10.000    0.000   55.313    0.619
##    .grad             67.311    6.731   10.000    0.000   67.311    0.587
## 
## R-Square:
##                    Estimate
##     col               0.439
##     gre               0.381
##     grad              0.413

In the output this time we have include more information. First we have standardized values with the parameter standardized= TRUE and second we have measures of fit with the parameter fit.measures=TRUE. These measures include CFI, TLI, RMSEA etc. Based on the known threshold this model does not have a very good fit.

Task: 1) Compare this results with a smaller model: What happens if you delete the regression path col -> gre? 2) Compare with a larger model: What happens if you add the path: hs →gre 3) Check what happens to the fit measurements if you include all possible relationships (add hs <- gre and also the effect of hs to grad). Can you explain why?

Mediation Effect

A mediation effect occurs when the observed relationship between a dependent and an independent variable is explained through a third variable, aka as the mediator.

In the mediation the idea is that the effect we observe from an independent (x) to the dependent (y) it could be partially of fully explained through the effect of the variable (x) to another variable. For example the level of education may predict spending. However, the true underlying relationship could be that the level of education will influence the spending through the effect it has to the income of people.

In order to test if a mediated relationship exist we need first to make sure that there is a direct significant effect which means that the x is correlated with Y. If there is no relationship of x and y then there is no need for a mediation analysis. If this relationship is established then we need to make sure the x and m have also relationship and the same should be for m and y.

After the introduction of the mediator in our model if the effect of x on y becomes insignificant then we can say that we have a full mediation which means that the effect of the observed relationship is only through the mediator. However, we may also have the case where the relationship is still significant. In that case what we observe is a partial mediation

The way we do this in Lavaan is by creating placeholders in the syntax. Each placeholder will store the estimated of the path coefficients of those equations (a,b,c). Using the notation from the previous graph the direct effect will be c, the indirect will be the product ab and the total effect will be the direct (c) + the indirect (a*b).

So Imagine that we have initially assumed that this was our true model. This is a simple multiple linear regression where the variables hs and col predict/explain grad

Task: Define and run the described model in Lavaan

Now we want to introduce the gre variable as a mediator and we want to perform the mediation analysis we discussed.So our new model will have the following form.

Now we will examine the mediation effect of gre to the effect of hs to grad. To do this we will use the placeholders. In the code below I used the placeholders a,b,c for convenience but you could use any letter.

model <- '
# regression
gre ~ a*hs
gre ~ col

# regression
grad ~ c*hs
grad ~ col
grad ~ b*gre

# indirect effect (a*b)
hsgre := a*b
# total effect
total := c + (a*b)
'
fit <- cfa(model = model, data = dat) # sobel test
summary(fit, standardized= TRUE, rsquare=TRUE, ci=TRUE)
## lavaan 0.6.15 ended normally after 1 iteration
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         7
## 
##   Number of observations                           200
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##   gre ~                                                                 
##     hs         (a)    0.309    0.065    4.756    0.000    0.182    0.437
##     col               0.400    0.071    5.626    0.000    0.261    0.540
##   grad ~                                                                
##     hs         (c)    0.372    0.075    4.937    0.000    0.225    0.520
##     col               0.123    0.084    1.465    0.143   -0.042    0.288
##     gre        (b)    0.369    0.078    4.755    0.000    0.217    0.522
##    Std.lv  Std.all
##                   
##     0.309    0.335
##     0.400    0.396
##                   
##     0.372    0.356
##     0.123    0.108
##     0.369    0.326
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##    .gre              49.694    4.969   10.000    0.000   39.954   59.434
##    .grad             59.998    6.000   10.000    0.000   48.238   71.757
##    Std.lv  Std.all
##    49.694    0.556
##    59.998    0.523
## 
## R-Square:
##                    Estimate
##     gre               0.444
##     grad              0.477
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|) ci.lower ci.upper
##     hsgre             0.114    0.034    3.362    0.001    0.048    0.181
##     total             0.487    0.075    6.453    0.000    0.339    0.634
##    Std.lv  Std.all
##     0.114    0.109
##     0.487    0.465

The above result indicate that both the indirect (which we have defined it as hsgre) and the direct effect (hs on grad) are significant parts, hence, this suggest a partial mediation.

By default the test used for the significance of the parameters in the mediation is the Sobel test. This has limitation that it assumes normality which may not hold true and therefore an alternative widely used approach is through bootstrapping. This is could be done with the following syntax

fit <- cfa(model = model, data = dat, se='bootstrap', bootstrap = 1000)

Task: Now it is your turn to perform a mediation analysis this time to examine the mediation effect of gre on the effect of col to grad. Perform the analysis as before and try to answer the following questions:

  • What is the direct effect of col on grad before controlling for mediation?
  • What is the direct effect of col on grad after controlling for mediation?
  • What is the indirect effect of col on grad (via GRE)
  • Does GRE (partially or fully mediates the relationship between col and grad)?

CFA and SEM

So far we have examined models with observed variables. However, the beauty of SEM is mainly on models with latent variables. In the SEM we have two parts. The part where we examine the measurement model which is the CFA and the structural part where we test the assumed relationships. The syntax is similar to what we have covered so far but what is different is the latent variable definition with the use of the =~ operator. See the figure below for a recap of the operators used in Lavaan syntax.

The data used in this part are taken from a SEM seminar by the Institute for Digital Research and Education Statistical Consulting/UCLA.This is a very detailed and interesting series of seminars which cover the EFA, CFA, SEM and more advanced models.

The data has six observed variables ascribed to 3 hypothetical latent variables describing the effects of student background on academic achievement. Nine observed variables are clustered into hypothetical latent constructs based on prior knowledge or theory. In this example, the three hypothesized latent constructs are adjustment, risk, and achievement. Adjustment is defined by the observed variables motivation, harmony, and stability. Similarly, risk is defined by negative parental psychology, socioeconomic status, and verbal IQ. Finally, achievement is composed of reading, arithmetic, and spelling.

Variable abbreviations are:

Motivation = motiv

Harmony = harm

Stability = stabi

Negative parental psychology = ppsych

Socioeconomic status = ses

Verbal IQ = verbal

Reading = read

Arithmetic = arith

Spelling = spell

For this example, we will create a structural model that relates multiple latent variables. First, we create the measurement model by defining each latent variable. Adjustment will be defined/measured by motivation, harmony, and stability. Risk will be verbal, negative parent psychology, and socioeconomic status. Achievement will be reading, arithmetic, and spelling.

Next, we will define the regression path. Achievement will be the combination of adjustment and risk.

We can illustrate this analysis using a path diagram such as the one below.

To sum up we have three latent variables (circles). Two of them are exogenous (adjustment and risk) and one endogenous (achievement). The rectangles are the indicators/items or a simply words what is measured usually through surveys. The syntax involves initially to define the latent variables

modelsem<- '
# measurement model/define latent variables
adjust=~ motiv + harm + stabi
risk=~ verbal + ppsych + ses 
achieve =~ read + arith + spell
# regression/ structural part
achieve ~ adjust + risk
'
fit3<- cfa(modelsem, data=dat2) # how I named the data
summary(fit3,  standardized= TRUE, fit.measures= TRUE)
## lavaan 0.6.15 ended normally after 130 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                        21
## 
##   Number of observations                           500
## 
## Model Test User Model:
##                                                       
##   Test statistic                               148.982
##   Degrees of freedom                                24
##   P-value (Chi-square)                           0.000
## 
## Model Test Baseline Model:
## 
##   Test statistic                              2597.972
##   Degrees of freedom                                36
##   P-value                                        0.000
## 
## User Model versus Baseline Model:
## 
##   Comparative Fit Index (CFI)                    0.951
##   Tucker-Lewis Index (TLI)                       0.927
## 
## Loglikelihood and Information Criteria:
## 
##   Loglikelihood user model (H0)             -15517.857
##   Loglikelihood unrestricted model (H1)     -15443.366
##                                                       
##   Akaike (AIC)                               31077.713
##   Bayesian (BIC)                             31166.220
##   Sample-size adjusted Bayesian (SABIC)      31099.565
## 
## Root Mean Square Error of Approximation:
## 
##   RMSEA                                          0.102
##   90 Percent confidence interval - lower         0.087
##   90 Percent confidence interval - upper         0.118
##   P-value H_0: RMSEA <= 0.050                    0.000
##   P-value H_0: RMSEA >= 0.080                    0.990
## 
## Standardized Root Mean Square Residual:
## 
##   SRMR                                           0.041
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Latent Variables:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   adjust =~                                                             
##     motiv             1.000                               9.324    0.933
##     harm              0.884    0.041   21.774    0.000    8.246    0.825
##     stabi             0.695    0.043   15.987    0.000    6.478    0.648
##   risk =~                                                               
##     verbal            1.000                               7.319    0.733
##     ppsych           -0.770    0.075  -10.223    0.000   -5.636   -0.564
##     ses               0.807    0.076   10.607    0.000    5.906    0.591
##   achieve =~                                                            
##     read              1.000                               9.404    0.941
##     arith             0.837    0.034   24.437    0.000    7.873    0.788
##     spell             0.976    0.028   34.338    0.000    9.178    0.919
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   achieve ~                                                             
##     adjust            0.375    0.046    8.085    0.000    0.372    0.372
##     risk              0.724    0.078    9.253    0.000    0.564    0.564
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##   adjust ~~                                                             
##     risk             32.098    4.320    7.431    0.000    0.470    0.470
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
##    .motiv            12.870    2.852    4.512    0.000   12.870    0.129
##    .harm             31.805    2.973   10.698    0.000   31.805    0.319
##    .stabi            57.836    3.990   14.494    0.000   57.836    0.580
##    .verbal           46.239    4.788    9.658    0.000   46.239    0.463
##    .ppsych           68.033    5.068   13.425    0.000   68.033    0.682
##    .ses              64.916    4.975   13.048    0.000   64.916    0.650
##    .read             11.372    1.608    7.074    0.000   11.372    0.114
##    .arith            37.818    2.680   14.109    0.000   37.818    0.379
##    .spell            15.560    1.699    9.160    0.000   15.560    0.156
##     adjust           86.930    6.830   12.727    0.000    1.000    1.000
##     risk             53.561    6.757    7.927    0.000    1.000    1.000
##    .achieve          30.685    3.449    8.896    0.000    0.347    0.347

Having this output we care about the fit measurement which will give an idea about how good is our measurement model and then as in any other analysis we are interested in the coefficients and the statistical significance among the latent variables.