Structural Equation Modeling (SEM) is a comprehensive and flexible approach that consists of studying, in a hypothetical model, the relationships between variables, whether they are measured or latent, meaning not directly observable, like any psychological construct (for example, intelligence, satisfaction, hope, trust). Comprehensive, because it is a multivariate analysis method that combines the inputs from factor analysis and that of methods based or derived from multiple regression analysis methods and canonical analysis. Flexible, because it is a technique that allows not only to identify the direct and indirect effects between variables, but also to estimate the parameters of varied and complex models including latent variable means.
The history of SEM approach traces back to three different traditions: (1) path analysis, originally developed by the geneticist Sewall Wright (Wright 1921), later picked up in sociology (Duncan 1966), (2) simultaneous-equation models, as developed in economics (Koopmans 1945), and (3) factor analysis from psychology (Anderson and Rubin 1956).
Basically, SEM is a statistical methodology that takes a confrmatory (i.e., hypothesis-testing) approach to the analysis of a structural theory bearing on some phenomenon. Typically, this theory represents “causal” processes that generate observations on multiple variables. The term “structural equation modeling” conveys two important aspects of the procedure: (1) that the causal processes under study are represented by a series of structural (i.e., regression) equations, and (2) that these structural relations can be modeled pictorially to enable a clearer conceptualization of the theory under study. The hypothesized model can then be tested statistically in a simultaneous analysis of the entire system of variables to determine the extent to which it is consistent with the data. If goodness-of-ft is adequate, the model argues for the plausibility of postulated relations among variables; if it is inadequate, the tenability of such relations is rejected.
Within the R environment, there are two approaches to estimate structural equation models.
The first approach is to connect R with external commercial SEM programs. This is often useful in simulation studies where fitting a model with SEM software is one part of the simulation pipeline. During one run of the simulation, syntax is written to a file in a format that can be read by the external SEM program (Mplus or EQS); the model is fitted by the external SEM program and the resulting output is written to a file that needs to be parsed manually to extract the relevant information for the study at hand. Depending on the SEM program, the connection protocols can be tedious to set up. Fortunately, two R packages have been developed to ease this process: MplusAutomation (Hallquist 2012) and REQS (Mair, Wu, and Bentler 2010), to communicate with the Mplus and EQS program respectively.
The second approach is to use a dedicated R package for structural equation modeling. Before the appearance of the lavaan, there are two alternative packages available. The sem package, developed by John Fox, has been around since 2001 (Fox, Nie, and Byrnes 2012; Fox 2006) and for a long time, it was the only package for SEM in the R environment. The second package is OpenMx (Boker, Neale, Maes, Wilde, Spiegel, Brick, Spies, Estabrook, Kenny, Bates, Mehta, and Fox 2011). As the name of the package suggests, OpenMx is a complete rewrite of the Mx program, consisting of three parts: a front end in R, a back end written in C, and a third-party commercial optimizer (NPSOL). All parts of OpenMx are open-source, except of course the NPSOL optimizer, which is closed-source.
As described above, many SEM software packages are available, both free and commercial, including a couple of packages that run in the R environment. Why then is there a need for yet another SEM package? The answers to this question are threefold:
lavaan aims to appeal to a large group of applied researchers that needs SEM software to answer their substantive questions. Many applied researchers have not previously used R and are accustomed to commercial SEM programs. Applied researchers often value software that is intuitive and rich with modeling features, and lavaan tries to fulfill both of these objectives.
lavaan aims to appeal to those who teach SEM classes or SEM workshops; ideally, teachers should have access to an easy-to-use, but complete, SEM program that is inexpensive to install in a computer classroom.
lavaan aims to appeal to statisticians working in the field of SEM. To implement a new methodological idea, it is advantageous to have access to an open-source SEM program that enables direct access to the SEM code.
In this section I will present R codes for conducting SEM analysis with involvement.sav - the data was used in Marketing Research with SPSS. For more information you can study here.
# Load some R packages and import data:
rm(list = ls())
library(haven)
library(tidyverse)
df_invol <- read_sav("C:\\Users\\Zbook\\Desktop\\amos\\0273703846_datasets\\ch08\\involvement.sav")
# Identify our SEM Model:
my_SEM <- "prcon =~ prcon2 + prcon3 + prcon4 + prcon5
involv =~ involv1 + involv2 + involv3
tpress =~ tpress1 + tpress2 + tpress3 + tpress4 + tpress5
prcon ~ involv
involv ~ tpress
tpress ~ prcon"
# Conduct SEM analysis:
library(lavaan)
SEM_model <- sem(my_SEM, data = df_invol)
# Show results:
summary(SEM_model)
## lavaan 0.6-3 ended normally after 42 iterations
##
## Optimization method NLMINB
## Number of free parameters 27
##
## Number of observations 256
##
## Estimator ML
## Model Fit Test Statistic 128.258
## Degrees of freedom 51
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## prcon =~
## prcon2 1.000
## prcon3 1.188 0.076 15.730 0.000
## prcon4 1.120 0.070 15.945 0.000
## prcon5 1.166 0.071 16.323 0.000
## involv =~
## involv1 1.000
## involv2 1.065 0.046 23.235 0.000
## involv3 1.001 0.051 19.725 0.000
## tpress =~
## tpress1 1.000
## tpress2 1.222 0.124 9.870 0.000
## tpress3 1.185 0.122 9.753 0.000
## tpress4 1.437 0.133 10.820 0.000
## tpress5 1.130 0.112 10.096 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## prcon ~
## involv 0.184 0.052 3.506 0.000
## involv ~
## tpress -0.319 0.090 -3.559 0.000
## tpress ~
## prcon -0.046 0.073 -0.630 0.529
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .prcon2 1.015 0.092 11.014 0.000
## .prcon3 0.194 0.022 8.796 0.000
## .prcon4 0.133 0.017 7.865 0.000
## .prcon5 0.068 0.014 4.753 0.000
## .involv1 0.515 0.063 8.156 0.000
## .involv2 0.185 0.052 3.562 0.000
## .involv3 0.626 0.071 8.862 0.000
## .tpress1 1.861 0.178 10.444 0.000
## .tpress2 1.290 0.137 9.385 0.000
## .tpress3 1.310 0.137 9.535 0.000
## .tpress4 0.700 0.109 6.439 0.000
## .tpress5 0.939 0.104 9.028 0.000
## .prcon 1.073 0.160 6.724 0.000
## .involv 1.751 0.198 8.829 0.000
## .tpress 1.183 0.223 5.299 0.000
# You can show more detailed results with fit.measures = TRUE:
summary(SEM_model, fit.measures = TRUE)
## lavaan 0.6-3 ended normally after 42 iterations
##
## Optimization method NLMINB
## Number of free parameters 27
##
## Number of observations 256
##
## Estimator ML
## Model Fit Test Statistic 128.258
## Degrees of freedom 51
## P-value (Chi-square) 0.000
##
## Model test baseline model:
##
## Minimum Function Test Statistic 2656.129
## Degrees of freedom 66
## P-value 0.000
##
## User model versus baseline model:
##
## Comparative Fit Index (CFI) 0.970
## Tucker-Lewis Index (TLI) 0.961
##
## Loglikelihood and Information Criteria:
##
## Loglikelihood user model (H0) -4414.669
## Loglikelihood unrestricted model (H1) NA
##
## Number of free parameters 27
## Akaike (AIC) 8883.338
## Bayesian (BIC) 8979.058
## Sample-size adjusted Bayesian (BIC) 8893.461
##
## Root Mean Square Error of Approximation:
##
## RMSEA 0.077
## 90 Percent Confidence Interval 0.060 0.094
## P-value RMSEA <= 0.05 0.005
##
## Standardized Root Mean Square Residual:
##
## SRMR 0.043
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## prcon =~
## prcon2 1.000
## prcon3 1.188 0.076 15.730 0.000
## prcon4 1.120 0.070 15.945 0.000
## prcon5 1.166 0.071 16.323 0.000
## involv =~
## involv1 1.000
## involv2 1.065 0.046 23.235 0.000
## involv3 1.001 0.051 19.725 0.000
## tpress =~
## tpress1 1.000
## tpress2 1.222 0.124 9.870 0.000
## tpress3 1.185 0.122 9.753 0.000
## tpress4 1.437 0.133 10.820 0.000
## tpress5 1.130 0.112 10.096 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## prcon ~
## involv 0.184 0.052 3.506 0.000
## involv ~
## tpress -0.319 0.090 -3.559 0.000
## tpress ~
## prcon -0.046 0.073 -0.630 0.529
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .prcon2 1.015 0.092 11.014 0.000
## .prcon3 0.194 0.022 8.796 0.000
## .prcon4 0.133 0.017 7.865 0.000
## .prcon5 0.068 0.014 4.753 0.000
## .involv1 0.515 0.063 8.156 0.000
## .involv2 0.185 0.052 3.562 0.000
## .involv3 0.626 0.071 8.862 0.000
## .tpress1 1.861 0.178 10.444 0.000
## .tpress2 1.290 0.137 9.385 0.000
## .tpress3 1.310 0.137 9.535 0.000
## .tpress4 0.700 0.109 6.439 0.000
## .tpress5 0.939 0.104 9.028 0.000
## .prcon 1.073 0.160 6.724 0.000
## .involv 1.751 0.198 8.829 0.000
## .tpress 1.183 0.223 5.299 0.000
# Show SEM Path:
library(semPlot)
semPaths(SEM_model, what = "std", nCharNodes = 6, sizeMan = 8,
edge.label.cex = 1.1, curvePivot = TRUE, fade = FALSE)
# We can show some criteria that evaluate model fitting:
fitMeasures(SEM_model, fit.measures = c("cfi", "rmsea"))
## cfi rmsea
## 0.970 0.077
## lavaan 0.6-3 ended normally after 42 iterations
##
## Optimization method NLMINB
## Number of free parameters 27
##
## Number of observations 256
##
## Estimator ML
## Model Fit Test Statistic 128.258
## Degrees of freedom 51
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## prcon =~
## prcon2 1.000
## prcon3 1.188 0.076 15.730 0.000
## prcon4 1.120 0.070 15.945 0.000
## prcon5 1.166 0.071 16.323 0.000
## involv =~
## involv1 1.000
## involv2 1.065 0.046 23.235 0.000
## involv3 1.001 0.051 19.725 0.000
## tpress =~
## tpress1 1.000
## tpress2 1.222 0.124 9.870 0.000
## tpress3 1.185 0.122 9.753 0.000
## tpress4 1.437 0.133 10.820 0.000
## tpress5 1.130 0.112 10.096 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## prcon ~
## involv 0.184 0.052 3.506 0.000
## involv ~
## tpress -0.319 0.090 -3.559 0.000
## tpress ~
## prcon -0.046 0.073 -0.630 0.529
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .prcon2 1.015 0.092 11.014 0.000
## .prcon3 0.194 0.022 8.796 0.000
## .prcon4 0.133 0.017 7.865 0.000
## .prcon5 0.068 0.014 4.753 0.000
## .involv1 0.515 0.063 8.156 0.000
## .involv2 0.185 0.052 3.562 0.000
## .involv3 0.626 0.071 8.862 0.000
## .tpress1 1.861 0.178 10.444 0.000
## .tpress2 1.290 0.137 9.385 0.000
## .tpress3 1.310 0.137 9.535 0.000
## .tpress4 0.700 0.109 6.439 0.000
## .tpress5 0.939 0.104 9.028 0.000
## .prcon 1.073 0.160 6.724 0.000
## .involv 1.751 0.198 8.829 0.000
## .tpress 1.183 0.223 5.299 0.000
## lhs op rhs est.std se z pvalue ci.lower ci.upper
## 1 prcon =~ prcon2 0.728 0.030 24.131 0.000 0.669 0.787
## 2 prcon =~ prcon3 0.945 0.008 120.599 0.000 0.930 0.960
## 3 prcon =~ prcon4 0.957 0.007 143.487 0.000 0.944 0.970
## 4 prcon =~ prcon5 0.979 0.005 202.362 0.000 0.969 0.988
## 5 involv =~ involv1 0.886 0.017 52.366 0.000 0.853 0.919
## 6 involv =~ involv2 0.959 0.012 78.955 0.000 0.935 0.983
## 7 involv =~ involv3 0.866 0.019 46.627 0.000 0.830 0.903
## 8 tpress =~ tpress1 0.625 0.042 14.832 0.000 0.542 0.707
## 9 tpress =~ tpress2 0.762 0.031 24.596 0.000 0.701 0.822
## 10 tpress =~ tpress3 0.749 0.032 23.386 0.000 0.686 0.812
## 11 tpress =~ tpress4 0.882 0.021 41.656 0.000 0.841 0.924
## 12 tpress =~ tpress5 0.786 0.029 27.265 0.000 0.730 0.843
## 13 prcon ~ involv 0.236 0.064 3.702 0.000 0.111 0.361
## 14 involv ~ tpress -0.254 0.066 -3.861 0.000 -0.383 -0.125
## 15 tpress ~ prcon -0.045 0.071 -0.632 0.528 -0.184 0.095
## 16 prcon2 ~~ prcon2 0.470 0.044 10.718 0.000 0.384 0.556
## 17 prcon3 ~~ prcon3 0.107 0.015 7.243 0.000 0.078 0.136
## 18 prcon4 ~~ prcon4 0.085 0.013 6.644 0.000 0.060 0.110
## 19 prcon5 ~~ prcon5 0.042 0.009 4.410 0.000 0.023 0.060
## 20 involv1 ~~ involv1 0.215 0.030 7.164 0.000 0.156 0.274
## 21 involv2 ~~ involv2 0.080 0.023 3.423 0.001 0.034 0.125
## 22 involv3 ~~ involv3 0.249 0.032 7.750 0.000 0.186 0.313
## 23 tpress1 ~~ tpress1 0.610 0.053 11.582 0.000 0.506 0.713
## 24 tpress2 ~~ tpress2 0.420 0.047 8.907 0.000 0.328 0.512
## 25 tpress3 ~~ tpress3 0.439 0.048 9.146 0.000 0.345 0.533
## 26 tpress4 ~~ tpress4 0.222 0.037 5.926 0.000 0.148 0.295
## 27 tpress5 ~~ tpress5 0.382 0.045 8.416 0.000 0.293 0.471
## 28 prcon ~~ prcon 0.939 0.030 31.107 0.000 0.880 0.998
## 29 involv ~~ involv 0.930 0.034 27.727 0.000 0.865 0.996
## 30 tpress ~~ tpress 0.993 0.014 69.616 0.000 0.965 1.021
# Note that sem() function accepts factor variables and results are identical:
df_invol %>%
mutate_all(as.numeric) %>%
sem(my_SEM, data = .) %>%
summary()
## lavaan 0.6-3 ended normally after 42 iterations
##
## Optimization method NLMINB
## Number of free parameters 27
##
## Number of observations 256
##
## Estimator ML
## Model Fit Test Statistic 128.258
## Degrees of freedom 51
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Information Expected
## Information saturated (h1) model Structured
## Standard Errors Standard
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## prcon =~
## prcon2 1.000
## prcon3 1.188 0.076 15.730 0.000
## prcon4 1.120 0.070 15.945 0.000
## prcon5 1.166 0.071 16.323 0.000
## involv =~
## involv1 1.000
## involv2 1.065 0.046 23.235 0.000
## involv3 1.001 0.051 19.725 0.000
## tpress =~
## tpress1 1.000
## tpress2 1.222 0.124 9.870 0.000
## tpress3 1.185 0.122 9.753 0.000
## tpress4 1.437 0.133 10.820 0.000
## tpress5 1.130 0.112 10.096 0.000
##
## Regressions:
## Estimate Std.Err z-value P(>|z|)
## prcon ~
## involv 0.184 0.052 3.506 0.000
## involv ~
## tpress -0.319 0.090 -3.559 0.000
## tpress ~
## prcon -0.046 0.073 -0.630 0.529
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .prcon2 1.015 0.092 11.014 0.000
## .prcon3 0.194 0.022 8.796 0.000
## .prcon4 0.133 0.017 7.865 0.000
## .prcon5 0.068 0.014 4.753 0.000
## .involv1 0.515 0.063 8.156 0.000
## .involv2 0.185 0.052 3.562 0.000
## .involv3 0.626 0.071 8.862 0.000
## .tpress1 1.861 0.178 10.444 0.000
## .tpress2 1.290 0.137 9.385 0.000
## .tpress3 1.310 0.137 9.535 0.000
## .tpress4 0.700 0.109 6.439 0.000
## .tpress5 0.939 0.104 9.028 0.000
## .prcon 1.073 0.160 6.724 0.000
## .involv 1.751 0.198 8.829 0.000
## .tpress 1.183 0.223 5.299 0.000
Anderson TW, Rubin H (1956). Statistical Inference in Factor Analysis." In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, pp. 111-150. University of California Press, Berkeley.
Byrne, B. M. (2016). Structural equation modeling with AMOS: Basic concepts, applications, and programming. Routledge.
Bollen, K. A. (1989). Structural Equations with Latent Variables. Wiley Series in Probability and Mathematical Statistics. New York: Wiley.
Chapman, C., & Feit, E. M. (2015). R for marketing research and analytics. New York, NY: Springer.
Boker S, Neale M, Maes HH, Wilde M, Spiegel M, Brick T, Spies J, Estabrook R, Kenny S, Bates T, Mehta P, Fox J (2011). OpenMx: An Open Source Extended Structural Equation Modeling Framework." Psychometrika, 76, 306-317.
Duncan OD (1966). Path Analysis: Sociological Examples." American Journal of Sociology, 72-1.
Fox J, Nie Z, Byrnes J (2012). sem: Structural Equation Models. R package version 3.0-0, URL http://CRAN.R-project.org/package=sem.
Hallquist M (2012). MplusAutomation: Automating Mplus Model Estimation and Interpretation. R package version 0.5-1, URL http://CRAN.R-project.org/package=MplusAutomation.
Howitt, D., & Cramer, D. (2017). Understanding statistics in psychology with SPSS. Pearson Higher Education.
Koopmans T (1945). Statistical Estimation of Simultaneous Economic Relations." Journal of the American Statistical Association, 40, 448-466.
Kline, R. B. (2015). Principles and practice of structural equation modeling. Guilford publications.
Mair P, Wu E, Bentler PM (2010). EQS Goes R: Embedding EQS into the R Environment Using the Package REQS." Structural Equation Modeling: A Multidisciplinary Journal, 17, 333{349.
Wright S (1921). “Correlation and Causation”. Journal of Agricultural Research, 20, 557