Structural Equation Modeling with the sem Package in R:

A Demonstration

Will Vincent, PH 251D, Final Project 2

In the simplest terms, structural equation modeling(SEM) is basically like regression, but you can analyze multiple outcomes simultaneously. You can also analyze multiple mediators and moderators at once in the same model. In psychology, this is popular among community-based researchers. However, neuropsychologists and more clinically focused psychologists who focus on randomized controlled trials tend not to use this. Public-health researchers who are analyzing complex models of causal factors or doing survey-based research with large samples multiple outcome variables would find SEM useful.

In order to conduct SEM in R, I am using the sem package available in R. (Another good package is lavaan.) This demonstration is based on John Fox's (2002) Appendix to his book An R and S-Plus Compnaion to Applied Regression. It's a good idea to familiarize yourself with the priniples and best practices of structural equation modeling before attempting to do SEM. A good primer is as follows:

Kline, R. B. (2005). Principles and practice of structural equation modeling. New York, NY: The Guilford Press.

In this model, I am determining whether AIDS-related stigma, religiosity, and adherence to the traditional male role norm of antifemininity lead to perceptions of two types of threat in response to gay men. I am also testing whether these perceived threats, in turn, lead to intoxicated and non-intoxicated aggression toward sexual minorities.

CAVEAT: This is a cross-sectional study. Of course, we cannot determine causation with this design. Nonetheless, terms like “path” and “effect” are often used with SEM even in cross-sectional studies. The understanding is that these are not really causal.

I am activitating the sem package that I have already installed. But I will not import data. As you will subsequently see, I won't need to for sem package.

library("sem")

## Loading required package: MASS
## Loading required package: matrixcalc

Create a correlation matrix of the assocations between all of the observed variables, which are just your study variables that you directly measured. Structural equation modeling analyzes the correlations between the variables. If you want to be fancier, you can create latent variables that represent underlying factors, and you might choose to analyze the covariance matrix plus the variable means and standard deviations in a software program or R package that will let you do so.

You will want to run correlation analyses to find the correlations among your observed variables. You can use R or another program, like Stata.

corr.diss <- matrix(c(1, 0, 0, 0, 0, 0, 0, 0.876, 1, 0, 0, 0, 0, 0, 0.318, 0.32, 
    1, 0, 0, 0, 0, 0.358, 0.333, 0.84, 1, 0, 0, 0, 0.245, 0.264, 0.566, 0.521, 
    1, 0, 0, 0.267, 0.254, 0.549, 0.506, 0.365, 1, 0, 0.058, 0.074, 0.074, 0.301, 
    0.52, 0.149, 1), ncol = 7, byrow = T)

rownames(corr.diss) <- colnames(corr.diss) <- c("IntAgg", "NonIntAgg", "RealThr", 
    "SymbThr", "ARStigma", "MRNaf", "Relig")

corr.diss

##           IntAgg NonIntAgg RealThr SymbThr ARStigma MRNaf Relig
## IntAgg     1.000     0.000   0.000   0.000    0.000 0.000     0
## NonIntAgg  0.876     1.000   0.000   0.000    0.000 0.000     0
## RealThr    0.318     0.320   1.000   0.000    0.000 0.000     0
## SymbThr    0.358     0.333   0.840   1.000    0.000 0.000     0
## ARStigma   0.245     0.264   0.566   0.521    1.000 0.000     0
## MRNaf      0.267     0.254   0.549   0.506    0.365 1.000     0
## Relig      0.058     0.074   0.074   0.301    0.520 0.149     1

So, now is the fun part: MODELING!

ram.diss <- matrix(c(
# path                        parameter   start-value
  "RealThr   ->  IntAgg",     "b11",      NA,
  "SymbThr   ->  IntAgg",     "b12",      NA,
  "RealThr   ->  NonIntAgg",  "b21",      NA,
  "SymbThr   ->  NonIntAgg",  "b22",      NA,
  "ARStigma  ->  RealThr",    "a11",      NA,
  "Relig     ->  RealThr",    "a12",      NA,
  "MRNaf     ->  RealThr",    "a13",      NA,
  "ARStigma  ->  SymbThr",    "a21",      NA,
  "Relig     ->  SymbThr",    "a22",      NA,
  "MRNaf     ->  SymbThr",    "a23",      NA,
  "RealThr  <->  SymbThr",    "c11",      NA,
  "IntAgg <-> NonIntAgg",     "c21",      NA,     
  "RealThr  <->  RealThr",    "d11",      NA,
  "SymbThr  <->  SymbThr",    "d21",      NA,
  "IntAgg   <->  IntAgg",     "d31",      NA,
  "NonIntAgg <-> NonIntAgg",  "d41",      NA),
  ncol=3, byrow=T)

In the path column, I am estimating assocations between variables that will be estimated using regression coefficents. (Think coefficients, or slops, from regression.) Single-headed arrows represent “causal” paths, where as double headed arrows represent “covariances” between variables. These covariances are like correlations that are in their original units of measurement. These covariances in the model are actually associations between the variables' error terms. The parameter column names the parameters that you want to estimate (e.g., regression coefficients), and you don't really need to worry about what the starting values are. If you put a number in this column instead of NA, you will fix the parameter to be that value.

I chose parameter names that are intuitive to me. You can name them what you want, and you can even name them after standard statitical notation, like “beta”. The numbering is up to you.

Now, we want the results of our SEM analyses.

sem.diss <- sem(ram.diss, corr.diss, N = 212, fixed.x = c("ARStigma", "Relig", 
    "MRNaf"))

I am reading into the object, sem.diss, the sem function that includs the modeling we did in ram.diss, the correlation matrix we created in corr.diss, and the sample size (N).

The fixed.x argument tells R what predictor variables are in the model. These predictor variables, which are called exogenous variables, are variables that have no single-headed arrows pointing to them. Variables that do have single-headed arrows pointing toward them are called, endogenous variables.

Prepare yourself for the results!

sem.diss

## 
##  Model Chisquare =  7.293   Df =  6 
## 
##      b11      b12      b21      b22      a11      a12      a13      a21 
##  0.05870  0.30870  0.13682  0.21807  0.57256 -0.28062  0.38183  0.35482 
##      a22      a23      c11      c21      d11      d21      d31      d41 
##  0.06177  0.36729  0.43296  0.75442  0.48708  0.61070  0.87082  0.88360 
## 
##  Iterations =  0

Here are the results of the analyses. As you can see, the (unstandardized) regression coefficients are shown. Unfortunately, standard errors are not included. But, fear not.

summary(sem.diss)

## 
##  Model Chisquare =  7.293   Df =  6 Pr(>Chisq) = 0.2946
##  AIC =  39.29
##  BIC =  -24.85
## 
##  Normalized Residuals
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -0.568   0.000   0.000   0.134   0.000   1.120 
## 
##  R-square for Endogenous Variables
##   RealThr    IntAgg   SymbThr NonIntAgg 
##    0.5129    0.1292    0.3893    0.1164 
## 
##  Parameter Estimates
##     Estimate Std Error z value Pr(>|z|)                          
## b11  0.05870 0.11840    0.4957 6.201e-01 IntAgg <--- RealThr     
## b12  0.30870 0.11840    2.6072 9.128e-03 IntAgg <--- SymbThr     
## b21  0.13682 0.11927    1.1472 2.513e-01 NonIntAgg <--- RealThr  
## b22  0.21807 0.11927    1.8284 6.748e-02 NonIntAgg <--- SymbThr  
## a11  0.57256 0.05982    9.5710 1.058e-21 RealThr <--- ARStigma   
## a12 -0.28062 0.05632   -4.9823 6.282e-07 RealThr <--- Relig      
## a13  0.38183 0.05167    7.3891 1.478e-13 RealThr <--- MRNaf      
## a21  0.35482 0.06698    5.2971 1.177e-07 SymbThr <--- ARStigma   
## a22  0.06177 0.06307    0.9794 3.274e-01 SymbThr <--- Relig      
## a23  0.36729 0.05786    6.3477 2.186e-10 SymbThr <--- MRNaf      
## c11  0.43296 0.04794    9.0315 1.694e-19 SymbThr <--> RealThr    
## c21  0.75442 0.07965    9.4717 2.753e-21 NonIntAgg <--> IntAgg   
## d11  0.48708 0.04742   10.2713 9.490e-25 RealThr <--> RealThr    
## d21  0.61070 0.05946   10.2713 9.490e-25 SymbThr <--> SymbThr    
## d31  0.87082 0.08478   10.2713 9.490e-25 IntAgg <--> IntAgg      
## d41  0.88360 0.08603   10.2713 9.490e-25 NonIntAgg <--> NonIntAgg
## 
##  Iterations =  0

Here are your path coefficients and standard errors, along with p-values if you are interested in hypothesis testing.

It is worth noting that the results also include R-squared values for all variables that have arrows pointing toward them. You also get fit stiatistics, chi-square test of model fit, the Aikike Information Criterion (AIC), and the Bayesian Information Criterion (BIC). Many people use chi-square default, and you should almost always report some version of the chi-square model fit test with your results. A good, commonly used test of model fit is the root mean-squared error (RMSEA), which you can calculate. See Fox's (2002) work, reference above.

Unfortunately, these results do not include indirect effects, which are the effects of the predictor (e.g., ARStigma) on the outcome (e.g., IntAgg) through the mediator (e.g., RealThr). However, you can get these by multiplying a by multiplying the regression coefficent of ARStigma -> RealThr by the regression coefficient. You can do what's called a Sobel test to find out if this particuarly indirect path is statistically significantly differnet from zero, or you can use bootstrapping. These deserve their own demonstrations and are beyond the scope of this project.

In this example of SEM, we used continuous variables. If you have endogenous variables that are not continuous, you would approach this all a bit differently.