SEM theory

Structural Equation Modeling (SEM) allows you to study structural relationships between variables, including observed and latent variables. Here I’ll be summarising the content of Structural Equation Modelling with Lavaan by Broc and Gana.

Chapter 1 - Structural equation modelling

Basic concepts

Covariance and bivariate correlation

Correlation and covariance measure the (linear) relationship between two or more variables.

This figure describes the covariance or correlation between two variables. As discussed in the EFA theory, the arrows connecting the same variable is the variance. Covariance is the variance of a variable in relation to another, and is described as:

\[cov_{XY}= \frac{\sum(X-M_X)(Y-M_Y)}{N-1}\]

where $M$ is the mean, and $N$ is the sample size.

Variance is obtained by:

\[var=\frac{\sum(X-M)^2}{N-1}\]

The values of a covariance have not limits. Positive covariance values indicate that as one variable increase, so does another. Negative covariance indicates the opposite.

In contrast to covariance, correlation measures the relationship between two variables after they have been standardized (centred to have a mean of 0, and standard deviation of 1). This standardization into z-scores is done by:

\[Z_{X}=\frac{X-M}{\sigma}\]

Where $Z_X$ is the standardized score of the variable $X$, $M$ is the mean of $X$, and $\sigma$ is the standard deviation (SD) of $X$ (which is $\sigma=\sqrt{var} =\sqrt{\frac{\sum(X-M)^2}{N-1}}$, expressed in the same units as the variable).

The correlation between the two standardized is calculated as:

\[r_{XY}=\frac{\sum{(Z_X)(Z_Y)}}{N-1}\] Where $Z_X$ and $Z_Y$ are the standardized values of the variables $X$ and $Y$ respectively. Correlation is easier to interpret than covariance, and is often represented by the Bravais-Pearson coefficient $r$. This tells us both the direction and strength of a correlation. The correlation matrix is simply a matrix describing the correlation of two or more variables. There are also other types of correlation; the use of tetrachoric and polychoric correlation coefficients is common because they are suitable for non-interval data. Tetrachoric correlation is used to estimate the association between two dichotomous variables, and polychoric correlation is used for ordinal-level data.

Partial correlation

What about when we observe a correlation between two variables, but these correlation is actually the consequence of a confounding variable? Lets look at an example, where wellbeing and chocolate consumption appear correlated, because they are confounded by wealth.

# Create the DF - Z.Wealth is the confounder, X.Well and Y.Choc are choclate and wellbeing 
Z.Wealth <- seq(1,101,1)
df <- data.frame("X.Well" = rnorm(101, Z.Wealth, 20), "Y.Choc"=rnorm(101, Z.Wealth, 20), Z.Wealth)

# Correlation
round(cor(df),2) %>%
    kable(format = "html", table.attr = "style='width:50%;'") %>% 
  kable_styling()

	X.Well	Y.Choc	Z.Wealth
X.Well	1.00	0.68	0.88
Y.Choc	0.68	1.00	0.83
Z.Wealth	0.88	0.83	1.00

Here we can use partial correlations to estimate the relationship between X and Y accounting for Z, which will be held constant. This correlation is written as $r_{XY.Z}$, calculated as:

\[r_{XY.Z}=\frac{r_{XY}-(r_{XZ})(r_{YZ})}{\sqrt{(1-r^2_{XZ})(1-r^2_{YZ})}}\]

In this equation, $Z$, describing wealth, is held constant; we removed the relationship between $Z$ and $X$ and $Y$, leaving the relationship between $X$ and $Y$.

Lets apply this to the data presented above.

# Function to calculate partial correlations 
partial_func <- function(X,Y,Z){
  R.XY = cor(X,Y)
  R.XZ = cor(X,Z)
  R.YZ = cor(Y,Z)
  numerator <-R.XY-(R.XZ*R.YZ)
  
  r.XZ2 = cor(X,Z)^2
  r.YZ2 = cor(Y,Z)^2
  
  r.XZ2_min = 1 - r.XZ2
  r.YZ2_min = 1 - r.YZ2
  
  r.XZ2_r.YZ2 = r.XZ2_min * r.YZ2_min
  
  demoniator = sqrt(r.XZ2_r.YZ2)  
  
  RXY.Z =  numerator/demoniator 
  return(RXY.Z)
  
}

# Implement the correlation
round(partial_func(X =df$X.Well, Y= df$Y.Choc, Z = df$Z.Wealth),2)

## [1] -0.17

# # Double check the equation is correct
# pcor(df)

If we implement the partial correlation, we can now see that the correlation between $X$ and $Y$ is no longer strongly positive, since we’ve accounted for $Z$.

It is unwise to interpret the correlation or partial correlation as indicating causality. Causality may be considered according to three criteria:

The association rule: two variables must be statistically associated.
There must be a causal order, often one event happening before another.
The non-artificiality rule: the association between two variables must not disappear when removing another variable higher in the causal order (above the two variables).

Linear regression anlysis

Regression is called “simple” when there is only one predictor variable describing a criterion variable (dependent variable), such as when $X$ predicts $Y$. In such a model $Y.B$ or $\beta$ is the regression coefficient, which can either be non-standardized $B$ or standardized $\beta$, and where $e$ is the residual error. This can be graphically expressed as:

Here the arrow connecting the explanatory variable to the response describes the coefficient, the circular arrow at the explanatory variable is the variance in that variable, and the circular arrow at the response is the residual variance in the response variable. This model is written as:

\[Y=\alpha+\beta X + e\] where $\alpha$ is the intercept, and $\beta$ is the coefficient. This model is often estimated using OLS, where estimates are found by minimising the sum of squares of the difference between the observed values and predicted values of the linear function.

This allows us to optimise the correlation between two variables $R$; the square of this, $R^2$ describes the proportion of variance in the response variable attributable to the explanatory variable or variables - this is an indicator of model fit.

Remember, non-standardized coefficients reflect the original units of the explanatory variable, but standardized coefficients only range between -1 and 1. A coefficient $\beta$ indicates the expected increase in $Y$ in standard deviation units when there is a 1 SD increase in $X$. If all variables are standardized, then you can compare the $\beta$ associated with different variables.

When the model has at least two explanatory variables, then then we call it multiple regression. Here we describe a model where the explanatory variables are uncorrelated.

# Create df
X <- rnorm(100,0,1)
Z <- rnorm(100,0,1)
Z<-Z-lm(Z~X)$fitted # making the two variable orthagnol 
Y = X*1 + Z*1 +  rnorm(100,0,.1)
df1 <- data.frame("Y"=Y,"X"=X,"Z"=Z)

# The model
model_1 <- 'Y ~ X + Z' # regression

# Run the model
fit_1 <- sem(model_1, df1)

With uncorrelated variables, we can see that the results of the simple linear regression (only $Y$ and $X$) and multiple regression ($Y$, $X$ and $Z$) are approximately the same.

# The simple regression
summary(lm(df1$Y~df1$X))

## 
## Call:
## lm(formula = df1$Y ~ df1$X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2426 -0.5985 -0.2032  0.6668  3.2660 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.001888   0.099040   0.019    0.985    
## df1$X       0.986123   0.101203   9.744 4.35e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9858 on 98 degrees of freedom
## Multiple R-squared:  0.4921, Adjusted R-squared:  0.4869 
## F-statistic: 94.95 on 1 and 98 DF,  p-value: 4.35e-16

However, when they are correlated we find that the results of the simple regression differ to those of the multiple regression differ; the multiple regression approximates the true coefficient, but the results of the simple regression are clearly off.

# Create df
cor <- rnorm(100,0,.5)
X <- cor + rnorm(100,0,1)
Z <- cor + rnorm(100,0,1)
Y = X*1 + Z*1 +  rnorm(100,0,.1)
df2 <- data.frame("Y"=Y,"X"=X,"Z"=Z)

# The model
model_2 <- 'Y ~ X + Z' # regression

# Run the model
fit_2 <- sem(model_2, df2)

# Simple regression with a correlated variable
summary(lm(df2$Y~df2$X))

## 
## Call:
## lm(formula = df2$Y ~ df2$X)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.38507 -0.78822  0.09523  0.65450  2.28799 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.22643    0.09824  -2.305   0.0233 *  
## df2$X        1.13976    0.07687  14.828   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9817 on 98 degrees of freedom
## Multiple R-squared:  0.6917, Adjusted R-squared:  0.6886 
## F-statistic: 219.9 on 1 and 98 DF,  p-value: < 2.2e-16

At a more basic level, we can see this difference when running correlations and partial correlations between with uncorrelated and correlated explanatory variables. With uncorrelated explanatory variables, the correlation between $X$ and $Y$ are very similar between the correlation and partial correlation (although I’d have thought they would have been more similar).

# Correlation 
cov(df1$X,df1$Y)

## [1] 0.9450399

# Parital correlation 
round(partial_func(df1$Y,df1$X,df1$Z),3)

## [1] 0.994

But they differ when the explanatory variables are correlated - the correlation is less than the partial correlation.

# Correlation 
cor(df2$X,df2$Y)

## [1] 0.8316846

# Parital correlation 
round(partial_func(df2$Y,df2$X,df2$Z),3)

## [1] 0.998

Because we have set the true coefficients to mean 0 and SD 1, we expect that increasing $X$ by 1 SD is will result in a 1 SD higher value of $Y$, holding $Z$ constant. We also added some noise to the data, but this approximately the result we found.

We can calcualte the $R^2$ of our model, to see how well the results fit the data. The $R^2$ is calculated by:

\[R^2=\beta_1r_{YX}+\beta_2r_{YZ}\] We can look at the $R^2$ of our second model, and find it is very high - unsurprisingly, since we’ve accounted for all by the random noise we added to $Y$.

inspect(fit_2, 'r2')

##     Y 
## 0.997

If we ran the model with only one explanatory variable, we’d get a much lower fit:

# The model
model_3 <- 'Y ~ X ' # regression

# Run the model
inspect(sem(model_3, df2), 'r2')

##     Y 
## 0.692

Two points to consider with multiple regression:

First, highly collinear variables are challenging to estimate. This is a perennial problem (including within SEM). However, in some cases it might be appropriate to estimate the common cause of variation between highly correlated variables, and use this as an explanatory variable (i.e. using factor analysis within SEM).
Second, multiple regression is additive, which precludes modelling both the direct and indirect (through mediator variable, for instance) effects of an explanatory variables. This second limitation can be addressed in through path analysis, and more generally SEM.

Standard error of the estimate

The standard error (SE) of the estimate is a meaure of the prediction error, or the accuracy of the estimate (either $B$ or $\beta$), and is used to assess statistical significance. To assess if an estimate is statistically significant at a given cofidence level, simple divide the coefficient by its SE ($\frac{\beta}{SE}$), which give a statistic $z$. If this value is greater than 1.96 that means the $\beta$ is significantly different from zero at $p<0.05$, if greater than 2.58 then the beta is significantly different from zero at $p<0.01$. We can find the SE of $\beta$ with the equation:

\[SE_{YX}=\sqrt{\frac{N-1}{N-2}var_Y^2(1-r_{YX}^2)}\]

with $N$ being the sample size, and $var$ being the variance. The sample size is important for the SE, and will be discussed further in the context of the SEM. Not mentioned in the book, but worth highlighting - $p$ values have lots of limitations, and should not be taken too seriously.

Factor analysis

This is already discussed in detail [here](http://rpubs.com/thomas_pienkowski/541005), so I’ll make a couple of points:

$\lambda_{ij}$ is the factor loading of the $i$th row of the matrix factor loadings, which is the $i$th item in a battery of items, and the $j$th common factor. This notation is sometimes displayed in SEM plots.
It appears that the package ‘lavaan’, and the book, use oblique rotation by default.
As a result, it appears like the package ‘lavaan’ and the book, also treat factor loadings as standardised partial regression coefficients.
However, the book appears to make two mistakes. Firstly, it claims that factor loadings should be bounded between -1 and 1, but if they are standardised regression coefficients then they can still take values larger or smaller than this. Relatedly, they say that taking the square of a factor loading provides the proportion of variance explained by the factor. Both contradict what was said in Exploratory Factor Analysis by Fabrigar and Wegener, a book dedicated to the subject.
The reproduced (or reduced) correlation matrix $\sum$ (or $\Lambda\Phi\Lambda^\top$/$\Lambda\Lambda^\top$ in the notation in the previous walkthrough) is the correlation of items based on the identified factor loadings. The difference between the observed $S$ (or $P$ in the previous notation) and reproduced correlation matrix $\sum$ (or $\Lambda\Phi\Lambda^\top$/$\Lambda\Lambda^\top$) is the residual matrix $S-\sum$ (or $\mathrm{D}\psi$).

Data distribution normality

According to the book, SEM models generally assume that data is drawn from a multivariate normally distributed population. But is this the same as saying that residuals of $Y$ normally distributed? They recommend evaluating the kurtosis and skewness of the actual data (but do not mention the residuals). They also mention statistical tests for assessing skewness, which I’m a bit sceptical about.

Basic principles of SEM

As in factor analysis, the reproduced (or reduced) correlation matrix is an important part of SEM. In SEM, we can compare the covariance/correlation matrix of the observed data ($S$/$P$ if using prior notation) with the covariance matrix reproduced from the model ($\sum$/$\\Lambda\Phi\Lambda^\top$). If the reproduced matrix is [approximately] equal to the theoretical matrix ($S=\sum$) then the null hypothesis ($H_0: S=\sum$) is not rejected, and the model is acceptable.

The residual covariance/correlation matrix is the difference between the observed and reproduced matrix ($S-\sum$). The similarity bewtween the two is the model fit.

The reproduced covariance/correlation matrix ($\sum$) is found using the model parameter estimates. Estimating these estimates involves finding a value for the unknown parameter in a specific model. The residual matrix ($S-\sum$) is minimised through iteratively changing the parameter estimates to an approximately optimal solution is found. Different algorithms do this in different ways.

OLS seeks to minimise the square of difference between the observed correlation and reproduced correlation ($\sum d^2$). Different parameter estimates are chosen iteratively, until a certain threshold for similarity is passed, and the model converges on a solution. The below figure shows how our simple multivariate regression converges relatively quickly.

Once the model has converged, the remaining difference between the reproduced matrix and observed matrix is the residual correlation matrix. We can see this below, first looking at the observed correlation matrix:

cov(df2)

##          Y         X         Z
## Y 3.094173 1.8777892 1.2033512
## X 1.877789 1.6475240 0.2320993
## Z 1.203351 0.2320993 0.9643534

Then the reduced correlation matrix:

as.matrix(fitted(model_fitted)[[1]])

##   Y     X     Z    
## Y 3.063            
## X 1.859 1.631      
## Z 1.191 0.230 0.955

and finally the residual difference between the two:

cov(df2) - as.matrix(fitted(model_fitted)[[1]])

##   Y     X     Z    
## Y 0.031            
## X 0.019 0.016      
## Z 0.012 0.002 0.010

This residual matrix provides clues to whether the specified model adequately reproduces the original correlation matrix.

The estimator

The estimator is the set of mathamatical functions that is used to find parameter estimates that mean the reproduced matrix is close to the original matrix. This estimator must find accurate, unbiased, and consistent solutions. We’ll talk about two estimators:

One when we’re assuming multi-variate normality
One when is for use when “data […] deviate following the normal distribution.”

In both cases, we use the $\chi^2$ test to determine if the null hypothosis ($H_0: S=\sum$) is admissiable (failed to be rejected). This test statistic is defined as:

\[\chi^2=(N)F_{min}\] where $N$ is the sample size, and $F_{min}$ is the minimum discrepency function $F[S,\sum]$ obtained by the estimation method used. A significant $\chi^2$ is used to reject the null hypothosis that the original correlation matrix is sufficiently close to the reproduced one. When $\chi^2$ approaches zero ($\chi^2=(N)F_{min}=0$) then we fail to reject the null hypothosis. The threhold of significant/non-significant is determined by a significance table, which gives the critical $\chi^2$ value for a given number of degrees of freedom $df$ and significane levels (e.g. $<0.05$). $df$ is calculated as:

\[df=[k(k+1)/2]-t\] where $t$ is the number of parameters (to be estimated) in the model and $k(k+1)/2$ is the number of variances or covariances available, where k is the number of observed variables. $\chi^2$ allows us to tell the statistical significane of the models fit, under certain conditions, particularly sample size and data distribution.

However, the $\chi^2$ is sensitive to sample size (the larger the sample size, the greater the risk of a false positive - rejected by accident). Because of this, other methods for assessing model fit have emerged.

In the case of normally distributed data, two estimators are normally used:

Maximum likelihood method (ML, $F_{ML}$)
Generalized least squares (GLS, $F_{GLS}$)

Weighted Least Squares method is based on the use of a polychoric correlation matrix, which is not recommended for small samples, but has the advantage of not depending on the form of the data distribution.

on-normally distributed data there are other options. Ordinal and binary outcomes measures are common. So too is the prevalence of non-normally distributed data. Although ML is not technically appropriate for ordinal data, causing biases estimates and SE. However, with sufficiently large sample sizes and when there is not sever violations, then the results are approximately OK.

When normality assumptions are more severely violated, then there are more robust estimation methods. In the family of Robust ML, there is the Satorra-Bentle $\chi^2$, which will give the same result as normal ML if the assumption is not violated, and Yuan-Bentler $\chi^2$, which is suitable for small sample sizes. These are equivilent to the MLM and MLR estimators in lavaan.

Outside of ML, there is the weighted least squares method (WLS), which analyses polychoric or polyserial correlation matrix, but has two main disadvantages. Firstly, it requires a fairly large sample. Secondly, it often has problems converging (creating Heywood cases etc.), especially for complex models. Alternatively, there is the Diagonally Weighted Least Squares (DWLS) method, which can be useful for small sample sizes and in the case of severe violations. DWLS also uses polychoric or polyserial correlations. In lavaan, there are two robust versions called WLSM and WLSMV. All these methods can encounter problems when the number of variables is more than 20, and need large sample sizes (apart from DWLS I guess).

A third approach is through bootstrap resampling, which requires neither a normal distribution nor large sample, but is not recommended for binary or ordinal data (with few responses). In this approach, you resample with replacement $N$ number of times, and the model is re-estimated every time. Then, the average across models is calculated, and confidence intervals determined by the actual spread of the estimates.

The below table summarises these recommendations, according to the book.

Data.type	Estimator
Continuous data
Approximately normal distribution	ML
Violation of normality assumption	ML (in case of moderate violation), MLM, MLR, Bootstrap
Ordinal/categorical data
Approximately normal distribution	ML (if at least 6 response categories), MLM, MLR (if at least 4 response categories), WLSMV (binary response or 3 response categories)
Violation of normality assumption	ML (if at least 6 response categories), MLM, MLR (if at least 4 response categories), WLSMV (in case of severe violation)

Ultimately, the choice of estimator depend on:

The data type. The most appropriate estimator for binary data is WLS, and its extension WLSMV.
Data distribution properties, as discussed.
Is the data raw or in the form of a correlation/covariance matrix - all methods apart from ML require the raw data.
Sample size - as discussed above.

The sample size and statistical power

SEM requires large sample sizes; a few rules of thumb exist:

SEM requires at least 100-200 observations (seems a bit arbitary).
You should have at least five (but probably 10-20) times more observations than free parameters, when using ML, and ten (or above) when using asymptotic Distribution-Free estimation method.

These rules of thumb look very arbitary, and must depend on the model complexity.

Sample size and statistical power and closely linked; the former determines the latter. The function ‘findRMSEAsamplesize’ in the package “semTools” offers the option of estimating a required sample size, based on a specified model, based on a hypothetical RMSEA value (see the EFA theory). [This section is a bit rubbish - not very clear or conclusive.]

Model evaluation

How well does our model fit our data? The book initially suggests $\chi^2$ tests but these are crap and sensitive to sample sizes. There are two groups of alternative tests: “overall goodness-of-fit” tests (which can be used to accept or reject a model), and local tests (focusing on specific parameter estimates).

Overall goodness of fit

There are a large number of GoF tests, but different sources use different thresholds and interpretations of results. The book classifies them into three categories:

absolute fit indices
incremental fit indices
parsimonious fit indices

The most commonly used for each category will be presented.

Absolute fit indices These are based on a comparison between the observed covariance matrix ($S$) and reproduced ($\sum$), which yields a residual matrix ($S-\sum$), which can be used to calculate the Root Mean Square Residual (RMR) or the Standardized Root Mean Square Residual (SRMR). The Standardized version is easier to interpret, since it remove the effect of scale. The smaller the deviation between the data and model ($S-\sum$), the smaller the value of RMR or SRMR. The values [of SRMR] range betwen 0 and 1 - the closer to 0 the better, but values less than 0.08 are good.

In the case of estimating a model with catagorical data, using WLS or DWLS, you should use the Weighted Root Mean Square Residual (WRMR); a value less than 1 indicates a good fit.

Parsimonious fit indices This approach preferences more parsimonious models; more complex models are considered less theoretically plausible, and more likely to suffer over parameterisation. The parsimony ratio captures the ratio between the model and number of degrees of freedom between the theoretical model $df_t$ and the null model $df_n$, equal to $k(k-1)/2$ where $k$ is the number of measured variables (as before):

\[PRATIO = \frac{df_t}{df_n}\]

We can get a different ratio, by dividing the $df_t$ by the maximum possible number of degrees of freedom $df_{max}$, equal to $k(k+1)/2$. The values of these ratios vary between 0 and 1 - the is more parsimonious the closer to 1 it is, and a saturated model gets 0. However, the most recommended index is the RMSEA - as calculated in the EFA theory walkthrough, with a CI of 90%. Refer back to the EFA framework for an indication of what’s a good fit or not.

When comparing models, you might choose to use the AIC ($AIC=-2logL+2p$, where $logL$ is log-likelihood, and $p$ is the number of estimated parameters) or the BIC ($BIC=-2logL+2p*\log(N)$, as above but where $\log(N)$ is the log of the sample size). These values are arbitrary, but when comparing models lower values are preferred. BIC penalises complex models more. They don’t provide recommended $\Delta AIC$ or $\Delta BIC$, but $\Delta AIC >2$ normally indicate a significant difference in fit.

Incremental fit indices These are for use in the case of “nested models” (not talking about random effects). It’s very unclear what these are, so I won’t wast time looking at them.

Local fit indices (parameter estimates)

These indicies refer to all the individual parmater estimates of the model, which allow for a more analytic analysis of the solution. It allows you to check that there are no inadmissable, or non-sensicle values like standardized estimates greater than 1, or negative variances (Heywood cases). You should not accept models that have not converged. Also, examine parameter estimates to make sure they make theoretical sense. Similarly, its also useful to examine the SE to ensure it not non-sensically small or large.

Local fit indices (parameter estimates)

The best way to formalise a theoretical model is to diagram it. These are referred to as causal models or causal diagrams, even though the statistical analysis they are derived from cannot (normally) be used to infer causality. We also use the term “effect” to describe relationships between variables in a causal diagram.

These causal models provide exact representations of the SEM. There are some general conventions used when creating these diagrams. These diagrams contain five objects.

Rectangles are measured variables.
Large circles or ovals represent latent variables.
Small circles are residual variables (error), which are not directly observed be derived from the structural model which is why they are treated as latent variables and so are inside a circle.
Arrows pointing in one direction are used to indicate the direction of effect.
Two headed arrows represent stochastic relationships like covariances or correlations.

Variables can either be exogenous or endogenous. Exogenous variables have no single headed arrows pointing at them (but can have correlations). Endogenous variables do. The “ultimate” variable is the last variable or variables in the system, and are often the criterion variable.

Chapter 2 - SEM software

The book goes into lots of detail about R. We’ll skip this, and jump to the ‘lavaan’ specific content.

Major operators of lavaan syntax

Command	Operator	Example	Description
Estimate a covariance (cor)	~~	X ~~ Y	X is correlated with Y
Estimate a regression	~	Y ~ X	Y is regressed on X
Define a reflective latent variable	=~	F =~ item1 + item2 + item3	The F factor is measured by indicators item 1, item 2, and item 3 which it effects
Define a formative latent variable	<~	F <~ item1 + item2 + item3	The factor is formed by items 1, 2, and 3
Estimate the intercept	~1	item1 ~ 1	Intercept of item 1
Label/fix a parameter		F =~ 1item1 + b1item2 + b2*item3	Item 1 is set to 1, item 2 is named “b1” and item 3 “b2”. The name must begin with a letter.
Constrain parameters to equality	==	b1 == b2	Factor loading of item 1 equals that of item 2.
Create a new parameter	:=	b1b2: = b1*b2	Define a parameter that is not in the model (for example, indirect effect) from the existing parameters.

Main steps in using lavaan

Lets refer to the dataset as “BASE”, the specified model as “model.SPE”, and the estimated model as “model.EST”. The analysis typically follows these steps.

Import the dataset
Specify the model - e.g. model.SPE <- 'model syntax'
Estimate the model - e.g. model.EST <- sem(model.SPE, estimator = "ML", data = BASE)
Retrive results - e.g. summary (model.EST, fit.measures = T, standardized = T, modindices = T)
View the output diagram of the model - e.g. semPaths (model.EST, "std")

Main fitting functions in lavaan

Function	Example	Description
cfa	model.EST <- cfa(model.SPE, data = BASE)	cfa = confirmatory factor analysis
sem	model.EST <- sem(model.SPE, data = BASE)	sem = structural equation modeling
growth	model.EST <- growth(model.SPE, data =BASE)	growth = latent growth modeling

Chapter 3 - Steps in SEM

Exogenous variables are those that are not predicted by any other in the model (although they may be co-correlated with another variable).
Endogenous are predicted by other variables in the model.
A mediator variable is an intermediary between two variables (indirect effect). Any change in the first variable will affect the mediator variable, which will in turn affect the secound variable.
These exogenous, endogenous, and mediator charecteristics can be applied to latent as well as measured variables.

Model parameters and identification

A model is a set of parameters that indicate relationships between variables. Parameterizing a model is the process of distinguishing between its elements to communicate them to the software that estimates the model. In a model, the values of a parameter can be free, fixed, or constrained. A free parameter is one which is fully estimated. A fixed parameter is one that is assigned a value a priori. A constrained model is one that is free but must equal one or more other parameters.

The number of parameters to be estimates influences its identification. Model identification occurs before estimation, and is a fundamental operation. Model identification involves two aspects:

First, it refers to the unique nature of a solution; an identified model is one that yields the unique solution for each of specified parameters. For instance, there is no unique solution for $6=X+Y$, because $X$ and $Y$ could equal 3 and 3, 2 and 4, or any other combination of numbers that equal 6. In this case, we may constrain $X$ to 1, which means $Y$ is 5.
Second, identification also refers to the number of free parameters with respect tot he size of the data. The data available is the variance or covariance matrices of measured variables, calculated as $k(k+1)/2$, where $k$ is the number of observed variables. The problem is knowing if we have enough information relative to the measured variables in order to estimate all unknown parameters in the model. Three cases are discussed:
“Just-identified” (or saturated), when the number of variances-covariances of the observed variables is equal to the number of free parameters. The number of $df$ is the difference between the first and second, which is 0 in the case of a just-identified model. This type of model is never statistically rejectable, so has limited value.
“Under-identified”, when the number of variances-covariances is less than the number of free parameters to be identified. These models do not have enough information to estimate the specified model, which is thus unusable. In this case, you may constrain some parameters to 0 for instance (effectively removing them).
“Over-identified”, when the number of variances-covariances is greater than the number of free parameters, so the $df$ becomes positive. An over-identified model is testable.

Models with observed variables (path models)

Path analysis allows for:

multivariate analysis
decomposing the total effect of one variable on another into direct and indirect effects (mediated effects)
but not the inclusion of unmseaured latent variables

Identification of a path model

When considering the number of free parameters to be esitmated, count up the number of:

causal paths
covariances
variances

Each endogenous variable also has a residual variable, the error term, which is the part of the variance that cannot be attributed to the other explanatory variables. Since this variable is not directly observed, it is considered a latent variable. Moreover, we’re generally not concerned with the effect these have on the variable, so their coefficients are set to 1, which means they do not have to be estimated as free parameters.

The following model includes three causal paths, one covariance, and four variances (for each variable) totaling eight free parameters to be estimated.

# Create df
cor <- rnorm(1000,0,1)
X <- cor + rnorm(1000,0,1)
W <- X + rnorm(1000,0,1)
Z <- cor + rnorm(1000,0,1)
Y = X*1 + Z*1 +  rnorm(1000,0,.1)
df2 <- data.frame("Y"=Y,"X"=X, "W"= W, "Z"=Z)

# The model
model_5 <- 'Y ~ X + Z
Z ~~ W 
X ~ W' 

# Run the model
fit_5 <- sem(model_5, df2)

We also have four observed variables, which means the variance-covariance matrix contains ten elements:

k = 4
k*(k+1)/2

## [1] 10

So the degrees of freedom are:

k = 4
df = (k*(k+1)/2)-lavInspect(fit_5, what = "npar")
df

## [1] 2

With two degrees of freedom, the model is over-identified which is fine.

Model specification using lavaan

Specifying a model invovles translating it from symbols to operators. There are as many equations as there are endogenous variables in the model. In an equation, there are many terms as there are arrows directed towards the endogenous variables. The relationship between variables is a linear function whose path coefficient (P) is the value. For instance, in the model above there are two endogenous variables, so the equations look like:

\[ Y = Z + X + W + \epsilon_1\] \[X = W + \epsilon_2\] Where $\epsilon_1$ and $\epsilon_2$ are the residual errors for each endogenous variable (whose coefficients are set to 1, so they do not have to be estimated).

Translated into R this looks like this:

# The model
model <- 'Y ~ X + Z
Z ~~ W 
X ~ W'

If we want to remove the correlation between $Z$ and $W$, we specify:

# The model
model <- 'Y ~ X + Z
Z ~~ 0*W # no correlation 
X ~ W'

Direct and indirect effects

The total effect of one variable on another can be broken into direct and indirect effects (if appropreate). The direct effect is where a variable direct predicts another. An indirect effect is where a variable predicts another, which in turn predicts the variable of interest. In this case, the first variable has an effect on the last through its indirect effect. This is the simplest form of mediation.

However, there are also “non-predictive effects” which exist when there is covariance between two or more exogenous variables. When one variable is condsidered, its effect is partly dependent on the values of other variables. In the model below, $Z$ has a direct effect on $Y$, but also an indirect effect on $Y$ through $X$, and vica versa.

The direct effect is simply the coefficient connecting two variables. The indirect effect is the product of the two path coefficients, and the total effect is the sum of all direct and indirect effects of the independent variable on the dependent variable.

In a path model, the decomposition of the correlation (or covariance) between two variables allows you to distinguish between the direct and indirect effects of the independent variable (X) on the dependent variable (Y). The decomposition is done using this formula:

\[r_{YX} = \sum_{q} P_{Yq}r_{qX}\]

where $q$ are all the variables in a model that have a direct effect on the dependent variable $Y$ and $P_{Yq}$ is the path coefficient that links a variable $q$ to $Y$.

I wasn’t able to work out their explantation. A simpler approach appears to be:

Indirect effects are calculated by mulitplying the coefficient connecting the independent to the mediator variable by the coefficient from the mediator to the dependent variable. In the example below, if $W$ is the independent variable, $X$ the mediator, and $Y$ the dependent variable, then we multiple the coeffients denoted by path 1 and 4 together.
Direct effects are simply the coefficient of the connecting the independent variable. In our example, still using $W$ as the independent variable, the coeffient is denoted by 3.
The total effect of $w$ on $Y$ is the sum of the direct and indirect effect - i.e. $3+(1*4)$.

Lets use an example with real data, below. Here we can see that the values of path 1, 3 and 4 have been subsituted with coefficients print(round(fit_9@ParTable$est[1],2)), print(round(fit_9@ParTable$est[3],2)), print(round(fit_9@ParTable$est[4],2)) respectively.

Here the indirct effect is:

round(fit_9@ParTable$est[4]*fit_9@ParTable$est[1],2)

## [1] 0.67

The direct effect is:

round(fit_9@ParTable$est[3],2)

## [1] 0

And so the total effect is:

round(fit_9@ParTable$est[3] + (fit_9@ParTable$est[4]*fit_9@ParTable$est[1]),2)

## [1] 0.67

Now, ‘lavaan’ does not estimate the indirect and total effect, so this has to be done in a bit of a complex way. We start by naming the different causal paths, and then use the := opperator to estimate the indirect effects. In our example, this is done as:

# The model
model_10 <- 'Y ~ p4*Z + p1*W # Direct effect 
X ~ p2*W  # Mediated effect - X to W
Y ~ p3*X # Mediated effect - W to Y

p2p3 := p2*p3 # define new parameter - indirect effect

total :=  p1 + (p2*p3) # define new parameter - total effect
Z~~W
'

We can examine the results, and looking at the defined parameters we see the expected indirect and total coefficient.

## lavaan 0.6-5 ended normally after 46 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of free parameters                          9
##                                                       
##   Number of observations                          1000
##                                                       
## Model Test User Model:
##                                                       
##   Test statistic                               103.010
##   Degrees of freedom                                 1
##   P-value (Chi-square)                           0.000
## 
## Parameter Estimates:
## 
##   Information                                 Expected
##   Information saturated (h1) model          Structured
##   Standard errors                             Standard
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Y ~                                                 
##     Z         (p4)    0.997    0.002  411.688    0.000
##     W         (p1)    0.001    0.003    0.415    0.678
##   X ~                                                 
##     W         (p2)    0.669    0.015   44.080    0.000
##   Y ~                                                 
##     X         (p3)    1.005    0.004  269.661    0.000
## 
## Covariances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   Z ~~                                                
##     W                 1.011    0.084   12.082    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .Y                 0.010    0.000   22.361    0.000
##    .X                 0.695    0.031   22.361    0.000
##     Z                 1.985    0.089   22.361    0.000
##     W                 3.014    0.135   22.361    0.000
## 
## Defined Parameters:
##                    Estimate  Std.Err  z-value  P(>|z|)
##     p2p3              0.673    0.015   43.503    0.000
##     total             0.674    0.015   43.805    0.000

The statistical significance of indirect effects

How do you calculate confidence intervals around these indirect effects? One common way is through bootstrapping. Here we’ll create a function to sample with replacement and run a standard SEM.

# Function to be repeated x times
# Model is the model 
# df is the dataframe 
# here we're just extracting the estimates - more can be extrated if desired 
SEM_boot <- function(model, df_sample) {
  output <- sem(model, df_sample)
  key_output <- output@ParTable$est
  return(key_output)
}

Now lets run the model 1000 times:

n = 1000 # Number of iterations 
boot_list <-  list() # create a list

# Run the SEM n number of times 
for(i in seq_along(1:n)){
  df_sample <-df2[sample(nrow(df2), nrow(df2), replace = T), ]
  boot_list[[i]] <- SEM_boot(model_10, df_sample )
}

And now we can look at the distribution of the direct, indirect, and total effects:

We can see that the total effect is almost entirely composed of the indirect effect. Moreover, we can see the 95% confidence intervals (which are simply the 5th and 95th percentiles of the distribution of estimates). You can do bootstrapping automatically, but I prefer the manual method as its easier to customise the returned results.

Model evaluation

When running the model, you can ask lavaan:

fit.measures = TRUE to provide goodness-of-fit indices
standardized = TRUE to standardized estimates
modindices = TRUE) to give modification indices
rsq = TRUE for th $R^2$ for the endogenous variables

Evaluation involves inspecting:

Overall model fit
areas of misfit
local fit indicies

Binary, ordinal, and factor data

The book discusses the use of estimators (ML (if at least 6 response categories), MLM, MLR (if at least 4 response categories), WLSMV (binary response or 3 response categories)) and the problems with them.

However, the lavaan documentation suggests that if the variable are exogenous then they are simply coded as 0/1 for binary variables, or numeric if ordinal (not comfortable with that). If they endogenous, then the documentation suggests using WLSMV or DWLS, both of which have limitations. When estimating latent variables, categorical data is endogenous, so we need to use WLSMV. This paper still proposes using MLR for categorical data, discussing some advantages and disadvantages compared to WLSMV.

Additionally, its not clear how coeffieicnts for these variables are interpreted. For example, if it’s an ordinal explanitory variable, and we standardize the variable, can we say that a 1 SD change in $X$ leads to a given SD change in $Y$? Maybe it’s a 1 SD change in the underlying latent variable?

Notes

If using WLS and WLSMV, you can’t use full-information maximum-liklihood to deal with missing data (since they take complete cases). One rubbish solution is to use missing='pairwise', which estimates based on pairwise availability.
By including meanstructure=TRUE when calling the SEM, we are provided with the intercept. This will be useful, especially when examining the effect of non-numeric variables.
parTable(model) tells you which parameters are estimated, and which are free. A non-zero in the ‘free’ column denote parameters that are freely estimated by the model.
“we can get standardized estimates in lavaan as well. This is a more complicated topic in SEM because we can standardize with respect to the latent variables alone (std.lv) or both the observed and latent variables (std.all). The latter is usually what is reported as standardized estimates in SEM papers.” From here.