More complex SEM’s and multiple imputation

So far we’ve just been using complete cases - next we’ll implement the model using multiple imputation. Again, you can find a walkthrough on MI and bootstrapping theory I did here. Here we’re imputing 10 datasets, using a range of models suitable for the data type (e.g. proportional odds model’s for ordinal data, logistic regression for binary variables, and predictive mean matching for continous variables, although they are actually amounts not continous variables).

The following code shows the actual MI process. This code is not actually evaluated, because R Markdown is slow. Instead, it’s run in a seperate bit of code and loaded here.

Multiple imputation

# Number of imputed df 
N.Imp = 10

# Conduct multiple imputation using the mice package
# Here we have to specify the imputation method for each variable: 
str_out <- data.frame(capture.output(str(DF3)))

# Delete the first and last row
str_out <- data.frame(str_output = matrix(str_out[2:(nrow(str_out)-1),]))

# Create a column that contain the appropreate model for each variable - this only works if the variable is of the correct data type in the first place 
str_out$type <- ifelse(grepl("Ord.factor", str_out$str_output, fixed = TRUE)==T, "polr", 
                       ifelse(grepl("num", str_out$str_output, fixed = TRUE)==T, "pmm", 
                              ifelse(grepl("Factor", str_out$str_output, fixed = TRUE)==T, "polyreg",
                                     ifelse(grepl("int", str_out$str_output, fixed = TRUE)==T, "logreg", "ERROR"))))
                       
# Conduct the MI - with the number of datasets specified by N.Imp, and the estimation type specified by str_out$type (derived from the above)
DF3_imp <- mice(DF3, m = N.Imp, method = str_out$type )

# create a df in a list for each imputed df
# and create dummy variable from factors 

# Exract each DF
mice.imp <- list()
for(i in 1:N.Imp) {

  mice.imp[[i]] <- mice::complete(DF3_imp, action= i, inc=FALSE)
  
  # Turn factor into series of binary variables
  mice.imp[[i]] <- dummy_cols(mice.imp[[i]], select_columns = c("gender", "position", "education"))

  # remove spaces and other charecters on colnames
  colnames( mice.imp[[i]]) <- gsub(" ", "", colnames( mice.imp[[i]]), fixed = TRUE)
  colnames( mice.imp[[i]]) <- gsub("/", "", colnames( mice.imp[[i]]), fixed = TRUE)
  colnames( mice.imp[[i]]) <- gsub("-", "_", colnames( mice.imp[[i]]), fixed = TRUE)
}

# Now we have to convert the ordinal exogenous variables to numeric. If we want to include them, we're somewhat forced to assume they are numeric. 
exogenous <- c("SS1" , "SS2" , "SS3", "PS_1" , "PS_2", "PS_3", "health") # the exogeous ordinal variables
for (i in 1:N.Imp) {
  mice.imp[[i]][,exogenous] <- apply(mice.imp[[i]][,exogenous], 2, as.numeric)
  }

Running a simple model on the MI datasets

Now let’s run a simple SEM on the imputed datasets.

# Load the preciously created imputed dataset
load("analy_data/mice.imp.Rdata")

# A simple K10 ~ age SEM
model_MI_test_1 <- '
# PD
PD =~ K10_1 + K10_2 + K10_3 + K10_4 + K10_5 + K10_6 + K10_7 + K10_8 + K10_9 + K10_10

# Correlated error terms 
K10_2 ~~ K10_3
K10_5 ~~ K10_6
K10_7 ~~ K10_8

# Predicting PD
PD ~ age_year
'

# Running the test model on each DF - this does the pooling automatically 
fit_MI_test_1 <- runMI(model_MI_test_1, data = mice.imp, fun="sem", estimator = "WLSMVS", ordered = c("K10_1" , "K10_2" , "K10_3" , "K10_4", "K10_5", "K10_6", "K10_7", "K10_8", "K10_9", "K10_10"))

Here lavaan.mi and semPlot are not integrated. As a result, we have to run a SEM with the same parameters and model structure. Then we subsitute the plotted valued with the pooled estimates and p-values from the lavaan.mi object. This is a point about the visualisation, rather than the statistics.

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

The RMSEA is 0.049 (95% CI 0.044 - 0.055).

A more complex mediated model with the MI dataset

Here we have a slightly more complex model. Additionally, lavaan has a neat feature that allows you to name coefficients in the mode, and calculate joint probabilities. You could do this by hand, but this also calculates p-values etc. for the joint probabilities. However, you still have to correctly describe how the joint probabilities are calculated, but this follows the same logic as when using probability trees.

# There are some warnings, but it does not appear to have affected the results... 

# A More complex SEM 
model_MI_test_2 <- '
### Conservation optimism
SO =~ SO_1 + SO_2 + SO_3 +SO_4 + SO_5+ SO_6 + SO_7 + SO_9

# Correlated error terms 
SO_1 ~~ SO_2
SO_3 ~~ SO_4 
SO_5 ~~ SO_6 
SO_7 ~~ SO_9

###  Dispositional optimism 
DO =~ LOTR_1 + LOTR_2 + LOTR_3 +  LOTR_4 + LOTR_5 + LOTR_6

# The method effect 
method =~  LOTR_1 + LOTR_3 + LOTR_6

### regressing SO and DO 
SO ~ SO.DO*DO

### regressing SO and DO on goal progress
GP_total ~ GP.DO*DO

### PD 
PD =~ K10_1 + K10_2 + K10_3 + K10_4 + K10_5 + K10_6 + K10_7 + K10_8 + K10_9 + K10_10

# Correlated error terms 
K10_2 ~~ K10_3
K10_5 ~~ K10_6
K10_7 ~~ K10_8

### regressing GP and DO on PD 
PD ~ PD.GP*GP_total + PD.DO*DO + PD.SO*SO 

# Setting the correlation between the method effect and DO to 0 
DO ~~ 0*method
SO ~~ 0*method

# Joint probabilities 
PD.GP.DO := GP.DO*PD.GP # DO on GP on PD
PD.SO.DO := SO.DO*PD.SO # DO on SO on PD
DO_total := PD.DO + PD.GP.DO + PD.SO.DO
'

# Running the test model on each DF - this does the pooling automatically 
fit_MI_test_2 <- runMI(model_MI_test_2, data = mice.imp, fun="sem", estimator = "WLSMVS", ordered = c("SO_1" , "SO_2" , "SO_3" , "SO_4", "SO_5", "SO_6", "SO_7",  "SO_9", "LOTR_1" , "LOTR_2" , "LOTR_3" , "LOTR_4", "LOTR_5", "LOTR_6", "K10_1" , "K10_2" , "K10_3" , "K10_4", "K10_5", "K10_6", "K10_7", "K10_8", "K10_9", "K10_10"))

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

Response variable	Relationship	Explanatory variable	Estimate	p-value
SO	~	DO	0.278	`***`
GP_total	~	DO	0.337	`***`
PD	~	GP_total	-0.100	`***`
PD	~	DO	-0.480	`***`
PD	~	SO	-0.059	`*`

The RMSEA is 0.049 (95% CI 0.044 - 0.055).

Including all the other predictor variables

Now we can combine this analysis with the other predictor variables into our main model. We have several exogenous ordinal variables (“SS1” , “SS2” , “SS3”, “PS_1” , “PS_2”, “PS_3”, “health”). I am unaware of how to include these as ordinal variables within lavaan. Typically, people assume (in many types of statistical model) these are numeric and include them as such. However, this makes the assumption that they are equal interval, and thus have a linear relationship with the response variable. Alternatively, it’s also common to include polynomial terms to account for potential non-linearity. I’m against this, since we don’t really have an a priori justification for the choice in polynomial, so I think if we’re making assumptions it’s better to keep them simple (and linear).

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

The model is actually too big to show as a single figure. Instead, we’ll seperate it into the ‘goal/optimism part’, and the ‘other predictors’ part.

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

## lavaan.mi object based on 10 imputed data sets. 
## See class?lavaan.mi help page for available methods. 
## 
## Convergence information:
## The model converged on 10 imputed data sets 
## 
## Rubin's (1987) rules were used to pool point and SE estimates across 10 imputed data sets, and to calculate degrees of freedom for each parameter's t test and CI.

Table 3. The regression (~) part of the structural equation model. ERI = effort reward imbalance.
Response variable	Relationship	Explanatory variable	Estimate	p-value
Optimism and goal progress
SO	~	DO	0.175	`***`
GP	~	DO	0.190	`***`
PD	~	GP	-0.053	`**`
PD	~	DO	-0.303	`***`
PD	~	SO	-0.039	`.`
Assorted charecteristics
PD	~	ERI (new)	0.302	`***`
PD	~	Years in conservation	-0.115	`**`
PD	~	Age	-0.131	`**`
PD	~	Health	-0.215	`***`
PD	~	National/non-national	-0.021
PD	~	Work hours	0.019
PD	~	CV	-0.007
Gender (reference = female)
PD	~	Male	-0.177	`***`
PD	~	Other than male or female	0.261
PD	~	Prefer not to say	0.179
Position (reference = desk)
PD	~	Non-desk	-0.066
PD	~	Unknown	-0.073
Education (reference = college)
PD	~	University	-0.041
PD	~	Primary	0.496
PD	~	Secoundary	0.116
PD	~	Unknown	-0.120
Social support
PD	~	Social support: personal relationships	-0.114	`***`
PD	~	Social support: support from F&F	-0.081	`***`
PD	~	Social support: time with F&F	-0.062	`***`
Personal security
PD	~	Personal security: dangerous outside	-0.039	`*`
PD	~	Personal security: dangerous work	-0.006
PD	~	Personal security: not feel safe	0.140	`***`