In this last lab we are going to examine structural equation modeling and then return to a principle that we have already harked on in this class, multiple paths to the same answer.
In this model we have one exogenous variable (x1). Exogenous variables are variables that do not have any arrows pointing at them. This model also has two endogenous variables (y1 and y2), which are variables that are predicted by others in the model. Arrows (also called edges) are relationships and the vertices (also called nodes) are the variables themselves. In the above model we are saying that x1 is a predictor of y1 and y2. Additionally, we are saying that y1 is a predictor of y2. This is another key feature of SEMs. We can separate the direct effect of x1 on y1 and the indirect effect of x1 on y2 mediated by y1.
Each line in the diagram represents a statistical model. For instance the model predicting y1 looks like this y1 = x1 + error. Each response variable gets its own error (\(\zeta\)). Let’s get on to working with real data. For this lab we will simulate everything.
library(lavaan)
library(lavaanPlot)
library(piecewiseSEM)
library(tidyverse)
set.seed(2002)
Here we will simulate data of caterpillar richness measured over 60 years. First I simulate that temperature is increasing. Then I simulate that caterpillar richness is declining as a function of both temperature and year. Then I add a precip variable with no trends and no effect on caterpillars.
## years
years <- c(1:60)
#precip variable
precip <- rnorm(60, mean=100, sd=10)
## simulate the increasing temperature anomolies
temperature <- .05 + .05*years + rnorm(length(years),1,.5)
plot(temperature ~ years,pch=19); abline(lm(temperature ~ years),lty=2)
## simulate the caterpillar richness as a function of years and temps
catrich <- 350 -2*years -50*temperature + rnorm(length(years),50,50)
plot(catrich ~ years,pch=19); abline(lm(catrich ~ years),lty=2)
Now lets run an SEM. First we combine the simulated vectors to create one dataset. Then we need to write a model statement. Finally we run the model using sem() and plot it using lavaanPlot(). You can examine the model using summary, however I will not here because the output is overwhelming.
semdata <- cbind(catrich, years, temperature, precip)
model1 <- 'precip ~ years
catrich ~ years
catrich ~ temperature
temperature ~ years'
fit1 <- sem(model1, data = semdata)
## Warning in lav_data_full(data = data, group = group, cluster = cluster, : lavaan
## WARNING: some observed variances are (at least) a factor 1000 times larger than
## others; use varTable(fit) to investigate
#summary(fit1, rsq = T, fit.measures = TRUE)
lavaanPlot(name = "MODEL1", fit1, labels = semdata, coefs = TRUE)
Here we see the relationships between all of the variables. In SEMs, path coefficients are the regressions coefficients standardized by the standard deviation of the predictor variable in the causal pathway.
Before I move on to interpretation, is there something going on here that might impact interpretation of the model?
Let’s scale our variables.
semdata_scaled <- apply(semdata, MARGIN = 2, scale)
model2 <- 'precip ~ years
catrich ~ years
catrich ~ temperature
temperature ~ years'
fit2 <- sem(model2, data = semdata_scaled)
#summary(fit2, rsq = T, fit.measures = TRUE)
lavaanPlot(name = "MODEL2", fit2, labels = semdata_scaled, coefs = TRUE)
SEM is especially powerful when we use them to test different hypotheses and perform model comparison. This next model does not model a trend in caterpillars over time. We can then perform model comparison using AIC or some other metric.
semdata_scaled <- apply(semdata, MARGIN = 2, scale)
model3 <- 'precip ~ years
catrich ~ temperature
temperature ~ years'
fit3 <- sem(model3, data = semdata_scaled)
#summary(fit3, rsq = T, fit.measures = TRUE)
lavaanPlot(name = "MODEL3", fit3, labels = semdata_scaled, coefs = TRUE)
AIC(fit1, fit2, fit3)
## df AIC
## fit1 8 1176.4203
## fit2 8 353.7550
## fit3 7 357.7849
Our second model appears to be the best, so let’s take a closer look. We can see that years has a positive effects on temperature, which is to say that temperature is increasing over time. We can also see that temperature has a direct negative effect on caterpillar richness. There is a also slight negative trend richness over time after accounting for temperature change and a slight negative trend in precip. The precip result should be interpreted with caution, as we did not simulate a relationship between precip and year. In fact if you look at the model output, the is no reason to suspect this is not due to random noise (high p-value).
Now for the indirect effect. We only have on in this model, it is the impact of year on caterpillar richness that is mediated by temperature change. We can calculate the total effect by multiplying the path coefficients. So the total impact of year on caterpillars is (-0.32 + (0.86 * -0.6)), which is -0.836.
Another note in SEMs is the interpretation of the of the chi squared p value. Here the null hypothesis is that our data match the model that we provide, so a higher p-value indicates better model fit. However, if you dive into the regression coefficients, these p-values should be interpreted like any other from a linear model.
Your turn: Try a few other SEMs with the above data. How does the AIC vary between them. Do you think over using this approach might lead to bad inference? Try reversing the causality. What happens then?
In this class we have introduced many techniques and depending on your data and questions some will be more appropriate than others. That said, often some might call the wrong technique will still yield some reasonable interpretations that reflect some of the underlying patterns in your data. This is not to say picking the right technique is not important, it is very important, but this is to say that it can be good to try a bunch of techniques to see where they agree and where they differ. It is super important to remember that you probably will not report all of the things you tried and the results you do report should be based off whether or not the test was appropriate and not which one gives you the “best” results.
For this last section we will look at the same dataset a bunch of different ways. I have simulated relationships between a bunch of different variables. Elevation is a factor with two levels: high and low. At each of these two elevations 100 plots were placed. In each, the leaf area of the plant, the herbivory (an ordinal variable from 0 to 2) was determined, and the survival at the end of the season was observed.
dat <- read.csv("multiple_paths.csv")
Based on our knowledge of the system we expect these relationships between variables (this is how I simulated them).
These data were simulated with SEMs in mind so it seem reasonable to analyze them using SEMs. There are a couple of different ways we can consider treating our variables. The first interesting one we should examine is herbivory. That was measured in an ordinal fashion and has three categories. Since this is the case, treating it like any other continuous variable probably is not the best thing to do, however it can be easier to interpret. Let’s start there.
#scale herbivory
moddat1 <- dat
moddat1$herbivory <- scale(moddat1$herbivory)
moddat1$leaf_area <- scale(moddat1$leaf_area)
model <- 'leaf_area ~ elevation
herbivory ~ leaf_area + elevation
survival ~ herbivory + leaf_area + elevation'
fit <- sem(model, data = moddat1)
#summary(fit, rsq = T, fit.measures = TRUE)
lavaanPlot(name = "MODEL1", fit, coefs = TRUE)
Instead of treating herbivory like a continuous variable, we can treat it as a categorical variable. This is a more conservative approach because we get an estimate for each herbivory level. The down side is the model looks a bit messier and it will be slightly harder to fit. That said, we have more than enough data here. First I dummy code these categorical variables, then I scale leaf area, before finally running the model.
#dummy code herbivory
moddat2 <- dat
moddat2.1 <- moddat2 %>%
mutate(ones = 1) %>%
spread(key = herbivory, value = ones, fill = 0)
colnames(moddat2.1)[5:7] <- c("herb0", "herb1", "herb2")
#Another way to do this with base R
herb_vals <- as.character(unique(moddat2$herbivory))
moddat2.2 <- data.frame(matrix(ncol = (ncol(moddat2)-1) + length(herb_vals)))
colnames(moddat2.2) <- c(colnames(moddat2)[-4], paste0("herb", herb_vals))
for(i in 1:length(herb_vals)) {
temp_dat <- subset(moddat2, herbivory == herb_vals[i])
temp_dat <- subset(temp_dat, select= -herbivory)
new_mat <- matrix(data = 0, nrow = nrow(temp_dat), ncol = length(herb_vals))
colnames(new_mat) <- paste0("herb", herb_vals)
new_mat[,i] <- 1
moddat2.2 <- rbind(moddat2.2, cbind(temp_dat, new_mat))
}
moddat2.2 <- moddat2.2[-1,]
#scale leaf area
moddat2.1$leaf_area <- scale(moddat2$leaf_area)
model <- 'leaf_area ~ elevation
herb1 ~ elevation
herb2 ~ elevation
survival ~ herb1 + herb2 + elevation'
fit <- sem(model, data = moddat2.1)
#summary(fit, rsq = T, fit.measures = TRUE)
lavaanPlot(name = "MODEL1", fit, coefs = TRUE)
You will probably recall from the complex models lab that we do not usually model binomial data as a regular glm, which both of the previous models have done. Lavaan can only run gaussian models. We need to use the piecewiseSEM package for fancier models.
moddat3 <- dat
moddat3$leaf_area <- scale(moddat3$leaf_area)
glmsem <- psem(
lm(leaf_area ~ elevation, moddat3),
lm(herbivory ~ leaf_area + elevation, moddat3),
glm(survival ~ herbivory + leaf_area + elevation, "binomial", moddat3)
)
coefs(glmsem)
## Warning in B * (sd.x/sd.y): longer object length is not a multiple of shorter
## object length
## Warning: Categorical variables detected. Please refer to documentation for
## interpretation of Estimates!
## Response Predictor Estimate Std.Error DF Crit.Value P.Value
## 1 leaf_area elevation - - 1 19.5256 0.0000
## 2 leaf_area elevation = high -0.2989 0.0956 198 -3.1245 0.0020
## 3 leaf_area elevation = low 0.2989 0.0956 198 3.1245 0.0020
## 4 herbivory leaf_area 0.0806 0.0532 197 1.5166 0.1310
## 5 herbivory elevation - - 1 43.7866 0.0000
## 6 herbivory elevation = high 0.7141 0.0733 197 9.7431 0.0000
## 7 herbivory elevation = low 1.4159 0.0733 197 19.3185 0.0000
## 8 survival herbivory -1.3462 0.2766 196 -4.8670 0.0000
## 9 survival leaf_area -0.0766 0.1764 196 -0.4345 0.6639
## 10 survival elevation - - 1 26.9222 0.0000
## 11 survival elevation = high -2.0447 0.3288 Inf -6.2180 0.0000
## 12 survival elevation = low 0.1363 0.251 Inf 0.5431 0.5870
## Std.Estimate
## 1 - ***
## 2 - **
## 3 - **
## 4 0.0997
## 5 - ***
## 6 - ***
## 7 - ***
## 8 -0.51 ***
## 9 -0.0359
## 10 - ***
## 11 - ***
## 12 -
PiecewiseSEM does not have a nice plotting function, but if you look at the coefficient estimates, they match our overall story!
So far we have only looked at different types of SEMs. What if we perform regular linear models? Can we pick up any of the signal in the data?
moddat4 <- dat
mod4.1 <- glm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, data = moddat4)
summary(mod4.1)
##
## Call:
## glm(formula = survival ~ elevation + herbivory + leaf_area +
## elevation * herbivory, data = moddat4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7940 -0.2750 -0.1565 0.4634 1.0518
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.612963 0.681641 0.899 0.369631
## elevationlow 0.418324 0.115111 3.634 0.000357 ***
## herbivory -0.214975 0.059984 -3.584 0.000428 ***
## leaf_area -0.001088 0.003042 -0.358 0.721008
## elevationlow:herbivory -0.041847 0.084693 -0.494 0.621795
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.182239)
##
## Null deviance: 43.155 on 199 degrees of freedom
## Residual deviance: 35.537 on 195 degrees of freedom
## AIC: 234.02
##
## Number of Fisher Scoring iterations: 2
mod4.2 <- glm(survival ~ elevation + herbivory + leaf_area + elevation*herbivory, family = "binomial", data = moddat4)
summary(mod4.2)
##
## Call:
## glm(formula = survival ~ elevation + herbivory + leaf_area +
## elevation * herbivory, family = "binomial", data = moddat4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7652 -0.8026 -0.4411 1.0979 2.8727
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.04381 3.74622 0.279 0.780530
## elevationlow 1.70865 0.60216 2.838 0.004546 **
## herbivory -1.86889 0.56165 -3.327 0.000876 ***
## leaf_area -0.00656 0.01673 -0.392 0.694908
## elevationlow:herbivory 0.74054 0.64743 1.144 0.252700
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 249.22 on 199 degrees of freedom
## Residual deviance: 208.95 on 195 degrees of freedom
## AIC: 218.95
##
## Number of Fisher Scoring iterations: 5
mod4.3 <- glm(survival ~ elevation, family = "binomial", data = moddat4)
summary(mod4.3)
##
## Call:
## glm(formula = survival ~ elevation, family = "binomial", data = moddat4)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0273 -1.0273 -0.7049 1.3354 1.7402
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.2657 0.2414 -5.243 1.58e-07 ***
## elevationlow 0.9017 0.3156 2.857 0.00428 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 249.22 on 199 degrees of freedom
## Residual deviance: 240.75 on 198 degrees of freedom
## AIC: 244.75
##
## Number of Fisher Scoring iterations: 4
We lose some of the indirect effect interpretation with these, but we still retain the interpretation that herbivory has a negative effect on survival, high elevations have lower survival, and that leaf_area is not very predictive.
We can also model these data as a Bayesian hierarchical model. Since plots are nested in elevation we can estimate both elevation specific and across elevation effects.
moddat5 <- dat
moddat5 <- moddat5 %>%
mutate(ones = 1) %>%
spread(key = herbivory, value = ones, fill = 0)
colnames(moddat5)[5:7] <- c("herb0", "herb1", "herb2")
model_data <- list(surv = moddat5$survival,
herb0 = moddat5$herb0,
herb1 = moddat5$herb1,
herb2 = moddat5$herb2,
area = scale(moddat5$leaf_area)[,1],
elevs = as.numeric(factor(moddat5$elevation)),
N = length(moddat5$plot_num),
Nelev = length(unique(moddat5$elevation)))
#Model for PA trends across sites
sim_model <- "model {
for(i in 1:N) {
surv[i] ~ dbern(p[i])
logit(p[i]) <- beta0[elevs[i]] * herb0[i] +
beta1[elevs[i]] * herb1[i] +
beta2[elevs[i]] * herb2[i] +
beta3[elevs[i]] * area[i]
}
for (j in 1:Nelev) {
beta0[j] ~ dnorm(hbmu0, 0.001)
beta1[j] ~ dnorm(hbmu1, 0.001)
beta2[j] ~ dnorm(hbmu2, 0.001)
beta3[j] ~ dnorm(hbmu3, 0.001)
}
#Uninformative hyperpriors
hbmu0 ~ dnorm(0, 0.001)
hbmu1 ~ dnorm(0, 0.001)
hbmu2 ~ dnorm(0, 0.001)
hbmu3 ~ dnorm(0, 0.001)
}
"
writeLines(sim_model, con="sim_model.txt")
mod <- jagsUI::jags(data = model_data,
n.adapt = 2000,
n.burnin = 1000,
n.iter = 10000,
n.chains = 4,
modules = "glm",
model.file = "sim_model.txt",
parameters.to.save = c("beta0", "beta1", "beta2", "beta3", "hbmu0", "hbmu1", "hbmu2", "hbmu3"),
verbose = F)
plot_dat <- data.frame(label = c("High_Herb0", "High_Herb1", "High_Herb2", "High_Leaf",
"Low_Herb0", "Low_Herb1", "Low_Herb2", "Low_Leaf"),
med = c(mod$q50$beta0[1], mod$q50$beta1[1], mod$q50$beta2[1], mod$q50$beta3[1],
mod$q50$beta0[2], mod$q50$beta1[2], mod$q50$beta2[2], mod$q50$beta3[2]),
lc = c(mod$q2.5$beta0[1], mod$q2.5$beta1[1], mod$q2.5$beta2[1], mod$q2.5$beta3[1],
mod$q2.5$beta0[2], mod$q2.5$beta1[2], mod$q2.5$beta2[2], mod$q2.5$beta3[2]),
uc = c(mod$q97.5$beta0[1], mod$q97.5$beta1[1], mod$q97.5$beta2[1], mod$q97.5$beta3[1],
mod$q97.5$beta0[2], mod$q97.5$beta1[2], mod$q97.5$beta2[2], mod$q97.5$beta3[2]))
ggplot(data = plot_dat) +
geom_vline(aes(xintercept = 0), linetype = "dashed", color = "gray") +
geom_errorbarh(aes(xmin = lc, xmax = uc, y = label), height = 0) +
geom_point(aes(x = med, y = label), pch = 21, fill = "gray65", color = "black") +
labs(x = "", y="") +
theme_classic()
plot_dat1 <- data.frame(label = c("Herb0", "Herb1", "Herb2", "Leaf"),
med = c(mod$q50$hbmu0, mod$q50$hbmu1, mod$q50$hbmu2, mod$q50$hbmu3),
lc = c(mod$q2.5$hbmu0, mod$q2.5$hbmu1, mod$q2.5$hbmu2, mod$q2.5$hbmu3),
uc = c(mod$q97.5$hbmu0, mod$q97.5$hbmu1, mod$q97.5$hbmu2, mod$q97.5$hbmu3))
ggplot(data = plot_dat1) +
geom_vline(aes(xintercept = 0), linetype = "dashed", color = "gray") +
geom_errorbarh(aes(xmin = lc, xmax = uc, y = label), height = 0) +
geom_point(aes(x = med, y = label), pch = 21, fill = "gray65", color = "black") +
labs(x = "", y="") +
theme_classic()
Here we see the same pattern as we saw before. Increasingly negative effects of of high hebrivory. We also see no effect of leaf area and we see greater survival overall at low elevations. There are no cross elevation effects, as we see in the estimated higher levels. This was not part of the original simulation so this also checks out!
YOUR TURN: Go back to the SEM data (caterpillar richness over time) and try some other types of analyses. Even if they are weird. One idea is to break one of the continuous variables into groups (maybe something like early and late years). Is the story the same?