L <- matrix(c(1,.5,.5,
0,1,.2,
0,.2,1),3,3,byrow =T)
qgraph(L,labels =c("f1","income","edu"), shape = c("round","square","square"),XKCD=T,edge.color="black",mar=c(5,5,5,5))Estimating Returns to Education with MCMSEM
MCMSEM
Multi Co-Moment Structural equation models (MCMSEMs) are structural equation models that consider covariance, coskewness and co-kurtosis between variables. This means there is sufficient information to estimate models based on observational data, where we do not know the direction of causation in the relations between two variables, while allowing for the influence of a normally distributed confounder that influences both variables. MCMSEM also comes with an R package which allows the user to specify these models.
We have two pre-prints out on MCMSEM one with he full details of the model (Tamimy et al. (2022)) and a second which illustrates MCMSEM alongside other observational causal inferences techniques applicable to psychology (Nivard, Tamimy, and Dolan (2023)). This document is a very brief illustration of an applied analysis with MCMSEM.
As we discuss in our pre-print MCMSEM isn’t the only higher moment based causal inference tool(for an early technique see: Shimizu et al. (2006)), nor is it the first application of structural equation models in the context of higher order moments (Boudt, Cornilly, and Verdonck (2017)). It is however a well developed generic framework to specify potentially causal SEM models up to the 4th (co)moments that comes with a flexible R package.
Returns to Education
Because showing is better then telling, this document is an applied (but severely abbreviated) economic analysis where we estimate the effects of schooling (in years of education) om income (log total income) based on the US census 2019 American Community Survey (ACS) as downloaded from IPUMS USA. Ruggles and Sobek (2021).
This particular analysis fits an illustration of MCMSEM well for two reasons: first because its not inconceivable the relation between schooling and income is confounded, but second we also have a strong prior idea on the expected direction of causation based on years of prior analysis and temporal ordering. Estimating the causal effects of education based on observational data is an old problem, and one that David Card (Card (1999)) and Joshua Angrist (Angrist and Krueger (1991)) worked on, and shared in the 2021 Nobel for. So we know what to expect: a causal effect fo schooling on income and not vice versa and some level of positive or negative confounding (both have been observed previously).
Returns to education in MCMSEM
I think MCMSEM is fairly well suited for this particular problem. In MCMSEM we estimate the following system of relations between a normally distributed confounded (labeled “f1” in MCMSEM output later), education and (log)income. Income and Education are allowed to be skewed, and to have excess kurtosis. Below a path diagram of the model:
This system of equations is very similar to the DAG of interest to those who study the returns to education.
Important caveats:
- I am not an economist and what follows is an illustration of MCMSEM not an empirical result. I take various shortcuts like omitting people with negative or zero incomes, biasing results toward an average treatment effect among those with an income in this particular year. I then also residual for covariates prior to the analysis to ease computations for MCMSEM and I neglect to weight the ACS based on survey weights.
- This document doesn’t go into the complete set of assumptions and limitations, that is what the pre-prints/papers are for! Read before use!
Analysis
Lets read the US census ACS data into R using the IPUMS package and uapply some basic first data filtering steps for convenience sake.
Recode education into “years of education”, omit ppl with less than 10 years of education, and make specific allowances for various post-graduate degrees.
Omit people with negative and zero incomes, which would later mess up our log transformation.
ddi <- read_ipums_ddi("usa_00005.xml") ### revise for family income
data <- read_ipums_micro(ddi)Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.
# set 0 and negative incomes to 0.1
data$nonzeroincome <- data$INCTOT * as.numeric(data$INCTOT > 0) + .1
data$eduyrs <- NA
data$eduyrs <- as.numeric(data$eduyrs)
# remove educational outliers (so lets do bedfore 10th grade?)
data[data$EDUC > 3,41] <- data[data$EDUC > 3,19] + 6
# code masters as 2 exta years (over B):
data[data$EDUCD == 114,41] <- data[data$EDUCD == 114,41] + 1
# code prof degree as 3 exta years (over B):
data[data$EDUCD == 115,41] <- data[data$EDUCD == 115,41] + 2
# code PhD degree as 3 exta years (over B):
data[data$EDUCD == 116,41] <- data[data$EDUCD == 116,41] + 2
### omit zero income people entirely
data$nonzeroincome2 <- NA
data$nonzeroincome2 <- as.numeric(data$nonzeroincome2 )
data[data$nonzeroincome > .2,42] <- data[data$nonzeroincome > .2,40]Okay lets really quickly inspect the data, as one should, years of education ranges form 10 (as we omitted those with less then 10 years) and most people have around ~12-16 years of schooling (either HS and/or some college)
table(data$eduyrs)
10 11 12 13 14 16 18 19
21753 24553 542782 207220 151653 339916 162854 65794
Lets look at personal income next, among those with a non-zero income and over 10 years of schooling the median income is 4.2^{4} while the mean income is 6.2118^{4}
hist(data$nonzeroincome2,xlim = c(0,300000),breaks=500)Okay lets control for sex, age, race and place of birth by residualizing both income and education. Then we run a simple linear regression, as a references, where we regress log(Income) on years of education, so we can express as % gain in income, per year of evacuation assuming there is no reverse causation, no confounding and no other violations of the OLS assumptions.
# wrangle data in final format:
## Residualize for birthplace, sex age race
data1.1 <- na.omit(data)
data1.1 <- data1.1[data1.1$AGE > 23 & data1.1$AGE < 66,]
data1.1$reseduyrs <- lm(eduyrs ~ SEX * AGE + as.factor(RACE) + as.factor(BPL), data=data1.1)$residuals
data1.1$resloginc <- lm(log(nonzeroincome2) ~ SEX * AGE + as.factor(RACE) + as.factor(BPL) , data=data1.1)$residuals
# Personal income analysis:
data2 <- cbind.data.frame((data1.1$reseduyrs),(data1.1$resloginc))
data2 <- na.omit(data2)
colnames(data2) <- c("EduYrs","Log_Income")
# OLS for comparison:
ols <- lm(Log_Income ~ EduYrs,data=data2)
summary(ols)
Call:
lm(formula = Log_Income ~ EduYrs, data = data2)
Residuals:
Min 1Q Median 3Q Max
-11.0754 -0.4233 0.1732 0.6345 4.0581
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.874e-16 9.411e-04 0.0 1
EduYrs 1.728e-01 4.112e-04 420.3 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.074 on 1301796 degrees of freedom
Multiple R-squared: 0.1195, Adjusted R-squared: 0.1195
F-statistic: 1.766e+05 on 1 and 1301796 DF, p-value: < 2.2e-16
So OLS suggests a 17.3% increase in income per additional year of schooling but as we know what could be confounded! Lets analyze this using MCMSEM.
We first use MCMmodel() to define a model, with 2 observed variables (it guesses this form the input) 1 latent variable (n_latent = 1), we indicate the observed variables have causal effects one each other (causal_observed = T) and we indicate the latent variable has no skewness nor excess kurtosis (skew_latent =T and kurt_latent = T). We then runt he model using the rprop optimizer for 1000 steps and then switch to lbfgs for the lst 25 steps. How do I know this is enough? trial and error where the configuration that gets you the lowest loss win’s. if you at any point get NaN loss, just lower the learning rate.
# MCMSEM Model
model <- MCMmodel(data2, n_latent=1,scale_data = F,
causal_observed = T ,constrained_a = T,skew_latent = F,kurt_latent = F)
results <- MCMfit(model ,data2,
optimizers=c("rprop", "lbfgs"), optim_iters=c(1000, 25),
learning_rate=c(0.0005,.35), monitor_grads = TRUE,debug =F)
summary_results <- summary(results)
summary_results|--------------------------------------|
| MCM Result Summary (MCMSEM v0.26.1) |
|--------------------------------------|
device : cpu
N phenotypes : 2
N latents : 1
N observations : 1301798
N parameters : 9
SE type : asymptotic
Fit statistics
loss : 0.00158572290092707
chisq : 2064.29090098105
BIC : 2191.00421347512
Parameters summary
label lhs edge rhs est se p
1 a1 f1 =~ EduYrs 0.25073361 0.0118801249 7.110041e-99
2 a1 f1 =~ Log_Income 0.25073361 0.0118801249 7.110041e-99
3 b1_2 EduYrs ~> Log_Income 0.15326214 0.0008930617 0.000000e+00
4 b2_1 Log_Income ~> EduYrs 0.01451442 0.0039134459 2.081945e-04
last_gradient
1 -0.00164816994220018
2 -0.00164816994220018
3 0.00131654739379883
4 -0.00247398018836975
Variances summary
label lhs edge rhs est se p last_gradient
1 s1 EduYrs ~~ EduYrs 5.151960 0.006402793 0 0.00074099563062191
2 s2 Log_Income ~~ Log_Income 1.091368 0.005250518 0 -0.000761702656745911
Skewness summary
label edge v1 v2 v3 est se p
1 sk1 ~~~ EduYrs EduYrs EduYrs 5.921098 0.03172945 0
2 sk2 ~~~ Log_Income Log_Income Log_Income -2.178876 0.01983109 0
last_gradient
1 -0.00234508304856718
2 -0.00489098159596324
Kurtosis summary
label edge v1 v2 v3 v4 est se p
1 k1 ~~~~ EduYrs EduYrs EduYrs EduYrs 57.52113 0.1696110 0
2 k2 ~~~~ Log_Income Log_Income Log_Income Log_Income 12.63412 0.1352236 0
last_gradient
1 -0.000389293709304184
2 0.00138089281972498
So MCMSEM produces quite a bit of output, like metrics on the data, optimizer, the final gradient, and various parameters. the parameters are split into categories first loadings and regressions, then variances then skewness and finally kurtosis (I call these Kurtosies in my head). while all are important we are substantively interested in the regressions and confounding, which is captured by the factor loadings going form the latent confounder to the two observed variables. we can grab those quickly:
parameters_df <- as.data.frame(summary_results, estimates='parameters')
parameters_df label lhs edge rhs est se p
1 a1 f1 =~ EduYrs 0.25073361 0.0118801249 7.110041e-99
2 a1 f1 =~ Log_Income 0.25073361 0.0118801249 7.110041e-99
3 b1_2 EduYrs ~> Log_Income 0.15326214 0.0008930617 0.000000e+00
4 b2_1 Log_Income ~> EduYrs 0.01451442 0.0039134459 2.081945e-04
last_gradient
1 -0.00164816994220018
2 -0.00164816994220018
3 0.00131654739379883
4 -0.00247398018836975
the columns “edge” contains info on the nature of the relationships, we are looking for “~>” when we are looking for causal paths. row 3 contains the path from years of education to log income, and implies an effect of 15.2% growth in income per year of education. rows 1 and 2 contain the paths form the latent confouder (f1) to the two variables. The latent variable is sclaed to unit variance here.
L <- matrix(c(1,.251,.251,
0,1,.153,
0,.015,1),3,3,byrow =T)
qgraph(L,labels =c("f1","edu","income"), shape = c("round","square","square"),XKCD=T, diag=F,edge.color="black",mar=c(5,5,5,5),edge.labels=T)Whats rather unsatisfying is that the effects of the confounder and education are on a different scales, so we can standardize the entire problem, to make the effects of education and the confounded comparable.
# MCMSEM Model
model <- MCMmodel(data2, n_latent=1,scale_data = T,
causal_observed = T ,constrained_a = T,skew_latent = F,kurt_latent = F)
results_std <- MCMfit(model ,data2,
optimizers=c("rprop", "lbfgs"), optim_iters=c(1000, 25),
learning_rate=c(0.0005,.35), monitor_grads = TRUE,debug =F)
summary_results_std <- summary(results_std)
parameters_df_std <- as.data.frame(summary_results_std, estimates='parameters')
parameters_df_std label lhs edge rhs est se p
1 a1 f1 =~ EduYrs 0.160730481 0.006550034 5.684954e-133
2 a1 f1 =~ Log_Income 0.160730481 0.006550034 5.684954e-133
3 b1_2 EduYrs ~> Log_Income 0.303993672 0.001742767 0.000000e+00
4 b2_1 Log_Income ~> EduYrs 0.007316236 0.001797704 4.706025e-05
last_gradient
1 -5.18839806318283e-06
2 -5.18839806318283e-06
3 -2.57287174463272e-05
4 1.12075358629227e-05
Based on the standardized results we can generate the path diagram again:
L <- matrix(c(1,.161,.161,
0,1,.007,
0,.304,1),3,3,byrow =T)
qgraph(L,labels =c("f1","income","edu"), shape = c("round","square","square"),XKCD=T, diag=T,edge.color="black",mar=c(5,5,5,5),edge.labels=T)As expected empirical analysis suggests education causes higher income and in this case the relation is confounded somewhat.
Conclusion
This brief tutorial based on empirical data showcases a potential application of MCMSEM. There are various further tweaks that might serve a empirical economists. If you feel comfortable pre-specifying that income doesn’t cause education, you can potentially free up degrees of freedom to consider kurtosis in the confounder. Here we considered a normally distributed confounder, but you could potentially consider a confounder with variance and skewness that are equal (a poisson distributed confounding variable for example) following the instructions found here.