Estimating Returns to Education with MCMSEM

Author

Michel G Nivard

MCMSEM

Multi Co-Moment Structural equation models (MCMSEMs) are structural equation models that consider covariance, coskewness and co-kurtosis between variables. This means there is sufficient information to estimate models based on observational data, where we do not know the direction of causation in the relations between two variables, while allowing for the influence of a normally distributed confounder that influences both variables. MCMSEM also comes with an R package which allows the user to specify these models.

We have two pre-prints out on MCMSEM one with he full details of the model (Tamimy et al. (2022)) and a second which illustrates MCMSEM alongside other observational causal inferences techniques applicable to psychology (Nivard, Tamimy, and Dolan (2023)). This document is a very brief illustration of an applied analysis with MCMSEM.

As we discuss in our pre-print MCMSEM isn’t the only higher moment based causal inference tool(for an early technique see: Shimizu et al. (2006)), nor is it the first application of structural equation models in the context of higher order moments (Boudt, Cornilly, and Verdonck (2017)). It is however a well developed generic framework to specify potentially causal SEM models up to the 4th (co)moments that comes with a flexible R package.

Returns to Education

Because showing is better then telling, this document is an applied (but severely abbreviated) economic analysis where we estimate the effects of schooling (in years of education) om income (log total income) based on the US census 2019 American Community Survey (ACS) as downloaded from IPUMS USA. Ruggles and Sobek (2021).

This particular analysis fits an illustration of MCMSEM well for two reasons: first because its not inconceivable the relation between schooling and income is confounded, but second we also have a strong prior idea on the expected direction of causation based on years of prior analysis and temporal ordering. Estimating the causal effects of education based on observational data is an old problem, and one that David Card (Card (1999)) and Joshua Angrist (Angrist and Krueger (1991)) worked on, and shared in the 2021 Nobel for. So we know what to expect: a causal effect fo schooling on income and not vice versa and some level of positive or negative confounding (both have been observed previously).

Returns to education in MCMSEM

I think MCMSEM is fairly well suited for this particular problem. In MCMSEM we estimate the following system of relations between a normally distributed confounded (labeled “f1” in MCMSEM output later), education and (log)income. Income and Education are allowed to be skewed, and to have excess kurtosis. Below a path diagram of the model:

L <- matrix(c(1,.5,.5,
             0,1,.2,
             0,.2,1),3,3,byrow =T)

qgraph(L,labels =c("f1","income","edu"), shape = c("round","square","square"),XKCD=T,edge.color="black",mar=c(5,5,5,5))

This system of equations is very similar to the DAG of interest to those who study the returns to education.

Important caveats:

I am not an economist and what follows is an illustration of MCMSEM not an empirical result. I take various shortcuts like omitting people with negative or zero incomes, biasing results toward an average treatment effect among those with an income in this particular year. I then also residual for covariates prior to the analysis to ease computations for MCMSEM and I neglect to weight the ACS based on survey weights.
This document doesn’t go into the complete set of assumptions and limitations, that is what the pre-prints/papers are for! Read before use!

Analysis

Lets read the US census ACS data into R using the IPUMS package and uapply some basic first data filtering steps for convenience sake.

Recode education into “years of education”, omit ppl with less than 10 years of education, and make specific allowances for various post-graduate degrees.
Omit people with negative and zero incomes, which would later mess up our log transformation.

ddi <- read_ipums_ddi("usa_00005.xml") ### revise for family income
data <- read_ipums_micro(ddi)

Use of data from IPUMS USA is subject to conditions including that users should cite the data appropriately. Use command `ipums_conditions()` for more details.

# set 0 and negative incomes to 0.1
data$nonzeroincome <- data$INCTOT * as.numeric(data$INCTOT > 0) + .1


data$eduyrs <- NA
data$eduyrs <- as.numeric(data$eduyrs) 
# remove educational outliers (so lets do bedfore 10th grade?)
data[data$EDUC > 3,41] <- data[data$EDUC > 3,19] + 6

# code masters as 2 exta years (over B):
data[data$EDUCD == 114,41] <- data[data$EDUCD == 114,41] + 1

# code prof degree as 3 exta years (over B):
data[data$EDUCD == 115,41] <- data[data$EDUCD == 115,41] + 2

# code PhD degree as 3 exta years (over B):
data[data$EDUCD == 116,41] <- data[data$EDUCD == 116,41] + 2

### omit zero income people entirely

data$nonzeroincome2 <- NA
data$nonzeroincome2  <- as.numeric(data$nonzeroincome2 ) 
data[data$nonzeroincome > .2,42] <- data[data$nonzeroincome > .2,40]

Okay lets really quickly inspect the data, as one should, years of education ranges form 10 (as we omitted those with less then 10 years) and most people have around ~12-16 years of schooling (either HS and/or some college)

table(data$eduyrs)


    10     11     12     13     14     16     18     19 
 21753  24553 542782 207220 151653 339916 162854  65794

Lets look at personal income next, among those with a non-zero income and over 10 years of schooling the median income is 4.2^{4} while the mean income is 6.2118^{4}

hist(data$nonzeroincome2,xlim = c(0,300000),breaks=500)

Okay lets control for sex, age, race and place of birth by residualizing both income and education. Then we run a simple linear regression, as a references, where we regress log(Income) on years of education, so we can express as % gain in income, per year of evacuation assuming there is no reverse causation, no confounding and no other violations of the OLS assumptions.

# wrangle data in final format:
 
## Residualize for birthplace, sex age race

data1.1 <- na.omit(data)
data1.1 <- data1.1[data1.1$AGE > 23 & data1.1$AGE < 66,]
data1.1$reseduyrs <- lm(eduyrs ~ SEX * AGE + as.factor(RACE) + as.factor(BPL), data=data1.1)$residuals
data1.1$resloginc <- lm(log(nonzeroincome2) ~ SEX * AGE + as.factor(RACE) + as.factor(BPL) , data=data1.1)$residuals


# Personal income analysis:
data2 <- cbind.data.frame((data1.1$reseduyrs),(data1.1$resloginc))
data2 <- na.omit(data2)
colnames(data2) <- c("EduYrs","Log_Income")
 

# OLS for comparison:
ols <- lm(Log_Income ~ EduYrs,data=data2)

summary(ols)


Call:
lm(formula = Log_Income ~ EduYrs, data = data2)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.0754  -0.4233   0.1732   0.6345   4.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 6.874e-16  9.411e-04     0.0        1    
EduYrs      1.728e-01  4.112e-04   420.3   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.074 on 1301796 degrees of freedom
Multiple R-squared:  0.1195,    Adjusted R-squared:  0.1195 
F-statistic: 1.766e+05 on 1 and 1301796 DF,  p-value: < 2.2e-16

So OLS suggests a 17.3% increase in income per additional year of schooling but as we know what could be confounded! Lets analyze this using MCMSEM.

We first use MCMmodel() to define a model, with 2 observed variables (it guesses this form the input) 1 latent variable (n_latent = 1), we indicate the observed variables have causal effects one each other (causal_observed = T) and we indicate the latent variable has no skewness nor excess kurtosis (skew_latent =T and kurt_latent = T). We then runt he model using the rprop optimizer for 1000 steps and then switch to lbfgs for the lst 25 steps. How do I know this is enough? trial and error where the configuration that gets you the lowest loss win’s. if you at any point get NaN loss, just lower the learning rate.

# MCMSEM Model

model <- MCMmodel(data2, n_latent=1,scale_data = F,
                  causal_observed = T ,constrained_a = T,skew_latent = F,kurt_latent = F)



results <- MCMfit(model ,data2,
                  optimizers=c("rprop", "lbfgs"), optim_iters=c(1000, 25),
                  learning_rate=c(0.0005,.35), monitor_grads = TRUE,debug =F)

summary_results <- summary(results)
summary_results

|--------------------------------------|
| MCM Result Summary (MCMSEM v0.26.1)  |
|--------------------------------------|
device         : cpu
N phenotypes   : 2
N latents      : 1
N observations : 1301798
N parameters   : 9
SE type        : asymptotic

Fit statistics
loss  : 0.00158572290092707
chisq : 2064.29090098105
BIC   : 2191.00421347512

Parameters summary
  label        lhs edge        rhs        est           se            p
1    a1         f1   =~     EduYrs 0.25073361 0.0118801249 7.110041e-99
2    a1         f1   =~ Log_Income 0.25073361 0.0118801249 7.110041e-99
3  b1_2     EduYrs   ~> Log_Income 0.15326214 0.0008930617 0.000000e+00
4  b2_1 Log_Income   ~>     EduYrs 0.01451442 0.0039134459 2.081945e-04
         last_gradient
1 -0.00164816994220018
2 -0.00164816994220018
3  0.00131654739379883
4 -0.00247398018836975

Variances summary
  label        lhs edge        rhs      est          se p         last_gradient
1    s1     EduYrs   ~~     EduYrs 5.151960 0.006402793 0   0.00074099563062191
2    s2 Log_Income   ~~ Log_Income 1.091368 0.005250518 0 -0.000761702656745911

Skewness summary
  label edge         v1         v2         v3       est         se p
1   sk1  ~~~     EduYrs     EduYrs     EduYrs  5.921098 0.03172945 0
2   sk2  ~~~ Log_Income Log_Income Log_Income -2.178876 0.01983109 0
         last_gradient
1 -0.00234508304856718
2 -0.00489098159596324

Kurtosis summary
  label edge         v1         v2         v3         v4      est        se p
1    k1 ~~~~     EduYrs     EduYrs     EduYrs     EduYrs 57.52113 0.1696110 0
2    k2 ~~~~ Log_Income Log_Income Log_Income Log_Income 12.63412 0.1352236 0
          last_gradient
1 -0.000389293709304184
2   0.00138089281972498

So MCMSEM produces quite a bit of output, like metrics on the data, optimizer, the final gradient, and various parameters. the parameters are split into categories first loadings and regressions, then variances then skewness and finally kurtosis (I call these Kurtosies in my head). while all are important we are substantively interested in the regressions and confounding, which is captured by the factor loadings going form the latent confounder to the two observed variables. we can grab those quickly:

parameters_df <- as.data.frame(summary_results, estimates='parameters')
parameters_df

  label        lhs edge        rhs        est           se            p
1    a1         f1   =~     EduYrs 0.25073361 0.0118801249 7.110041e-99
2    a1         f1   =~ Log_Income 0.25073361 0.0118801249 7.110041e-99
3  b1_2     EduYrs   ~> Log_Income 0.15326214 0.0008930617 0.000000e+00
4  b2_1 Log_Income   ~>     EduYrs 0.01451442 0.0039134459 2.081945e-04
         last_gradient
1 -0.00164816994220018
2 -0.00164816994220018
3  0.00131654739379883
4 -0.00247398018836975

the columns “edge” contains info on the nature of the relationships, we are looking for “~>” when we are looking for causal paths. row 3 contains the path from years of education to log income, and implies an effect of 15.2% growth in income per year of education. rows 1 and 2 contain the paths form the latent confouder (f1) to the two variables. The latent variable is sclaed to unit variance here.

L <- matrix(c(1,.251,.251,
             0,1,.153,
             0,.015,1),3,3,byrow =T)

qgraph(L,labels =c("f1","edu","income"), shape = c("round","square","square"),XKCD=T, diag=F,edge.color="black",mar=c(5,5,5,5),edge.labels=T)

Whats rather unsatisfying is that the effects of the confounder and education are on a different scales, so we can standardize the entire problem, to make the effects of education and the confounded comparable.

# MCMSEM Model

model <- MCMmodel(data2, n_latent=1,scale_data = T,
                  causal_observed = T ,constrained_a = T,skew_latent = F,kurt_latent = F)



results_std <- MCMfit(model ,data2,
                  optimizers=c("rprop", "lbfgs"), optim_iters=c(1000, 25),
                  learning_rate=c(0.0005,.35), monitor_grads = TRUE,debug =F)

summary_results_std <- summary(results_std)
parameters_df_std <- as.data.frame(summary_results_std, estimates='parameters')
parameters_df_std

  label        lhs edge        rhs         est          se             p
1    a1         f1   =~     EduYrs 0.160730481 0.006550034 5.684954e-133
2    a1         f1   =~ Log_Income 0.160730481 0.006550034 5.684954e-133
3  b1_2     EduYrs   ~> Log_Income 0.303993672 0.001742767  0.000000e+00
4  b2_1 Log_Income   ~>     EduYrs 0.007316236 0.001797704  4.706025e-05
          last_gradient
1 -5.18839806318283e-06
2 -5.18839806318283e-06
3 -2.57287174463272e-05
4  1.12075358629227e-05

Based on the standardized results we can generate the path diagram again:

L <- matrix(c(1,.161,.161,
             0,1,.007,
             0,.304,1),3,3,byrow =T)

qgraph(L,labels =c("f1","income","edu"), shape = c("round","square","square"),XKCD=T, diag=T,edge.color="black",mar=c(5,5,5,5),edge.labels=T)

As expected empirical analysis suggests education causes higher income and in this case the relation is confounded somewhat.

Conclusion

This brief tutorial based on empirical data showcases a potential application of MCMSEM. There are various further tweaks that might serve a empirical economists. If you feel comfortable pre-specifying that income doesn’t cause education, you can potentially free up degrees of freedom to consider kurtosis in the confounder. Here we considered a normally distributed confounder, but you could potentially consider a confounder with variance and skewness that are equal (a poisson distributed confounding variable for example) following the instructions found here.

References

Angrist, Joshua D., and Alan B. Krueger. 1991. “Does Compulsory School Attendance Affect Schooling and Earnings?” The Quarterly Journal of Economics 106 (4): 979–1014. https://doi.org/10.2307/2937954.

Boudt, Kris, Dries Cornilly, and Tim Verdonck. 2017. “Nearest Comoment Estimation with Unobserved Factors.” SSRN Electronic Journal. https://doi.org/10.2139/ssrn.3087336.

Card, David. 1999. “The Causal Effect of Education on Earnings.” In, 1801–63. Elsevier. https://doi.org/10.1016/s1573-4463(99)03011-4.

Nivard, Michel Guillaume, Zenab Tamimy, and Conor V. Dolan. 2023. “Three Methods That Advance Tests of Complex Psychological Theory Using Structural Equation Models.” http://dx.doi.org/10.31234/osf.io/j9n76.

Ruggles, Steven, and Matthew Sobek. 2021. “IPUMS.” In, 2767–75. Springer International Publishing. https://doi.org/10.1007/978-3-030-22009-9_980.

Shimizu, Shohei, Aapo Hyvärinen, Patrik O. Hoyer, and Yutaka Kano. 2006. “Finding a Causal Ordering via Independent Component Analysis.” Computational Statistics & Data Analysis 50 (11): 3278–93. https://doi.org/10.1016/j.csda.2005.05.004.

Tamimy, Zenab, Elsje van Bergen, Matthijs D. van der Zee, Conor V. Dolan, and Michel Guillaume Nivard. 2022. “Multi Co-Moment Structural Equation Models: Discovering Direction of Causality in the Presence of Confounding.” http://dx.doi.org/10.31235/osf.io/ynam2.