Data Description

Our data comes from a 1980 longitudinal study of a senior cohort class in high school, called High School and Beyond(HS&B) In this particular dataset, there are 1818 subjects, and the file contains many different variables of each high school senior, from demographic variables, to parental education levels, to information about their high schools. Our treatment variable was determined to be twoyr, which is a binary variable indicating whether the student went to two-year junior college or not. Our response variable was educ86, which is the number of years of education a student received (i.e. 12 years means the student did not attend any college). Below shows all the covariates included in the dataset.

Below we isolate the distribution of the response variable alone, and the distribution of the response conditioned on the treatment without taking into account the other variables.

EDA

In this initial EDA, we visualize a quick understanding of the distribution of the response variable. We see it is a quantitative variable, but it is discrete and the numbers only range from 12 to 18. We see a strong mode at 16 years, which makes sense as this indicates students who completed a bachelor’s degree. If a student is in the 12 years category, this indicates the highest level of education they completed was high school. The student could have enrolled in two-year college or not, but they did not complete that education. Interestingly, we do not see a slightly higher frequency in the 14 years category, which would indicate the completion of an associates degree at junior college.

This second graph is the boxplot distributions of the years of education of students, depending on whether they went to two-year college or not. We see that the average distribution is lower for those who did attend two year college, at each level of the 25th, 50th and 75th quartiles. We also see a larger spread (greater IQR) for the junior college attendee group. This initial observation may make sense, as students who go to two year college will by design go to two year less of school than those who go to 4-year college for a bachelor’s degree. The difference in medians is indeed 2 years. However we will do further causal inference analyses to see if there are more meaningful insights to be gained.

Table 1

We also include a Table 1 to understand which covariates are present at a basic level. It is divided by treatment and control, and for qualitative variables, the mean and standard deviation are shown, while for quantitative variables, the number of people who said yes to the given criteria and the percentage of those are shown. Some interesting thing to note is that for dadcoll (Dad’s college status), 4 year’s percentage is much greater than that for 2 year’s. These can give us an idea of which variables to include or not when conducting analysis to see if the difference is indeed significant. Note that for later analysis, we exclude variables mommiss, dadmiss, and fincmiss, as these are indicator variables simply denoting if the information was missing or not. We did not believe this was a useful indicator, so we omitted these in our analysis.

Goal

The goal of our research is to determine if there is a causal effect of attending two-year junior college, on the number of years of education a student will have, and determine the specific magnitude of that effect. As we saw in the prior EDA, there naturally will be a difference in the years of education for a student who goes to two year college versus someone who goes to a traditional four year college, by nature of the way the data values are structured. That difference in theory–under the null hypothesis of no difference between two year and four year college on the number of years of education a student gets–should be 2 years. However, prior research has shown there are different effects that change the magnitude of the difference. One is of the Democratization effect, which is the theory that junior college lower the bar to entry for a four-year college, so students who otherwise would not have gone to college or would have dropped out had they gone straight to four-year college, can go to two-year college as a stepping stone to further education. An opposite theory though, is the Diversion effect, which is the belief that junior college might draw students away from four-year college, with students who would have gone to traditional four-year college opting for less years of schooling and going to junior college instead. Thus we look to gain more insight from causal inference analysis, namely if we can claim two-year college attendance causes something with the number of years of education a student will go through.

Something we note about possible challenges with this dataset, is that we lacked information on whether students went on to a four-year college at all. For example, simply because it was indicated that a student first attended two-year college, this did tell us whether they went on to attend four-year college or not. At the same time, just because it was indicated that a student did not attend two-year college, does not mean that student opted for four-year college. Some subjects opted for neither two-year nor four-year college post high school. Therefore it could have been helpful information to have had another binary variable indicating whether a student went to and/or completed four-year college. Without such information, we could only analyze based on the larger aggregated groups simply indicated by the twoyr variable. Finally, we mention again that there were some indicator variables stating some columns of parental data were missing for certain subjects, so we note the possible challenge in accuracy with such omissions.

Methods

On this observational dataset, matching methods were implemented as matching provides us the covariate balance within the dataset, which provides for increased robustness to the choice of model used to estimate the treatment effect. Then, the best matched dataset was used for further analysis on finding the treatment effect such as mean-difference estimator or linear regression.

We first try 1:1 optimal matching using the Mahalanobis distance.

Matched pairs

In our study, 1:1 matching is used to match the instances based on the Mahalanobis distance, which shows a measured balance across covariates. The Mahalanobis distance formula is shown below as reference, where \(N_1\) and \(N_0\) represents the sample size of treated and not respectively, \(\text{Cov}(\mathbf{X})\) is the covariance matrix, and \(\bar{\mathbf{X}}_1\) and \(\bar{\mathbf{X}}_1\) represents the vector of covariates for treatment and control respectively.

\[\text{MD} = \sqrt{\frac{N_1N_0}{N}(\bar{\textbf{X}}_1 - \bar{\textbf{X}}_0)^T[\text{Cov}(\mathbf{X})]^{-1}(\bar{\textbf{X}}_1 - \bar{\textbf{X}}_0)}\]

The mahalanobis distance is measured for each of the instances with each other, and for each treated individual \(i\), the control individual with optimally the smallest distance from individual \(i\) is chosen.

We see that the standardized covariate mean differences are fairly balanced for the matched dataset for all the variables. The matched dataset is more balanced than the full dataset, meaning we might more reasonably conduct causal inference on the matched dataset. However we do notice that the matched dataset reduces the original dataset by more than 50% from 1818 observations to only 858. So next we try similar 1:2 optimal matching with Mahalanobis distance.

1:1 Matching vs. 1:2 Matching

1:2 matching is again used with mahalanobis distance, and this time, for each treated individual \(i\), two control individuals with optimally the smallest distance from individual \(i\) are chosen.

We see generally 1:1 matching performs better than 1:2 matching. But the covariate balances are relatively similar except for a few variables. The bytest and perwhite variables do not fall within the 0.1 rule of thumb for covariate mean differences. Next we check if coarsened exact matching might provide better balance.

Coarsened Exact Matching (CEM)

Caorsened Exact Matching (CEM) is when we divide a continuous variable into different groups, and it automatically categorizes covariates so subjects can be exactly matched. How it works is that first, the covariate space is chopped up into different categories. Then, it only keeps a category if it contains at least one treated subject and one control subject. There are blocks of treatment/control subjects that are left that exactly match on all categorized covariates, and then, we can basically treat it as a blocked randomized experiment.

We see the standardized covariate mean differences are all 0 or very close to 0 in Table 3. This is a good sign for simulating a random experiment, but the downside with this method is that we see only 274 subjects of the 1818 original were retained. We note separately when we conducted CEM on only the categorical variables we were able to retain 1590 subjects, but then we do not evaluate the covariate balance of the quantitative variables, so we are missing data. Therefore it seems CEM leads to too small of a dataset. Note that CEM also doesn’t ensure that balance on categorical covariates mean balance on raw covariates, and it ignores the continuous nature of covariates.

Cardinality Matching

The cardinality matching is when it provides the largest dataset that satisfies covariate balance constraints, such as all standardized covariate mean differences being below 0.1.

The cardinality matched data with tolerances set to 0.1 gives a similar sample size to 1:1 matching with 854 subjects included. In the love plot, we see that the overall covariate balance is well maintained for both 1:1 matched dataset and cardinality matched dataset. However, even though we see the two covariates that falls outside the 0.1 rule-of-thumb for 1:1 matched dataset, we believe it is better than using the cardinality matched dataset as most of the covariates in cardinality matched dataset are further away from the 0 line, and the magnitude distance is larger. We do notice that the number of rows in the two datasets are equal with 858 observations. Therefore, because we do not see a clear benefit from using this method, we stick to 1:1 matching as our best bet for covariate balance.

We find that 1:1 matching makes for the best covariate balance but also retains a better portion of the original dataset. Therefore we proceed with this matching technique and this dataset to draw causal conclusions.

Results & Interpretations

Because we are estimating the average treatment effect on the matched dataset that leveraged the covariate balance, MATE (Average Treatment Effect on Matched datset) will be the main estimand we estimate. We will use three different methods to estimate this: mean-difference estimator, IPW estimator, and linear regression estimator. The results will be shown and will be discussed with pros and cons of each method.

Mean difference estimate

The mean-difference estimator is the most commonly used causal estimand, where the average treatment effect (ATE) estimator is calculated with \(\hat{\tau}=\bar{Y}_1 - \bar{Y}_0\). Just for a reference, the calculated ATE on the full dataset is -1.3592.

The MATE is -1.2802, and the 95% confidence interval is (-1.4531, -1.0364), and because this interval does not include a 0, we conclude that there is a significant difference between the total number of education years for people who went to two-year college and who did not. Therefore, the average treatment effect of attending a two-year college decreases the total number of education years by 1.2802 years on average.

IPW estimate

The Inverse Propensity score Weighting (IPW) is a method that uses weights for each individuals as the inverse of their propensity score to account for imbalance in their groups. Here, propensity score is a probability of one subject getting a treatment given the observed covariate values. Therefore, IPW takes advantage of the propensity score estimates to incorporate in the linear regression to find the treatment effect.

## 
## t test of coefficients:
## 
##              Estimate Std. Error t value  Pr(>|t|)    
## (Intercept) 15.262786   0.041017  372.11 < 2.2e-16 ***
## twoyr       -1.221961   0.107284  -11.39 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the output, we see that the estimate for twoyr is -1.2220 with significance. This means that the average treatment effect of attending a two-year college decreases the total number of education years by 1.222 years on average.

Linear Regression on Matched Data estimate

The linear regression is another method that is commonly used in estimating \(\mathbb{E}[Y_i|\mathbf{X}_i]\), the average treatment effect. The linear regression equation can be seen as \[\text{educ86 =}\beta_0+\beta_1\text{twoyr}+\beta_2\text{female}+...+\beta_{15}\text{urban}+\epsilon.\] This outputs a coefficinet for the treatment variable, educ86, and it targets MATE. Because running a linear regression that only includes educ86 as a covariate is equivalent to using \(\hat{\tau}\) under complete randomization, we ran the linear regression with all of the covariates included. However, becuase we include all the other covariates, we proceed with assumption of consistency and randomization.

## 
## Call:
## lm(formula = educ86 ~ female + black + hispanic + bytest + dadvoc + 
##     dadsome + dadcoll + momvoc + momsome + momcoll + fincome + 
##     ownhome + perwhite + urban + twoyr, data = matchedData)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.914 -1.290  0.163  1.133  3.810 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.152e+01  1.117e+00  10.319  < 2e-16 ***
## female      -7.950e-02  1.093e-01  -0.727  0.46734    
## black       -1.648e-01  2.544e-01  -0.648  0.51724    
## hispanic     1.907e-01  2.055e-01   0.928  0.35374    
## bytest       6.248e-02  1.799e-02   3.473  0.00054 ***
## dadvoc      -1.669e-01  1.950e-01  -0.856  0.39233    
## dadsome      1.132e-01  1.642e-01   0.689  0.49086    
## dadcoll      2.930e-01  1.629e-01   1.798  0.07248 .  
## momvoc       1.436e-01  1.874e-01   0.766  0.44378    
## momsome      9.714e-02  1.615e-01   0.602  0.54755    
## momcoll      2.054e-01  1.774e-01   1.158  0.24739    
## fincome      6.009e-06  3.688e-06   1.629  0.10361    
## ownhome      2.224e-01  1.463e-01   1.520  0.12881    
## perwhite    -5.984e-03  2.803e-03  -2.135  0.03305 *  
## urban       -2.087e-02  1.559e-01  -0.134  0.89353    
## twoyr       -1.247e+00  1.080e-01 -11.543  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.574 on 842 degrees of freedom
## Multiple R-squared:  0.1764, Adjusted R-squared:  0.1617 
## F-statistic: 12.02 on 15 and 842 DF,  p-value: < 2.2e-16

Based on the summary output, we can see that the coefficient of twoyr is -1.247 with significance, which shows that the average treatment effect of attending a two-year college versus not reduces the total number of education years by 1.247 years. The caveats to using linear regression is discussed below.

Evaluation

The table 4 shows the results of all the \(\tau\) for estimating the MATE with different methods, and we believe \(\tau\) from linear regression model seems the most probable. Therefore, we conclude that on average, the total number of education years for students who attended a two-year college versus who did not is 1.247 years less.

First, the advantage of using the IPW in general can be that it allows us to conduct causal inference on the whole sample if the propensity scores are well estimated. However, note that the IPW relies on the propensity score to be well-estimated in order for the IPW to be unbiased. In addition, the extreme weights for IPW makes the variances to explode as the inverse of a really small number will be close to infinity. Therefore, using the linear regression estimator is more robust to propensity score estimates since we have used the Mahalanobis distance for matching.

Second, the mean-difference estimator does indeed measure the average treatment effect, and we suspect that using it on the matched dataset does a sufficient job in estimating the true ATE. However, because we have taken into account for the covariate imbalance in the first place and thereby assume for unconfoundedness among the covariates, we wanted to value more of the linear regression as it allows us to estimate the treatment effect in relation with other covariates.

Overall, the advantage of using a linear regression model for estimating the correlation between the treatment and the response variable is that the relationship can be clearly seen and easily interpretated. However, in this case, since we have taken into account for the covariate balance by working on the matched dataset, we proceed with the unconfoundedness assumption, and conclude that the linear regression estimator is the best out of all the other methods. Note that the disadvantage of using linear regression model for estimating the treatment effect in this study is that there can always be other interaction terms that impacts the response variable. In addition, we assume that we have a super-population, and that we should always look out for extrapolation/interpolation.

Conclusion

We found that there seems to be a clear causal relationship between attending two-year college and a decrease in the number of years of schooling for the average subject. Particularly, we note a magnitude of about -1.2 years of less schooling for a two-year college attendee, compared to someone who will not go to junior college. We found this to be the case with multiple different causal inference techniques, suggesting a robustness in the confidence of this value. The fact that the difference is negative and not 0 indicates there is still the diversion effect, in that people go to junior college instead of obtaining a four-year degree. At the same time, this value also holds implications for the democratization effect, as we see the number of years of schooling for the average junior college attendee is not 2 years less but less at around 1.2 years. So while junior college students still overall may attend less school, it is not as many years less as we may have initially thought. This could be significant, as it suggests two-year college indeed can be a benefit to people furthering their education, giving opportunity to people when they may not originally have been able to. This hold valuable information about the effectiveness and benefits of junior college.

Future Work

First, we assumed for unconfoundedness when working with different causal relationship estimating methods on the matched dataset as it alleviated the covariate imbalance. However, there might be better ways to account for covariate balance other than the matching methdos we have discovered, and therefore, better account for unconfoundedness in the study. Furthermore, sensitivity analysis could be done to further check how the response variables changes with respect to changes in other covariates. Also, different interactions of the covariates that are deemed significant in having an impact on the response variables can be found by back stepwise regression.

Also, note that the 0 value on the indicator variable, twoyr, does not necessarily mean that the individual attended a 4 year college. It might mean that the student didn’t attend college at all or they might indeed have attended a 4 year college. Because this is unclear, it was harder to define a clear relationship between whether the result holds for two-year versus four-year.

In addition, we noted having a binary variable indicating whether the student has finished the college or not would be helpful in coming up with a more fruitful insight. Becuase finidng a causal relationship between going to a two-year college or not and the total number of education years might be somewhat intuitive, it would be more interestin gto see whether attending a two-year college or not impacts the college completion rate, and see how it impacts the overall diversion effect that was explained in the paper.