Introduction

This assignment will take you through a Propensity Score Matching exercise, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics. The key question is: Do catholic schools have an effect on student’s performance? The key issus is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).

Download the data

  • From here: http://www.mfilipski.com/files/ecls.csv (Data is originally form here)
  • For the purpose of questions 1 - 6, make a smaller dataset which only contains the variables: c5r2mtsc_std: Standardized math scores catholic: Whether the school is catholic w3income_1k: Family income in thousands p5numpla: Number of places the student has lived for at least 4 months w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)? [hint: use select()]
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
##   c5r2mtsc_std     catholic  w3income_1k        p5numpla      w3momed_hsb    
##  Min.   :-3.1036   0:4499   Min.   :  5.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:-0.4416   1: 930   1st Qu.: 37.50   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : 0.2256            Median : 62.50   Median :1.000   Median :0.0000  
##  Mean   : 0.1728            Mean   : 68.95   Mean   :1.101   Mean   :0.3603  
##  3rd Qu.: 0.8024            3rd Qu.: 87.50   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   : 3.4337            Max.   :200.00   Max.   :5.000   Max.   :1.0000

Identify the problem

Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average? * Can you conclude that going to a catholic school leads to higher math scores? Why or why not? * Run regressions: 1. A regression of just math scores on the catholic dummy 2. Another regression that includes all the variables listed above 3. Comment on the results * Do the means of the covariates (income, mother’s education etc) differ between catholic vs. non-catholic schools? You can: 1. Compare the means [Hint: summarise_all()] 2. Test the difference in means for statistical significance [Hint: t.test] 3. Discuss what would be the problems with an identification strategy based on simple regressions.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -2.5922 -0.3232  0.2510  0.2197  0.7729  3.0552
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -3.1036 -0.4577  0.2207  0.1631  0.8084  3.4337

## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = -1.7866, df = 1468.1, p-value = 0.07421
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.118653665  0.005539377
## sample estimates:
## mean in group 0 mean in group 1 
##       0.1631279       0.2196851

There is significant difference in the t-test difference means of the scores between the catholic and non-catholic schools distribution in the Math scores. Also if we look closely at the box plot we see a lot of variation between the two as well present.

## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2667 -0.6082  0.0538  0.6292  3.2705 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.16313    0.01424  11.459   <2e-16 ***
## catholic1    0.05656    0.03440   1.644      0.1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9549 on 5427 degrees of freedom
## Multiple R-squared:  0.000498,   Adjusted R-squared:  0.0003138 
## F-statistic: 2.704 on 1 and 5427 DF,  p-value: 0.1002
## 
## Call:
## lm(formula = c5r2mtsc_std ~ ., data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3922 -0.5696  0.0372  0.5986  3.2572 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0733290  0.0491661   1.491 0.135900    
## catholic1   -0.1273899  0.0328888  -3.873 0.000109 ***
## w3income_1k  0.0053247  0.0003026  17.595  < 2e-16 ***
## p5numpla    -0.1009432  0.0355779  -2.837 0.004567 ** 
## w3momed_hsb -0.3740355  0.0271962 -13.753  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8941 on 5424 degrees of freedom
## Multiple R-squared:  0.1242, Adjusted R-squared:  0.1235 
## F-statistic: 192.3 on 4 and 5424 DF,  p-value: < 2.2e-16

If we are to compare the two regression we see that catholic school admits have a reduced standardized test scores and the effect although in the ols first equation is positive but is insignificant. For the second regression with all the variables we can see that mother’s age and better standard of life proxied with greater family income increases the standardized test scores and vice versa.

## # A tibble: 2 x 5
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
##   <fct>           <dbl>       <dbl>    <dbl>       <dbl>
## 1 0               0.163        65.4     1.11       0.392
## 2 1               0.220        86.2     1.07       0.205
## 
##  Welch Two Sample t-test
## 
## data:  w3income_1k by catholic
## t = -13.238, df = 1314.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -23.86719 -17.70620
## sample estimates:
## mean in group 0 mean in group 1 
##        65.39393        86.18063
## 
##  Welch Two Sample t-test
## 
## data:  p5numpla by catholic
## t = 3.128, df = 1600.4, p-value = 0.001792
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.01235472 0.05390038
## sample estimates:
## mean in group 0 mean in group 1 
##        1.106246        1.073118
## 
##  Welch Two Sample t-test
## 
## data:  w3momed_hsb by catholic
## t = 12.362, df = 1545.1, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1572715 0.2165946
## sample estimates:
## mean in group 0 mean in group 1 
##       0.3923094       0.2053763
  • T-test for covariates : We see that the p-value of the t-test is less than 0.001. Thus the difference in means is statistically different for all the variables.

There is significant difference in the t-test difference means of the scores between the catholic and non-catholic schools distributios. There could be possible parallel trends assumption in DID design getting violated. There could exist selction on observables and unobservables which means existence of endogeneity and ommitted variable bias present in the dataset. There is potential sample selection problems with too less variables being the correct representative of the datasset and there are other factors contributing to attending catholic schools and not attending.

Estimate the propensity to go to catholic school, by hand

  • Run a logit regression to predict catholic based on the covariates we kept (but not the math scores, of course).
  • Make a prediction for who is likely to go to catholic school, using this model
  • Check that there is common support.
  1. (Meaning check both groups have predicted values in overlapping ranges)
  2. You may check this visually. For instance: you can do this with two histograms side-by-side. Or plot densities.
  3. Plot the income variable against the logit prediction (by catholic), and add the lowess smooth densities to the plot. They should look similar, but not perfectly aligned.
## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = "binomial", data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## w3income_1k  0.0077921  0.0008115   9.602  < 2e-16 ***
## p5numpla    -0.2774346  0.1232930  -2.250   0.0244 *  
## w3momed_hsb -0.6504757  0.0915710  -7.104 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4
##   c5r2mtsc_std     catholic  w3income_1k        p5numpla      w3momed_hsb    
##  Min.   :-3.1036   0:4499   Min.   :  5.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:-0.4416   1: 930   1st Qu.: 37.50   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : 0.2256            Median : 62.50   Median :1.000   Median :0.0000  
##  Mean   : 0.1728            Mean   : 68.95   Mean   :1.101   Mean   :0.3603  
##  3rd Qu.: 0.8024            3rd Qu.: 87.50   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   : 3.4337            Max.   :200.00   Max.   :5.000   Max.   :1.0000  
##   catholicpred    
##  Min.   :0.04333  
##  1st Qu.:0.10801  
##  Median :0.16839  
##  Mean   :0.17130  
##  3rd Qu.:0.21996  
##  Max.   :0.40389

Run a Matching algorithm to get a balanced dataset

  • Create a matched dataset:
  1. Run a MatchIt command with the same equation as your logit. (which by default runs a logit just like the one you did, then matches each observation of the treated group with another from the control group.)
  2. Follow that by match.data(your_matchit_output) to create your matched dataset
  3. Look at the dimensions of what you just created (say, dta_psm). What can you conclude about the number of observations that matched?
  • Sort the data by distance, and show the 10 first observations. – Notice how observations were matched on distance?
  • In this new dataset, do the means of the covariates (income, etc) differ between catholic vs. non-catholic schools? – Compare the means [Hint: summarise_all] – Test the difference in means for statistical significance [Hint: t.test]
  • Reflect on what PSM did for your identification strategy.
## A matchit object
##  - method: 1:1 nearest neighbor matching without replacement
##  - distance: Propensity score
##              - estimated with logistic regression
##  - number of obs.: 5429 (original), 1860 (matched)
##  - target estimand: ATT
##  - covariates: w3income_1k, p5numpla, w3momed_hsb
##   c5r2mtsc_std     catholic  w3income_1k        p5numpla      w3momed_hsb    
##  Min.   :-2.9539   0:930    Min.   :  7.50   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:-0.2794   1:930    1st Qu.: 62.50   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : 0.3035            Median : 87.50   Median :1.000   Median :0.0000  
##  Mean   : 0.2771            Mean   : 86.18   Mean   :1.073   Mean   :0.2054  
##  3rd Qu.: 0.8766            3rd Qu.: 87.50   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   : 3.1066            Max.   :200.00   Max.   :3.000   Max.   :1.0000  
##                                                                              
##     distance         weights     subclass   
##  Min.   :0.0607   Min.   :1   1      :   2  
##  1st Qu.:0.1604   1st Qu.:1   2      :   2  
##  Median :0.1884   Median :1   3      :   2  
##  Mean   :0.2034   Mean   :1   4      :   2  
##  3rd Qu.:0.2200   3rd Qu.:1   5      :   2  
##  Max.   :0.4039   Max.   :1   6      :   2  
##                               (Other):1848
##   c5r2mtsc_std     catholic  w3income_1k        p5numpla      w3momed_hsb    
##  Min.   :-2.9539   0:930    Min.   :  7.50   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:-0.2794   1:930    1st Qu.: 62.50   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : 0.3035            Median : 87.50   Median :1.000   Median :0.0000  
##  Mean   : 0.2771            Mean   : 86.18   Mean   :1.073   Mean   :0.2054  
##  3rd Qu.: 0.8766            3rd Qu.: 87.50   3rd Qu.:1.000   3rd Qu.:0.0000  
##  Max.   : 3.1066            Max.   :200.00   Max.   :3.000   Max.   :1.0000  
##                                                                              
##     distance         weights     subclass   
##  Min.   :0.0607   Min.   :1   1      :   2  
##  1st Qu.:0.1604   1st Qu.:1   2      :   2  
##  Median :0.1884   Median :1   3      :   2  
##  Mean   :0.2034   Mean   :1   4      :   2  
##  3rd Qu.:0.2200   3rd Qu.:1   5      :   2  
##  Max.   :0.4039   Max.   :1   6      :   2  
##                               (Other):1848
##      c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb   distance weights
## 176    -0.3835843        0     17.5005        2           1 0.06069529       1
## 4274   -2.5922340        1     17.5005        2           1 0.06069529       1
## 41     -2.3690526        0     22.5005        2           1 0.06295488       1
## 2083    0.5271555        1     22.5005        2           1 0.06295488       1
## 561    -0.2195955        0     27.5005        2           1 0.06529274       1
## 2304   -1.6878765        1     27.5005        2           1 0.06529274       1
## 178    -0.2550080        0     32.5005        2           1 0.06771114       1
## 716    -1.2722942        1     32.5005        2           1 0.06771114       1
## 288    -0.7403859        0     37.5005        2           1 0.07021240       1
## 485    -0.7841369        1     37.5005        2           1 0.07021240       1
##      subclass
## 176       598
## 4274      598
## 41        222
## 2083      222
## 561       272
## 2304      272
## 178       859
## 716       859
## 288       728
## 485       728
## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA

## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## # A tibble: 2 x 8
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb distance weights
##   <fct>           <dbl>       <dbl>    <dbl>       <dbl>    <dbl>   <dbl>
## 1 0               0.335        86.2     1.07       0.205    0.203       1
## 2 1               0.220        86.2     1.07       0.205    0.203       1
## # … with 1 more variable: subclass <dbl>
## [1] 1
## [1] 1
## [1] 1

We have 1860 observations in all with a sample of Catholics & Non-Catholics, i.e. 930 students in each category. With the Popensity score matching we got equal pairs of treated vs non-treated sampe and weighed them based on distance metric. We still have identification issue with our treatment variable even after PSM as there exists some unobserved factor that is correlated with the decision to go to Catholic School but not correlated with the covariates.

Visually check the balance in the new dataset:

  • Plot the income variable against the propensity score (by catholic) and plot the lowess smooth densities – Verify that the means are nearly identical at each level of the pscore distribution – (You can repeat this for all covariates if you want)

## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA

## Warning in mean.default(subclass): argument is not numeric or logical: returning
## NA
## # A tibble: 2 x 8
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb distance weights
##   <fct>           <dbl>       <dbl>    <dbl>       <dbl>    <dbl>   <dbl>
## 1 0               0.335        86.2     1.07       0.205    0.203       1
## 2 1               0.220        86.2     1.07       0.205    0.203       1
## # … with 1 more variable: subclass <dbl>
## 
##  Welch Two Sample t-test
## 
## data:  c5r2mtsc_std by catholic
## t = 2.8048, df = 1852.1, p-value = 0.005087
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.03455008 0.19520664
## sample estimates:
## mean in group 0 mean in group 1 
##       0.3345634       0.2196851
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic, data = newmatcheddata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2884 -0.5547  0.0443  0.5931  2.8356 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.33456    0.02896  11.552  < 2e-16 ***
## catholic1   -0.11488    0.04096  -2.805  0.00509 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8832 on 1858 degrees of freedom
## Multiple R-squared:  0.004216,   Adjusted R-squared:  0.00368 
## F-statistic: 7.867 on 1 and 1858 DF,  p-value: 0.005087
## 
## Call:
## lm(formula = c5r2mtsc_std ~ catholic + p5numpla + w3momed_hsb, 
##     data = newmatcheddata)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3884 -0.5685  0.0329  0.5975  2.7356 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.52796    0.08273   6.382 2.21e-10 ***
## catholic1   -0.11488    0.04008  -2.866   0.0042 ** 
## p5numpla    -0.09342    0.07154  -1.306   0.1917    
## w3momed_hsb -0.45350    0.04962  -9.139  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8642 on 1856 degrees of freedom
## Multiple R-squared:  0.04764,    Adjusted R-squared:  0.0461 
## F-statistic: 30.95 on 3 and 1856 DF,  p-value: < 2.2e-16

Split the PSM procedure

We observe that with this new PSM matched data, our treatment effect of being in catholic schools on standardized score is negative and statistically significant.

## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = "binomial", data = dataset)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## w3income_1k  0.0077921  0.0008115   9.602  < 2e-16 ***
## p5numpla    -0.2774346  0.1232930  -2.250   0.0244 *  
## w3momed_hsb -0.6504757  0.0915710  -7.104 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4
##   c5r2mtsc_std     catholic  w3income_1k        p5numpla      w3momed_hsb    
##  Min.   :-3.1036   0:4499   Min.   :  5.00   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:-0.4416   1: 930   1st Qu.: 37.50   1st Qu.:1.000   1st Qu.:0.0000  
##  Median : 0.2256            Median : 62.50   Median :1.000   Median :0.0000  
##  Mean   : 0.1728            Mean   : 68.95   Mean   :1.101   Mean   :0.3603  
##  3rd Qu.: 0.8024            3rd Qu.: 87.50   3rd Qu.:1.000   3rd Qu.:1.0000  
##  Max.   : 3.4337            Max.   :200.00   Max.   :5.000   Max.   :1.0000  
##   catholicpred    
##  Min.   :0.04333  
##  1st Qu.:0.10801  
##  Median :0.16839  
##  Mean   :0.17130  
##  3rd Qu.:0.21996  
##  Max.   :0.40389

Use Machine Learning for matching

## 
## The downloaded binary packages are in
##  /var/folders/np/zfkhhhb16cjfjjh96791mx140000gn/T//RtmpthanNQ/downloaded_packages
## Loading required package: Matrix
## Loaded glmnet 4.1-1
##   c5r2mtsc_std        catholic        race_white       race_black     
##  Min.   :-3.1036   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:-0.4416   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median : 0.2256   Median :0.0000   Median :1.0000   Median :0.00000  
##  Mean   : 0.1728   Mean   :0.1713   Mean   :0.6731   Mean   :0.06576  
##  3rd Qu.: 0.8024   3rd Qu.:0.0000   3rd Qu.:1.0000   3rd Qu.:0.00000  
##  Max.   : 3.4337   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000  
##    race_asian      race_hispanic       p5hmage         p5hdage     
##  Min.   :0.00000   Min.   :0.0000   Min.   :19.00   Min.   :22.00  
##  1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:34.00   1st Qu.:36.00  
##  Median :0.00000   Median :0.0000   Median :38.00   Median :40.00  
##  Mean   :0.06299   Mean   :0.1464   Mean   :38.13   Mean   :40.46  
##  3rd Qu.:0.00000   3rd Qu.:0.0000   3rd Qu.:42.00   3rd Qu.:44.00  
##  Max.   :1.00000   Max.   :1.0000   Max.   :71.00   Max.   :69.00  
##   w3daded_hsb        p5numpla      w3momed_hsb      w3income_1k    
##  Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Min.   :  5.00  
##  1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 37.50  
##  Median :0.0000   Median :1.000   Median :0.0000   Median : 62.50  
##  Mean   :0.4332   Mean   :1.101   Mean   :0.3603   Mean   : 68.95  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.: 87.50  
##  Max.   :1.0000   Max.   :5.000   Max.   :1.0000   Max.   :200.00  
##     w3momscr        w3dadscr        w3povrty          p5fstamp      
##  Min.   :29.60   Min.   :29.60   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:35.78   1st Qu.:35.78   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :38.18   Median :39.18   Median :0.00000   Median :0.00000  
##  Mean   :44.55   Mean   :43.16   Mean   :0.08694   Mean   :0.03887  
##  3rd Qu.:53.50   3rd Qu.:53.50   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :77.50   Max.   :77.50   Max.   :1.00000   Max.   :1.00000

## 
## Call:  glmnet(x = x.matrix, y = y, family = "gaussian", alpha = 1) 
## 
##    Df  %Dev   Lambda
## 1   0  0.00 0.288400
## 2   1  1.55 0.262800
## 3   2  3.34 0.239400
## 4   3  4.97 0.218200
## 5   3  6.45 0.198800
## 6   4  7.91 0.181100
## 7   4  9.25 0.165000
## 8   4 10.36 0.150400
## 9   5 11.29 0.137000
## 10  5 12.12 0.124800
## 11  6 12.83 0.113700
## 12  6 13.43 0.103600
## 13  8 14.00 0.094430
## 14  9 14.57 0.086040
## 15  9 15.04 0.078400
## 16  9 15.44 0.071440
## 17  9 15.76 0.065090
## 18 10 16.06 0.059310
## 19 10 16.38 0.054040
## 20 10 16.64 0.049240
## 21 11 16.91 0.044860
## 22 11 17.16 0.040880
## 23 11 17.37 0.037250
## 24 11 17.55 0.033940
## 25 12 17.69 0.030920
## 26 13 17.81 0.028180
## 27 13 17.93 0.025670
## 28 13 18.02 0.023390
## 29 13 18.10 0.021310
## 30 13 18.16 0.019420
## 31 13 18.22 0.017700
## 32 13 18.26 0.016120
## 33 13 18.30 0.014690
## 34 13 18.33 0.013390
## 35 13 18.35 0.012200
## 36 13 18.38 0.011110
## 37 13 18.39 0.010130
## 38 13 18.41 0.009226
## 39 13 18.42 0.008407
## 40 14 18.44 0.007660
## 41 14 18.46 0.006979
## 42 14 18.48 0.006359
## 43 14 18.50 0.005794
## 44 14 18.51 0.005280
## 45 14 18.52 0.004811
## 46 14 18.53 0.004383
## 47 14 18.54 0.003994
## 48 14 18.54 0.003639
## 49 14 18.55 0.003316
## 50 14 18.55 0.003021
## 51 14 18.56 0.002753
## 52 14 18.56 0.002508
## 53 14 18.56 0.002285
## 54 14 18.57 0.002082
## 55 14 18.57 0.001897
## 56 14 18.57 0.001729
## 57 14 18.57 0.001575
## 58 14 18.57 0.001435
## 59 14 18.57 0.001308
## 60 14 18.57 0.001192
## 61 14 18.57 0.001086
## 62 15 18.57 0.000989
## 63 15 18.57 0.000901
## 64 15 18.57 0.000821
## 65 15 18.58 0.000748
## 66 15 18.58 0.000682
## 67 15 18.58 0.000621
## 68 15 18.58 0.000566
## 69 15 18.58 0.000516

## 16 x 1 sparse Matrix of class "dgCMatrix"
##                          1
## (Intercept)   -0.455557786
## catholic      -0.010889736
## race_white     0.210778482
## race_black    -0.128278509
## race_asian     0.085651014
## race_hispanic  .          
## p5hmage        0.005514440
## p5hdage        .          
## w3daded_hsb   -0.199875569
## p5numpla       .          
## w3momed_hsb   -0.161435158
## w3income_1k    0.002446267
## w3momscr       0.003339817
## w3dadscr       0.002685208
## w3povrty      -0.085842306
## p5fstamp       .

For regularization, we choose about 10 out of 15 covariates. And they do differ as we see certain variables which we used previously are getting dropped here suggesting low penalty being imposed on them.

Use Lasso predictions for matching.

## [1] 0.001085757

We can see Overfitting - switch spot (Log(lambda)=3.5) - Underfitting