This assignment will take you through a \(\textbf{Propensity Score Matching}\) exercise. The example data contains test results of school children along with various characteristics.
The key question is: Do catholic schools have an effect on student’s performance?
The key issue is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).
From here: http://www.mfilipski.com/files/ecls.csv (Data is originally form here)
For the purpose of questions 1 - 6, make a smaller dataset which only contains the variables:
setwd("C:/Users/Akash/Dropbox/UGA/AAEC8610AdvEcotrix_Filipski/FilipskiHW11")
url = 'http://www.mfilipski.com/files/ecls.csv'
hw11data <- read.csv(url,header = TRUE)
hw11data = subset(hw11data, select = c(c5r2mtsc_std,catholic,race_white,p5hmage,
w3income_1k, p5numpla, w3momed_hsb))
hw11data$catholic = as.factor(hw11data$catholic)
summary(hw11data)
c5r2mtsc_std catholic race_white p5hmage w3income_1k
Min. :-3.1036 0:4499 Min. :0.0000 Min. :19.00 Min. : 5.00
1st Qu.:-0.4416 1: 930 1st Qu.:0.0000 1st Qu.:34.00 1st Qu.: 37.50
Median : 0.2256 Median :1.0000 Median :38.00 Median : 62.50
Mean : 0.1728 Mean :0.6731 Mean :38.13 Mean : 68.95
3rd Qu.: 0.8024 3rd Qu.:1.0000 3rd Qu.:42.00 3rd Qu.: 87.50
Max. : 3.4337 Max. :1.0000 Max. :71.00 Max. :200.00
p5numpla w3momed_hsb
Min. :1.000 Min. :0.0000
1st Qu.:1.000 1st Qu.:0.0000
Median :1.000 Median :0.0000
Mean :1.101 Mean :0.3603
3rd Qu.:1.000 3rd Qu.:1.0000
Max. :5.000 Max. :1.0000
attach(hw11data)
# No NAs
QUESTION: Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools?
summary(hw11data %>% filter(catholic == 1) %>% .$c5r2mtsc_std)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.5922 -0.3232 0.2510 0.2197 0.7729 3.0552
summary(hw11data %>% filter(catholic == 0) %>% .$c5r2mtsc_std)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.1036 -0.4577 0.2207 0.1631 0.8084 3.4337
We can see Catholics appears to have higher standardized mean score than Non-Catholics
ggplot(hw11data, aes(catholic, c5r2mtsc_std )) +
geom_boxplot()
t.test(c5r2mtsc_std ~ catholic, data = hw11data)
Welch Two Sample t-test
data: c5r2mtsc_std by catholic
t = -1.7866, df = 1468.1, p-value = 0.07421
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.118653665 0.005539377
sample estimates:
mean in group 0 mean in group 1
0.1631279 0.2196851
We see that the t-test difference in means of scores for catholics and non-catholics is statistically different.
ols.simple = lm(c5r2mtsc_std ~ catholic, data = hw11data)
summary(ols.simple)
Call:
lm(formula = c5r2mtsc_std ~ catholic, data = hw11data)
Residuals:
Min 1Q Median 3Q Max
-3.2667 -0.6082 0.0538 0.6292 3.2705
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.16313 0.01424 11.459 <0.0000000000000002 ***
catholic1 0.05656 0.03440 1.644 0.1
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9549 on 5427 degrees of freedom
Multiple R-squared: 0.000498, Adjusted R-squared: 0.0003138
F-statistic: 2.704 on 1 and 5427 DF, p-value: 0.1002
ols.all = lm(c5r2mtsc_std ~ ., data = hw11data)
summary(ols.all)
Call:
lm(formula = c5r2mtsc_std ~ ., data = hw11data)
Residuals:
Min 1Q Median 3Q Max
-3.5174 -0.5617 0.0329 0.6097 2.9165
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.6580369 0.0994710 -6.615 0.0000000000406 ***
catholic1 -0.1583818 0.0324691 -4.878 0.0000011027244 ***
race_white 0.3086458 0.0263764 11.702 < 0.0000000000000002 ***
p5hmage 0.0145650 0.0022407 6.500 0.0000000000874 ***
w3income_1k 0.0042174 0.0003097 13.616 < 0.0000000000000002 ***
p5numpla -0.0708801 0.0351715 -2.015 0.0439 *
w3momed_hsb -0.3274425 0.0270767 -12.093 < 0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8796 on 5422 degrees of freedom
Multiple R-squared: 0.1526, Adjusted R-squared: 0.1517
F-statistic: 162.7 on 6 and 5422 DF, p-value: < 0.00000000000000022
hw11data %>%
group_by(catholic) %>%
dplyr::summarise(RACE = round(mean(race_white), digits = 3),
MOTHERS_AGE = round(mean(p5hmage), digits = 3),
FAM_INCOME = round(mean(w3income_1k), digits = 3),
Num_Places = round(mean(p5numpla), digits = 3),
Mothers_Educ = round(mean(w3momed_hsb), digits = 3)
)
# A tibble: 2 x 6
catholic RACE MOTHERS_AGE FAM_INCOME Num_Places Mothers_Educ
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.654 37.8 65.4 1.11 0.392
2 1 0.767 39.8 86.2 1.07 0.205
We see that the means among these covariates differ between Catholics and Non-Catholics. But, to see whether these differences are statistically significant we conduct the t-test below.
t.test(race_white ~ catholic, data = hw11data)$p.value
[1] 0.0000000000006814076
t.test(p5hmage ~ catholic, data = hw11data)$p.value
[1] 0.0000000000000000000000000004283437
t.test(w3income_1k ~ catholic, data = hw11data)$p.value
[1] 0.0000000000000000000000000000000000001212805
t.test(p5numpla ~ catholic, data = hw11data)$p.value
[1] 0.001791557
t.test(w3momed_hsb ~ catholic, data = hw11data)$p.value
[1] 0.000000000000000000000000000000001530072
Thus, we see that the means of the covariates differ between CATHOLICS & NON-CATHOLICS students.
What does this mean for IDENTIFICATION: These differences in covariates among the catholics and non-catholics students imply that 1) the parallel trends assumption would be violated in a Difference-In-Differences design ; 2) These differences among the covariates also tell us that there exists selection on observables and unobservables such that there is endogeneity & omitted variable concern with students attending catholic schools and those that are not. There exists inherent sample selection among those who attend catholic schools and those who do not. In simple terms, there are other factors contributing to attending catholic schools and not attending.
Since the parallel trends assumption fails , we can understand that the untreated units (here, non-catholic students) do not provide us with a “good” counterfactual. [See David McKenzie World Bank Blog ]
course). [Hint: you can use the glm command and the option family = binomial() ]
logit.catholic = glm(catholic~ factor(race_white) + p5hmage +
w3income_1k + p5numpla + w3momed_hsb,
data=hw11data, family = "binomial")
summary(logit.catholic)
Call:
glm(formula = catholic ~ factor(race_white) + p5hmage + w3income_1k +
p5numpla + w3momed_hsb, family = "binomial", data = hw11data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2305 -0.6669 -0.5170 -0.3743 2.4825
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.4092820 0.3263351 -10.447 < 0.0000000000000002 ***
factor(race_white)1 0.3001516 0.0870242 3.449 0.000563 ***
p5hmage 0.0395347 0.0070820 5.582 0.0000000237241155 ***
w3income_1k 0.0063463 0.0008473 7.490 0.0000000000000688 ***
p5numpla -0.2106296 0.1230279 -1.712 0.086888 .
w3momed_hsb -0.5644212 0.0926141 -6.094 0.0000000010989248 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 4972.4 on 5428 degrees of freedom
Residual deviance: 4706.8 on 5423 degrees of freedom
AIC: 4718.8
Number of Fisher Scoring iterations: 5
predict command]hw11data$catholicpred = predict(logit.catholic, newdata = hw11data, type = "response")
summary(hw11data)
c5r2mtsc_std catholic race_white p5hmage w3income_1k
Min. :-3.1036 0:4499 Min. :0.0000 Min. :19.00 Min. : 5.00
1st Qu.:-0.4416 1: 930 1st Qu.:0.0000 1st Qu.:34.00 1st Qu.: 37.50
Median : 0.2256 Median :1.0000 Median :38.00 Median : 62.50
Mean : 0.1728 Mean :0.6731 Mean :38.13 Mean : 68.95
3rd Qu.: 0.8024 3rd Qu.:1.0000 3rd Qu.:42.00 3rd Qu.: 87.50
Max. : 3.4337 Max. :1.0000 Max. :71.00 Max. :200.00
p5numpla w3momed_hsb catholicpred
Min. :1.000 Min. :0.0000 Min. :0.03195
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.10199
Median :1.000 Median :0.0000 Median :0.16003
Mean :1.101 Mean :0.3603 Mean :0.17130
3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:0.22061
Max. :5.000 Max. :1.0000 Max. :0.53094
## first plot - left half of x-axis, right margin set to 0 lines
par(fig = c(0, .5, 0, 1), mar = c(5,4,3,0))
hist(hw11data$catholicpred[hw11data$catholic==0],
ann = FALSE, las = 1, col=rgb(1,0,0,0.5))
## second plot - right half of x-axis, left margin set to 0 lines
par(fig = c(.5, 1, 0, 1), mar = c(5,0,3,2), new = TRUE)
hist(hw11data$catholicpred[hw11data$catholic==1],
ann = FALSE, axes = FALSE, col=rgb(0,0,1,0.5))
axis(1)
axis(2, lwd.ticks = 0, labels = FALSE)
title(main = 'Histogram: \nLeft: Non-Catholic; Right: Catholics',
xlab = 'Predicted Value Of Being Catholic', outer = TRUE, line = -2)
# Histogram Colored (blue and red)
hist(hw11data$catholicpred[hw11data$catholic==0],
col=rgb(1,0,0,0.5) ,xlim=c(0,0.6), main='Overlapping Histogram', xlab='Variable')
hist(hw11data$catholicpred[hw11data$catholic==1],
col=rgb(0,0,1,0.5), add=T)
box()
plot(density(hw11data$catholicpred[hw11data$catholic==1]),
col="red", ann=FALSE)
lines(density(hw11data$catholicpred[hw11data$catholic==0]), col="blue")
title(main = 'Overlay of Density Plots Of Predicted Catholic Score',
xlab = 'Predicted Value Of Being Catholic', outer = TRUE, line = -2)
legend("topright", c("Catholics","non-Catholics"), lty = c(1,1), col = c("red","blue"))
# removing the above generated predicted catholics
hw11data = dplyr::select(hw11data, -catholicpred )
m.out <- matchit(catholic~ race_white + p5hmage +
w3income_1k + p5numpla + w3momed_hsb,
data=hw11data)
m.out
Call:
matchit(formula = catholic ~ race_white + p5hmage + w3income_1k +
p5numpla + w3momed_hsb, data = hw11data)
Sample sizes:
Control Treated
All 4499 930
Matched 930 930
Unmatched 3569 0
Discarded 0 0
m.data <- match.data(m.out)
summary(m.data)
c5r2mtsc_std catholic race_white p5hmage w3income_1k
Min. :-3.0919 0:930 Min. :0.0000 Min. :24.0 Min. : 7.50
1st Qu.:-0.2738 1:930 1st Qu.:1.0000 1st Qu.:37.0 1st Qu.: 62.50
Median : 0.3226 Median :1.0000 Median :40.0 Median : 87.50
Mean : 0.3068 Mean :0.7694 Mean :39.7 Mean : 86.09
3rd Qu.: 0.9461 3rd Qu.:1.0000 3rd Qu.:43.0 3rd Qu.: 87.50
Max. : 3.1066 Max. :1.0000 Max. :55.0 Max. :200.00
p5numpla w3momed_hsb distance weights
Min. :1.000 Min. :0.0000 Min. :0.0459 Min. :1
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.1557 1st Qu.:1
Median :1.000 Median :0.0000 Median :0.2020 Median :1
Mean :1.063 Mean :0.2048 Mean :0.2085 Mean :1
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.2523 3rd Qu.:1
Max. :5.000 Max. :1.0000 Max. :0.4619 Max. :1
We have 1860 observations.
We have a balanced sample of Catholics & Non-Catholics, i.e. 930 students in each category.
So, in original data set we had 5429 observations and now in the matched data we have 1860 observations. This means that 3569 observations were discarded.
If we compare the summary statistics of this matched data with 1860 observations and our original sample, we see that in the new matched data:
newmatcheddata = m.data[order(m.data$distance),]
summary(newmatcheddata)
c5r2mtsc_std catholic race_white p5hmage w3income_1k
Min. :-3.0919 0:930 Min. :0.0000 Min. :24.0 Min. : 7.50
1st Qu.:-0.2738 1:930 1st Qu.:1.0000 1st Qu.:37.0 1st Qu.: 62.50
Median : 0.3226 Median :1.0000 Median :40.0 Median : 87.50
Mean : 0.3068 Mean :0.7694 Mean :39.7 Mean : 86.09
3rd Qu.: 0.9461 3rd Qu.:1.0000 3rd Qu.:43.0 3rd Qu.: 87.50
Max. : 3.1066 Max. :1.0000 Max. :55.0 Max. :200.00
p5numpla w3momed_hsb distance weights
Min. :1.000 Min. :0.0000 Min. :0.0459 Min. :1
1st Qu.:1.000 1st Qu.:0.0000 1st Qu.:0.1557 1st Qu.:1
Median :1.000 Median :0.0000 Median :0.2020 Median :1
Mean :1.063 Mean :0.2048 Mean :0.2085 Mean :1
3rd Qu.:1.000 3rd Qu.:0.0000 3rd Qu.:0.2523 3rd Qu.:1
Max. :5.000 Max. :1.0000 Max. :0.4619 Max. :1
head(newmatcheddata, 10)
c5r2mtsc_std catholic race_white p5hmage w3income_1k p5numpla w3momed_hsb
2304 -1.6878765 1 0 30 27.5005 2 1
3078 -1.4395771 0 0 30 27.5005 2 1
2500 -2.3585524 0 0 26 27.5005 1 1
4274 -2.5922340 1 0 33 17.5005 2 1
1220 -1.7827903 0 0 35 12.5005 2 1
485 -0.7841369 1 0 31 37.5005 2 1
4751 0.5286997 0 0 30 17.5005 1 1
716 -1.2722942 1 0 33 32.5005 2 1
distance weights
2304 0.04589596 1
3078 0.04589596 1
2500 0.04825008 1
4274 0.04837194 1
1220 0.05059818 1
485 0.05062309 1
4751 0.05278851 1
716 0.05294718 1
[ reached 'max' / getOption("max.print") -- omitted 2 rows ]
We see a pair of catholics and non-catholics matched on distance .
newmatcheddata %>%
group_by(catholic) %>%
dplyr::summarise(RACE = round(mean(race_white), digits = 3),
MOTHERS_AGE = round(mean(p5hmage), digits = 3),
FAM_INCOME = round(mean(w3income_1k), digits = 3),
Num_Places = round(mean(p5numpla), digits = 3),
Mothers_Educ = round(mean(w3momed_hsb), digits = 3)
)
# A tibble: 2 x 6
catholic RACE MOTHERS_AGE FAM_INCOME Num_Places Mothers_Educ
<fct> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.772 39.6 86.0 1.05 0.204
2 1 0.767 39.8 86.2 1.07 0.205
Test the difference in means for statistical significance [Hint: t.test]
T-test for RACE : We see that the p-value of the t-test is greater than 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different
t.test(race_white ~ catholic, data = newmatcheddata)$p.value
[1] 0.7832882
t.test(p5hmage ~ catholic, data = newmatcheddata)$p.value
[1] 0.515512
t.test(w3income_1k ~ catholic, data = newmatcheddata)$p.value
[1] 0.9327609
t.test(p5numpla ~ catholic, data = newmatcheddata)$p.value
[1] 0.09782669
t.test(w3momed_hsb ~ catholic, data = newmatcheddata)$p.value
[1] 0.9542154
Now with this new mathced data our Parallel Trends Assumption would hold for a Diff-In-Diff design since covariates don’t differ between treated and non-treated.
We still face identification issue as we still have an endogeneity/omitted variable concern with the decision to go to Catholic school that is correlated with an unobserved factor.
PSM created a pairs of treated and non-treated sample. Thus, giving us a balanced sample of treated and non-treated units by weigting them based on the distance metric.
Caveat: We are working with a much smaller sample and we have “generated” our new sample of data.
All that PSM did was it made the covariates immaterial for the treatment. We see from our bivariate and multivariate regression that the coefficient of treatment varies very little with additional covariates and that is because of our matched sample.
We still have identification issue with our treatment variable even after PSM as there exists some unobserved factor that is correlated with the decision to go to Catholic School but not correlated with the covariates.
Possible Exogenous Instrument: State Budget cuts on Public Schools thus deteriorating the quality and availability of Public Schools making parents gravitate towards Catholic schools.
lw1 <- loess(w3income_1k ~ distance,data=newmatcheddata)
plot(newmatcheddata$distance, newmatcheddata$w3income_1k,
pch=21, bg=c("red","green3")[unclass(newmatcheddata$catholic)],
ann = FALSE)
j <- order(newmatcheddata$distance)
lines(newmatcheddata$distance[j][newmatcheddata$catholic==0],
lw1$fitted[j][newmatcheddata$catholic==0],
col="red",lwd=6)
lines(newmatcheddata$distance[j][newmatcheddata$catholic==1],
lw1$fitted[j][newmatcheddata$catholic==1],
col="green3",lwd=2)
legend("bottomright", c("Lowess:Non-Catholic","Lowess:Catholic"),
lty = c(1,1), col = c("red","green3"), lwd =c(6,2) )
newmatcheddata %>%
group_by(catholic) %>%
summarise_all(mean)
# A tibble: 2 x 9
catholic c5r2mtsc_std race_white p5hmage w3income_1k p5numpla w3momed_hsb
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0.394 0.772 39.6 86.0 1.05 0.204
2 1 0.220 0.767 39.8 86.2 1.07 0.205
# ... with 2 more variables: distance <dbl>, weights <dbl>
t.test(c5r2mtsc_std ~ catholic, data = newmatcheddata)
Welch Two Sample t-test
data: c5r2mtsc_std by catholic
t = 4.1704, df = 1841.8, p-value = 0.00003182
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
0.09232814 0.25626460
sample estimates:
mean in group 0 mean in group 1
0.3939815 0.2196851
newlm = lm(c5r2mtsc_std ~ catholic, data = newmatcheddata)
summary(newlm)
Call:
lm(formula = c5r2mtsc_std ~ catholic, data = newmatcheddata)
Residuals:
Min 1Q Median 3Q Max
-3.4859 -0.5854 0.0336 0.6089 2.8356
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.39398 0.02955 13.33 < 0.0000000000000002 ***
catholic1 -0.17430 0.04179 -4.17 0.0000318 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.9012 on 1858 degrees of freedom
Multiple R-squared: 0.009274, Adjusted R-squared: 0.008741
F-statistic: 17.39 on 1 and 1858 DF, p-value: 0.00003181
newlm.full = lm(c5r2mtsc_std ~ catholic + race_white +
p5hmage + w3income_1k + p5numpla +
w3momed_hsb, data = newmatcheddata)
summary(newlm.full)
Call:
lm(formula = c5r2mtsc_std ~ catholic + race_white + p5hmage +
w3income_1k + p5numpla + w3momed_hsb, data = newmatcheddata)
Residuals:
Min 1Q Median 3Q Max
-3.6475 -0.5389 0.0350 0.6108 2.7688
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.5831495 0.1972757 -2.956 0.00316 **
catholic1 -0.1725187 0.0398312 -4.331 0.00001561994653 ***
race_white 0.2496662 0.0474655 5.260 0.00000016078061 ***
p5hmage 0.0187153 0.0043511 4.301 0.00001786192590 ***
w3income_1k 0.0032596 0.0004708 6.924 0.00000000000604 ***
p5numpla -0.1599154 0.0750249 -2.131 0.03318 *
w3momed_hsb -0.3397023 0.0504062 -6.739 0.00000000002118 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.8581 on 1853 degrees of freedom
Multiple R-squared: 0.1042, Adjusted R-squared: 0.1013
F-statistic: 35.92 on 6 and 1853 DF, p-value: < 0.00000000000000022
We observe that with this new PSM matched data, our treatment effect of being in catholic schools on standardized score is negative and statistically significant.
We observe that the estimate of the treatment effect is approximately close with and without inclusion of covariates.
Our earlier result of using full sample gave a negative and statistically significant estimate of the treatment effect of -0.15 standard devation points.
Keep only numeric variables
Run the lasso on all the numeric covariates
Hint: You can use cv.glmnet followed by glmnet with the best lambda (use lambda.1se, not lambda.min) [lambda.1se the “most regularized model such that error is within one standard error of the minimum”]
Hint: The glmnet command likes to work with matrix object, so make sure you make x into a matrix.
List the variables selected by your LASSO model [hint: use coef() command]
rm(list = ls())
#install.packages("glmnet", repos = "http://cran.us.r-project.org")
library("glmnet")
url = 'http://www.mfilipski.com/files/ecls.csv'
hw11data <- read.csv(url,header = TRUE)
## KEEPING ALL NUMERIC VARIABLES
hw11data = subset(hw11data, select = c(c5r2mtsc_std,catholic,race_white,
race_black, race_asian, race_hispanic,
p5hmage, p5hdage, w3daded_hsb, p5numpla,
w3momed_hsb, w3income_1k, w3momscr, w3dadscr,
w3povrty, p5fstamp)
)
summary(hw11data)
c5r2mtsc_std catholic race_white race_black
Min. :-3.1036 Min. :0.0000 Min. :0.0000 Min. :0.00000
1st Qu.:-0.4416 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.00000
Median : 0.2256 Median :0.0000 Median :1.0000 Median :0.00000
Mean : 0.1728 Mean :0.1713 Mean :0.6731 Mean :0.06576
race_asian race_hispanic p5hmage p5hdage
Min. :0.00000 Min. :0.0000 Min. :19.00 Min. :22.00
1st Qu.:0.00000 1st Qu.:0.0000 1st Qu.:34.00 1st Qu.:36.00
Median :0.00000 Median :0.0000 Median :38.00 Median :40.00
Mean :0.06299 Mean :0.1464 Mean :38.13 Mean :40.46
w3daded_hsb p5numpla w3momed_hsb w3income_1k
Min. :0.0000 Min. :1.000 Min. :0.0000 Min. : 5.00
1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.0000 1st Qu.: 37.50
Median :0.0000 Median :1.000 Median :0.0000 Median : 62.50
Mean :0.4332 Mean :1.101 Mean :0.3603 Mean : 68.95
w3momscr w3dadscr w3povrty p5fstamp
Min. :29.60 Min. :29.60 Min. :0.00000 Min. :0.00000
1st Qu.:35.78 1st Qu.:35.78 1st Qu.:0.00000 1st Qu.:0.00000
Median :38.18 Median :39.18 Median :0.00000 Median :0.00000
Mean :44.55 Mean :43.16 Mean :0.08694 Mean :0.03887
[ reached getOption("max.print") -- omitted 2 rows ]
y = data.matrix(hw11data$c5r2mtsc_std)
x.matrix = data.matrix(hw11data[2:16])
fit = glmnet(x.matrix, y)
lasso = glmnet( x.matrix, y,
family = "gaussian",
alpha = 1)
plot(fit, label = TRUE)
plot(lasso, "lambda")
print(lasso)
Call: glmnet(x = x.matrix, y = y, family = "gaussian", alpha = 1)
Df %Dev Lambda
1 0 0.00000 0.288400
2 1 0.01548 0.262800
3 2 0.03338 0.239400
4 3 0.04970 0.218200
5 3 0.06449 0.198800
6 4 0.07908 0.181100
7 4 0.09247 0.165000
8 4 0.10360 0.150400
9 5 0.11290 0.137000
10 5 0.12120 0.124800
11 6 0.12830 0.113700
12 6 0.13430 0.103600
13 8 0.14000 0.094430
14 9 0.14570 0.086040
15 9 0.15040 0.078400
16 9 0.15440 0.071440
17 9 0.15760 0.065090
18 10 0.16060 0.059310
19 10 0.16380 0.054040
20 10 0.16640 0.049240
21 11 0.16910 0.044860
22 11 0.17160 0.040880
23 11 0.17370 0.037250
24 11 0.17550 0.033940
25 12 0.17690 0.030920
[ reached getOption("max.print") -- omitted 44 rows ]
cvlasso = cv.glmnet(x.matrix, y,
family="gaussian",
alpha = 1
)
plot(cvlasso, cex=0.8, cex.axis=0.8)
coef(cvlasso, s = "lambda.1se")
16 x 1 sparse Matrix of class "dgCMatrix"
1
(Intercept) -0.513235002
catholic -0.039411398
race_white 0.226986243
race_black -0.142197444
race_asian 0.125280662
race_hispanic .
p5hmage 0.006293943
p5hdage .
w3daded_hsb -0.203409054
p5numpla .
w3momed_hsb -0.167092623
w3income_1k 0.002459938
w3momscr 0.003630758
w3dadscr 0.002945626
w3povrty -0.099933789
p5fstamp .
According to
lambda.1sewe will choose 11 out of the 15 covariates. That is : Race, Mother’s Age, Family Income, Mother’s Education, Dad’s Education, Poverty Level.
We would have chosen to use: Race, Family income level, mother’s and father’s education level.
Yes, they differ only in one variable and that is : number of places child has lived. We see that in LASSO this variable is dropped. Thus suggesting non-importance of the variable.
Now, with this brief understanding of LASSO, I think that this might improve matching in creating a proper distance metric to match. Since we know through LASSO the variables that matter for the response variable , we can use those important variables and run the PSM command to get a stronger distance metric based on features that matter.
3.2.3 Comment on the results of those two regressions: