1 Background:

This assignment will take you through a \(\textbf{Propensity Score Matching}\) exercise. The example data contains test results of school children along with various characteristics.

The key question is: Do catholic schools have an effect on student’s performance?

The key issue is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely).

2 Download the data

2.1 Download and summarise data

setwd("C:/Users/Akash/Dropbox/UGA/AAEC8610AdvEcotrix_Filipski/FilipskiHW11")
url = 'http://www.mfilipski.com/files/ecls.csv'
hw11data <- read.csv(url,header = TRUE)
hw11data = subset(hw11data, select = c(c5r2mtsc_std,catholic,race_white,p5hmage,
                                       w3income_1k, p5numpla, w3momed_hsb))

hw11data$catholic = as.factor(hw11data$catholic)
summary(hw11data)
  c5r2mtsc_std     catholic   race_white        p5hmage       w3income_1k    
 Min.   :-3.1036   0:4499   Min.   :0.0000   Min.   :19.00   Min.   :  5.00  
 1st Qu.:-0.4416   1: 930   1st Qu.:0.0000   1st Qu.:34.00   1st Qu.: 37.50  
 Median : 0.2256            Median :1.0000   Median :38.00   Median : 62.50  
 Mean   : 0.1728            Mean   :0.6731   Mean   :38.13   Mean   : 68.95  
 3rd Qu.: 0.8024            3rd Qu.:1.0000   3rd Qu.:42.00   3rd Qu.: 87.50  
 Max.   : 3.4337            Max.   :1.0000   Max.   :71.00   Max.   :200.00  
    p5numpla      w3momed_hsb    
 Min.   :1.000   Min.   :0.0000  
 1st Qu.:1.000   1st Qu.:0.0000  
 Median :1.000   Median :0.0000  
 Mean   :1.101   Mean   :0.3603  
 3rd Qu.:1.000   3rd Qu.:1.0000  
 Max.   :5.000   Max.   :1.0000  
attach(hw11data)

# No NAs

3 Identify the problem

3.1 Do Test Scores Differ Between Catholics versus NonCatholics:

QUESTION: Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools?

  • Catholic Standardized Scores:
summary(hw11data %>% filter(catholic == 1) %>% .$c5r2mtsc_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.5922 -0.3232  0.2510  0.2197  0.7729  3.0552 
  • Non-Catholic Standardized Scores:
summary(hw11data %>% filter(catholic == 0) %>% .$c5r2mtsc_std)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-3.1036 -0.4577  0.2207  0.1631  0.8084  3.4337 

We can see Catholics appears to have higher standardized mean score than Non-Catholics

ggplot(hw11data, aes(catholic, c5r2mtsc_std )) +
        geom_boxplot()

t.test(c5r2mtsc_std ~ catholic, data = hw11data)

    Welch Two Sample t-test

data:  c5r2mtsc_std by catholic
t = -1.7866, df = 1468.1, p-value = 0.07421
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.118653665  0.005539377
sample estimates:
mean in group 0 mean in group 1 
      0.1631279       0.2196851 

We see that the t-test difference in means of scores for catholics and non-catholics is statistically different.

3.2 Run Regressions:

3.2.1 A regression of just scores on the catholic dummy

ols.simple = lm(c5r2mtsc_std ~ catholic, data = hw11data)
summary(ols.simple)

Call:
lm(formula = c5r2mtsc_std ~ catholic, data = hw11data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2667 -0.6082  0.0538  0.6292  3.2705 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept)  0.16313    0.01424  11.459 <0.0000000000000002 ***
catholic1    0.05656    0.03440   1.644                 0.1    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9549 on 5427 degrees of freedom
Multiple R-squared:  0.000498,  Adjusted R-squared:  0.0003138 
F-statistic: 2.704 on 1 and 5427 DF,  p-value: 0.1002

3.2.2 Another regression that includes all the variables listed above

ols.all = lm(c5r2mtsc_std ~ ., data = hw11data)
summary(ols.all)

Call:
lm(formula = c5r2mtsc_std ~ ., data = hw11data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.5174 -0.5617  0.0329  0.6097  2.9165 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -0.6580369  0.0994710  -6.615      0.0000000000406 ***
catholic1   -0.1583818  0.0324691  -4.878      0.0000011027244 ***
race_white   0.3086458  0.0263764  11.702 < 0.0000000000000002 ***
p5hmage      0.0145650  0.0022407   6.500      0.0000000000874 ***
w3income_1k  0.0042174  0.0003097  13.616 < 0.0000000000000002 ***
p5numpla    -0.0708801  0.0351715  -2.015               0.0439 *  
w3momed_hsb -0.3274425  0.0270767 -12.093 < 0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8796 on 5422 degrees of freedom
Multiple R-squared:  0.1526,    Adjusted R-squared:  0.1517 
F-statistic: 162.7 on 6 and 5422 DF,  p-value: < 0.00000000000000022

3.2.3 Comment on the results of those two regressions:

Comparing the two regressions we see that being in catholic school reduces standardized test scores, ceteris paribus. In the simple bivariate regression, the effect was positive but insignificant.

The regression with the many covariates tells us that being white , mother’s age, and higher family income increases the standardized test scores, ceteris paribus.

While, not having a stable house and a mother with high school or below education reduces the standardized test score, ceteris paribus.

3.3 Do the means of the five covariates (race, age, income, etc) between catholic vs. non-catholic schools?

3.3.1 Compare the means [HINT: summarize_all]

hw11data %>% 
  group_by(catholic) %>% 
  dplyr::summarise(RACE = round(mean(race_white), digits = 3),
                   MOTHERS_AGE = round(mean(p5hmage), digits = 3),
                   FAM_INCOME = round(mean(w3income_1k), digits = 3),
                   Num_Places = round(mean(p5numpla), digits = 3),
                   Mothers_Educ = round(mean(w3momed_hsb), digits = 3)
                  )
# A tibble: 2 x 6
  catholic  RACE MOTHERS_AGE FAM_INCOME Num_Places Mothers_Educ
  <fct>    <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
1 0        0.654        37.8       65.4       1.11        0.392
2 1        0.767        39.8       86.2       1.07        0.205

We see that the means among these covariates differ between Catholics and Non-Catholics. But, to see whether these differences are statistically significant we conduct the t-test below.

3.3.2 Test the difference in means for statistical significance [Hint: t.test]

  • T-test for RACE : We see that the p-value of the t-test is less than 0.001. Thus the difference in means is statistically different
t.test(race_white ~ catholic, data = hw11data)$p.value
[1] 0.0000000000006814076
  • T-test for MOTHERS AGE : We see that the p-value of the t-test is less than 0.001. Thus the difference in means is statistically different
t.test(p5hmage ~ catholic, data = hw11data)$p.value
[1] 0.0000000000000000000000000004283437
  • T-test for FAMILY INCOME : We see that the p-value of the t-test is less than 0.001. Thus the difference in means is statistically different
t.test(w3income_1k ~ catholic, data = hw11data)$p.value
[1] 0.0000000000000000000000000000000000001212805
  • T-test for NUMBER OF PLACES STUDENT HAS LIVED : We see that the p-value of the t-test is less than 0.001 . Thus the difference in means is statistically different
t.test(p5numpla ~ catholic, data = hw11data)$p.value
[1] 0.001791557
  • T-test for MOTHER’s EDUCATION : We see that the p-value of the t-test is less than 0.001. Thus the difference in means is statistically different
t.test(w3momed_hsb ~ catholic, data = hw11data)$p.value
[1] 0.000000000000000000000000000000001530072

3.3.3 Discuss what would be the problems with an identification strategy based on regressions.

Thus, we see that the means of the covariates differ between CATHOLICS & NON-CATHOLICS students.

What does this mean for IDENTIFICATION: These differences in covariates among the catholics and non-catholics students imply that 1) the parallel trends assumption would be violated in a Difference-In-Differences design ; 2) These differences among the covariates also tell us that there exists selection on observables and unobservables such that there is endogeneity & omitted variable concern with students attending catholic schools and those that are not. There exists inherent sample selection among those who attend catholic schools and those who do not. In simple terms, there are other factors contributing to attending catholic schools and not attending.

Since the parallel trends assumption fails , we can understand that the untreated units (here, non-catholic students) do not provide us with a “good” counterfactual. [See David McKenzie World Bank Blog ]

4 Estimate the propensity to go to catholic school, by hand

4.1 Run a logit regression to predict catholic based on the 5 covariates we kept (not the math scores, of

course). [Hint: you can use the glm command and the option family = binomial() ]

logit.catholic = glm(catholic~ factor(race_white) + p5hmage + 
                       w3income_1k + p5numpla + w3momed_hsb,
                     data=hw11data, family = "binomial")


summary(logit.catholic)

Call:
glm(formula = catholic ~ factor(race_white) + p5hmage + w3income_1k + 
    p5numpla + w3momed_hsb, family = "binomial", data = hw11data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.2305  -0.6669  -0.5170  -0.3743   2.4825  

Coefficients:
                      Estimate Std. Error z value             Pr(>|z|)    
(Intercept)         -3.4092820  0.3263351 -10.447 < 0.0000000000000002 ***
factor(race_white)1  0.3001516  0.0870242   3.449             0.000563 ***
p5hmage              0.0395347  0.0070820   5.582   0.0000000237241155 ***
w3income_1k          0.0063463  0.0008473   7.490   0.0000000000000688 ***
p5numpla            -0.2106296  0.1230279  -1.712             0.086888 .  
w3momed_hsb         -0.5644212  0.0926141  -6.094   0.0000000010989248 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4972.4  on 5428  degrees of freedom
Residual deviance: 4706.8  on 5423  degrees of freedom
AIC: 4718.8

Number of Fisher Scoring iterations: 5

4.2 Make a prediction for who is likely to go to catholic school, using this model [Hint: use the predict command]

hw11data$catholicpred = predict(logit.catholic, newdata = hw11data, type = "response")


summary(hw11data)
  c5r2mtsc_std     catholic   race_white        p5hmage       w3income_1k    
 Min.   :-3.1036   0:4499   Min.   :0.0000   Min.   :19.00   Min.   :  5.00  
 1st Qu.:-0.4416   1: 930   1st Qu.:0.0000   1st Qu.:34.00   1st Qu.: 37.50  
 Median : 0.2256            Median :1.0000   Median :38.00   Median : 62.50  
 Mean   : 0.1728            Mean   :0.6731   Mean   :38.13   Mean   : 68.95  
 3rd Qu.: 0.8024            3rd Qu.:1.0000   3rd Qu.:42.00   3rd Qu.: 87.50  
 Max.   : 3.4337            Max.   :1.0000   Max.   :71.00   Max.   :200.00  
    p5numpla      w3momed_hsb      catholicpred    
 Min.   :1.000   Min.   :0.0000   Min.   :0.03195  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.10199  
 Median :1.000   Median :0.0000   Median :0.16003  
 Mean   :1.101   Mean   :0.3603   Mean   :0.17130  
 3rd Qu.:1.000   3rd Qu.:1.0000   3rd Qu.:0.22061  
 Max.   :5.000   Max.   :1.0000   Max.   :0.53094  

4.3 Check that there is common support.

  • (Meaning check both groups have predicted values in overlapping ranges)
  • You may check this visually. For instance: you can do this with two histograms side-by-side. Or plot densities.
## first plot - left half of x-axis, right margin set to 0 lines
par(fig = c(0, .5, 0, 1), mar = c(5,4,3,0))
hist(hw11data$catholicpred[hw11data$catholic==0], 
     ann = FALSE, las = 1, col=rgb(1,0,0,0.5))

## second plot - right half of x-axis, left margin set to 0 lines
par(fig = c(.5, 1, 0, 1), mar = c(5,0,3,2), new = TRUE)
hist(hw11data$catholicpred[hw11data$catholic==1], 
     ann = FALSE, axes = FALSE, col=rgb(0,0,1,0.5))
axis(1)
axis(2, lwd.ticks = 0, labels = FALSE)

title(main = 'Histogram: \nLeft: Non-Catholic; Right: Catholics', 
      xlab = 'Predicted Value Of Being Catholic', outer = TRUE, line = -2)

  • OVERLAY:
# Histogram Colored (blue and red)
hist(hw11data$catholicpred[hw11data$catholic==0],
     col=rgb(1,0,0,0.5) ,xlim=c(0,0.6), main='Overlapping Histogram', xlab='Variable')
hist(hw11data$catholicpred[hw11data$catholic==1],
     col=rgb(0,0,1,0.5), add=T)
box()

  • DENSITY PLOT:
plot(density(hw11data$catholicpred[hw11data$catholic==1]), 
     col="red", ann=FALSE)
lines(density(hw11data$catholicpred[hw11data$catholic==0]), col="blue")
title(main = 'Overlay of Density Plots Of Predicted Catholic Score', 
      xlab = 'Predicted Value Of Being Catholic', outer = TRUE, line = -2)
legend("topright", c("Catholics","non-Catholics"), lty = c(1,1), col = c("red","blue"))

5 Run a Matching algorithm to get a balanced dataset

5.1 Create a matched dataset:

  • Run a MatchIt command (which runs a logit just like the one you did, then matches each observation of the treated group with another from the control group.)
# removing the above generated predicted catholics
hw11data = dplyr::select(hw11data, -catholicpred )



m.out <- matchit(catholic~ race_white + p5hmage + 
                       w3income_1k + p5numpla + w3momed_hsb,
                     data=hw11data)


m.out

Call: 
matchit(formula = catholic ~ race_white + p5hmage + w3income_1k + 
    p5numpla + w3momed_hsb, data = hw11data)

Sample sizes:
          Control Treated
All          4499     930
Matched       930     930
Unmatched    3569       0
Discarded       0       0
  • Follow that by match.data(your_matchit_output) to create you matched dataset
m.data <- match.data(m.out)
  • Look at the dimensions of dta_psm. What can you conclude about the number of observations that matched?
summary(m.data)
  c5r2mtsc_std     catholic   race_white        p5hmage      w3income_1k    
 Min.   :-3.0919   0:930    Min.   :0.0000   Min.   :24.0   Min.   :  7.50  
 1st Qu.:-0.2738   1:930    1st Qu.:1.0000   1st Qu.:37.0   1st Qu.: 62.50  
 Median : 0.3226            Median :1.0000   Median :40.0   Median : 87.50  
 Mean   : 0.3068            Mean   :0.7694   Mean   :39.7   Mean   : 86.09  
 3rd Qu.: 0.9461            3rd Qu.:1.0000   3rd Qu.:43.0   3rd Qu.: 87.50  
 Max.   : 3.1066            Max.   :1.0000   Max.   :55.0   Max.   :200.00  
    p5numpla      w3momed_hsb        distance         weights 
 Min.   :1.000   Min.   :0.0000   Min.   :0.0459   Min.   :1  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.1557   1st Qu.:1  
 Median :1.000   Median :0.0000   Median :0.2020   Median :1  
 Mean   :1.063   Mean   :0.2048   Mean   :0.2085   Mean   :1  
 3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.2523   3rd Qu.:1  
 Max.   :5.000   Max.   :1.0000   Max.   :0.4619   Max.   :1  

We have 1860 observations.

We have a balanced sample of Catholics & Non-Catholics, i.e. 930 students in each category.

So, in original data set we had 5429 observations and now in the matched data we have 1860 observations. This means that 3569 observations were discarded.

If we compare the summary statistics of this matched data with 1860 observations and our original sample, we see that in the new matched data:

  • mean and median of the standardized score has increased
  • More white race
  • higher family income at mean
  • And less proportion of mothers with less than high school degree.

5.2 Sort the data by distance, and show the 10 first observations.

  • Notice how observations were matched on distance?
newmatcheddata = m.data[order(m.data$distance),]

summary(newmatcheddata)
  c5r2mtsc_std     catholic   race_white        p5hmage      w3income_1k    
 Min.   :-3.0919   0:930    Min.   :0.0000   Min.   :24.0   Min.   :  7.50  
 1st Qu.:-0.2738   1:930    1st Qu.:1.0000   1st Qu.:37.0   1st Qu.: 62.50  
 Median : 0.3226            Median :1.0000   Median :40.0   Median : 87.50  
 Mean   : 0.3068            Mean   :0.7694   Mean   :39.7   Mean   : 86.09  
 3rd Qu.: 0.9461            3rd Qu.:1.0000   3rd Qu.:43.0   3rd Qu.: 87.50  
 Max.   : 3.1066            Max.   :1.0000   Max.   :55.0   Max.   :200.00  
    p5numpla      w3momed_hsb        distance         weights 
 Min.   :1.000   Min.   :0.0000   Min.   :0.0459   Min.   :1  
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.1557   1st Qu.:1  
 Median :1.000   Median :0.0000   Median :0.2020   Median :1  
 Mean   :1.063   Mean   :0.2048   Mean   :0.2085   Mean   :1  
 3rd Qu.:1.000   3rd Qu.:0.0000   3rd Qu.:0.2523   3rd Qu.:1  
 Max.   :5.000   Max.   :1.0000   Max.   :0.4619   Max.   :1  
head(newmatcheddata, 10)
     c5r2mtsc_std catholic race_white p5hmage w3income_1k p5numpla w3momed_hsb
2304   -1.6878765        1          0      30     27.5005        2           1
3078   -1.4395771        0          0      30     27.5005        2           1
2500   -2.3585524        0          0      26     27.5005        1           1
4274   -2.5922340        1          0      33     17.5005        2           1
1220   -1.7827903        0          0      35     12.5005        2           1
485    -0.7841369        1          0      31     37.5005        2           1
4751    0.5286997        0          0      30     17.5005        1           1
716    -1.2722942        1          0      33     32.5005        2           1
       distance weights
2304 0.04589596       1
3078 0.04589596       1
2500 0.04825008       1
4274 0.04837194       1
1220 0.05059818       1
485  0.05062309       1
4751 0.05278851       1
716  0.05294718       1
 [ reached 'max' / getOption("max.print") -- omitted 2 rows ]

We see a pair of catholics and non-catholics matched on distance .

5.3 In this new dataset, do the means of the five covariates (race, age, income, etc) between catholic vs. non-catholic schools?

  • Compare the meants [Hint: summarise_all]
newmatcheddata %>% 
  group_by(catholic) %>% 
  dplyr::summarise(RACE = round(mean(race_white), digits = 3),
                   MOTHERS_AGE = round(mean(p5hmage), digits = 3),
                   FAM_INCOME = round(mean(w3income_1k), digits = 3),
                   Num_Places = round(mean(p5numpla), digits = 3),
                   Mothers_Educ = round(mean(w3momed_hsb), digits = 3)
                  )
# A tibble: 2 x 6
  catholic  RACE MOTHERS_AGE FAM_INCOME Num_Places Mothers_Educ
  <fct>    <dbl>       <dbl>      <dbl>      <dbl>        <dbl>
1 0        0.772        39.6       86.0       1.05        0.204
2 1        0.767        39.8       86.2       1.07        0.205
  • Test the difference in means for statistical significance [Hint: t.test]

  • T-test for RACE : We see that the p-value of the t-test is greater than 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different

t.test(race_white ~ catholic, data = newmatcheddata)$p.value
[1] 0.7832882
  • T-test for MOTHERS AGE : We see that the p-value of the t-test is greater than 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different
t.test(p5hmage ~ catholic, data = newmatcheddata)$p.value
[1] 0.515512
  • T-test for FAMILY INCOME : We see that the p-value of the t-test is greater than 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different
t.test(w3income_1k ~ catholic, data = newmatcheddata)$p.value
[1] 0.9327609
  • T-test for NUMBER OF PLACES STUDENT HAS LIVED : We see that the p-value of the t-test is close to 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different
t.test(p5numpla ~ catholic, data = newmatcheddata)$p.value
[1] 0.09782669
  • T-test for MOTHER’s EDUCATION : We see that the p-value of the t-test is greater than 0.100 (statistically insignificant). Thus the difference in means is NOT statistically different
t.test(w3momed_hsb ~ catholic, data = newmatcheddata)$p.value
[1] 0.9542154
  • Discuss what would be the problems with an identification strategy based on regressions.

Now with this new mathced data our Parallel Trends Assumption would hold for a Diff-In-Diff design since covariates don’t differ between treated and non-treated.

We still face identification issue as we still have an endogeneity/omitted variable concern with the decision to go to Catholic school that is correlated with an unobserved factor.

5.4 Reflect on what PSM did for your identification strategy.

PSM created a pairs of treated and non-treated sample. Thus, giving us a balanced sample of treated and non-treated units by weigting them based on the distance metric.

Caveat: We are working with a much smaller sample and we have “generated” our new sample of data.

All that PSM did was it made the covariates immaterial for the treatment. We see from our bivariate and multivariate regression that the coefficient of treatment varies very little with additional covariates and that is because of our matched sample.

We still have identification issue with our treatment variable even after PSM as there exists some unobserved factor that is correlated with the decision to go to Catholic School but not correlated with the covariates.

Possible Exogenous Instrument: State Budget cuts on Public Schools thus deteriorating the quality and availability of Public Schools making parents gravitate towards Catholic schools.

6 Visually check the balance in the new dataset:

6.1 Plot the income variable against the propensity score (by catholic) and plot the lowess smooth densities

  • Verify that the means are nearly identical at each level of the pscore distribution
lw1 <- loess(w3income_1k ~ distance,data=newmatcheddata)
plot(newmatcheddata$distance, newmatcheddata$w3income_1k,
     pch=21,  bg=c("red","green3")[unclass(newmatcheddata$catholic)],
     ann = FALSE)
j <- order(newmatcheddata$distance)
lines(newmatcheddata$distance[j][newmatcheddata$catholic==0],
      lw1$fitted[j][newmatcheddata$catholic==0],
      col="red",lwd=6)
lines(newmatcheddata$distance[j][newmatcheddata$catholic==1],
      lw1$fitted[j][newmatcheddata$catholic==1],
      col="green3",lwd=2)
legend("bottomright", c("Lowess:Non-Catholic","Lowess:Catholic"), 
       lty = c(1,1), col = c("red","green3"), lwd =c(6,2) )

  • (You can repeat this for all covariates if you want)

7 Compute the average treatment effect in your new sample

7.1 Compare means of the two groups.

newmatcheddata %>%
  group_by(catholic) %>%
  summarise_all(mean)
# A tibble: 2 x 9
  catholic c5r2mtsc_std race_white p5hmage w3income_1k p5numpla w3momed_hsb
  <fct>           <dbl>      <dbl>   <dbl>       <dbl>    <dbl>       <dbl>
1 0               0.394      0.772    39.6        86.0     1.05       0.204
2 1               0.220      0.767    39.8        86.2     1.07       0.205
# ... with 2 more variables: distance <dbl>, weights <dbl>
  • Or conduct a t-test
t.test(c5r2mtsc_std ~ catholic, data = newmatcheddata)

    Welch Two Sample t-test

data:  c5r2mtsc_std by catholic
t = 4.1704, df = 1841.8, p-value = 0.00003182
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.09232814 0.25626460
sample estimates:
mean in group 0 mean in group 1 
      0.3939815       0.2196851 

7.2 Run regressions:

  • A regression of just scores on the catholic dummy
newlm = lm(c5r2mtsc_std ~ catholic, data = newmatcheddata)
summary(newlm)

Call:
lm(formula = c5r2mtsc_std ~ catholic, data = newmatcheddata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4859 -0.5854  0.0336  0.6089  2.8356 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  0.39398    0.02955   13.33 < 0.0000000000000002 ***
catholic1   -0.17430    0.04179   -4.17            0.0000318 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9012 on 1858 degrees of freedom
Multiple R-squared:  0.009274,  Adjusted R-squared:  0.008741 
F-statistic: 17.39 on 1 and 1858 DF,  p-value: 0.00003181
  • Another regression that includes the following variables
newlm.full = lm(c5r2mtsc_std ~ catholic + race_white + 
                 p5hmage + w3income_1k + p5numpla + 
                 w3momed_hsb, data = newmatcheddata)
summary(newlm.full)

Call:
lm(formula = c5r2mtsc_std ~ catholic + race_white + p5hmage + 
    w3income_1k + p5numpla + w3momed_hsb, data = newmatcheddata)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6475 -0.5389  0.0350  0.6108  2.7688 

Coefficients:
              Estimate Std. Error t value         Pr(>|t|)    
(Intercept) -0.5831495  0.1972757  -2.956          0.00316 ** 
catholic1   -0.1725187  0.0398312  -4.331 0.00001561994653 ***
race_white   0.2496662  0.0474655   5.260 0.00000016078061 ***
p5hmage      0.0187153  0.0043511   4.301 0.00001786192590 ***
w3income_1k  0.0032596  0.0004708   6.924 0.00000000000604 ***
p5numpla    -0.1599154  0.0750249  -2.131          0.03318 *  
w3momed_hsb -0.3397023  0.0504062  -6.739 0.00000000002118 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8581 on 1853 degrees of freedom
Multiple R-squared:  0.1042,    Adjusted R-squared:  0.1013 
F-statistic: 35.92 on 6 and 1853 DF,  p-value: < 0.00000000000000022

7.3 Comment on the results of those the PSM procedure

We observe that with this new PSM matched data, our treatment effect of being in catholic schools on standardized score is negative and statistically significant.

We observe that the estimate of the treatment effect is approximately close with and without inclusion of covariates.

Our earlier result of using full sample gave a negative and statistically significant estimate of the treatment effect of -0.15 standard devation points.

8 Some “Machine Learning”

8.1 Run a LASSO procedure on the ORIGINAL dataset (the one with all the variables, not just those from q1)

  • Keep only numeric variables

  • Run the lasso on all the numeric covariates

  • Hint: You can use cv.glmnet followed by glmnet with the best lambda (use lambda.1se, not lambda.min) [lambda.1se the “most regularized model such that error is within one standard error of the minimum”]

  • Hint: The glmnet command likes to work with matrix object, so make sure you make x into a matrix.

  • List the variables selected by your LASSO model [hint: use coef() command]

rm(list = ls())

#install.packages("glmnet", repos = "http://cran.us.r-project.org")
library("glmnet")






url = 'http://www.mfilipski.com/files/ecls.csv'
hw11data <- read.csv(url,header = TRUE)

## KEEPING ALL NUMERIC VARIABLES
hw11data = subset(hw11data, select = c(c5r2mtsc_std,catholic,race_white,
                                       race_black, race_asian, race_hispanic,
                                       p5hmage, p5hdage, w3daded_hsb, p5numpla, 
                                       w3momed_hsb, w3income_1k, w3momscr, w3dadscr, 
                                       w3povrty, p5fstamp)
                  )


summary(hw11data)
  c5r2mtsc_std        catholic        race_white       race_black     
 Min.   :-3.1036   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000  
 1st Qu.:-0.4416   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.00000  
 Median : 0.2256   Median :0.0000   Median :1.0000   Median :0.00000  
 Mean   : 0.1728   Mean   :0.1713   Mean   :0.6731   Mean   :0.06576  
   race_asian      race_hispanic       p5hmage         p5hdage     
 Min.   :0.00000   Min.   :0.0000   Min.   :19.00   Min.   :22.00  
 1st Qu.:0.00000   1st Qu.:0.0000   1st Qu.:34.00   1st Qu.:36.00  
 Median :0.00000   Median :0.0000   Median :38.00   Median :40.00  
 Mean   :0.06299   Mean   :0.1464   Mean   :38.13   Mean   :40.46  
  w3daded_hsb        p5numpla      w3momed_hsb      w3income_1k    
 Min.   :0.0000   Min.   :1.000   Min.   :0.0000   Min.   :  5.00  
 1st Qu.:0.0000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.: 37.50  
 Median :0.0000   Median :1.000   Median :0.0000   Median : 62.50  
 Mean   :0.4332   Mean   :1.101   Mean   :0.3603   Mean   : 68.95  
    w3momscr        w3dadscr        w3povrty          p5fstamp      
 Min.   :29.60   Min.   :29.60   Min.   :0.00000   Min.   :0.00000  
 1st Qu.:35.78   1st Qu.:35.78   1st Qu.:0.00000   1st Qu.:0.00000  
 Median :38.18   Median :39.18   Median :0.00000   Median :0.00000  
 Mean   :44.55   Mean   :43.16   Mean   :0.08694   Mean   :0.03887  
 [ reached getOption("max.print") -- omitted 2 rows ]
y = data.matrix(hw11data$c5r2mtsc_std)
x.matrix = data.matrix(hw11data[2:16])
fit = glmnet(x.matrix, y)


lasso = glmnet( x.matrix, y,
                family = "gaussian",
                alpha = 1)
  • We can visualize the coefficients by executing the plot function:
plot(fit, label = TRUE)

plot(lasso, "lambda")

  • A summary of the glmnet path at each step is displayed if we just enter the object name or use the print function:
print(lasso)

Call:  glmnet(x = x.matrix, y = y, family = "gaussian", alpha = 1) 

   Df    %Dev   Lambda
1   0 0.00000 0.288400
2   1 0.01548 0.262800
3   2 0.03338 0.239400
4   3 0.04970 0.218200
5   3 0.06449 0.198800
6   4 0.07908 0.181100
7   4 0.09247 0.165000
8   4 0.10360 0.150400
9   5 0.11290 0.137000
10  5 0.12120 0.124800
11  6 0.12830 0.113700
12  6 0.13430 0.103600
13  8 0.14000 0.094430
14  9 0.14570 0.086040
15  9 0.15040 0.078400
16  9 0.15440 0.071440
17  9 0.15760 0.065090
18 10 0.16060 0.059310
19 10 0.16380 0.054040
20 10 0.16640 0.049240
21 11 0.16910 0.044860
22 11 0.17160 0.040880
23 11 0.17370 0.037250
24 11 0.17550 0.033940
25 12 0.17690 0.030920
 [ reached getOption("max.print") -- omitted 44 rows ]

8.1.1 Automating Cross-Validation

cvlasso = cv.glmnet(x.matrix, y,
                    family="gaussian",
                    alpha = 1
                    )
  • We can plot the object.
plot(cvlasso, cex=0.8, cex.axis=0.8)

  • We can obtain the actual coefficients at one or more \(\lambda\)’s within the range of the sequence:
coef(cvlasso, s = "lambda.1se")
16 x 1 sparse Matrix of class "dgCMatrix"
                         1
(Intercept)   -0.513235002
catholic      -0.039411398
race_white     0.226986243
race_black    -0.142197444
race_asian     0.125280662
race_hispanic  .          
p5hmage        0.006293943
p5hdage        .          
w3daded_hsb   -0.203409054
p5numpla       .          
w3momed_hsb   -0.167092623
w3income_1k    0.002459938
w3momscr       0.003630758
w3dadscr       0.002945626
w3povrty      -0.099933789
p5fstamp       .          

8.2 Comment on the results

8.2.1 If you used lambda.1se as your regularization parameter, how many variables would you have chosen to include?

According to lambda.1se we will choose 11 out of the 15 covariates. That is : Race, Mother’s Age, Family Income, Mother’s Education, Dad’s Education, Poverty Level.

8.2.2 Which variables would you have chosen to run the PSM command?

We would have chosen to use: Race, Family income level, mother’s and father’s education level.

8.2.3 Do they differ from those we actually used?

Yes, they differ only in one variable and that is : number of places child has lived. We see that in LASSO this variable is dropped. Thus suggesting non-importance of the variable.

8.3 Comment on how LASSO might help you improve a matching method further.

Now, with this brief understanding of LASSO, I think that this might improve matching in creating a proper distance metric to match. Since we know through LASSO the variables that matter for the response variable , we can use those important variables and run the PSM command to get a stronger distance metric based on features that matter.

8.4 Think about how you could use LASSO to compute the p-score and use it in the algorithm. (If you have lots of time on your hands, feel free to read up on this and give it a try!)