HW 9 - PSM Lasso

In this exercise, I am working on the application of a Propensity Score Matching, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics. The key question is: DO catholic schools have an effect on student’s performance?

The key issue is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely)

Part A: Understanding and running PSM

1. Download the data

Make a smaller dataset which only contains the variables:
c5r2mtsc_std: Standardized math scores
catholic: Whether the school is catholic
w3income_1k: Family income in thousands
p5numpla: Number of places the student has lived for at least 4 months
w3momed_hsb: Is the mother’s education level high-school or below (1) or some college or more (0)?

df <- read.csv("ecls.csv")
# Subset the data set
df1 <- df %>%
  select(c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb)

2. Identify the problem

Do test scores on c5r2mtsc_std differ for pupils of catholic and non-catholic schools on average?

df1 %>%
  group_by(catholic) %>%
  summarise_at(vars(c5r2mtsc_std), funs(n(),mean))

## # A tibble: 2 x 3
##   catholic     n  mean
##      <int> <int> <dbl>
## 1        0  4499 0.163
## 2        1   930 0.220

On an average, students of catholic tend to have higher math score than of non-catholic.

Run a regression of just math scores on the catholic dummy

reg1 <- lm(c5r2mtsc_std ~ catholic, data = df1)
stargazer(reg1, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                       0.057           
##                               (0.034)          
##                                                
## Constant                     0.163***          
##                               (0.014)          
##                                                
## -----------------------------------------------
## Observations                   5,429           
## R2                            0.0005           
## Adjusted R2                   0.0003           
## Residual Std. Error      0.955 (df = 5427)     
## F Statistic            2.704 (df = 1; 5427)    
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

The estimate suggests that the effect of catholic school on standardized math score is positive but insignificant.

What about with a regression that includes all the variables from the subset?

reg2 <- lm(c5r2mtsc_std ~ ., data = df1)
stargazer(reg2, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                     -0.127***         
##                               (0.033)          
##                                                
## w3income_1k                  0.005***          
##                              (0.0003)          
##                                                
## p5numpla                     -0.101***         
##                               (0.036)          
##                                                
## w3momed_hsb                  -0.374***         
##                               (0.027)          
##                                                
## Constant                       0.073           
##                               (0.049)          
##                                                
## -----------------------------------------------
## Observations                   5,429           
## R2                             0.124           
## Adjusted R2                    0.124           
## Residual Std. Error      0.894 (df = 5424)     
## F Statistic          192.272*** (df = 4; 5424) 
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

When all variables are included in the model specification, the effect of catholic school on standardized math score becomes negative and significant at 1% significance level. Results also show significant negative effect of higher education level of mother and students staying in many places over last 4 months. Additionally, there is significant positive effect of higher family income on math score.

Do the means of the covariates (income, mother’s education etc) differ between catholic vs. non-catholic schools?

by_catholic <- df1 %>%
  group_by(catholic) %>%
  summarise_all(funs( mean))
by_catholic

## # A tibble: 2 x 5
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
##      <int>        <dbl>       <dbl>    <dbl>       <dbl>
## 1        0        0.163        65.4     1.11       0.392
## 2        1        0.220        86.2     1.07       0.205

# test using t-test
t1 <- t.test(w3income_1k ~ catholic, data = df1)$p.value #family income
t2 <- t.test(p5numpla ~ catholic, data = df1)$p.value # number of places
t3 <- t.test(w3momed_hsb ~ catholic, data = df1)$p.value # mother education

t1; t2; t3

## [1] 1.212805e-37

## [1] 0.001791557

## [1] 1.530072e-33

p-values from the t-tests suggest that all variables are significantly different between groups.
Endogeneity and omitted variable bias can be of a major concern to the identification strategy based on simple regression.

3. Estimate the propensity to go to catholic school, by hand

Run a logit regression to predict catholic based on the covariates we kept

mod_logit <- glm(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1)
summary(mod_logit)

## 
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb, 
##     family = binomial(), data = df1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0172  -0.6461  -0.5240  -0.4122   2.3672  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.6702950  0.1576164 -10.597  < 2e-16 ***
## w3income_1k  0.0077921  0.0008115   9.602  < 2e-16 ***
## p5numpla    -0.2774346  0.1232930  -2.250   0.0244 *  
## w3momed_hsb -0.6504757  0.0915710  -7.104 1.22e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4972.4  on 5428  degrees of freedom
## Residual deviance: 4750.6  on 5425  degrees of freedom
## AIC: 4758.6
## 
## Number of Fisher Scoring iterations: 4

Make a prediction for who is likely to go to catholic school, using this model

df1$prediction <- predict(mod_logit, type = "response")

df1%>%
  select(catholic, prediction) %>%
  head(., 6)

##   catholic prediction
## 1        0  0.1883576
## 2        0  0.1683901
## 3        0  0.1883576
## 4        1  0.2199574
## 5        0  0.3145556
## 6        0  0.1883576

# visual check with plot densities for common support 
ggplot(data = df1, aes(x = prediction, fill = as.factor(catholic)))+
  geom_density(alpha=0.5)

Comment : There is a decent support which means both the groups have predicted values in overlapping ranges.

Visually check income versus predicted value

library(ggplot2)
ggplot(data = df1, aes(x = prediction, y= w3income_1k, color = as.factor(catholic)))+
  geom_point(data = df1, aes(prediction, w3income_1k))+
  geom_smooth(data = df1, aes(x = prediction, y= w3income_1k, color = as.factor(catholic), group = catholic), method = "loess")

## `geom_smooth()` using formula 'y ~ x'

4. Run a Matching algorithm to get a balanced dataset

library(MatchIt)
psm <- matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1)
psm_data <- match.data(psm)
dim(psm_data)

## [1] 1860    9

Now, there are 1860 observations, 930 students each in the catholic category. The original dataset had a total of 5429 observations.

Sort the data by distance, and show the 10 first observations

newpsm_data <- arrange(psm_data, distance)
head(newpsm_data, n = 10)

##    c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb prediction   distance
## 1    -0.3835843        0     17.5005        2           1 0.06069529 0.06069529
## 2    -2.5922340        1     17.5005        2           1 0.06069529 0.06069529
## 3    -2.3690526        0     22.5005        2           1 0.06295488 0.06295488
## 4     0.5271555        1     22.5005        2           1 0.06295488 0.06295488
## 5    -0.2195955        0     27.5005        2           1 0.06529274 0.06529274
## 6    -1.6878765        1     27.5005        2           1 0.06529274 0.06529274
## 7    -0.2550080        0     32.5005        2           1 0.06771114 0.06771114
## 8    -1.2722942        1     32.5005        2           1 0.06771114 0.06771114
## 9    -0.7403859        0     37.5005        2           1 0.07021240 0.07021240
## 10   -0.7841369        1     37.5005        2           1 0.07021240 0.07021240
##    weights subclass
## 1        1      598
## 2        1      598
## 3        1      222
## 4        1      222
## 5        1      272
## 6        1      272
## 7        1      859
## 8        1      859
## 9        1      728
## 10       1      728

# check if the means of the covariates differ between catholic vs. non-catholic schools
by_catholic1 <- newpsm_data %>%
  group_by(catholic) %>%
  summarise_all(funs( mean))
by_catholic1

## # A tibble: 2 x 9
##   catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb prediction distance
##      <int>        <dbl>       <dbl>    <dbl>       <dbl>      <dbl>    <dbl>
## 1        0        0.335        86.2     1.07       0.205      0.203    0.203
## 2        1        0.220        86.2     1.07       0.205      0.203    0.203
## # ... with 2 more variables: weights <dbl>, subclass <dbl>

# t-test
t11 <- t.test(w3income_1k ~ catholic, data = newpsm_data)$p.value #family income
t12 <- t.test(p5numpla ~ catholic, data = newpsm_data)$p.value # number of places
t13 <- t.test(w3momed_hsb ~ catholic, data = newpsm_data)$p.value # mother education

t11; t12; t13

## [1] 1

## [1] 1

## [1] 1

p-values suggest that differences in the mean are statistically insignificant. Endogeneity and omitted variable bias could still be of concern.

5. Visually check the balance in the new dataset

Plot the income variable against the propensity score (by catholic) and plot the lowess smooth densities
Verify that the means are nearly identical at each level of the pscore distribution

ggplot(data = newpsm_data, aes(x = prediction, y= w3income_1k, color = as.factor(catholic)))+
  geom_point(data = newpsm_data, aes(prediction, w3income_1k))+
  geom_smooth(data = newpsm_data, aes(x = prediction, y= w3income_1k, color = as.factor(catholic), group = catholic), method = "loess")

## `geom_smooth()` using formula 'y ~ x'

The ggplot shows that variable means are identical at each level of the p score distribution.
#### 6. Compute the average treatment effect in your new sample

Compare means of the two groups.
Run regressions; just scores on the catholic dummy and another regression that includes all the variables.

# Compare means of the two groups
newpsm_data %>%
  group_by(catholic) %>%
  summarise_at(vars(c5r2mtsc_std), funs(n(),mean))

## # A tibble: 2 x 3
##   catholic     n  mean
##      <int> <int> <dbl>
## 1        0   930 0.335
## 2        1   930 0.220

# Run regression: math scores on the catholic dummy
reg3 <- lm(c5r2mtsc_std ~ catholic, newpsm_data)
reg4 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + p5numpla + w3momed_hsb, newpsm_data)
stargazer(reg3, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                     -0.115***         
##                               (0.041)          
##                                                
## Constant                     0.335***          
##                               (0.029)          
##                                                
## -----------------------------------------------
## Observations                   1,860           
## R2                             0.004           
## Adjusted R2                    0.004           
## Residual Std. Error      0.883 (df = 1858)     
## F Statistic           7.867*** (df = 1; 1858)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

stargazer(reg4, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                     -0.115***         
##                               (0.039)          
##                                                
## w3income_1k                  0.004***          
##                              (0.0005)          
##                                                
## p5numpla                      -0.091           
##                               (0.070)          
##                                                
## w3momed_hsb                  -0.366***         
##                               (0.050)          
##                                                
## Constant                      0.151*           
##                               (0.091)          
##                                                
## -----------------------------------------------
## Observations                   1,860           
## R2                             0.088           
## Adjusted R2                    0.086           
## Residual Std. Error      0.846 (df = 1855)     
## F Statistic          44.707*** (df = 4; 1855)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

While regressing only math score, the coefficient is negative and significant. Similarly, When other variables are included in the model, we still find significant negative effect of catholic variable on math score at 1% significance level. Results also show significant negative effect of higher education level of mother and significant positive effect of higher family income on math score. The effect of students staying in many places over last 4 months was negative and insignificant.

Part B: Deconstructing PSM

7. Split the PSM procedure

Reproduce a PSM ‘by hand’

# Prediction from logit regression
logit_pred <- fitted(mod_logit)
plot(psm$distance,logit_pred)

# Run same matchIt command but with distance = your logit prediction
Psm1 <- matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1, distance = logit_pred )
Psm1_data <- match.data(Psm1)
dim(Psm1_data)

## [1] 1860    9

The dimension is same as that in logit matchIt.

8. Use Machine Learning for matching

Run a LASSO procedure on the ORIGINAL dataset (the one before your select command in q1)

# Use 'df'
# Keep only numeric variables
df2 <- df %>%
  select(where(is.numeric))
head(df2)

##   catholic race_white race_black race_hispanic race_asian p5numpla p5hmage
## 1        0          1          0             0          0        1      47
## 2        0          1          0             0          0        1      41
## 3        0          1          0             0          0        1      43
## 4        1          1          0             0          0        1      38
## 5        0          1          0             0          0        1      47
## 6        0          1          0             0          0        1      41
##   p5hdage w3daded_hsb w3momed_hsb w3momscr w3dadscr w3income w3povrty p5fstamp
## 1      45           0           0    53.50     77.5  62500.5        0        0
## 2      48           0           0    34.95     53.5  45000.5        0        0
## 3      55           0           0    63.43     53.5  62500.5        0        0
## 4      39           0           0    53.50     53.5  87500.5        0        0
## 5      57           0           0    61.56     77.5 150000.5        0        0
## 6      41           0           0    38.18     53.5  62500.5        0        0
##   c5r2mtsc c5r2mtsc_std w3income_1k
## 1   60.043    0.9817533     62.5005
## 2   56.280    0.5943775     45.0005
## 3   55.272    0.4906106     62.5005
## 4   64.604    1.4512779     87.5005
## 5   75.721    2.5956991    150.0005
## 6   54.248    0.3851966     62.5005

# run lasso
library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-4

y <- data.matrix(df2$catholic)
x <- data.matrix(df2[,c(2:18)])

fit <- cv.glmnet(x, y, family = "gaussian", alpha = 1)

lasso <- glmnet(x, y, family = "gaussian", alpha=1, lambda = fit$lambda.1se)
coef(lasso, s= "lambda.lse")

## 18 x 1 sparse Matrix of class "dgCMatrix"
##                          s1
## (Intercept)    7.516499e-02
## race_white     .           
## race_black     .           
## race_hispanic  .           
## race_asian     .           
## p5numpla       .           
## p5hmage        1.700277e-03
## p5hdage        .           
## w3daded_hsb   -2.147176e-02
## w3momed_hsb   -2.195889e-02
## w3momscr       .           
## w3dadscr       .           
## w3income       7.035449e-07
## w3povrty       .           
## p5fstamp       .           
## c5r2mtsc       .           
## c5r2mtsc_std   .           
## w3income_1k    2.168253e-16

plot(fit)

There are only five variables to include using lambda.1se as regularization parameter. For the PSM command, I would choose education level and income. These variables are different from those used before.

9. Use Lasso predictions for matching

# Generate a prediction from the “best lasso” model
pred <- predict(lasso, newx = x, type = "response")

# Use that prediction as your distance in the matchit command
match <- matchit(catholic ~ p5hmage  + w3daded_hsb + w3momed_hsb + w3income, data = df2, family = binomial(), distance = as.numeric(pred))

# Create the match.data dataset based on that matchit
data.match <- match.data(match)

# Plot
ggplot(data = data.match, aes(x = distance, y = w3income_1k, col = as.factor(catholic))) +
geom_point(data = data.match, aes(distance, w3income_1k)) +
geom_smooth(data = data.match, aes(distance, w3income_1k))

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

# Reflect on how PSM methods can be coupled with ML methods
reg5 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + w3momed_hsb + w3daded_hsb , 
                     data = data.match)
stargazer(reg5, type = "text")

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                            c5r2mtsc_std        
## -----------------------------------------------
## catholic                     -0.220***         
##                               (0.039)          
##                                                
## w3income_1k                  0.003***          
##                              (0.0005)          
##                                                
## w3momed_hsb                  -0.268***         
##                               (0.053)          
##                                                
## w3daded_hsb                  -0.274***         
##                               (0.048)          
##                                                
## Constant                     0.279***          
##                               (0.054)          
##                                                
## -----------------------------------------------
## Observations                   1,860           
## R2                             0.110           
## Adjusted R2                    0.108           
## Residual Std. Error      0.836 (df = 1855)     
## F Statistic          57.184*** (df = 4; 1855)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01