In this exercise, I am working on the application of a Propensity Score Matching, with a Machine Learning twist at the end. The example data contains test results of school children along with various characteristics. The key question is: DO catholic schools have an effect on student’s performance?
The key issue is, of course, that students in catholic schools (“Treated”) might be different from those in non-catholic schools (“Control”). The exercise will show how propensity score matching attempts to reduce this issue (Note: it does not solve it completely)
df <- read.csv("ecls.csv")
# Subset the data set
df1 <- df %>%
select(c5r2mtsc_std, catholic, w3income_1k, p5numpla, w3momed_hsb)
df1 %>%
group_by(catholic) %>%
summarise_at(vars(c5r2mtsc_std), funs(n(),mean))
## # A tibble: 2 x 3
## catholic n mean
## <int> <int> <dbl>
## 1 0 4499 0.163
## 2 1 930 0.220
On an average, students of catholic tend to have higher math score than of non-catholic.
reg1 <- lm(c5r2mtsc_std ~ catholic, data = df1)
stargazer(reg1, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic 0.057
## (0.034)
##
## Constant 0.163***
## (0.014)
##
## -----------------------------------------------
## Observations 5,429
## R2 0.0005
## Adjusted R2 0.0003
## Residual Std. Error 0.955 (df = 5427)
## F Statistic 2.704 (df = 1; 5427)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
The estimate suggests that the effect of catholic school on standardized math score is positive but insignificant.
reg2 <- lm(c5r2mtsc_std ~ ., data = df1)
stargazer(reg2, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.127***
## (0.033)
##
## w3income_1k 0.005***
## (0.0003)
##
## p5numpla -0.101***
## (0.036)
##
## w3momed_hsb -0.374***
## (0.027)
##
## Constant 0.073
## (0.049)
##
## -----------------------------------------------
## Observations 5,429
## R2 0.124
## Adjusted R2 0.124
## Residual Std. Error 0.894 (df = 5424)
## F Statistic 192.272*** (df = 4; 5424)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
When all variables are included in the model specification, the effect of catholic school on standardized math score becomes negative and significant at 1% significance level. Results also show significant negative effect of higher education level of mother and students staying in many places over last 4 months. Additionally, there is significant positive effect of higher family income on math score.
by_catholic <- df1 %>%
group_by(catholic) %>%
summarise_all(funs( mean))
by_catholic
## # A tibble: 2 x 5
## catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb
## <int> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.163 65.4 1.11 0.392
## 2 1 0.220 86.2 1.07 0.205
# test using t-test
t1 <- t.test(w3income_1k ~ catholic, data = df1)$p.value #family income
t2 <- t.test(p5numpla ~ catholic, data = df1)$p.value # number of places
t3 <- t.test(w3momed_hsb ~ catholic, data = df1)$p.value # mother education
t1; t2; t3
## [1] 1.212805e-37
## [1] 0.001791557
## [1] 1.530072e-33
p-values from the t-tests suggest that all variables are
significantly different between groups.
Endogeneity and omitted variable bias can be of a major concern to the
identification strategy based on simple regression.
mod_logit <- glm(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1)
summary(mod_logit)
##
## Call:
## glm(formula = catholic ~ w3income_1k + p5numpla + w3momed_hsb,
## family = binomial(), data = df1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0172 -0.6461 -0.5240 -0.4122 2.3672
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.6702950 0.1576164 -10.597 < 2e-16 ***
## w3income_1k 0.0077921 0.0008115 9.602 < 2e-16 ***
## p5numpla -0.2774346 0.1232930 -2.250 0.0244 *
## w3momed_hsb -0.6504757 0.0915710 -7.104 1.22e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 4972.4 on 5428 degrees of freedom
## Residual deviance: 4750.6 on 5425 degrees of freedom
## AIC: 4758.6
##
## Number of Fisher Scoring iterations: 4
df1$prediction <- predict(mod_logit, type = "response")
df1%>%
select(catholic, prediction) %>%
head(., 6)
## catholic prediction
## 1 0 0.1883576
## 2 0 0.1683901
## 3 0 0.1883576
## 4 1 0.2199574
## 5 0 0.3145556
## 6 0 0.1883576
# visual check with plot densities for common support
ggplot(data = df1, aes(x = prediction, fill = as.factor(catholic)))+
geom_density(alpha=0.5)
Comment : There is a decent support which means both the groups have predicted values in overlapping ranges.
library(ggplot2)
ggplot(data = df1, aes(x = prediction, y= w3income_1k, color = as.factor(catholic)))+
geom_point(data = df1, aes(prediction, w3income_1k))+
geom_smooth(data = df1, aes(x = prediction, y= w3income_1k, color = as.factor(catholic), group = catholic), method = "loess")
## `geom_smooth()` using formula 'y ~ x'
library(MatchIt)
psm <- matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1)
psm_data <- match.data(psm)
dim(psm_data)
## [1] 1860 9
Now, there are 1860 observations, 930 students each in the catholic category. The original dataset had a total of 5429 observations.
newpsm_data <- arrange(psm_data, distance)
head(newpsm_data, n = 10)
## c5r2mtsc_std catholic w3income_1k p5numpla w3momed_hsb prediction distance
## 1 -0.3835843 0 17.5005 2 1 0.06069529 0.06069529
## 2 -2.5922340 1 17.5005 2 1 0.06069529 0.06069529
## 3 -2.3690526 0 22.5005 2 1 0.06295488 0.06295488
## 4 0.5271555 1 22.5005 2 1 0.06295488 0.06295488
## 5 -0.2195955 0 27.5005 2 1 0.06529274 0.06529274
## 6 -1.6878765 1 27.5005 2 1 0.06529274 0.06529274
## 7 -0.2550080 0 32.5005 2 1 0.06771114 0.06771114
## 8 -1.2722942 1 32.5005 2 1 0.06771114 0.06771114
## 9 -0.7403859 0 37.5005 2 1 0.07021240 0.07021240
## 10 -0.7841369 1 37.5005 2 1 0.07021240 0.07021240
## weights subclass
## 1 1 598
## 2 1 598
## 3 1 222
## 4 1 222
## 5 1 272
## 6 1 272
## 7 1 859
## 8 1 859
## 9 1 728
## 10 1 728
# check if the means of the covariates differ between catholic vs. non-catholic schools
by_catholic1 <- newpsm_data %>%
group_by(catholic) %>%
summarise_all(funs( mean))
by_catholic1
## # A tibble: 2 x 9
## catholic c5r2mtsc_std w3income_1k p5numpla w3momed_hsb prediction distance
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0.335 86.2 1.07 0.205 0.203 0.203
## 2 1 0.220 86.2 1.07 0.205 0.203 0.203
## # ... with 2 more variables: weights <dbl>, subclass <dbl>
# t-test
t11 <- t.test(w3income_1k ~ catholic, data = newpsm_data)$p.value #family income
t12 <- t.test(p5numpla ~ catholic, data = newpsm_data)$p.value # number of places
t13 <- t.test(w3momed_hsb ~ catholic, data = newpsm_data)$p.value # mother education
t11; t12; t13
## [1] 1
## [1] 1
## [1] 1
p-values suggest that differences in the mean are statistically insignificant. Endogeneity and omitted variable bias could still be of concern.
ggplot(data = newpsm_data, aes(x = prediction, y= w3income_1k, color = as.factor(catholic)))+
geom_point(data = newpsm_data, aes(prediction, w3income_1k))+
geom_smooth(data = newpsm_data, aes(x = prediction, y= w3income_1k, color = as.factor(catholic), group = catholic), method = "loess")
## `geom_smooth()` using formula 'y ~ x'
The ggplot shows that variable means are identical at each level of
the p score distribution.
#### 6. Compute the average treatment effect
in your new sample
# Compare means of the two groups
newpsm_data %>%
group_by(catholic) %>%
summarise_at(vars(c5r2mtsc_std), funs(n(),mean))
## # A tibble: 2 x 3
## catholic n mean
## <int> <int> <dbl>
## 1 0 930 0.335
## 2 1 930 0.220
# Run regression: math scores on the catholic dummy
reg3 <- lm(c5r2mtsc_std ~ catholic, newpsm_data)
reg4 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + p5numpla + w3momed_hsb, newpsm_data)
stargazer(reg3, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.115***
## (0.041)
##
## Constant 0.335***
## (0.029)
##
## -----------------------------------------------
## Observations 1,860
## R2 0.004
## Adjusted R2 0.004
## Residual Std. Error 0.883 (df = 1858)
## F Statistic 7.867*** (df = 1; 1858)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
stargazer(reg4, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.115***
## (0.039)
##
## w3income_1k 0.004***
## (0.0005)
##
## p5numpla -0.091
## (0.070)
##
## w3momed_hsb -0.366***
## (0.050)
##
## Constant 0.151*
## (0.091)
##
## -----------------------------------------------
## Observations 1,860
## R2 0.088
## Adjusted R2 0.086
## Residual Std. Error 0.846 (df = 1855)
## F Statistic 44.707*** (df = 4; 1855)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
While regressing only math score, the coefficient is negative and significant. Similarly, When other variables are included in the model, we still find significant negative effect of catholic variable on math score at 1% significance level. Results also show significant negative effect of higher education level of mother and significant positive effect of higher family income on math score. The effect of students staying in many places over last 4 months was negative and insignificant.
# Prediction from logit regression
logit_pred <- fitted(mod_logit)
plot(psm$distance,logit_pred)
# Run same matchIt command but with distance = your logit prediction
Psm1 <- matchit(catholic ~ w3income_1k + p5numpla + w3momed_hsb, family = binomial(), data = df1, distance = logit_pred )
Psm1_data <- match.data(Psm1)
dim(Psm1_data)
## [1] 1860 9
The dimension is same as that in logit matchIt.
# Use 'df'
# Keep only numeric variables
df2 <- df %>%
select(where(is.numeric))
head(df2)
## catholic race_white race_black race_hispanic race_asian p5numpla p5hmage
## 1 0 1 0 0 0 1 47
## 2 0 1 0 0 0 1 41
## 3 0 1 0 0 0 1 43
## 4 1 1 0 0 0 1 38
## 5 0 1 0 0 0 1 47
## 6 0 1 0 0 0 1 41
## p5hdage w3daded_hsb w3momed_hsb w3momscr w3dadscr w3income w3povrty p5fstamp
## 1 45 0 0 53.50 77.5 62500.5 0 0
## 2 48 0 0 34.95 53.5 45000.5 0 0
## 3 55 0 0 63.43 53.5 62500.5 0 0
## 4 39 0 0 53.50 53.5 87500.5 0 0
## 5 57 0 0 61.56 77.5 150000.5 0 0
## 6 41 0 0 38.18 53.5 62500.5 0 0
## c5r2mtsc c5r2mtsc_std w3income_1k
## 1 60.043 0.9817533 62.5005
## 2 56.280 0.5943775 45.0005
## 3 55.272 0.4906106 62.5005
## 4 64.604 1.4512779 87.5005
## 5 75.721 2.5956991 150.0005
## 6 54.248 0.3851966 62.5005
# run lasso
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-4
y <- data.matrix(df2$catholic)
x <- data.matrix(df2[,c(2:18)])
fit <- cv.glmnet(x, y, family = "gaussian", alpha = 1)
lasso <- glmnet(x, y, family = "gaussian", alpha=1, lambda = fit$lambda.1se)
coef(lasso, s= "lambda.lse")
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 7.516499e-02
## race_white .
## race_black .
## race_hispanic .
## race_asian .
## p5numpla .
## p5hmage 1.700277e-03
## p5hdage .
## w3daded_hsb -2.147176e-02
## w3momed_hsb -2.195889e-02
## w3momscr .
## w3dadscr .
## w3income 7.035449e-07
## w3povrty .
## p5fstamp .
## c5r2mtsc .
## c5r2mtsc_std .
## w3income_1k 2.168253e-16
plot(fit)
There are only five variables to include using lambda.1se as regularization parameter. For the PSM command, I would choose education level and income. These variables are different from those used before.
# Generate a prediction from the “best lasso” model
pred <- predict(lasso, newx = x, type = "response")
# Use that prediction as your distance in the matchit command
match <- matchit(catholic ~ p5hmage + w3daded_hsb + w3momed_hsb + w3income, data = df2, family = binomial(), distance = as.numeric(pred))
# Create the match.data dataset based on that matchit
data.match <- match.data(match)
# Plot
ggplot(data = data.match, aes(x = distance, y = w3income_1k, col = as.factor(catholic))) +
geom_point(data = data.match, aes(distance, w3income_1k)) +
geom_smooth(data = data.match, aes(distance, w3income_1k))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
# Reflect on how PSM methods can be coupled with ML methods
reg5 <- lm(c5r2mtsc_std ~ catholic + w3income_1k + w3momed_hsb + w3daded_hsb ,
data = data.match)
stargazer(reg5, type = "text")
##
## ===============================================
## Dependent variable:
## ---------------------------
## c5r2mtsc_std
## -----------------------------------------------
## catholic -0.220***
## (0.039)
##
## w3income_1k 0.003***
## (0.0005)
##
## w3momed_hsb -0.268***
## (0.053)
##
## w3daded_hsb -0.274***
## (0.048)
##
## Constant 0.279***
## (0.054)
##
## -----------------------------------------------
## Observations 1,860
## R2 0.110
## Adjusted R2 0.108
## Residual Std. Error 0.836 (df = 1855)
## F Statistic 57.184*** (df = 4; 1855)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01