rm(list = ls())
### Load useful packages
library(WeightIt)
library(survey)
library(cobalt)
library(MatchIt)
library(dplyr)
library(ggplot2)
library(psych)
library(PSW)
library(pastecs)
library(designmatch)
library(stargazer)
library(foreign)
library(Hmisc)
library(exactRankTests)
library(dplyr)
library(ggplot2)
library(optmatch)
library(foreign)
library(Hmisc)
library(xtable)
library(ipw)
library(CausalGAM)
d<-load("C:/Users/okuffuor/Downloads/asn-1-data.RData")
d<-d1
#1. Write the potential outcomes for this analysis.
D is a causal exposure indicator of child bearing (1 = treatment, 0 = control)
Y is observable outcome random variable of level of years of education. Therefore observed potential outcomes are [Y1 | D=1] and [Y0 | D=0] [Y1 | D=1] This model represents the number of years of education for individuals gave birth during their teens. [Y0 | D=0] This model also represents the number of years of education for individuals who did not give birth during their teens. However, what we do not observe is [Y1 | D=0] and [Y0 | D = 1]. Therefore the notations here are the converse of what is stated supra.
#2. Prepare a table of descriptive statistics that follows the conventions of your area of work in terms of which descriptives are included and what format the table takes.
d2<-subset(d,coh==1)
stargazer(d2, type = "text", title="Descriptive statistics", digits=1, out="table1.txt",covariate.labels = c("id","Cohort","Outcome Variable","Treatment Variable","Birth Year","age","Black","South","Urban","Both Parents US Citizens","two-parent household","Parent years of schooling","cat.meas.parent yrs of schooling","pe1-pe3"))
##
## Descriptive statistics
## =====================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------------------------
## id 4,445 2,519.6 1,458.8 1 1,262 3,779 5,083
## Cohort 4,445 1.0 0.0 1 1 1 1
## Outcome Variable 4,445 11.2 2.6 6 10 12 18
## Treatment Variable 4,445 0.3 0.5 0 0 1 1
## Birth Year 4,445 1,929.8 4.4 1,923 1,926 1,934 1,937
## age 4,445 37.2 4.4 30 33 41 44
## Black 4,445 0.3 0.4 0 0 1 1
## South 4,445 0.4 0.5 0 0 1 1
## Urban 4,445 0.7 0.5 0 0 1 1
## Both Parents US Citizens 4,445 0.9 0.4 0 1 1 1
## two-parent household 4,445 0.7 0.5 0 0 1 1
## Parent years of schooling 4,445 8.4 4.3 0 6 12 18
## cat.meas.parent yrs of schooling 4,445 0.4 0.7 0 0 1 2
## pe1-pe3 4,445 0.7 0.5 0 0 1 1
## pe2 4,445 0.2 0.4 0 0 0 1
## pe3 4,445 0.1 0.3 0 0 0 1
## -------------------------------------------------------------------------------------
d4<-subset(d,coh==4)
stargazer(d4, type = "text", title="Descriptive statistics", digits=1, out="table1.txt",covariate.labels = c("id","Cohort","Outcome Variable","Treatment Variable","Birth Year","age","Black","South","Urban","Both Parents US Citizens","two-parent household","Parent years of schooling","cat.meas.parent yrs of schooling","pe1-pe3"))
##
## Descriptive statistics
## =====================================================================================
## Statistic N Mean St. Dev. Min Pctl(25) Pctl(75) Max
## -------------------------------------------------------------------------------------
## id 2,976 4,379.3 2,469.4 1 2,321.8 6,201.2 9,022
## Cohort 2,976 4.0 0.0 4 4 4 4
## Outcome Variable 2,976 14.2 3.0 6 12 16 20
## Treatment Variable 2,976 0.2 0.4 0 0 0 1
## Birth Year 2,976 1,982.6 1.5 1,979 1,981 1,984 1,985
## age 2,976 14.4 1.5 12 13 16 18
## Black 2,976 0.3 0.5 0 0 1 1
## South 2,976 0.4 0.5 0 0 1 1
## Urban 2,976 0.7 0.5 0 0 1 1
## Both Parents US Citizens 2,976 0.9 0.3 0 1 1 1
## two-parent household 2,976 0.5 0.5 0 0 1 1
## Parent years of schooling 2,976 13.4 3.2 0 12 16 20
## cat.meas.parent yrs of schooling 2,976 1.4 0.7 0 1 2 2
## pe1-pe3 2,976 0.1 0.3 0 0 0 1
## pe2 2,976 0.3 0.5 0 0 1 1
## pe3 2,976 0.5 0.5 0 0 1 1
## -------------------------------------------------------------------------------------
#3. Write a brief paragraph to accompany the table of descriptive statistics.
From the descriptive statistics, the control and treatment groups look different based on characteristics such as age, race, two parent household.
#For our analysis we will explore standard regression estimates of the ATE and propensity score estimates using IPW and matching of the ATE and ATT. Note: for the questions involving assessing balance, you can produce a table, create a figure, and/or provide a brief write-up describing the degree of covariate balance.
#(a) Specify a propensity score model and assess overlap and balance for the ATE and ATT for each cohort.
m.coh1.ps <- glm(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d2, family = binomial)
# calculate propensity scores and add to dataframe
coh1.ps.df <- data.frame(coh1.ps = predict(m.coh1.ps, type = "response"), tb = m.coh1.ps$model$tb)
# plot propensity scores by teen birth
labs <- c("teen birth", "no teen birth")
coh1.ps.df %>%
mutate(tb = ifelse(tb == 1, labs[1], labs[2])) %>%
ggplot(aes(x = coh1.ps)) +
geom_density(color = "red") +
facet_wrap(~tb) +
xlab("Probability of teen birth") +
theme_bw()
m.coh4.ps <- glm(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, family = binomial)
# calculate propensity scores and add to dataframe
coh4.ps.df <- data.frame(coh4.ps = predict(m.coh4.ps, type = "response"), tb = m.coh4.ps$model$tb)
# plot propensity scores by teen birth
labs <- c("teen birth", "no teen birth")
coh4.ps.df %>%
mutate(tb = ifelse(tb == 1, labs[1], labs[2])) %>%
ggplot(aes(x = coh4.ps)) +
geom_density(color = "red") +
facet_wrap(~tb) +
xlab("Probability of teen birth") +
theme_bw()
From the two plots above, it is quite clear that the treatment and control groups are not balanced. The plots indicate that there is no overlap for both cohorts.
#(b) Optional: you may notice an unusual pattern in some of the overlap plots. What do you think is the reason for this pattern?
This unusual pattern could be attributed to possible social desirability bias causing individuals who gave birth during their teens to hide their answer and potentially inflate the mean for the control group.
#(c) Once you are satisfied with the propensity score model, obtain estimates for the ATE and ATT using a standard IPW estimator and using one that adjusts for the propensity score covariates. Place the estimates in a table of results noted below.
coh1.ate.w <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, estimand = "ATE")
coh1.ate.w
## A weightit object
## - method: "ps" (propensity score weighting)
## - number of obs.: 7421
## - sampling weights: none
## - treatment: 2-category
## - estimand: ATE
## - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh1.ate.w)
## Call
## weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 +
## pe3, data = d1, estimand = "ATE")
##
## Balance Measures
## Type Diff.Adj
## prop.score Distance 0.0014
## age Contin. 0.0182
## blk Binary 0.0032
## sth Binary -0.0048
## urb Binary 0.0078
## ntp Binary -0.0064
## tph Binary -0.0076
## pe2 Binary -0.0009
## pe3 Binary -0.0004
##
## Effective sample sizes
## Control Treated
## Unadjusted 5382.000 2039.000
## Adjusted 5069.086 1355.444
# use weights to estimate ATE
coh1.ate.dw <- svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
coh1.ate <- svyglm(sch ~ tb, design = coh1.ate.dw)
summary(coh1.ate)
##
## Call:
## svyglm(formula = sch ~ tb, design = coh1.ate.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.85837 0.04424 290.7 <2e-16 ***
## tb -1.75614 0.08483 -20.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 8.130732)
##
## Number of Fisher Scoring iterations: 2
coh1.ate.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh1.ate.dw)
summary(coh1.ate.ra)
##
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph +
## pe2 + pe3, design = coh1.ate.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.721462 0.194262 70.634 < 2e-16 ***
## tb -1.737209 0.071184 -24.405 < 2e-16 ***
## age -0.073738 0.003856 -19.122 < 2e-16 ***
## blk -0.016086 0.084423 -0.191 0.849
## sth -0.344995 0.077982 -4.424 9.83e-06 ***
## urb 0.175453 0.073520 2.386 0.017 *
## ntp -0.045982 0.145654 -0.316 0.752
## tph 0.652728 0.077277 8.447 < 2e-16 ***
## pe2 1.103090 0.079227 13.923 < 2e-16 ***
## pe3 2.151981 0.103244 20.844 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.63069)
##
## Number of Fisher Scoring iterations: 2
coh1.att.w <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, estimand = "ATT")
coh1.att.w
## A weightit object
## - method: "ps" (propensity score weighting)
## - number of obs.: 7421
## - sampling weights: none
## - treatment: 2-category
## - estimand: ATT (focal: 1)
## - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh1.att.w)
## Call
## weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 +
## pe3, data = d1, estimand = "ATT")
##
## Balance Measures
## Type Diff.Adj
## prop.score Distance -0.0107
## age Contin. 0.0018
## blk Binary -0.0050
## sth Binary 0.0004
## urb Binary -0.0036
## ntp Binary 0.0001
## tph Binary 0.0046
## pe2 Binary -0.0008
## pe3 Binary 0.0007
##
## Effective sample sizes
## Control Treated
## Unadjusted 5382.000 2039
## Adjusted 2966.013 2039
coh1.att.dw <- svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
coh1.att <- svyglm(sch ~ tb, design = coh1.att.dw)
summary(coh1.att)
##
## Call:
## svyglm(formula = sch ~ tb, design = coh1.att.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.16072 0.05774 210.60 <2e-16 ***
## tb -1.61389 0.08095 -19.94 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 8.030271)
##
## Number of Fisher Scoring iterations: 2
coh1.att.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh1.att.dw)
summary(coh1.att.ra)
##
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph +
## pe2 + pe3, design = coh1.att.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.051352 0.195523 71.866 < 2e-16 ***
## tb -1.615266 0.069462 -23.254 < 2e-16 ***
## age -0.082714 0.003978 -20.791 < 2e-16 ***
## blk -0.130351 0.081414 -1.601 0.109402
## sth -0.432775 0.071966 -6.014 1.90e-09 ***
## urb 0.275407 0.077862 3.537 0.000407 ***
## ntp -0.098356 0.128917 -0.763 0.445524
## tph 0.608257 0.074981 8.112 5.77e-16 ***
## pe2 1.050739 0.086954 12.084 < 2e-16 ***
## pe3 2.064264 0.109037 18.932 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 5.78826)
##
## Number of Fisher Scoring iterations: 2
coh4.ate.w <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, estimand = "ATE")
coh4.ate.w
## A weightit object
## - method: "ps" (propensity score weighting)
## - number of obs.: 2976
## - sampling weights: none
## - treatment: 2-category
## - estimand: ATE
## - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh4.ate.w)
## Call
## weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 +
## pe3, data = d4, estimand = "ATE")
##
## Balance Measures
## Type Diff.Adj
## prop.score Distance 0.0012
## age Contin. 0.0361
## blk Binary 0.0095
## sth Binary 0.0033
## urb Binary 0.0126
## ntp Binary -0.0166
## tph Binary -0.0097
## pe2 Binary -0.0018
## pe3 Binary -0.0013
##
## Effective sample sizes
## Control Treated
## Unadjusted 2355.000 621.000
## Adjusted 2266.549 382.984
coh4.ate.dw <- svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
coh4.ate <- svyglm(sch ~ tb, design = coh4.ate.dw)
summary(coh4.ate)
##
## Call:
## svyglm(formula = sch ~ tb, design = coh4.ate.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.62097 0.06243 234.18 <2e-16 ***
## tb -2.10082 0.15704 -13.38 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.956605)
##
## Number of Fisher Scoring iterations: 2
coh4.ate.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh4.ate.dw)
summary(coh4.ate.ra)
##
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph +
## pe2 + pe3, design = coh4.ate.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.86770 0.82576 14.372 < 2e-16 ***
## tb -2.11140 0.14402 -14.660 < 2e-16 ***
## age 0.09565 0.05107 1.873 0.0612 .
## blk 0.65489 0.15496 4.226 2.45e-05 ***
## sth -0.03431 0.15920 -0.216 0.8294
## urb 0.02926 0.15012 0.195 0.8455
## ntp -0.69916 0.37512 -1.864 0.0624 .
## tph 0.85483 0.15150 5.642 1.84e-08 ***
## pe2 0.86737 0.16176 5.362 8.86e-08 ***
## pe3 2.09966 0.17254 12.169 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.949654)
##
## Number of Fisher Scoring iterations: 2
coh4.att.w <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, estimand = "ATt")
coh4.att.w
## A weightit object
## - method: "ps" (propensity score weighting)
## - number of obs.: 2976
## - sampling weights: none
## - treatment: 2-category
## - estimand: ATT (focal: 1)
## - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh4.att.w)
## Call
## weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 +
## pe3, data = d4, estimand = "ATt")
##
## Balance Measures
## Type Diff.Adj
## prop.score Distance -0.0395
## age Contin. -0.0145
## blk Binary -0.0123
## sth Binary -0.0089
## urb Binary 0.0037
## ntp Binary 0.0004
## tph Binary 0.0054
## pe2 Binary 0.0042
## pe3 Binary 0.0041
##
## Effective sample sizes
## Control Treated
## Unadjusted 2355.000 621
## Adjusted 1249.289 621
coh4.att.dw <- svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
coh4.att <- svyglm(sch ~ tb, design = coh4.att.dw)
summary(coh4.att)
##
## Call:
## svyglm(formula = sch ~ tb, design = coh4.att.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.98895 0.08392 166.69 <2e-16 ***
## tb -1.84885 0.13413 -13.78 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 7.794384)
##
## Number of Fisher Scoring iterations: 2
coh4.att.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh4.att.dw)
summary(coh4.att.ra)
##
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph +
## pe2 + pe3, design = coh4.att.dw)
##
## Survey design:
## svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.64824 0.70111 18.040 < 2e-16 ***
## tb -1.85624 0.12593 -14.740 < 2e-16 ***
## age 0.04664 0.04189 1.113 0.26563
## blk 0.66771 0.13699 4.874 1.15e-06 ***
## sth -0.01313 0.13612 -0.096 0.92318
## urb 0.03392 0.14081 0.241 0.80968
## ntp -0.91379 0.29622 -3.085 0.00206 **
## tph 0.80270 0.14187 5.658 1.68e-08 ***
## pe2 0.86949 0.16494 5.271 1.45e-07 ***
## pe3 2.09652 0.17449 12.015 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 6.89065)
##
## Number of Fisher Scoring iterations: 2
#2. Matching estimators
#(a) Use the same propensity score model from the previous question. Implement a nearest neighbor matching algorithm with a caliper of 0.05 and a minimum of 1 neighbor and then 5 neighbors for both the ATE and ATT for each cohort. Assess balance for each and place the estimates in a table of results noted below
### Nearest neighbor matching min 1 for ATE for cohort 1
# run nearest neighbor matching algorithm
coh1.nn1.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "all",
method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh1.nn1.ate.d <- match.data(coh1.nn1.ate.m)
coh1.nn1.ate <- lm(sch ~ tb, data = coh1.nn1.ate.d)
summary(coh1.nn1.ate)
##
## Call:
## lm(formula = sch ~ tb, data = coh1.nn1.ate.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3482 -1.5476 -0.3482 1.4524 9.4524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.34820 0.07177 172.05 <2e-16 ***
## tb -1.80061 0.09459 -19.04 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.782 on 3538 degrees of freedom
## Multiple R-squared: 0.0929, Adjusted R-squared: 0.09265
## F-statistic: 362.4 on 1 and 3538 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn1.ate.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching for ATT for cohort 1
coh1.nn1.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "treat",
method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh1.nn1.att.d <- match.data(coh1.nn1.att.m)
coh1.nn1.att <- lm(sch ~ tb, data = coh1.nn1.att.d)
summary(coh1.nn1.att)
##
## Call:
## lm(formula = sch ~ tb, data = coh1.nn1.att.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.3666 -1.5476 -0.3666 1.4524 9.4524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.36664 0.07236 170.90 <2e-16 ***
## tb -1.81905 0.09549 -19.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.813 on 3547 degrees of freedom
## Multiple R-squared: 0.09281, Adjusted R-squared: 0.09256
## F-statistic: 362.9 on 1 and 3547 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn1.att.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching for ATT for cohort 1 min 5
coh1.nn5.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "treat",
method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh1.nn5.att.d <- match.data(coh1.nn5.att.m)
coh1.nn5.att <- lm(sch ~ tb, data = coh1.nn5.att.d)
summary(coh1.nn5.att)
##
## Call:
## lm(formula = sch ~ tb, data = coh1.nn5.att.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.6807 -1.5476 -0.6807 1.4524 9.4524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.68069 0.04709 269.27 <2e-16 ***
## tb -2.13309 0.07925 -26.92 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.877 on 5769 degrees of freedom
## Multiple R-squared: 0.1116, Adjusted R-squared: 0.1114
## F-statistic: 724.5 on 1 and 5769 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn5.att.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching for ATE for cohort 4
coh4.nn1.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh4.nn1.ate.d <- match.data(coh4.nn1.ate.m)
coh4.nn1.ate <- lm(sch ~ tb, data = coh4.nn1.ate.d)
summary(coh4.nn1.ate)
##
## Call:
## lm(formula = sch ~ tb, data = coh4.nn1.ate.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2409 -2.1401 -0.1401 1.8599 7.8599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.2409 0.1241 114.72 <2e-16 ***
## tb -2.1008 0.1663 -12.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.759 on 1113 degrees of freedom
## Multiple R-squared: 0.1253, Adjusted R-squared: 0.1246
## F-statistic: 159.5 on 1 and 1113 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn1.ate.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching min 5 for ATE for cohort 1
coh1.nn5.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "all",
method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh1.nn5.ate.d <- match.data(coh1.nn5.ate.m)
coh1.nn5.ate <- lm(sch ~ tb, data = coh1.nn5.ate.d)
summary(coh1.nn5.ate)
##
## Call:
## lm(formula = sch ~ tb, data = coh1.nn5.ate.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.713 -1.548 -0.713 1.452 9.452
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.71302 0.04695 270.80 <2e-16 ***
## tb -2.16543 0.07899 -27.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.868 on 5768 degrees of freedom
## Multiple R-squared: 0.1153, Adjusted R-squared: 0.1151
## F-statistic: 751.5 on 1 and 5768 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn5.ate.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching min 5 for ATE for cohort 4
coh4.nn5.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh4.nn5.ate.d <- match.data(coh4.nn5.ate.m)
coh4.nn5.ate <- lm(sch ~ tb, data = coh4.nn5.ate.d)
summary(coh4.nn5.ate)
##
## Call:
## lm(formula = sch ~ tb, data = coh4.nn5.ate.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3505 -2.3505 -0.1401 1.8599 7.8599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.35047 0.07684 186.76 <2e-16 ***
## tb -2.21037 0.13796 -16.02 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.855 on 2000 degrees of freedom
## Multiple R-squared: 0.1137, Adjusted R-squared: 0.1133
## F-statistic: 256.7 on 1 and 2000 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn5.ate.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching for ATT for cohort 4
coh4.nn1.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "treat",
method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh4.nn1.att.d <- match.data(coh4.nn1.att.m)
coh4.nn1.att <- lm(sch ~ tb, data = coh4.nn1.att.d)
summary(coh4.nn1.att)
##
## Call:
## lm(formula = sch ~ tb, data = coh4.nn1.att.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1400 -2.1400 -0.1401 1.8600 7.8599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.1400 0.1248 113.32 <2e-16 ***
## tb -1.9999 0.1676 -11.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.79 on 1119 degrees of freedom
## Multiple R-squared: 0.1128, Adjusted R-squared: 0.112
## F-statistic: 142.3 on 1 and 1119 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn1.att.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
### Nearest neighbor matching min 5 for ATT for cohort 4
coh4.nn5.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh4.nn5.att.d <- match.data(coh4.nn5.att.m)
coh4.nn5.att <- lm(sch ~ tb, data = coh4.nn5.att.d)
summary(coh4.nn5.att)
##
## Call:
## lm(formula = sch ~ tb, data = coh4.nn5.att.d)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.3839 -2.3839 -0.1401 1.8599 7.8599
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.3839 0.0764 188.27 <2e-16 ***
## tb -2.2438 0.1375 -16.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.849 on 2010 degrees of freedom
## Multiple R-squared: 0.117, Adjusted R-squared: 0.1165
## F-statistic: 266.2 on 1 and 2010 DF, p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn5.att.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot.
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.
#(b) Using the same propensity score model, implement a kernel matching algorithm for the ATE and ATT. Place the estimates in a table of results noted below
#3. Fit two standard regression models for each cohort. The first model should not include covariates and will represent a naive estimate of the ATE. The second model should include the same variables and functional form as the propensity score models.
##cohort1
reg1_ch1 = lm(d1$sch ~ d1$tb)
summary(reg1_ch1)
##
## Call:
## lm(formula = d1$sch ~ d1$tb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1234 -1.1234 -0.5468 1.8766 9.4532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.12337 0.04008 327.4 <2e-16 ***
## d1$tb -2.57654 0.07646 -33.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.94 on 7419 degrees of freedom
## Multiple R-squared: 0.1327, Adjusted R-squared: 0.1326
## F-statistic: 1136 on 1 and 7419 DF, p-value: < 2.2e-16
reg2_ch1 = lm(d1$sch ~ d1$tb+d1$age+d1$blk+d1$sth+d1$urb+d1$ntp+d1$tph+d1$pec)
summary(reg2_ch1)
##
## Call:
## lm(formula = d1$sch ~ d1$tb + d1$age + d1$blk + d1$sth + d1$urb +
## d1$ntp + d1$tph + d1$pec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6747 -1.5470 0.1564 1.4251 7.7719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.735171 0.149041 92.157 < 2e-16 ***
## d1$tb -1.665744 0.067103 -24.824 < 2e-16 ***
## d1$age -0.079737 0.003034 -26.278 < 2e-16 ***
## d1$blk -0.035736 0.072226 -0.495 0.62077
## d1$sth -0.328416 0.063947 -5.136 2.88e-07 ***
## d1$urb 0.171537 0.062812 2.731 0.00633 **
## d1$ntp 0.062339 0.091221 0.683 0.49438
## d1$tph 0.677783 0.062513 10.842 < 2e-16 ***
## d1$pec 1.156567 0.042014 27.528 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.421 on 7412 degrees of freedom
## Multiple R-squared: 0.4126, Adjusted R-squared: 0.4119
## F-statistic: 650.7 on 8 and 7412 DF, p-value: < 2.2e-16
#3. Fit two standard regression models for each cohort. The first model should not include covariates and will represent a naive estimate of the ATE. The second model should include the same variables and functional form as the propensity score models.
##cohort4
reg1_ch4 = lm(d4$sch ~ d4$tb)
summary(reg1_ch4)
##
## Call:
## lm(formula = d4$sch ~ d4$tb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.789 -2.140 0.211 2.211 7.860
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.78896 0.05873 251.8 <2e-16 ***
## d4$tb -2.64886 0.12856 -20.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.85 on 2974 degrees of freedom
## Multiple R-squared: 0.1249, Adjusted R-squared: 0.1246
## F-statistic: 424.5 on 1 and 2974 DF, p-value: < 2.2e-16
reg2_ch4 = lm(d4$sch ~ d4$tb+d4$age+d4$blk+d4$sth+d4$urb+d4$ntp+d4$tph+d4$pec)
summary(reg2_ch4)
##
## Call:
## lm(formula = d4$sch ~ d4$tb + d4$age + d4$blk + d4$sth + d4$urb +
## d4$ntp + d4$tph + d4$pec)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5113 -1.7074 0.0825 1.8029 7.8386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.39153 0.53079 23.345 < 2e-16 ***
## d4$tb -1.94522 0.12797 -15.200 < 2e-16 ***
## d4$age 0.03571 0.03297 1.083 0.279
## d4$blk 0.47067 0.11981 3.929 8.74e-05 ***
## d4$sth -0.06399 0.10587 -0.604 0.546
## d4$urb 0.03392 0.11023 0.308 0.758
## d4$ntp -0.46095 0.18348 -2.512 0.012 *
## d4$tph 0.91064 0.10745 8.475 < 2e-16 ***
## d4$pec 1.13581 0.07601 14.942 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.682 on 2967 degrees of freedom
## Multiple R-squared: 0.2269, Adjusted R-squared: 0.2248
## F-statistic: 108.9 on 8 and 2967 DF, p-value: < 2.2e-16
#4. Prepare a table with the estimates of the ATEs and ATTs from your analyses.
m1 <- lm(sch ~ tb, data = coh1.nn1.ate.d)
m2 <- lm(sch ~ tb, data = coh1.nn5.ate.d)
m3 <- lm(sch ~ tb, data = coh1.nn1.att.d)
m4 <- lm(sch ~ tb, data = coh1.nn5.att.d)
stargazer(m1,m2,m3,m4, type="text",
dep.var.labels=c(""),
covariate.labels=c("Treatment"), out="models.text")
##
## ===========================================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------------------------
##
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------------------------
## Treatment -1.801*** -2.165*** -1.819*** -2.133***
## (0.095) (0.079) (0.095) (0.079)
##
## Constant 12.348*** 12.713*** 12.367*** 12.681***
## (0.072) (0.047) (0.072) (0.047)
##
## ---------------------------------------------------------------------------------------------------------------------------
## Observations 3,540 5,770 3,549 5,771
## R2 0.093 0.115 0.093 0.112
## Adjusted R2 0.093 0.115 0.093 0.111
## Residual Std. Error 2.782 (df = 3538) 2.868 (df = 5768) 2.813 (df = 3547) 2.877 (df = 5769)
## F Statistic 362.353*** (df = 1; 3538) 751.475*** (df = 1; 5768) 362.883*** (df = 1; 3547) 724.542*** (df = 1; 5769)
## ===========================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
m1 <- lm(sch ~ tb, data = coh4.nn1.ate.d)
m2 <- lm(sch ~ tb, data = coh4.nn5.ate.d)
m3 <- lm(sch ~ tb, data = coh4.nn1.att.d)
m4 <- lm(sch ~ tb, data = coh4.nn5.att.d)
stargazer(m1,m2,m3,m4, type="text",
dep.var.labels=c(""),
covariate.labels=c("Treatment"), out="models.text")
##
## ===========================================================================================================================
## Dependent variable:
## -------------------------------------------------------------------------------------------------------
##
## (1) (2) (3) (4)
## ---------------------------------------------------------------------------------------------------------------------------
## Treatment -2.101*** -2.210*** -2.000*** -2.244***
## (0.166) (0.138) (0.168) (0.138)
##
## Constant 14.241*** 14.350*** 14.140*** 14.384***
## (0.124) (0.077) (0.125) (0.076)
##
## ---------------------------------------------------------------------------------------------------------------------------
## Observations 1,115 2,002 1,121 2,012
## R2 0.125 0.114 0.113 0.117
## Adjusted R2 0.125 0.113 0.112 0.117
## Residual Std. Error 2.759 (df = 1113) 2.855 (df = 2000) 2.790 (df = 1119) 2.849 (df = 2010)
## F Statistic 159.500*** (df = 1; 1113) 256.687*** (df = 1; 2000) 142.310*** (df = 1; 1119) 266.220*** (df = 1; 2010)
## ===========================================================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
#5. Write a paragraph to accompany the table that (1) interprets and compares the estimates from the standard regression models to the propensity score models, (2) compares the estimates across the different propensity score models, and (3) assesses the presence of effect heterogeneity in each cohort and whether there’s evidence of differences in heterogeneity across cohorts.