rm(list = ls())

### Load useful packages
library(WeightIt)
library(survey)
library(cobalt)
library(MatchIt)
library(dplyr)
library(ggplot2)
library(psych)
library(PSW)
library(pastecs)
library(designmatch)
library(stargazer)
library(foreign)
library(Hmisc)
library(exactRankTests)
library(dplyr)
library(ggplot2)
library(optmatch)
library(foreign)
library(Hmisc)
library(xtable)
library(ipw)
library(CausalGAM)
d<-load("C:/Users/okuffuor/Downloads/asn-1-data.RData")
d<-d1
#1. Write the potential outcomes for this analysis.

D is a causal exposure indicator of child bearing (1 = treatment, 0 = control)

Y is observable outcome random variable of level of years of education. Therefore observed potential outcomes are [Y1 | D=1] and [Y0 | D=0] [Y1 | D=1] This model represents the number of years of education for individuals gave birth during their teens. [Y0 | D=0] This model also represents the number of years of education for individuals who did not give birth during their teens. However, what we do not observe is [Y1 | D=0] and [Y0 | D = 1]. Therefore the notations here are the converse of what is stated supra.

#2. Prepare a table of descriptive statistics that follows the conventions of your area of work in terms of which descriptives are included and what format the table takes.


d2<-subset(d,coh==1)

stargazer(d2, type = "text", title="Descriptive statistics", digits=1, out="table1.txt",covariate.labels = c("id","Cohort","Outcome Variable","Treatment Variable","Birth Year","age","Black","South","Urban","Both Parents US Citizens","two-parent household","Parent years of schooling","cat.meas.parent yrs of schooling","pe1-pe3"))
## 
## Descriptive statistics
## =====================================================================================
## Statistic                          N    Mean   St. Dev.  Min  Pctl(25) Pctl(75)  Max 
## -------------------------------------------------------------------------------------
## id                               4,445 2,519.6 1,458.8    1    1,262    3,779   5,083
## Cohort                           4,445   1.0     0.0      1      1        1       1  
## Outcome Variable                 4,445  11.2     2.6      6      10       12     18  
## Treatment Variable               4,445   0.3     0.5      0      0        1       1  
## Birth Year                       4,445 1,929.8   4.4    1,923  1,926    1,934   1,937
## age                              4,445  37.2     4.4     30      33       41     44  
## Black                            4,445   0.3     0.4      0      0        1       1  
## South                            4,445   0.4     0.5      0      0        1       1  
## Urban                            4,445   0.7     0.5      0      0        1       1  
## Both Parents US Citizens         4,445   0.9     0.4      0      1        1       1  
## two-parent household             4,445   0.7     0.5      0      0        1       1  
## Parent years of schooling        4,445   8.4     4.3      0      6        12     18  
## cat.meas.parent yrs of schooling 4,445   0.4     0.7      0      0        1       2  
## pe1-pe3                          4,445   0.7     0.5      0      0        1       1  
## pe2                              4,445   0.2     0.4      0      0        0       1  
## pe3                              4,445   0.1     0.3      0      0        0       1  
## -------------------------------------------------------------------------------------
d4<-subset(d,coh==4)
stargazer(d4, type = "text", title="Descriptive statistics", digits=1, out="table1.txt",covariate.labels = c("id","Cohort","Outcome Variable","Treatment Variable","Birth Year","age","Black","South","Urban","Both Parents US Citizens","two-parent household","Parent years of schooling","cat.meas.parent yrs of schooling","pe1-pe3"))
## 
## Descriptive statistics
## =====================================================================================
## Statistic                          N    Mean   St. Dev.  Min  Pctl(25) Pctl(75)  Max 
## -------------------------------------------------------------------------------------
## id                               2,976 4,379.3 2,469.4    1   2,321.8  6,201.2  9,022
## Cohort                           2,976   4.0     0.0      4      4        4       4  
## Outcome Variable                 2,976  14.2     3.0      6      12       16     20  
## Treatment Variable               2,976   0.2     0.4      0      0        0       1  
## Birth Year                       2,976 1,982.6   1.5    1,979  1,981    1,984   1,985
## age                              2,976  14.4     1.5     12      13       16     18  
## Black                            2,976   0.3     0.5      0      0        1       1  
## South                            2,976   0.4     0.5      0      0        1       1  
## Urban                            2,976   0.7     0.5      0      0        1       1  
## Both Parents US Citizens         2,976   0.9     0.3      0      1        1       1  
## two-parent household             2,976   0.5     0.5      0      0        1       1  
## Parent years of schooling        2,976  13.4     3.2      0      12       16     20  
## cat.meas.parent yrs of schooling 2,976   1.4     0.7      0      1        2       2  
## pe1-pe3                          2,976   0.1     0.3      0      0        0       1  
## pe2                              2,976   0.3     0.5      0      0        1       1  
## pe3                              2,976   0.5     0.5      0      0        1       1  
## -------------------------------------------------------------------------------------
#3. Write a brief paragraph to accompany the table of descriptive statistics.

From the descriptive statistics, the control and treatment groups look different based on characteristics such as age, race, two parent household.

#For our analysis we will explore standard regression estimates of the ATE and propensity score estimates using IPW and matching of the ATE and ATT. Note: for the questions involving assessing balance, you can produce a table, create a figure, and/or provide a brief write-up describing the degree of covariate balance.

#(a) Specify a propensity score model and assess overlap and balance for the ATE and ATT for each cohort.


m.coh1.ps  <- glm(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d2, family = binomial)
# calculate propensity scores and add to dataframe
coh1.ps.df <- data.frame(coh1.ps = predict(m.coh1.ps, type = "response"), tb = m.coh1.ps$model$tb)

# plot propensity scores by teen birth
labs <- c("teen birth", "no teen birth")
coh1.ps.df %>%
  mutate(tb = ifelse(tb == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = coh1.ps)) +
  geom_density(color = "red") +
  facet_wrap(~tb) +
  xlab("Probability of teen birth") +
  theme_bw()

m.coh4.ps  <- glm(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, family = binomial)

# calculate propensity scores and add to dataframe
coh4.ps.df <- data.frame(coh4.ps = predict(m.coh4.ps, type = "response"), tb = m.coh4.ps$model$tb)

# plot propensity scores by teen birth
labs <- c("teen birth", "no teen birth")
coh4.ps.df %>%
  mutate(tb = ifelse(tb == 1, labs[1], labs[2])) %>%
  ggplot(aes(x = coh4.ps)) +
  geom_density(color = "red") +
  facet_wrap(~tb) +
  xlab("Probability of teen birth") +
  theme_bw()

From the two plots above, it is quite clear that the treatment and control groups are not balanced. The plots indicate that there is no overlap for both cohorts.

#(b) Optional: you may notice an unusual pattern in some of the overlap plots. What do you think is the reason for this pattern?

This unusual pattern could be attributed to possible social desirability bias causing individuals who gave birth during their teens to hide their answer and potentially inflate the mean for the control group.

#(c) Once you are satisfied with the propensity score model, obtain estimates for the ATE and ATT using a standard IPW estimator and using one that adjusts for the propensity score covariates. Place the estimates in a table of results noted below.

coh1.ate.w  <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, estimand = "ATE")
coh1.ate.w
## A weightit object
##  - method: "ps" (propensity score weighting)
##  - number of obs.: 7421
##  - sampling weights: none
##  - treatment: 2-category
##  - estimand: ATE
##  - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh1.ate.w)
## Call
##  weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 + 
##     pe3, data = d1, estimand = "ATE")
## 
## Balance Measures
##                Type Diff.Adj
## prop.score Distance   0.0014
## age         Contin.   0.0182
## blk          Binary   0.0032
## sth          Binary  -0.0048
## urb          Binary   0.0078
## ntp          Binary  -0.0064
## tph          Binary  -0.0076
## pe2          Binary  -0.0009
## pe3          Binary  -0.0004
## 
## Effective sample sizes
##             Control  Treated
## Unadjusted 5382.000 2039.000
## Adjusted   5069.086 1355.444
# use weights to estimate ATE
coh1.ate.dw <- svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
coh1.ate <- svyglm(sch ~ tb, design = coh1.ate.dw)
summary(coh1.ate)
## 
## Call:
## svyglm(formula = sch ~ tb, design = coh1.ate.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.85837    0.04424   290.7   <2e-16 ***
## tb          -1.75614    0.08483   -20.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8.130732)
## 
## Number of Fisher Scoring iterations: 2
coh1.ate.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh1.ate.dw)
summary(coh1.ate.ra)
## 
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph + 
##     pe2 + pe3, design = coh1.ate.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh1.ate.w$weights, data = d1)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.721462   0.194262  70.634  < 2e-16 ***
## tb          -1.737209   0.071184 -24.405  < 2e-16 ***
## age         -0.073738   0.003856 -19.122  < 2e-16 ***
## blk         -0.016086   0.084423  -0.191    0.849    
## sth         -0.344995   0.077982  -4.424 9.83e-06 ***
## urb          0.175453   0.073520   2.386    0.017 *  
## ntp         -0.045982   0.145654  -0.316    0.752    
## tph          0.652728   0.077277   8.447  < 2e-16 ***
## pe2          1.103090   0.079227  13.923  < 2e-16 ***
## pe3          2.151981   0.103244  20.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.63069)
## 
## Number of Fisher Scoring iterations: 2
coh1.att.w  <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, estimand = "ATT")
coh1.att.w
## A weightit object
##  - method: "ps" (propensity score weighting)
##  - number of obs.: 7421
##  - sampling weights: none
##  - treatment: 2-category
##  - estimand: ATT (focal: 1)
##  - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh1.att.w)
## Call
##  weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 + 
##     pe3, data = d1, estimand = "ATT")
## 
## Balance Measures
##                Type Diff.Adj
## prop.score Distance  -0.0107
## age         Contin.   0.0018
## blk          Binary  -0.0050
## sth          Binary   0.0004
## urb          Binary  -0.0036
## ntp          Binary   0.0001
## tph          Binary   0.0046
## pe2          Binary  -0.0008
## pe3          Binary   0.0007
## 
## Effective sample sizes
##             Control Treated
## Unadjusted 5382.000    2039
## Adjusted   2966.013    2039
coh1.att.dw <- svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
coh1.att <- svyglm(sch ~ tb, design = coh1.att.dw)
summary(coh1.att)
## 
## Call:
## svyglm(formula = sch ~ tb, design = coh1.att.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.16072    0.05774  210.60   <2e-16 ***
## tb          -1.61389    0.08095  -19.94   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 8.030271)
## 
## Number of Fisher Scoring iterations: 2
coh1.att.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh1.att.dw)
summary(coh1.att.ra)
## 
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph + 
##     pe2 + pe3, design = coh1.att.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh1.att.w$weights, data = d1)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.051352   0.195523  71.866  < 2e-16 ***
## tb          -1.615266   0.069462 -23.254  < 2e-16 ***
## age         -0.082714   0.003978 -20.791  < 2e-16 ***
## blk         -0.130351   0.081414  -1.601 0.109402    
## sth         -0.432775   0.071966  -6.014 1.90e-09 ***
## urb          0.275407   0.077862   3.537 0.000407 ***
## ntp         -0.098356   0.128917  -0.763 0.445524    
## tph          0.608257   0.074981   8.112 5.77e-16 ***
## pe2          1.050739   0.086954  12.084  < 2e-16 ***
## pe3          2.064264   0.109037  18.932  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 5.78826)
## 
## Number of Fisher Scoring iterations: 2
coh4.ate.w  <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, estimand = "ATE")
coh4.ate.w
## A weightit object
##  - method: "ps" (propensity score weighting)
##  - number of obs.: 2976
##  - sampling weights: none
##  - treatment: 2-category
##  - estimand: ATE
##  - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh4.ate.w)
## Call
##  weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 + 
##     pe3, data = d4, estimand = "ATE")
## 
## Balance Measures
##                Type Diff.Adj
## prop.score Distance   0.0012
## age         Contin.   0.0361
## blk          Binary   0.0095
## sth          Binary   0.0033
## urb          Binary   0.0126
## ntp          Binary  -0.0166
## tph          Binary  -0.0097
## pe2          Binary  -0.0018
## pe3          Binary  -0.0013
## 
## Effective sample sizes
##             Control Treated
## Unadjusted 2355.000 621.000
## Adjusted   2266.549 382.984
coh4.ate.dw <- svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
coh4.ate <- svyglm(sch ~ tb, design = coh4.ate.dw)
summary(coh4.ate)
## 
## Call:
## svyglm(formula = sch ~ tb, design = coh4.ate.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.62097    0.06243  234.18   <2e-16 ***
## tb          -2.10082    0.15704  -13.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7.956605)
## 
## Number of Fisher Scoring iterations: 2
coh4.ate.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh4.ate.dw)
summary(coh4.ate.ra)
## 
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph + 
##     pe2 + pe3, design = coh4.ate.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh4.ate.w$weights, data = d4)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.86770    0.82576  14.372  < 2e-16 ***
## tb          -2.11140    0.14402 -14.660  < 2e-16 ***
## age          0.09565    0.05107   1.873   0.0612 .  
## blk          0.65489    0.15496   4.226 2.45e-05 ***
## sth         -0.03431    0.15920  -0.216   0.8294    
## urb          0.02926    0.15012   0.195   0.8455    
## ntp         -0.69916    0.37512  -1.864   0.0624 .  
## tph          0.85483    0.15150   5.642 1.84e-08 ***
## pe2          0.86737    0.16176   5.362 8.86e-08 ***
## pe3          2.09966    0.17254  12.169  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6.949654)
## 
## Number of Fisher Scoring iterations: 2
coh4.att.w  <- weightit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, estimand = "ATt")
coh4.att.w
## A weightit object
##  - method: "ps" (propensity score weighting)
##  - number of obs.: 2976
##  - sampling weights: none
##  - treatment: 2-category
##  - estimand: ATT (focal: 1)
##  - covariates: age, blk, sth, urb, ntp, tph, pe2, pe3
bal.tab(coh4.att.w)
## Call
##  weightit(formula = tb ~ age + blk + sth + urb + ntp + tph + pe2 + 
##     pe3, data = d4, estimand = "ATt")
## 
## Balance Measures
##                Type Diff.Adj
## prop.score Distance  -0.0395
## age         Contin.  -0.0145
## blk          Binary  -0.0123
## sth          Binary  -0.0089
## urb          Binary   0.0037
## ntp          Binary   0.0004
## tph          Binary   0.0054
## pe2          Binary   0.0042
## pe3          Binary   0.0041
## 
## Effective sample sizes
##             Control Treated
## Unadjusted 2355.000     621
## Adjusted   1249.289     621
coh4.att.dw <- svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
coh4.att <- svyglm(sch ~ tb, design = coh4.att.dw)
summary(coh4.att)
## 
## Call:
## svyglm(formula = sch ~ tb, design = coh4.att.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.98895    0.08392  166.69   <2e-16 ***
## tb          -1.84885    0.13413  -13.78   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 7.794384)
## 
## Number of Fisher Scoring iterations: 2
coh4.att.ra <- svyglm(sch ~ tb + age + blk + sth + urb + ntp + tph + pe2 + pe3, design = coh4.att.dw)
summary(coh4.att.ra)
## 
## Call:
## svyglm(formula = sch ~ tb + age + blk + sth + urb + ntp + tph + 
##     pe2 + pe3, design = coh4.att.dw)
## 
## Survey design:
## svydesign(ids = ~1, weights = coh4.att.w$weights, data = d4)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.64824    0.70111  18.040  < 2e-16 ***
## tb          -1.85624    0.12593 -14.740  < 2e-16 ***
## age          0.04664    0.04189   1.113  0.26563    
## blk          0.66771    0.13699   4.874 1.15e-06 ***
## sth         -0.01313    0.13612  -0.096  0.92318    
## urb          0.03392    0.14081   0.241  0.80968    
## ntp         -0.91379    0.29622  -3.085  0.00206 ** 
## tph          0.80270    0.14187   5.658 1.68e-08 ***
## pe2          0.86949    0.16494   5.271 1.45e-07 ***
## pe3          2.09652    0.17449  12.015  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 6.89065)
## 
## Number of Fisher Scoring iterations: 2
#2. Matching estimators
#(a) Use the same propensity score model from the previous question. Implement a nearest neighbor matching algorithm with a caliper of 0.05 and a minimum of 1 neighbor and then 5 neighbors for both the ATE and ATT for each cohort. Assess balance for each and place the estimates in a table of results noted below

### Nearest neighbor matching min 1 for ATE for cohort 1
# run nearest neighbor matching algorithm
coh1.nn1.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "all",
                        method = "nearest", ratio = 1, caliper = 0.05, replace = T)

coh1.nn1.ate.d <- match.data(coh1.nn1.ate.m)

coh1.nn1.ate <- lm(sch ~ tb, data = coh1.nn1.ate.d)
summary(coh1.nn1.ate)
## 
## Call:
## lm(formula = sch ~ tb, data = coh1.nn1.ate.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3482 -1.5476 -0.3482  1.4524  9.4524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.34820    0.07177  172.05   <2e-16 ***
## tb          -1.80061    0.09459  -19.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.782 on 3538 degrees of freedom
## Multiple R-squared:  0.0929, Adjusted R-squared:  0.09265 
## F-statistic: 362.4 on 1 and 3538 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn1.ate.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching for ATT for cohort 1
coh1.nn1.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "treat",
                          method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh1.nn1.att.d <- match.data(coh1.nn1.att.m)
coh1.nn1.att <- lm(sch ~ tb, data = coh1.nn1.att.d)
summary(coh1.nn1.att)
## 
## Call:
## lm(formula = sch ~ tb, data = coh1.nn1.att.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3666 -1.5476 -0.3666  1.4524  9.4524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.36664    0.07236  170.90   <2e-16 ***
## tb          -1.81905    0.09549  -19.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.813 on 3547 degrees of freedom
## Multiple R-squared:  0.09281,    Adjusted R-squared:  0.09256 
## F-statistic: 362.9 on 1 and 3547 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn1.att.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching for ATT for cohort 1 min 5

coh1.nn5.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "treat",
                          method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh1.nn5.att.d <- match.data(coh1.nn5.att.m)
coh1.nn5.att <- lm(sch ~ tb, data = coh1.nn5.att.d)
summary(coh1.nn5.att)
## 
## Call:
## lm(formula = sch ~ tb, data = coh1.nn5.att.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.6807 -1.5476 -0.6807  1.4524  9.4524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.68069    0.04709  269.27   <2e-16 ***
## tb          -2.13309    0.07925  -26.92   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.877 on 5769 degrees of freedom
## Multiple R-squared:  0.1116, Adjusted R-squared:  0.1114 
## F-statistic: 724.5 on 1 and 5769 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn5.att.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching for ATE for cohort 4
coh4.nn1.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
                          method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh4.nn1.ate.d <- match.data(coh4.nn1.ate.m)
coh4.nn1.ate <- lm(sch ~ tb, data = coh4.nn1.ate.d)
summary(coh4.nn1.ate)
## 
## Call:
## lm(formula = sch ~ tb, data = coh4.nn1.ate.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2409 -2.1401 -0.1401  1.8599  7.8599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.2409     0.1241  114.72   <2e-16 ***
## tb           -2.1008     0.1663  -12.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.759 on 1113 degrees of freedom
## Multiple R-squared:  0.1253, Adjusted R-squared:  0.1246 
## F-statistic: 159.5 on 1 and 1113 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn1.ate.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching min 5 for ATE for cohort 1
coh1.nn5.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d1, sub.by = "all",
                          method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh1.nn5.ate.d <- match.data(coh1.nn5.ate.m)
coh1.nn5.ate <- lm(sch ~ tb, data = coh1.nn5.ate.d)
summary(coh1.nn5.ate)
## 
## Call:
## lm(formula = sch ~ tb, data = coh1.nn5.ate.d)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -6.713 -1.548 -0.713  1.452  9.452 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.71302    0.04695  270.80   <2e-16 ***
## tb          -2.16543    0.07899  -27.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.868 on 5768 degrees of freedom
## Multiple R-squared:  0.1153, Adjusted R-squared:  0.1151 
## F-statistic: 751.5 on 1 and 5768 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh1.nn5.ate.m), threshold = .1)
## Note: s.d.denom not specified; assuming pooled.
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching min 5 for ATE for cohort 4
coh4.nn5.ate.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
                          method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh4.nn5.ate.d <- match.data(coh4.nn5.ate.m)
coh4.nn5.ate <- lm(sch ~ tb, data = coh4.nn5.ate.d)
summary(coh4.nn5.ate)
## 
## Call:
## lm(formula = sch ~ tb, data = coh4.nn5.ate.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3505 -2.3505 -0.1401  1.8599  7.8599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.35047    0.07684  186.76   <2e-16 ***
## tb          -2.21037    0.13796  -16.02   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.855 on 2000 degrees of freedom
## Multiple R-squared:  0.1137, Adjusted R-squared:  0.1133 
## F-statistic: 256.7 on 1 and 2000 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn5.ate.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching for ATT for cohort 4
coh4.nn1.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "treat",
                          method = "nearest", ratio = 1, caliper = 0.05, replace = T)
coh4.nn1.att.d <- match.data(coh4.nn1.att.m)
coh4.nn1.att <- lm(sch ~ tb, data = coh4.nn1.att.d)
summary(coh4.nn1.att)
## 
## Call:
## lm(formula = sch ~ tb, data = coh4.nn1.att.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1400 -2.1400 -0.1401  1.8600  7.8599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.1400     0.1248  113.32   <2e-16 ***
## tb           -1.9999     0.1676  -11.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.79 on 1119 degrees of freedom
## Multiple R-squared:  0.1128, Adjusted R-squared:  0.112 
## F-statistic: 142.3 on 1 and 1119 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn1.att.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

### Nearest neighbor matching min 5 for ATT for cohort 4
coh4.nn5.att.m <- matchit(tb ~ age + blk + sth + urb + ntp + tph + pe2 + pe3, data = d4, sub.by = "all",
                          method = "nearest", ratio = 5, caliper = 0.05, replace = T)
coh4.nn5.att.d <- match.data(coh4.nn5.att.m)
coh4.nn5.att <- lm(sch ~ tb, data = coh4.nn5.att.d)
summary(coh4.nn5.att)
## 
## Call:
## lm(formula = sch ~ tb, data = coh4.nn5.att.d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3839 -2.3839 -0.1401  1.8599  7.8599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.3839     0.0764  188.27   <2e-16 ***
## tb           -2.2438     0.1375  -16.32   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.849 on 2010 degrees of freedom
## Multiple R-squared:  0.117,  Adjusted R-squared:  0.1165 
## F-statistic: 266.2 on 1 and 2010 DF,  p-value: < 2.2e-16
love.plot(bal.tab(coh4.nn5.att.m), threshold = .1)
## Warning: Standardized mean differences and raw mean differences are present in the same plot. 
## Use the 'stars' argument to distinguish between them and appropriately label the x-axis.

#(b) Using the same propensity score model, implement a kernel matching algorithm for the ATE and ATT. Place the estimates in a table of results noted below
#3. Fit two standard regression models for each cohort. The first model should not include covariates and will represent a naive estimate of the ATE. The second model should include the same variables and functional form as the propensity score models.

##cohort1
reg1_ch1 = lm(d1$sch ~ d1$tb)
summary(reg1_ch1)
## 
## Call:
## lm(formula = d1$sch ~ d1$tb)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1234 -1.1234 -0.5468  1.8766  9.4532 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.12337    0.04008   327.4   <2e-16 ***
## d1$tb       -2.57654    0.07646   -33.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.94 on 7419 degrees of freedom
## Multiple R-squared:  0.1327, Adjusted R-squared:  0.1326 
## F-statistic:  1136 on 1 and 7419 DF,  p-value: < 2.2e-16
reg2_ch1 = lm(d1$sch ~ d1$tb+d1$age+d1$blk+d1$sth+d1$urb+d1$ntp+d1$tph+d1$pec)
summary(reg2_ch1)
## 
## Call:
## lm(formula = d1$sch ~ d1$tb + d1$age + d1$blk + d1$sth + d1$urb + 
##     d1$ntp + d1$tph + d1$pec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6747 -1.5470  0.1564  1.4251  7.7719 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.735171   0.149041  92.157  < 2e-16 ***
## d1$tb       -1.665744   0.067103 -24.824  < 2e-16 ***
## d1$age      -0.079737   0.003034 -26.278  < 2e-16 ***
## d1$blk      -0.035736   0.072226  -0.495  0.62077    
## d1$sth      -0.328416   0.063947  -5.136 2.88e-07 ***
## d1$urb       0.171537   0.062812   2.731  0.00633 ** 
## d1$ntp       0.062339   0.091221   0.683  0.49438    
## d1$tph       0.677783   0.062513  10.842  < 2e-16 ***
## d1$pec       1.156567   0.042014  27.528  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.421 on 7412 degrees of freedom
## Multiple R-squared:  0.4126, Adjusted R-squared:  0.4119 
## F-statistic: 650.7 on 8 and 7412 DF,  p-value: < 2.2e-16
#3. Fit two standard regression models for each cohort. The first model should not include covariates and will represent a naive estimate of the ATE. The second model should include the same variables and functional form as the propensity score models.
##cohort4

reg1_ch4 = lm(d4$sch ~ d4$tb)
summary(reg1_ch4)
## 
## Call:
## lm(formula = d4$sch ~ d4$tb)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.789 -2.140  0.211  2.211  7.860 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.78896    0.05873   251.8   <2e-16 ***
## d4$tb       -2.64886    0.12856   -20.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.85 on 2974 degrees of freedom
## Multiple R-squared:  0.1249, Adjusted R-squared:  0.1246 
## F-statistic: 424.5 on 1 and 2974 DF,  p-value: < 2.2e-16
reg2_ch4 = lm(d4$sch ~ d4$tb+d4$age+d4$blk+d4$sth+d4$urb+d4$ntp+d4$tph+d4$pec)
summary(reg2_ch4)
## 
## Call:
## lm(formula = d4$sch ~ d4$tb + d4$age + d4$blk + d4$sth + d4$urb + 
##     d4$ntp + d4$tph + d4$pec)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5113 -1.7074  0.0825  1.8029  7.8386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.39153    0.53079  23.345  < 2e-16 ***
## d4$tb       -1.94522    0.12797 -15.200  < 2e-16 ***
## d4$age       0.03571    0.03297   1.083    0.279    
## d4$blk       0.47067    0.11981   3.929 8.74e-05 ***
## d4$sth      -0.06399    0.10587  -0.604    0.546    
## d4$urb       0.03392    0.11023   0.308    0.758    
## d4$ntp      -0.46095    0.18348  -2.512    0.012 *  
## d4$tph       0.91064    0.10745   8.475  < 2e-16 ***
## d4$pec       1.13581    0.07601  14.942  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.682 on 2967 degrees of freedom
## Multiple R-squared:  0.2269, Adjusted R-squared:  0.2248 
## F-statistic: 108.9 on 8 and 2967 DF,  p-value: < 2.2e-16
#4. Prepare a table with the estimates of the ATEs and ATTs from your analyses.
m1 <- lm(sch ~ tb, data = coh1.nn1.ate.d)
m2 <- lm(sch ~ tb, data = coh1.nn5.ate.d)
m3 <- lm(sch ~ tb, data = coh1.nn1.att.d)
m4 <- lm(sch ~ tb, data = coh1.nn5.att.d)
stargazer(m1,m2,m3,m4, type="text",
 dep.var.labels=c(""),
 covariate.labels=c("Treatment"), out="models.text")
## 
## ===========================================================================================================================
##                                                               Dependent variable:                                          
##                     -------------------------------------------------------------------------------------------------------
##                                                                                                                            
##                                (1)                       (2)                       (3)                       (4)           
## ---------------------------------------------------------------------------------------------------------------------------
## Treatment                   -1.801***                 -2.165***                 -1.819***                 -2.133***        
##                              (0.095)                   (0.079)                   (0.095)                   (0.079)         
##                                                                                                                            
## Constant                    12.348***                 12.713***                 12.367***                 12.681***        
##                              (0.072)                   (0.047)                   (0.072)                   (0.047)         
##                                                                                                                            
## ---------------------------------------------------------------------------------------------------------------------------
## Observations                  3,540                     5,770                     3,549                     5,771          
## R2                            0.093                     0.115                     0.093                     0.112          
## Adjusted R2                   0.093                     0.115                     0.093                     0.111          
## Residual Std. Error     2.782 (df = 3538)         2.868 (df = 5768)         2.813 (df = 3547)         2.877 (df = 5769)    
## F Statistic         362.353*** (df = 1; 3538) 751.475*** (df = 1; 5768) 362.883*** (df = 1; 3547) 724.542*** (df = 1; 5769)
## ===========================================================================================================================
## Note:                                                                                           *p<0.1; **p<0.05; ***p<0.01
m1 <- lm(sch ~ tb, data = coh4.nn1.ate.d)
m2 <- lm(sch ~ tb, data = coh4.nn5.ate.d)
m3 <- lm(sch ~ tb, data = coh4.nn1.att.d)
m4 <- lm(sch ~ tb, data = coh4.nn5.att.d)
stargazer(m1,m2,m3,m4, type="text",
 dep.var.labels=c(""),
 covariate.labels=c("Treatment"), out="models.text")
## 
## ===========================================================================================================================
##                                                               Dependent variable:                                          
##                     -------------------------------------------------------------------------------------------------------
##                                                                                                                            
##                                (1)                       (2)                       (3)                       (4)           
## ---------------------------------------------------------------------------------------------------------------------------
## Treatment                   -2.101***                 -2.210***                 -2.000***                 -2.244***        
##                              (0.166)                   (0.138)                   (0.168)                   (0.138)         
##                                                                                                                            
## Constant                    14.241***                 14.350***                 14.140***                 14.384***        
##                              (0.124)                   (0.077)                   (0.125)                   (0.076)         
##                                                                                                                            
## ---------------------------------------------------------------------------------------------------------------------------
## Observations                  1,115                     2,002                     1,121                     2,012          
## R2                            0.125                     0.114                     0.113                     0.117          
## Adjusted R2                   0.125                     0.113                     0.112                     0.117          
## Residual Std. Error     2.759 (df = 1113)         2.855 (df = 2000)         2.790 (df = 1119)         2.849 (df = 2010)    
## F Statistic         159.500*** (df = 1; 1113) 256.687*** (df = 1; 2000) 142.310*** (df = 1; 1119) 266.220*** (df = 1; 2010)
## ===========================================================================================================================
## Note:                                                                                           *p<0.1; **p<0.05; ***p<0.01
#5. Write a paragraph to accompany the table that (1) interprets and compares the estimates from the standard regression models to the propensity score models, (2) compares the estimates across the different propensity score models, and (3) assesses the presence of effect heterogeneity in each cohort and whether there’s evidence of differences in heterogeneity across cohorts.