Part III Analysis Implementation
#read in data
library(haven)
EITC <- read_dta("C:/Users/rlutt/Downloads/eitcRR-1.dta")
For both annual family income and earned income, the first and third quartiles are presented in parentheses. The median age is 34, with the first quartile being 26 and the third quartile being 44. Education is presented in years, and this sample’s median education is 10 years, with the max being 11. This means that the women in this sample did not finish high school. 51% of the sample was employed in the year prior to the survey. Lastly, the unearned income is negligible, as the maximum value is just under $135.
#inspect, describe and summarize the data
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
EITC2<-EITC%>%
select(year, children, nonwhite, finc, earn, age, ed, work, unearn)
library(gtsummary)
## #BlackLivesMatter
table1 <-
EITC2 %>%
tbl_summary()
table1
Characteristic | N = 13,7461 |
---|---|
Year [taxyear] | |
1991 | 2,610 (19%) |
1992 | 2,449 (18%) |
1993 | 2,342 (17%) |
1994 | 2,255 (16%) |
1995 | 2,085 (15%) |
1996 | 2,005 (15%) |
Number of Children | 1.00 (0.00, 2.00) |
Dummy=1 if Hispanic/Black | 8,257 (60%) |
Annual Family Income (97$) | 9,637 (5,123, 18,659) |
Annual earnings (97$) | 3,332 (0, 14,321) |
Age of woman | 34 (26, 44) |
Years of education | |
0 | 452 (3.3%) |
3 | 714 (5.2%) |
7 | 3,145 (23%) |
9 | 2,022 (15%) |
10 | 2,850 (21%) |
11 | 4,563 (33%) |
Dummy =1 if Employed last year | 7,052 (51%) |
Unearned Income (97$) | 3.0 (0.0, 6.9) |
1 n (%); Median (IQR) |
summary(EITC$children)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 1.193 2.000 9.000
summary(EITC$finc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5123 9637 15255 18659 575617
summary(EITC$finc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 5123 9637 15255 18659 575617
#2. Calculate the sample means of all variables for (a) single women with no children, (b) single women with 1 child, and (c) single women with 2+ children.
#Note: I do not include year or state for the sample means since they are the levels of which these observations are within. I also do not include the variable for number of children since we are grouping the means by number of children.
#a.)
library(dplyr)
nochildren<-filter(EITC, children==0)
mean(nochildren$finc)
## [1] 18559.86
mean(nochildren$earn)
## [1] 13760.26
mean(nochildren$age)
## [1] 38.49823
mean(nochildren$nonwhite)
## [1] 0.515944
mean(nochildren$ed)
## [1] 8.548676
mean(nochildren$work)
## [1] 0.5744896
mean(nochildren$unearn)
## [1] 4.799607
#b.)
onechild<-filter(EITC, children==1)
mean(onechild$finc)
## [1] 13941.57
mean(onechild$earn)
## [1] 9928.279
mean(onechild$age)
## [1] 33.75899
mean(onechild$nonwhite)
## [1] 0.5964683
mean(onechild$ed)
## [1] 8.992479
mean(onechild$work)
## [1] 0.5376063
mean(onechild$unearn)
## [1] 4.013291
#c.)
morechildren<-filter(EITC, children>=2)
mean(morechildren$finc)
## [1] 11985.3
mean(morechildren$earn)
## [1] 6613.547
mean(morechildren$age)
## [1] 32.04747
mean(morechildren$nonwhite)
## [1] 0.7088847
mean(morechildren$ed)
## [1] 9.006721
mean(morechildren$work)
## [1] 0.4207099
mean(morechildren$unearn)
## [1] 5.371749
#Earning is reported as zero for women who are not employed. Create a new variable with earnings conditional on working (those non-employed are set to be NA) and calculate the means of new variable of earnings by group (i.e., a. single women with no children, b. single women with 1 child, and c. single women with 2+ children
EITC3<-EITC %>% mutate(newearn= case_when(work==0 ~ NA, TRUE ~ as.numeric((earn))))
#a.)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats 1.0.0 ✔ readr 2.1.4
## ✔ ggplot2 3.4.2 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
ncempl<-filter(EITC3, children==0)
ncemploy<-drop_na(ncempl)
mean(ncemploy$newearn)
## [1] 19838.93
#b.)
ocempl<-filter(EITC3, children==1)
ocemploy<-drop_na(ocempl)
mean(ocemploy$newearn)
## [1] 14963.35
#c.)
tcempl<-filter(EITC3, children>=2)
tcemploy<-drop_na(tcempl)
mean(tcemploy$newearn)
## [1] 11961.44
mean(nochildren$nonwhite)
## [1] 0.515944
mean(nochildren$ed)
## [1] 8.548676
mean(onechild$nonwhite)
## [1] 0.5964683
mean(onechild$ed)
## [1] 8.992479
mean(morechildren$nonwhite)
## [1] 0.7088847
mean(morechildren$ed)
## [1] 9.006721
#prep data for DiD by creating treatment and timing variable
#treatment is whether women has children or not
EITC3$treated <-ifelse(EITC$children> 0, 1, 0)
#time treatment went into place is 1993
EITC3<-EITC3%>%
mutate(post = case_when(
treated== 0 & year>=1993~ 1, treated==1 & year>=1993 ~1, treated==1 & year<1993 ~0, treated==0 & year<=1993~0))
The trends between the treatment and control group prior to the ETIC expansion, are similar, however. This holds the parallel trends assumption as valid. The major gap in employment between the treatment and control group could be due to compositional differences, however. We know there are some compositional differences due to the earlier analysis with the variable means. To investigate this further it would be important to compare the composition of the treatment and control arms in the pre-treatment and post-treatment periods.
#Create a graph which plots mean annual employment rates by year (1991-1996) for single women with children (treatment) and without children (control)
library(dplyr)
EITC4<-EITC3%>%
group_by(year, treated) %>% summarize(Emp = mean(work))
## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.
EITC4$Employment<-EITC4$Emp*100
library(ggplot2)
ggplot(data =EITC4, aes(x = year, y = Employment, ymin=0, ymax=100, fill=treated ,color = treated, group = treated)) +
geom_line() + geom_ribbon(alpha=0.01,aes(color=NULL)) +
labs(title="Employment rates by control and treatement arm, 1991-1996") +
xlab(label="Year") +
ylab(label="Employment rate")
6. See the code below for the unconditional DiD analysis. The results
are marginally significant, meaning the p-value is slightly higher than
0.05. The coefficient for the DiD is 0.031 which can be interpreted as,
the expansion of the EITC credit led to a 3% increase in employment
amongst women with children compared to women without children.
#Calculate the unconditional (i.e., without any control variables) difference-in-difference estimates of the effect of the 1993 EITC expansion on employment of single women. Calculate estimates with all women with children as the treatment and single women with no children as the control. Interpret these results. (10 points)
EITC3$did<- EITC3$post * EITC3$treated
#generate unconditional DiD
DiDmodel <-lm(work~ treated + post + did, data = EITC3)
summary(DiDmodel)
##
## Call:
## lm(formula = work ~ treated + post + did, data = EITC3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5775 -0.4763 0.4225 0.5237 0.5502
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.577478 0.010720 53.871 <2e-16 ***
## treated -0.127650 0.014134 -9.031 <2e-16 ***
## post -0.004688 0.013427 -0.349 0.7270
## did 0.031128 0.017761 1.753 0.0797 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4969 on 13742 degrees of freedom
## Multiple R-squared: 0.01184, Adjusted R-squared: 0.01163
## F-statistic: 54.91 on 3 and 13742 DF, p-value: < 2.2e-16
#generate conditional DiD
DiDmodelconditional <-lm(work~ treated + post + did +ed + nonwhite +age, data = EITC3)
summary(DiDmodelconditional)
##
## Call:
## lm(formula = work ~ treated + post + did + ed + nonwhite + age,
## data = EITC3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6809 -0.4897 0.3374 0.4787 0.7539
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3745016 0.0261657 14.313 < 2e-16 ***
## treated -0.1170893 0.0143052 -8.185 2.96e-16 ***
## post -0.0050659 0.0133443 -0.380 0.7042
## did 0.0329061 0.0176495 1.864 0.0623 .
## ed 0.0178870 0.0016322 10.959 < 2e-16 ***
## nonwhite -0.0539581 0.0087631 -6.157 7.60e-10 ***
## age 0.0020299 0.0004376 4.639 3.53e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4937 on 13739 degrees of freedom
## Multiple R-squared: 0.02473, Adjusted R-squared: 0.0243
## F-statistic: 58.06 on 6 and 13739 DF, p-value: < 2.2e-16
As this program is a workforce training initiative, I am going to assume that it increased employment rates during this period. I am also going to assume that women with children were less likely to participate in this program because they were tasked with childcare. Therefore, as the covariance of this omitted variable and the outcome is positive and the covariance of this omitted variable and our treatment variable is negative, we can assume our results are subject to downward bias.
#create dummy variables for different "doses"
#treatment is whether women has children or not: 1 child is dosage #1 and 2 or more children is dosage #2
#first I make an unconditional model
EITC3$doses1 <-ifelse(EITC3$children== 1, 1, 0)
EITC3$doses2 <-ifelse(EITC3$children>= 2, 1, 0)
EITC3$didbydose1<- EITC3$post * EITC3$doses1
EITC3$didbydose2<- EITC3$post * EITC3$doses2
dosesmodel <-lm(work~ doses1 + doses2+ post + didbydose1+ didbydose2, data = EITC3)
summary(dosesmodel)
##
## Call:
## lm(formula = work ~ doses1 + doses2 + post + didbydose1 + didbydose2,
## data = EITC3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.5775 -0.5197 0.4225 0.4517 0.5954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.577478 0.010680 54.070 < 2e-16 ***
## doses1 -0.057793 0.018125 -3.189 0.00143 **
## doses2 -0.172837 0.015899 -10.871 < 2e-16 ***
## post -0.004688 0.013377 -0.350 0.72600
## didbydose1 0.033306 0.022834 1.459 0.14470
## didbydose2 0.030241 0.019989 1.513 0.13032
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4951 on 13740 degrees of freedom
## Multiple R-squared: 0.01926, Adjusted R-squared: 0.01891
## F-statistic: 53.98 on 5 and 13740 DF, p-value: < 2.2e-16
#then I make the conditional model
dosesmodelconditional <-lm(work~ doses1 + doses2+ post + didbydose1+ didbydose2+ ed + nonwhite +age, data = EITC3)
summary(dosesmodelconditional)
##
## Call:
## lm(formula = work ~ doses1 + doses2 + post + didbydose1 + didbydose2 +
## ed + nonwhite + age, data = EITC3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6728 -0.4928 0.3446 0.4562 0.7900
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.3814314 0.0260927 14.618 < 2e-16 ***
## doses1 -0.0548552 0.0181393 -3.024 0.0025 **
## doses2 -0.1614731 0.0161224 -10.015 < 2e-16 ***
## post -0.0051390 0.0133016 -0.386 0.6993
## didbydose1 0.0351553 0.0227033 1.548 0.1215
## didbydose2 0.0313793 0.0198718 1.579 0.1143
## ed 0.0179040 0.0016270 11.004 < 2e-16 ***
## nonwhite -0.0466478 0.0087696 -5.319 1.06e-07 ***
## age 0.0017493 0.0004372 4.002 6.32e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4922 on 13737 degrees of freedom
## Multiple R-squared: 0.03109, Adjusted R-squared: 0.03053
## F-statistic: 55.1 on 8 and 13737 DF, p-value: < 2.2e-16
First, I do an unconditional analysis, with no covariates. Second I do a conditional analysis, with years of education, race/ethnicity, and age as covariates. The results are not significant, giving no evidence of differences between different “doses” of the treatment, having no children, only one child, or having more. I’m interested in looking more into the group of women with 2+ children. This is a heterogeneous group in terms of number of children. It may be interesting to break this group up by number of children even further, but small sample sizes of number of children may prevent this analysis.