Causal Inference Midterm

Part III Analysis Implementation

#read in data
library(haven)
EITC <- read_dta("C:/Users/rlutt/Downloads/eitcRR-1.dta")

The original import is 13746 observations of 10 variables. There are multiple observations for each state, varyng by tax year. The summary table shows the distribution of tax year: 1991- 1996, which range from being 15%-19% of the total observations. This means that each year is well-distributed. The median number of children is 1, with 0 being the minimum and 9 being the max, the graphic shows the first and third quartile. (I used the summary function to get the minimum and maximum) 60% of the sample is shown to be non-white, meaning they are either Hispanic or Black. The median annual family income is 9,637 dollars, while the median annual earnings is 3,332 dollars. The median for earned income is smaller than family income because it is an individual measure rather than based on the family unit. This sample only includes non-married women, so it is probable that their family income is their parents’.

For both annual family income and earned income, the first and third quartiles are presented in parentheses. The median age is 34, with the first quartile being 26 and the third quartile being 44. Education is presented in years, and this sample’s median education is 10 years, with the max being 11. This means that the women in this sample did not finish high school. 51% of the sample was employed in the year prior to the survey. Lastly, the unearned income is negligible, as the maximum value is just under $135.

#inspect, describe and summarize the data

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

EITC2<-EITC%>%
  select(year, children, nonwhite, finc, earn, age, ed, work, unearn)
library(gtsummary)

## #BlackLivesMatter

table1 <- 
  EITC2 %>%
  tbl_summary()
table1

Characteristic	N = 13,746¹
Year [taxyear]
1991	2,610 (19%)
1992	2,449 (18%)
1993	2,342 (17%)
1994	2,255 (16%)
1995	2,085 (15%)
1996	2,005 (15%)
Number of Children	1.00 (0.00, 2.00)
Dummy=1 if Hispanic/Black	8,257 (60%)
Annual Family Income (97$)	9,637 (5,123, 18,659)
Annual earnings (97$)	3,332 (0, 14,321)
Age of woman	34 (26, 44)
Years of education
0	452 (3.3%)
3	714 (5.2%)
7	3,145 (23%)
9	2,022 (15%)
10	2,850 (21%)
11	4,563 (33%)
Dummy =1 if Employed last year	7,052 (51%)
Unearned Income (97$)	3.0 (0.0, 6.9)
¹ n (%); Median (IQR)

summary(EITC$children)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   1.193   2.000   9.000

summary(EITC$finc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    5123    9637   15255   18659  575617

summary(EITC$finc)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0    5123    9637   15255   18659  575617

#2. Calculate the sample means of all variables for (a) single women with no children, (b) single women with 1 child, and (c) single women with 2+ children. 

#Note: I do not include year or state for the sample means since they are the levels of which these observations are within. I also do not include the variable for number of children since we are grouping the means by number of children. 

#a.)
library(dplyr)
nochildren<-filter(EITC, children==0)
mean(nochildren$finc)

## [1] 18559.86

mean(nochildren$earn)

## [1] 13760.26

mean(nochildren$age)

## [1] 38.49823

mean(nochildren$nonwhite)

## [1] 0.515944

mean(nochildren$ed)

## [1] 8.548676

mean(nochildren$work)

## [1] 0.5744896

mean(nochildren$unearn)

## [1] 4.799607

#b.)
onechild<-filter(EITC, children==1)
mean(onechild$finc)

## [1] 13941.57

mean(onechild$earn)

## [1] 9928.279

mean(onechild$age)

## [1] 33.75899

mean(onechild$nonwhite)

## [1] 0.5964683

mean(onechild$ed)

## [1] 8.992479

mean(onechild$work)

## [1] 0.5376063

mean(onechild$unearn)

## [1] 4.013291

#c.) 
morechildren<-filter(EITC, children>=2)
mean(morechildren$finc)

## [1] 11985.3

mean(morechildren$earn)

## [1] 6613.547

mean(morechildren$age)

## [1] 32.04747

mean(morechildren$nonwhite)

## [1] 0.7088847

mean(morechildren$ed)

## [1] 9.006721

mean(morechildren$work)

## [1] 0.4207099

mean(morechildren$unearn)

## [1] 5.371749

#Earning is reported as zero for women who are not employed.  Create a new variable with earnings conditional on working (those non-employed are set to be NA) and calculate the means of new variable of earnings by group (i.e., a. single women with no children, b. single women with 1 child, and c. single women with 2+ children

EITC3<-EITC %>% mutate(newearn= case_when(work==0 ~ NA, TRUE ~ as.numeric((earn))))
  
#a.) 

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ forcats   1.0.0     ✔ readr     2.1.4
## ✔ ggplot2   3.4.2     ✔ stringr   1.5.0
## ✔ lubridate 1.9.2     ✔ tibble    3.2.1
## ✔ purrr     1.0.1     ✔ tidyr     1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

ncempl<-filter(EITC3, children==0) 

ncemploy<-drop_na(ncempl)

mean(ncemploy$newearn)

## [1] 19838.93

#b.) 

ocempl<-filter(EITC3, children==1) 

ocemploy<-drop_na(ocempl)

mean(ocemploy$newearn)

## [1] 14963.35

#c.) 

tcempl<-filter(EITC3, children>=2) 

tcemploy<-drop_na(tcempl)

mean(tcemploy$newearn)

## [1] 11961.44

If you break up the samples by number of child: 0, 1 or 2 or more, you can see differences in earnings and demographic characteristics. There is a gradient seen in earnings corresponding with number of children. Earnings tend to go down as number of children goes up. The socio-demographic variables I chose to compare are race/ ethnicity and education level. You can also see a gradient in race/ethnicity, as the percent of the sample being nonwhite goes up from about 51% among women with no children, to almost 60% for women with one child, and to over 70% to children with more. There is a visible education gradient, too. However, it is in the opposite direction. Women with children have more years of education compared to women without, although, the total difference seen in the gradient is only about one year in total.

mean(nochildren$nonwhite)

## [1] 0.515944

mean(nochildren$ed)

## [1] 8.548676

mean(onechild$nonwhite)

## [1] 0.5964683

mean(onechild$ed)

## [1] 8.992479

mean(morechildren$nonwhite)

## [1] 0.7088847

mean(morechildren$ed)

## [1] 9.006721

#prep data for DiD by creating treatment and timing variable


#treatment is whether women has children or not

EITC3$treated <-ifelse(EITC$children> 0, 1, 0)

#time treatment went into place is 1993

   EITC3<-EITC3%>%
 mutate(post = case_when(
      treated== 0 & year>=1993~ 1, treated==1 & year>=1993 ~1, treated==1 & year<1993 ~0, treated==0 & year<=1993~0))

The graph below shows that the employment rates of women with (treatment group) and without children (control group) are different in terms of employment prior to the expansion of EITC. The control group has higher employment levels than the treatment group.

The trends between the treatment and control group prior to the ETIC expansion, are similar, however. This holds the parallel trends assumption as valid. The major gap in employment between the treatment and control group could be due to compositional differences, however. We know there are some compositional differences due to the earlier analysis with the variable means. To investigate this further it would be important to compare the composition of the treatment and control arms in the pre-treatment and post-treatment periods.

#Create a graph which plots mean annual employment rates by year (1991-1996) for single women with children (treatment) and without children (control)


library(dplyr)
EITC4<-EITC3%>%
group_by(year, treated) %>% summarize(Emp = mean(work))

## `summarise()` has grouped output by 'year'. You can override using the
## `.groups` argument.

EITC4$Employment<-EITC4$Emp*100


library(ggplot2)
ggplot(data =EITC4, aes(x = year, y = Employment, ymin=0, ymax=100, fill=treated ,color = treated, group = treated)) +
     geom_line() + geom_ribbon(alpha=0.01,aes(color=NULL)) +
labs(title="Employment rates by control and treatement arm, 1991-1996") +
       xlab(label="Year") +
  ylab(label="Employment rate")

6. See the code below for the unconditional DiD analysis. The results are marginally significant, meaning the p-value is slightly higher than 0.05. The coefficient for the DiD is 0.031 which can be interpreted as, the expansion of the EITC credit led to a 3% increase in employment amongst women with children compared to women without children.

The variables I included as controls were years of education, age, and race/ethnicity. These DiD results are also marginally significant, with a coefficient that is slightly higher 0.032, which can be interpreted the same as the unconditional interpretation.

#Calculate the unconditional (i.e., without any control variables) difference-in-difference estimates of the effect of the 1993 EITC expansion on employment of single women.  Calculate estimates with all women with children as the treatment and single women with no children as the control.  Interpret these results.  (10 points)

EITC3$did<- EITC3$post * EITC3$treated

#generate unconditional DiD
  
DiDmodel <-lm(work~ treated + post + did, data = EITC3)

summary(DiDmodel)

## 
## Call:
## lm(formula = work ~ treated + post + did, data = EITC3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5775 -0.4763  0.4225  0.5237  0.5502 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.577478   0.010720  53.871   <2e-16 ***
## treated     -0.127650   0.014134  -9.031   <2e-16 ***
## post        -0.004688   0.013427  -0.349   0.7270    
## did          0.031128   0.017761   1.753   0.0797 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4969 on 13742 degrees of freedom
## Multiple R-squared:  0.01184,    Adjusted R-squared:  0.01163 
## F-statistic: 54.91 on 3 and 13742 DF,  p-value: < 2.2e-16

#generate conditional DiD

  
DiDmodelconditional <-lm(work~ treated + post + did +ed + nonwhite +age, data = EITC3)

summary(DiDmodelconditional)

## 
## Call:
## lm(formula = work ~ treated + post + did + ed + nonwhite + age, 
##     data = EITC3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6809 -0.4897  0.3374  0.4787  0.7539 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3745016  0.0261657  14.313  < 2e-16 ***
## treated     -0.1170893  0.0143052  -8.185 2.96e-16 ***
## post        -0.0050659  0.0133443  -0.380   0.7042    
## did          0.0329061  0.0176495   1.864   0.0623 .  
## ed           0.0178870  0.0016322  10.959  < 2e-16 ***
## nonwhite    -0.0539581  0.0087631  -6.157 7.60e-10 ***
## age          0.0020299  0.0004376   4.639 3.53e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4937 on 13739 degrees of freedom
## Multiple R-squared:  0.02473,    Adjusted R-squared:  0.0243 
## F-statistic: 58.06 on 6 and 13739 DF,  p-value: < 2.2e-16

If a nationwide workforce training initiative was introduced around the same time as the expansion of the EITC, this effect needs to be considered in our analysis. If we do not control for it, it is possible that this program could account for any changes in employment witnessed in our analysis. we must consider the relationship between having children and participating in this training initiative, as well as the relationship between participating in this training initiative and becoming employed.

As this program is a workforce training initiative, I am going to assume that it increased employment rates during this period. I am also going to assume that women with children were less likely to participate in this program because they were tasked with childcare. Therefore, as the covariance of this omitted variable and the outcome is positive and the covariance of this omitted variable and our treatment variable is negative, we can assume our results are subject to downward bias.

(Bonus)

#create dummy variables for different "doses"

#treatment is whether women has children or not: 1 child is dosage #1 and 2 or more children is dosage #2

#first I make an unconditional model
EITC3$doses1 <-ifelse(EITC3$children== 1, 1, 0)

EITC3$doses2 <-ifelse(EITC3$children>= 2, 1, 0)

EITC3$didbydose1<- EITC3$post * EITC3$doses1

EITC3$didbydose2<- EITC3$post * EITC3$doses2

dosesmodel <-lm(work~ doses1 + doses2+ post + didbydose1+ didbydose2, data = EITC3)

summary(dosesmodel)

## 
## Call:
## lm(formula = work ~ doses1 + doses2 + post + didbydose1 + didbydose2, 
##     data = EITC3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5775 -0.5197  0.4225  0.4517  0.5954 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.577478   0.010680  54.070  < 2e-16 ***
## doses1      -0.057793   0.018125  -3.189  0.00143 ** 
## doses2      -0.172837   0.015899 -10.871  < 2e-16 ***
## post        -0.004688   0.013377  -0.350  0.72600    
## didbydose1   0.033306   0.022834   1.459  0.14470    
## didbydose2   0.030241   0.019989   1.513  0.13032    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4951 on 13740 degrees of freedom
## Multiple R-squared:  0.01926,    Adjusted R-squared:  0.01891 
## F-statistic: 53.98 on 5 and 13740 DF,  p-value: < 2.2e-16

#then I make the conditional model
dosesmodelconditional <-lm(work~ doses1 + doses2+ post + didbydose1+ didbydose2+ ed + nonwhite +age, data = EITC3)

summary(dosesmodelconditional)

## 
## Call:
## lm(formula = work ~ doses1 + doses2 + post + didbydose1 + didbydose2 + 
##     ed + nonwhite + age, data = EITC3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6728 -0.4928  0.3446  0.4562  0.7900 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.3814314  0.0260927  14.618  < 2e-16 ***
## doses1      -0.0548552  0.0181393  -3.024   0.0025 ** 
## doses2      -0.1614731  0.0161224 -10.015  < 2e-16 ***
## post        -0.0051390  0.0133016  -0.386   0.6993    
## didbydose1   0.0351553  0.0227033   1.548   0.1215    
## didbydose2   0.0313793  0.0198718   1.579   0.1143    
## ed           0.0179040  0.0016270  11.004  < 2e-16 ***
## nonwhite    -0.0466478  0.0087696  -5.319 1.06e-07 ***
## age          0.0017493  0.0004372   4.002 6.32e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4922 on 13737 degrees of freedom
## Multiple R-squared:  0.03109,    Adjusted R-squared:  0.03053 
## F-statistic:  55.1 on 8 and 13737 DF,  p-value: < 2.2e-16

First, I do an unconditional analysis, with no covariates. Second I do a conditional analysis, with years of education, race/ethnicity, and age as covariates. The results are not significant, giving no evidence of differences between different “doses” of the treatment, having no children, only one child, or having more. I’m interested in looking more into the group of women with 2+ children. This is a heterogeneous group in terms of number of children. It may be interesting to break this group up by number of children even further, but small sample sizes of number of children may prevent this analysis.

Causal Inference Midterm

Rebecca Luttinen

2024-03-07