#load libraries
library(tidyverse)
library(psych)
library(dplyr)
library(haven)
options(scipen = 999)causal midterm
Causal Midterm
Drew Schaefer
Problem 1: There are important differences between correlation and causation that make it important for researchers to be clear when discussing findings. Causation is a much more serious claim and requires far more rigorous methodology.
A classic example of this problem is the sale of ice cream and murders. There exists correlation between the two but it would be incorrect to make a causal claim that an increase in ice cream sales causes increase in murder.
Some of the most common potential problems in trying to make causal claims about correlation are omitted variables and reverse causality. With omitted variables it may be that the causal direction is not what we expect (this can be a particular problem with cross-sectional data with no temporality).
Problem 2:
An experiment is useful for causal inference because it allows us to make the least number of assumptions because much of the process is controlled by the experimenter. For example, in an experiment requirements like consistent treatment amounts and independence of participants are much easier to guarantee because the experiment should limit that variation.
Problem 3:
*This is taken verbatim from my homework 3*
Part A:
This Torche paper is interested in maternal stress and birth weight outcomes. The main endogenous problem in this paper is that maternal stress may not be equally distributed across mothers. Torche notes that women with higher levels of maternal stress may be “systematically different” which could induce a spurious association between maternal stress and birth weight.
Part B:
Torche uses a natural experiment in this paper. This is an earthquake which struck Chile in 2005. The use of this natural experiment is relevant because the stress that was induced on mothers should not vary by any characteristics that may have an affect on birth weight. Thus, any differences in birth weight should be able to be explained by the stress.
Part C:
Z induces randomness on the treatment which is maternal stress. In a world without the earthquake there were lots of endogenous problems, meaning that variables existed which were associated both with exposure to maternal stress and the birth weight outcome (ie maternal age, income, education, etc.).
Part D:
Torche argues that there was little to no “spill over effect” from the earthquake. This means that the only pathway through which the earthquake affects birth weight is through maternal stress, this is important because it is a key assumption. This seems reasonable to me and I can accept Torche’s argument.
Torche addresses other potential concerns in the sensitivity analysis/discussion sections of the paper. First, it may be the case that the stress of the earthquake could lead to an increase in miscarriages which could lead to a decrease in low birth weights. However, Torche shows with administrative data that this did not occur.
Another concern is compositional changes in areas hit by the earthquake. It could be the case that the earthquake led certain women to leave the area. Torche illustrates that one concern would be that healthy women might be a high proportion of out-migrants. Torche uses national survey data to show that there may be slight increases in out-migration for health women but the effect would be very small. This is a concern because it could lead to an upward bias of the effect.
Problem 4:
#load data
eitcRR_1 <- read_dta("eitcRR-1.dta")Part 1:
head(eitcRR_1)# A tibble: 6 × 10
state year children nonwhite finc earn age ed work unearn
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 11 1991 0 0 9630 0 53 7 0 9.63
2 11 1991 0 1 18714. 18714. 26 10 1 0
3 11 1991 0 0 31228. 14730. 48 11 1 16.5
4 11 1991 0 0 54331. 17676. 44 11 0 36.7
5 11 1991 0 0 8249. 8249. 20 10 1 0
6 11 1991 1 0 7499. 0 27 10 0 7.50
Looking at the first few rows of the data we can that there are ten columns and each row represents a unique individual response.
length(unique(eitcRR_1$state))[1] 51
This tells us that there are 51 unique state codes (presumably the 50 states plus DC) represented in the sample.
mean(eitcRR_1$nonwhite)[1] 0.6006838
We can also see that approximately 60% of the sample is non-White.
mean(eitcRR_1$age)[1] 35.20966
The mean age of the sample is 35.2 years.
mean(eitcRR_1$work)[1] 0.513022
Approximately 51% of the women in the sample were employed in the past year.
range(eitcRR_1$finc)[1] 0.0 575616.8
range(eitcRR_1$earn)[1] 0.0 537880.6
Finally, we can see that the range of both family income and individual earnings is quite large (from zero to more than 500,000 dollars a year).
Part 2:
#create a new variable for the three different groups of number of kids
eitcRR_1$kidgroup <- if_else(eitcRR_1$children==0,"No Kids",if_else(eitcRR_1$children==1,"One Kid","Two or more kids"))
#display a table to the mean of all the variables for the kid groups
aggregate(eitcRR_1[,1:10], by=list(eitcRR_1$kidgroup), mean, na.rm=T) Group.1 state year children nonwhite finc earn
1 No Kids 53.39666 1993.365 0.000000 0.5159440 18559.86 13760.256
2 One Kid 55.59091 1993.338 1.000000 0.5964683 13941.57 9928.279
3 Two or more kids 55.24386 1993.330 2.801092 0.7088847 11985.30 6613.547
age ed work unearn
1 38.49823 8.548676 0.5744896 4.799607
2 33.75899 8.992479 0.5376063 4.013291
3 32.04747 9.006721 0.4207099 5.371749
This table shows us the means of the ten original variables for the three different groups (no children, one child, and two or more children).
#make new variable that is earnings conditional on working
eitcRR_1$condearn <- if_else(eitcRR_1$work==1,eitcRR_1$earn, NA)
setNames(aggregate(eitcRR_1[,12], by=list(eitcRR_1$kidgroup), mean, na.rm=T), c("Number of Kids", "Earnings Conditional on Employment")) Number of Kids Earnings Conditional on Employment
1 No Kids 19838.93
2 One Kid 14963.35
3 Two or more kids 11961.44
Part 3:
First, we can look at how the three samples differ by earnings.
setNames(aggregate(eitcRR_1$condearn, by=list(eitcRR_1$kidgroup), mean, na.rm=T) ,c("Number of Kids", "Mean Age")) Number of Kids Mean Age
1 No Kids 19838.93
2 One Kid 14963.35
3 Two or more kids 11961.44
setNames(aggregate(eitcRR_1$age, by=list(eitcRR_1$kidgroup), mean, na.rm=T) ,c("Number of Kids", "Mean Age")) Number of Kids Mean Age
1 No Kids 38.49823
2 One Kid 33.75899
3 Two or more kids 32.04747
setNames(aggregate(list(eitcRR_1$condearn, eitcRR_1$earn, eitcRR_1$finc, eitcRR_1$unearn), by=list(eitcRR_1$kidgroup), mean, na.rm=T), c("Number of kids", "cond earn", "earn everyone", "family income", "unearned income")) Number of kids cond earn earn everyone family income unearned income
1 No Kids 19838.93 13760.256 18559.86 4.799607
2 One Kid 14963.35 9928.279 13941.57 4.013291
3 Two or more kids 11961.44 6613.547 11985.30 5.371749
We can see that for both the conditional earnings and the earnings with the whole sample the single women with no kids group has the highest earnings. With a decline for single women with one kid and a further decline for single women with two or more kids. The gradient works conversely for unearned income with single women with two or more kids having the highest unearned income and single women with no kids have the lowest unearned income.
#display a table for mean age for the different kid groups
setNames(aggregate(eitcRR_1$age, by=list(eitcRR_1$kidgroup), mean, na.rm=T) ,c("Number of Kids", "Mean Age")) Number of Kids Mean Age
1 No Kids 38.49823
2 One Kid 33.75899
3 Two or more kids 32.04747
Next, we can look at how the samples differ by average age. The mean age for single women with no kids was almost 38.5 years. For single women with one kid it was 33.75 years. While for single women with two or more kids the mean age was just over 32 years. There is more than six years of difference in mean age between single women with no kids and single women with two or more kids.
#display a table for the mean years of ed for the different kid groups
setNames(aggregate(eitcRR_1$ed, by=list(eitcRR_1$kidgroup), mean, na.rm=T) ,c("Number of Kids", "Mean Years of Education")) Number of Kids Mean Years of Education
1 No Kids 8.548676
2 One Kid 8.992479
3 Two or more kids 9.006721
Finally, we can look at how the three samples differ by years of education. For single women with no kids the mean years of education was 8.5 years. For single women with one kid it was approximately 9 years. For single women with two or more kids it was 9 years. Overall, the educational attainment (in years of education) does not differ very much within this sample.
Part 4:
#new variable for treatment
eitcRR_1$treatment <- if_else(eitcRR_1$children==0,0,1)
#new variable for having any kids (this is the same as the treat ment variable)
eitcRR_1$haskids <- if_else(eitcRR_1$treatment==1,"kids", "no Kids")
#new variable for post expansion indicator
eitcRR_1$post <- if_else(eitcRR_1$year>1993,1,0)Part 5:
#create a new dataframe that finds the mean work force particpation rate grouped by year and child status
meanemployment <- eitcRR_1 %>% group_by(year, haskids) %>% summarise(employ = mean(work))
#the first few rows this new data frame to check it
head(meanemployment)# A tibble: 6 × 3
# Groups: year [3]
year haskids employ
<dbl> <chr> <dbl>
1 1991 kids 0.460
2 1991 no Kids 0.583
3 1992 kids 0.439
4 1992 no Kids 0.572
5 1993 kids 0.438
6 1993 no Kids 0.571
#plot the mean employment rate by year and the child status
ggplot(meanemployment, aes(x=year, y=employ, color=haskids))+geom_point()+geom_line()+ggtitle("Employment by Year and Number of Children")+labs(caption = "Data is subset of Current Population Survey: 1991-1996")+ geom_vline(xintercept = 1993)This plot allows us to look at the pre-treatment trends and see how well the data follows the parallel trends assumption. Looking the trend from 1991-1993 for both groups it appears that the data is roughly parallel with a drop from 1991 to 1992 and then not much of a change from 1992 to 1993.
Part 6:
#reset scientific notation so p values look normal
options(scipen = 1)
#create did variable
eitcRR_1$did <- eitcRR_1$post*eitcRR_1$treatment
#run the first unconditional model
didmod1 <- lm(work~treatment+post+did, data = eitcRR_1)
#print the model summary
summary(didmod1)
Call:
lm(formula = work ~ treatment + post + did, data = eitcRR_1)
Residuals:
Min 1Q Median 3Q Max
-0.5755 -0.4908 0.4245 0.5092 0.5540
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.575460 0.008845 65.060 < 2e-16 ***
treatment -0.129498 0.011676 -11.091 < 2e-16 ***
post -0.002074 0.012931 -0.160 0.87261
did 0.046873 0.017158 2.732 0.00631 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4967 on 13742 degrees of freedom
Multiple R-squared: 0.0126, Adjusted R-squared: 0.01238
F-statistic: 58.45 on 3 and 13742 DF, p-value: < 2.2e-16
This unconditional model shows the effect of EITC expansion on single women with no controls. The main coefficient of interest is the did variable which shows us the effect. First, it is significant at the 0.01 level and it tells us that the expansion of the EITC led to an increase in employment of nearly 5%. The post variable indicates whether the response was before or after the treatment. It is not significant and is very close to zero. Finally, the treatment variable indicates whether the respondent was eligible for the treatment (in this case whether or not the women had children). It is significant at the 0.001 level and tells us that the having children group is associated with a 12% reduction in employment.
Part 7:
#run the conditional model
didmod2 <- lm(work~treatment+post+did+age+scale(unearn)+ed, data = eitcRR_1)
#print the model summary
summary(didmod2)
Call:
lm(formula = work ~ treatment + post + did + age + scale(unearn) +
ed, data = eitcRR_1)
Residuals:
Min 1Q Median 3Q Max
-0.7482 -0.4769 0.2769 0.4386 2.5125
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.3319584 0.0241094 13.769 < 2e-16 ***
treatment -0.1219158 0.0114801 -10.620 < 2e-16 ***
post -0.0094519 0.0124418 -0.760 0.4475
did 0.0530821 0.0165056 3.216 0.0013 **
age 0.0029185 0.0004245 6.876 6.43e-12 ***
scale(unearn) -0.1270630 0.0040994 -30.995 < 2e-16 ***
ed 0.0156960 0.0015742 9.971 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4778 on 13739 degrees of freedom
Multiple R-squared: 0.08665, Adjusted R-squared: 0.08626
F-statistic: 217.2 on 6 and 13739 DF, p-value: < 2.2e-16
This conditional model includes three control variables. First, and most importantly the did coefficient is still significant at the 0.01 level. It tells us that the expansion of the EITC led to an increase in employment by more than 5 percent. In fact, this fuller model has larger coefficients which means that the original model with no controls was an underestimation of the true effect of EITC on employment. All three of the control variables were significant at the 0.001 level. First, age is positively associated with employment. Every single year increase in age is associated with a 0.2% increase in employment. Next, unearned income is negatively associated with employment. Every one standard deviation increase in unearned income is associated with a almost 13% reduction in employment. Lastly, education is positively associated with employment. For every one year increase in years of education we see a 1.5% increase in employment.
Part 8:
The large workforce training program is a potential omitted variable that may explain some of the effect found of EITC expansion on employment outcomes by opening a backdoor pathway. First, if we suppose that the workforce training program increased employment rates for those with lower levels of education then the omitted variable bias would be extremely concerning if there are educational differences between those with and without children (ie the treatment vs control group).
Further, since in question seven I found that there was a positive association between years of education and employment. If the workforce training program increased employment outcomes more so than would have happened if the program hadn’t happened than it may be that we are incorrectly estimating the effect of EITC on employment on those with low levels of education. However, the lowest education group is the no kids group (ie the control group) which means that their employment outcome may be higher than expected (or would be in a counter-factual world without the training programs). Therefore, it may be that our estimates are underestimating the true effect of the EITC expansion on employment outcomes.