Section 1: General regression analysis and the CIA
Question 1 - Reading Data
data =read_csv("PS2data.csv")
Rows: 710 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, county, gisjoin
dbl (9): pop_enslaved_1860, pop_total_1860, pop_total_2010, pop_black_2010, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#Income Gaplegendtitle1 <-"Had Rosenwald School"ggplot(data, aes(x = income_white_2010 - income_black_2010, fill=factor(had_rosenwald_school)))+geom_histogram(position="identity", alpha=0.5,bins=50)+labs(x='Income Gap in 2010 (US Dollars)', y='Count', fill = legendtitle1)+scale_fill_manual(values=c("blue","red"),labels =c("N","Y"))
#Enslaved Population in 1860ggplot(data, aes(x = pct_pop_enslaved_1860, fill =factor(had_rosenwald_school))) +geom_histogram(position ="identity", alpha=0.5, bins=50) +labs(x='Percent of the Population enslaved in 1860', y='Count', fill = legendtitle1)+scale_fill_manual(values=c("blue","red"),labels =c("N","Y"))
#Scatter Plotggplot(data, aes(x = pct_pop_enslaved_1860, y = income_white_2010 - income_black_2010, color =factor(had_rosenwald_school))) +geom_point(alpha=0.5)+labs(x='Population Enslaved in 1860', y='County Income Gap in 2010 (US Dollars)', color = legendtitle1)+scale_color_manual(values=c("blue","red"),labels =c("N","Y")) +guides(color =guide_legend(title = legendtitle1))
Question 3-First Regression
data1.1<- data %>%mutate(Race_Income_Gap = income_white_2010 - income_black_2010)reg1 <-lm(Race_Income_Gap ~ had_rosenwald_school, data = data1.1)summary(reg1)
Call:
lm(formula = Race_Income_Gap ~ had_rosenwald_school, data = data1.1)
Residuals:
Min 1Q Median 3Q Max
-55160 -4409 911 5719 37520
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 14537.5 802.8 18.107 < 2e-16 ***
had_rosenwald_school 4323.7 909.7 4.753 2.43e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10060 on 708 degrees of freedom
Multiple R-squared: 0.03092, Adjusted R-squared: 0.02955
F-statistic: 22.59 on 1 and 708 DF, p-value: 2.43e-06
We assume that the schools location within the county is “as good as random” (meaning that where a Rosenwald school is in a county is as good as random), and thus satisfies the CIA.
Question 4 - State Fixed Effects
state_fe_reg1 <-feols(Race_Income_Gap ~ had_rosenwald_school | state, data = data1.1)etable(state_fe_reg1)
By including these controls our standard errors decreased by almost 40%, furthermore, our coefficient on the Rosenwald schools decreased by almost 50%. Indicating, that by not including the control variables, there was almost certainly omitted variable bias present.
Question 7 - DAG
DAG is hopefully attached on Canvas… The CIA is likely not valid as all the controls have not been specified. Meaning we likely have omitted variable bias.
Question 8 - Bad Controls?
If Rosenwald schools are affecting 2010 population then controlling for it would be bad, and introduces selection bias. Controlling for this would be bad because it would mean that our 2010 population is affected by our treatment variable, Rosenwald schools. If Rosenwald schools did not affect 2010 population, then it is more than likely fine to control for it.
Earlier, when we regressed the Race Income Gap on our indicator variable, whether or not a Rosenwald school was in the county, that estimate was approximately 4285. Indicating that when a Rosenwald school was present, the race income gap increased by almost 4300 dollars. Although, our calculated ATE tells us that the difference is only 102 dollars. Thus the ATE is nearly 1/43 or 2.3% of what our regression was. Both estimates seem poor due to assumptions likely being violated. The first because we have omitted variable bias, and the latter because I’m not sure if we have sufficient overlap. Although, I would trust the ATE more, due to the fact that matching methods are typically more accurate than a fixed effects method.
Question 12 - ATE or not?
This is estimating the ATE, we took the treatment effect and then took the average of the treatment effect. Thus, yielding the ATE.
Section 3: Propensity-Score Methods
Question 13 Estimating Propensity Scores
saved_reg <-feglm(had_rosenwald_school ~I((pct_pop_enslaved_1860)^2) +I((pop_total_1860)^2) + pct_pop_enslaved_1860 + pop_total_1860 + pct_pop_enslaved_1860:pop_total_1860, data = data, family ='logit')propensityscore<-saved_reg$fitted.valuesdata1.2<- data1.1%>%mutate(propensityscore)
Question 14 Regression with controls and estimated propensity scores
Comparing to our regression in question 5, our coefficient on Rosenwald school decreases by about 600, and the coefficient on the percent of the population enslaved in 1860 decreases by about 35.
Question 15 - Overlap Figure
ggplot(data = data1.2, aes(x = propensityscore, fill =factor(had_rosenwald_school)))+geom_density(alpha=0.5, position ='identity')+scale_fill_manual(values=c("blue","red"), labels=c("N","Y"), name="Had Rosenwald School")
Question 16 - Percentage of non-supported overlap
6.1% of the propensity score is not supported by overlap. From question 18, we see that 667/710 are supported by overlap, which implies 6.1% are not.
Question 17 - Why is overlap important?
We care about overlap because for our assumptions, there must be some likelihood (probability) that counties are either treated or untreated. This enforces our CIA, as well as allows for matches.
Enforcing overlap changed our coefficient on Rosenwald Schools by 4, and obviously changed the observations and the coefficient on propensity score. Although, enforcing overlap barely changed our observations. Though, as stated in question 17, enforcing overlap is important.
Block 1 has a treatment effect of -6248.400996, n=25 Block 2 has a treatment effect of 4740.73301, n=29 Block 3 has a treatment effect of -40.670933, n=36 Block 4 has a treatment effect of -5330.115931, n=66 Block 5 has a treatment effect of 4501.954113, n=114 Block 6 has a treatment effect of 179.685890, n=206 Block 7 has a treatment effect of 2780.196318, n=182