PS2

Author

Andrew Grimoldby

Section 1: General regression analysis and the CIA

Question 1 - Reading Data

data = read_csv("PS2data.csv")
Rows: 710 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (3): state, county, gisjoin
dbl (9): pop_enslaved_1860, pop_total_1860, pop_total_2010, pop_black_2010, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data
# A tibble: 710 × 12
   state  county gisjoin pop_e…¹ pop_t…² pop_t…³ pop_b…⁴ pop_w…⁵ incom…⁶ incom…⁷
   <chr>  <chr>  <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
 1 Alaba… Autau… G01000…    9607   16739   54571    9643   42855   28082   58592
 2 Alaba… Baldw… G01000…    3714    7530  182265   17105  156153   28579   52455
 3 Alaba… Barbo… G01000…   16150   30812   27457   12875   13180   21101   42570
 4 Alaba… Bibb   G01000…    3842   11894   22915    5047   17381   21153   43874
 5 Alaba… Blount G01000…     666   10865   57322     761   53068   21070   46124
 6 Alaba… Butler G01001…    6818   18122   20947    9095   11399   21309   38202
 7 Alaba… Calho… G01001…    4342   21539  118572   24382   88840   22163   42775
 8 Alaba… Chamb… G01001…   11849   23214   34215   13257   20112   23236   38009
 9 Alaba… Chero… G01001…    3002   18360   25989    1208   24081   19375   41102
10 Alaba… Choct… G01002…    7094   13877   13859    6012    7731   19985   40032
# … with 700 more rows, 2 more variables: pct_pop_enslaved_1860 <dbl>,
#   had_rosenwald_school <dbl>, and abbreviated variable names
#   ¹​pop_enslaved_1860, ²​pop_total_1860, ³​pop_total_2010, ⁴​pop_black_2010,
#   ⁵​pop_white_2010, ⁶​income_black_2010, ⁷​income_white_2010

Question 2 - Histogram and Plots

#Income Gap
legendtitle1 <- "Had Rosenwald School"

ggplot(data, aes(x = income_white_2010 - income_black_2010, fill=factor(had_rosenwald_school)))+
  geom_histogram(position="identity", alpha=0.5,bins=50)+
  labs(x='Income Gap in 2010 (US Dollars)', y='Count', fill = legendtitle1)+
  scale_fill_manual(values=c("blue","red"),
                    labels = c("N","Y"))

#Enslaved Population in 1860
ggplot(data, aes(x = pct_pop_enslaved_1860, fill = factor(had_rosenwald_school))) +
  geom_histogram(position = "identity", alpha= 0.5, bins=50) +
  labs(x='Percent of the Population enslaved in 1860', y='Count', fill = legendtitle1)+
  scale_fill_manual(values=c("blue","red"),
                    labels = c("N","Y"))

#Scatter Plot
ggplot(data, aes(x = pct_pop_enslaved_1860, y = income_white_2010 - income_black_2010, color = factor(had_rosenwald_school))) +
  geom_point(alpha=0.5)+
  labs(x='Population Enslaved in 1860', y='County Income Gap in 2010 (US Dollars)', color = legendtitle1)+
  scale_color_manual(values=c("blue","red"),
                    labels = c("N","Y")) +
  guides(color = guide_legend(title = legendtitle1))

Question 3-First Regression

data1.1 <- data %>% mutate(Race_Income_Gap = income_white_2010 - income_black_2010)
reg1 <- lm(Race_Income_Gap ~ had_rosenwald_school, data = data1.1)

summary(reg1)

Call:
lm(formula = Race_Income_Gap ~ had_rosenwald_school, data = data1.1)

Residuals:
   Min     1Q Median     3Q    Max 
-55160  -4409    911   5719  37520 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)           14537.5      802.8  18.107  < 2e-16 ***
had_rosenwald_school   4323.7      909.7   4.753 2.43e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10060 on 708 degrees of freedom
Multiple R-squared:  0.03092,   Adjusted R-squared:  0.02955 
F-statistic: 22.59 on 1 and 708 DF,  p-value: 2.43e-06

We assume that the schools location within the county is “as good as random” (meaning that where a Rosenwald school is in a county is as good as random), and thus satisfies the CIA.

Question 4 - State Fixed Effects

state_fe_reg1 <- feols(Race_Income_Gap ~ had_rosenwald_school | state, data = data1.1)
etable(state_fe_reg1)
                           state_fe_reg1
Dependent Var.:          Race_Income_Gap
                                        
had_rosenwald_school 4,285.8** (1,087.0)
Fixed-Effects:       -------------------
state                                Yes
____________________ ___________________
S.E.: Clustered                by: state
Observations                         710
R2                               0.09778
Within R2                        0.02938
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After adding State FE, we now need the schools location within the state to be as good as random.

Question 5-Controls

state_fe_controls_reg <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = data1.1)
etable(state_fe_controls_reg)
                      state_fe_contro..
Dependent Var.:         Race_Income_Gap
                                       
had_rosenwald_school  2,252.9** (669.1)
pct_pop_enslaved_1860  129.1*** (20.06)
pop_total_1860         0.1062* (0.0409)
Fixed-Effects:        -----------------
state                               Yes
_____________________ _________________
S.E.: Clustered               by: state
Observations                        710
R2                              0.16186
Within R2                       0.09831
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Question 6 - Bias?

etable(state_fe_reg1,state_fe_controls_reg)
                            state_fe_reg1 state_fe_contro..
Dependent Var.:           Race_Income_Gap   Race_Income_Gap
                                                           
had_rosenwald_school  4,285.8** (1,087.0) 2,252.9** (669.1)
pct_pop_enslaved_1860                      129.1*** (20.06)
pop_total_1860                             0.1062* (0.0409)
Fixed-Effects:        ------------------- -----------------
state                                 Yes               Yes
_____________________ ___________________ _________________
S.E.: Clustered                 by: state         by: state
Observations                          710               710
R2                                0.09778           0.16186
Within R2                         0.02938           0.09831
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By including these controls our standard errors decreased by almost 40%, furthermore, our coefficient on the Rosenwald schools decreased by almost 50%. Indicating, that by not including the control variables, there was almost certainly omitted variable bias present.

Question 7 - DAG

DAG is hopefully attached on Canvas… The CIA is likely not valid as all the controls have not been specified. Meaning we likely have omitted variable bias.

Question 8 - Bad Controls?

If Rosenwald schools are affecting 2010 population then controlling for it would be bad, and introduces selection bias. Controlling for this would be bad because it would mean that our 2010 population is affected by our treatment variable, Rosenwald schools. If Rosenwald schools did not affect 2010 population, then it is more than likely fine to control for it.

Section 2: Matching

Question 9 - Mahalanobis Distance

treateddata <- filter(data1.1, had_rosenwald_school ==1)

distance <- mahalanobis_dist(had_rosenwald_school ~ pct_pop_enslaved_1860 + pop_total_1860, data = data1.1)


nearest_neighbor_distance <- numeric(nrow(treateddata))
nearest_neighbor_county <- character(nrow(treateddata))
nearest_neighbor_state <- character(nrow(treateddata))

for (i in 1:nrow(treateddata)) {
  treated_dist <- distance[i, ]
  untreated_dist <- treated_dist[data1.1$had_rosenwald_school == 0]
  min_dist <- min(untreated_dist, na.rm = TRUE)
  min_index <- which(untreated_dist == min_dist)
  nearest_neighbor_distance[i] <- min_dist
  nearest_neighbor_county[i] <- data1.1$county[data1.1$had_rosenwald_school == 0][min_index]
  nearest_neighbor_state[i] <- data1.1$state[data1.1$had_rosenwald_school == 0][min_index]
}

nearest_neighbor_data <- data.frame(
  County = treateddata$county,
  State = treateddata$state,
  Distance = nearest_neighbor_distance,
  Nearest_Neighbor_County = nearest_neighbor_county,
  Nearest_Neighbor_State = nearest_neighbor_state
)

Question 10 - Mean Difference

treatment_effect <- numeric(nrow(treateddata))

for(i in 1:nrow(treateddata)) {
  treated_county <- treateddata$county[i]
  treated_state <- treateddata$state[i]
  nearest_control_county <- nearest_neighbor_data$Nearest_Neighbor_County[i]
  nearest_control_state <- nearest_neighbor_data$Nearest_Neighbor_State[i]
  treated_outcome <- data1.1$Race_Income_Gap[data1.1$county == treated_county & data1.1$state == treated_state]
  control_outcome <- data1.1$Race_Income_Gap[data1.1$county == nearest_control_county & data1.1$state == nearest_control_state]
  
  treatment_effect[i] <- treated_outcome - control_outcome
}

treatment_effect_data <- data.frame(
  County = treateddata$county,
  State = treateddata$state,
  Treatment_Effect = treatment_effect
)

ATE <- mean(treatment_effect_data$Treatment_Effect)
print(ATE)
[1] 102.0217

Question 11 - Trusting Estimates, ATE

Earlier, when we regressed the Race Income Gap on our indicator variable, whether or not a Rosenwald school was in the county, that estimate was approximately 4285. Indicating that when a Rosenwald school was present, the race income gap increased by almost 4300 dollars. Although, our calculated ATE tells us that the difference is only 102 dollars. Thus the ATE is nearly 1/43 or 2.3% of what our regression was. Both estimates seem poor due to assumptions likely being violated. The first because we have omitted variable bias, and the latter because I’m not sure if we have sufficient overlap. Although, I would trust the ATE more, due to the fact that matching methods are typically more accurate than a fixed effects method.

Question 12 - ATE or not?

This is estimating the ATE, we took the treatment effect and then took the average of the treatment effect. Thus, yielding the ATE.

Section 3: Propensity-Score Methods

Question 13 Estimating Propensity Scores

saved_reg <-feglm(had_rosenwald_school ~ I((pct_pop_enslaved_1860)^2) + I((pop_total_1860)^2) + pct_pop_enslaved_1860 + pop_total_1860 + pct_pop_enslaved_1860:pop_total_1860, data = data, family = 'logit')

propensityscore<-saved_reg$fitted.values

data1.2 <- data1.1%>% mutate(propensityscore)

Question 14 Regression with controls and estimated propensity scores

state_fe_propensity_controls_reg <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 + propensityscore | state, data = data1.2)
etable(state_fe_propensity_controls_reg,state_fe_controls_reg)
                      state_fe_propen.. state_fe_contro..
Dependent Var.:         Race_Income_Gap   Race_Income_Gap
                                                         
had_rosenwald_school    1,626.8 (928.2) 2,252.9** (669.1)
pct_pop_enslaved_1860   95.69** (23.67)  129.1*** (20.06)
pop_total_1860         0.0619. (0.0317)  0.1062* (0.0409)
propensityscore       7,464.5 (4,561.5)                  
Fixed-Effects:        ----------------- -----------------
state                               Yes               Yes
_____________________ _________________ _________________
S.E.: Clustered               by: state         by: state
Observations                        710               710
R2                              0.16806           0.16186
Within R2                       0.10498           0.09831
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparing to our regression in question 5, our coefficient on Rosenwald school decreases by about 600, and the coefficient on the percent of the population enslaved in 1860 decreases by about 35.

Question 15 - Overlap Figure

ggplot(data = data1.2, aes(x = propensityscore, fill = factor(had_rosenwald_school)))+
         geom_density(alpha=0.5, position = 'identity')+
  scale_fill_manual(values=c("blue","red"), labels=c("N","Y"), name="Had Rosenwald School")

Question 16 - Percentage of non-supported overlap

6.1% of the propensity score is not supported by overlap. From question 18, we see that 667/710 are supported by overlap, which implies 6.1% are not.

Question 17 - Why is overlap important?

We care about overlap because for our assumptions, there must be some likelihood (probability) that counties are either treated or untreated. This enforces our CIA, as well as allows for matches.

Question 18 - Regression with overlap

Highest_control_row <- data1.2 %>% filter(had_rosenwald_school == 0) %>%
  filter(propensityscore ==max(propensityscore, na.rm = TRUE))
Highest_control <- Highest_control_row$propensityscore
Lowest_Treated_row <- data1.2 %>% filter(had_rosenwald_school == 1) %>%
  filter(propensityscore ==min(propensityscore, na.rm = TRUE))
Lowest_treated <- Lowest_Treated_row$propensityscore

data1.3 <- filter(data1.2, data1.2$had_rosenwald_school == 0 &
                  data1.2$propensityscore>Lowest_treated)

data1.4 <- filter(data1.2, data1.2$had_rosenwald_school == 1 &
                    data1.2$propensityscore<Highest_control)

data1.5 <- rbind(data1.3,data1.4)

overlap_enforced <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 + propensityscore | state, data = data1.5)
etable(overlap_enforced,state_fe_propensity_controls_reg)
                       overlap_enforced state_fe_propen..
Dependent Var.:         Race_Income_Gap   Race_Income_Gap
                                                         
had_rosenwald_school    1,622.6 (920.9)   1,626.8 (928.2)
pct_pop_enslaved_1860   95.13** (23.88)   95.69** (23.67)
pop_total_1860          0.0024 (0.1019)  0.0619. (0.0317)
propensityscore       8,365.2 (5,651.1) 7,464.5 (4,561.5)
Fixed-Effects:        ----------------- -----------------
state                               Yes               Yes
_____________________ _________________ _________________
S.E.: Clustered               by: state         by: state
Observations                        667               710
R2                              0.16321           0.16806
Within R2                       0.09459           0.10498
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Enforcing overlap changed our coefficient on Rosenwald Schools by 4, and obviously changed the observations and the coefficient on propensity score. Although, enforcing overlap barely changed our observations. Though, as stated in question 17, enforcing overlap is important.

Question 19 Inverse Propensity Scores

weights <- data1.5$had_rosenwald_school / data1.5$propensityscore + (1-data1.5$had_rosenwald_school) / (1 - data1.5$propensityscore)
data1.6 <- data1.5 %>% mutate(weights)

weights_reg <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, weights = ~ weights, data = data1.6)
summary(weights_reg)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 667 
Weights: weights 
Fixed-effects: state: 10
Standard-errors: Clustered (state) 
                         Estimate Std. Error  t value  Pr(>|t|)    
had_rosenwald_school  1966.393786 941.608353 2.088335 0.0663623 .  
pct_pop_enslaved_1860  147.512138  34.882663 4.228809 0.0022105 ** 
pop_total_1860           0.102588   0.149688 0.685343 0.5103924    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 13,520.4     Adj. R2: 0.160007
                 Within R2: 0.091519

Question 20 - Doubly Robust

Block1 <- filter(data1.6, propensityscore >= 0.3 & propensityscore< 0.4)
Block2 <- filter(data1.6, propensityscore >= 0.4 & propensityscore < 0.5)
Block3 <- filter(data1.6, propensityscore >=0.5 &propensityscore <0.6)
Block4 <- filter(data1.6, propensityscore >=0.6 & propensityscore <0.7)
Block5 <- filter(data1.6, propensityscore >=0.7 & propensityscore <0.8)
Block6 <- filter(data1.6, propensityscore >=0.8 & propensityscore <0.9)
Block7 <- filter(data1.6, propensityscore >=0.9 & propensityscore <1)

Block_reg1 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block1)
Block_reg2 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block2)
Block_reg3 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block3)
Block_reg4 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block4)
Block_reg5 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block5)
Block_reg6 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block6)
Block_reg7 <- feols(Race_Income_Gap ~ had_rosenwald_school + pct_pop_enslaved_1860 + pop_total_1860 | state, data = Block7)

summary(Block_reg1)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 25 
Fixed-effects: state: 6
Standard-errors: Clustered (state) 
                          Estimate  Std. Error   t value Pr(>|t|) 
had_rosenwald_school  -6248.400996 11099.83911 -0.562927  0.59779 
pct_pop_enslaved_1860   687.713490  4104.73103  0.167542  0.87351 
pop_total_1860           -0.888914     6.40672 -0.138747  0.89506 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 14,324.7     Adj. R2: 0.013423
                 Within R2: 0.032047
summary(Block_reg2)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 29 
Fixed-effects: state: 7
Standard-errors: Clustered (state) 
                        Estimate Std. Error  t value Pr(>|t|) 
had_rosenwald_school  4740.73301 6545.11352 0.724316  0.49613 
pct_pop_enslaved_1860  792.19534  851.29469 0.930577  0.38798 
pop_total_1860           1.96354    3.08431 0.636621  0.54787 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 13,581.6     Adj. R2: -0.180637
                 Within R2:  0.039202
summary(Block_reg3)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 36 
Fixed-effects: state: 8
Standard-errors: Clustered (state) 
                        Estimate Std. Error   t value Pr(>|t|) 
had_rosenwald_school  -40.670933 2733.71099 -0.014878  0.98855 
pct_pop_enslaved_1860 386.741272  743.85426  0.519915  0.61915 
pop_total_1860          0.828857    1.51212  0.548143  0.60063 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 7,241.5     Adj. R2: -0.015766
                Within R2:  0.005761
summary(Block_reg4)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 66 
Fixed-effects: state: 9
Standard-errors: Clustered (state) 
                          Estimate  Std. Error   t value Pr(>|t|) 
had_rosenwald_school  -5330.115931 3980.085955 -1.339196  0.21731 
pct_pop_enslaved_1860    58.973149   41.804374  1.410693  0.19602 
pop_total_1860           -0.758247    0.922231 -0.822188  0.43479 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 9,286.6     Adj. R2: 0.135543
                Within R2: 0.078673
summary(Block_reg5)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 114 
Fixed-effects: state: 9
Standard-errors: Clustered (state) 
                         Estimate Std. Error   t value Pr(>|t|)    
had_rosenwald_school  4501.954113 940.199446  4.788297 0.001376 ** 
pct_pop_enslaved_1860   69.912863  48.888208  1.430056 0.190572    
pop_total_1860          -0.204279   0.305835 -0.667939 0.522971    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8,066.4     Adj. R2: 0.159718
                Within R2: 0.083011
summary(Block_reg6)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 206 
Fixed-effects: state: 10
Standard-errors: Clustered (state) 
                        Estimate  Std. Error   t value Pr(>|t|) 
had_rosenwald_school  179.685890 1251.920876  0.143528  0.88904 
pct_pop_enslaved_1860  54.834977   55.612079  0.986026  0.34988 
pop_total_1860         -0.202993    0.158264 -1.282625  0.23167 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 8,099.2     Adj. R2: 0.06064 
                Within R2: 0.013724
summary(Block_reg7)
OLS estimation, Dep. Var.: Race_Income_Gap
Observations: 182 
Fixed-effects: state: 9
Standard-errors: Clustered (state) 
                         Estimate  Std. Error  t value Pr(>|t|)    
had_rosenwald_school  2780.196318 4278.497650 0.649807 0.534027    
pct_pop_enslaved_1860  123.299821   44.684967 2.759313 0.024701 *  
pop_total_1860           0.239697    0.115009 2.084152 0.070658 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 6,069.3     Adj. R2: 0.149358
                Within R2: 0.123077
Coefficients_Blocks <- c(-6248.400996,4740.73301,-40.670933,-5330.115931,4501.954113,179.685890,2780.196318)
weight_blocks <- c(25,29,36,66,114,206,182)
ATE_Blocks <- weighted.mean(Coefficients_Blocks,weight_blocks)
ATE_Blocks
[1] 1039.899

Block 1 has a treatment effect of -6248.400996, n=25 Block 2 has a treatment effect of 4740.73301, n=29 Block 3 has a treatment effect of -40.670933, n=36 Block 4 has a treatment effect of -5330.115931, n=66 Block 5 has a treatment effect of 4501.954113, n=114 Block 6 has a treatment effect of 179.685890, n=206 Block 7 has a treatment effect of 2780.196318, n=182

The average treatment effect is 1039.899