IV + RD, Robust

1.1 Load & Explore

Below I load the four dta files, assuming I’ll be merging them at some stage or will at least need portions of each at some point. Using the view command, the descriptions for the various items of data are very useful. After a skim of the research linked to this dataset, it appears that the hazardous nature of the sites are indicated by the “Mile” dataset’s ‘hrs_82’.

Sitecov is 1982 HRS sites w/ nonmissing house price data, 306 observations of which that have 1982 HRS above 28.5. Mile seems to be a similar dataset but drops 4 observations.

setwd("C:/Users/phili/Desktop/Econometrics III/coremetrics3/disc_sec/coremetrics3_discsec/Problem Sets")

sitecov <- read_dta("sitecovariates.dta")
allsites <- read_dta("allsites.dta")
allcov <- read_dta("allcovariates.dta")
mile <- read_dta("2miledata.dta")

#Found glimpse was a bit too crude to look at.
#head(sitecov)
#head(allsites)
#head(allcov)
#head(mile)

‘hrs_82’ and it’s visualized relationship with various measures available to us is of immediate interest. Introducing these measures into a scatterplot their respective median prices in both 1990(mdvalhs9) and 2000(mdvalhs0), with both looking to have quite a weak relationship.

HRS_HP90 <- ggplot(data = mile, aes(x=hrs_82, y=mdvalhs9)) + 
             geom_point(color="blue", alpha=0.5)

HRS_HP00 <- ggplot(data = mile, aes(x=hrs_82, y=mdvalhs0)) +
             geom_point(color="blue", alpha=0.5)
HRS_HP90

HRS_HP00

#re78order <- sort(nsw$re78)

#df <- data.frame(x = re75order, y =re78order)

#ggplot(data = df, size = large) + 
#  geom_line(aes(x=1:nrow(df), y =x), color = "purple", size = 0.5) + 
#  geom_line(aes(x=1:nrow(df), y =y), color = "red", size = 0.5)+
#  ylab("Earnings") +
#  theme_bw()

It does not appear like HRS is correlated strongly with other Mile variables.

mile_Mat <- as_tibble(mile[,-43])

res <- cor(mile_Mat) 

#install.packages("corrplot")
library(corrplot)

## Warning: package 'corrplot' was built under R version 3.6.3

## corrplot 0.84 loaded

corrplot(res, type = "upper", order = "hclust", 
         tl.col = "black", tl.srt = 90, tl.cex=0.6)

1.2 Simple Regression

Houses listed on the National Priorities List prior to 2000 are associated with a 2 per cent lower value, on average, for the given sample. In terms of significance, the finding is highlight significant, suggesting this dummy variable explains little to none of the variation in our dependent variable, median house prices across tracts. The R-squared of 0.00 further highlights this point. Given that only 2.3 per cent of the overall sample is actually listed on this specific document, we may be comparing against too wide a range of the sample whereas a tighter sub-sample closer to some key threshhold region would be more appropriate.

logmedHval <- lm_robust(data=allsites, lnmdvalhs0 ~ npl2000)

export_summs(logmedHval)

	Model 1
(Intercept)	11.72 ***
	(0.00)
npl2000	-0.02
	(0.02)
nobs	42974
r.squared	0.00
adj.r.squared	0.00
statistic	1.53
p.value	0.22
df.residual	42972.00
N	42974.00
se_type	HC2.00
* p < 0.001; p < 0.01; * p < 0.05.

summary(allsites$npl2000)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00000 0.00000 0.00000 0.02292 0.00000 1.00000

1.3 Clustered Standard Errors

Given that only 3 per cent of the overall population of tracts include NPL listings, that places 985 sites on the list. Clustering by state changes the standard error quite dramatically, which is understandable since state plays an important role in the determination of effect. Ongoing state-specific shocks would be limited to specific regions, hence it is understandable that error terms within states would be correlated with one another, being subject to such shocks. As a result, when allowing the model to account for this interaction, the state error jumps up dramatically from 0.02 to 0.05.

library(lfe)

## Warning: package 'lfe' was built under R version 3.6.3

## Loading required package: Matrix

## 
## Attaching package: 'Matrix'

## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack

#Fit a linear model with multiple group fixed effects

FE_logmed <- felm(data=allsites, lnmdvalhs0 ~ npl2000 | 0 | 0 | statefips)

export_summs(FE_logmed)

	Model 1
(Intercept)	11.72 ***
	(0.09)
npl2000	-0.02
	(0.05)
nobs	42974
r.squared	0.00
adj.r.squared	0.00
sigma	0.63
statistic	1.11
p.value	0.29
df	42972.00
df.residual	42972.00
logLik	-40804.46
AIC	81614.92
BIC	81640.93
* p < 0.001; p < 0.01; * p < 0.05.

1.4 Further Regressions

Controlling for 1980 housing values, lnmeanhs8, successive inclusions of housing condition controls lead to a better identification of values in 2000. This is ran with and without fixed effects, using clustering robust standard errors. The initial regression without fixed effects measures an average effect of being on the NPL list of a 4 per cent higher median value in housing,

Using state fixed effects at this points results in a far lower point estimate of 0.00.

#Standard OLS with housing value control using Clustered Robust Standard Errors 

lm_logm_H1 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 | 0 | 0 | statefips )
#lm_logm_H2 <- lm_robust(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + tothsun8)
#lm_logm_H3 <- lm_robust(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + tothsun8 + occupied80)
#lm_logm_H4 <- lm_robust(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + tothsun8 + occupied80 + ownocc8)

export_summs(lm_logm_H1, model.names=c("LM with cluster rob. SE"))

	LM with cluster rob. SE
(Intercept)	2.40 ***
	(0.44)
npl2000	0.04
	(0.02)
lnmeanhs8	0.86 ***
	(0.04)
nobs	42974
r.squared	0.58
adj.r.squared	0.58
sigma	0.41
statistic	29568.13
p.value	0.00
df	42971.00
df.residual	42971.00
logLik	-22208.05
AIC	44424.11
BIC	44458.78
* p < 0.001; p < 0.01; * p < 0.05.

#Fixed Effects with housing value control using Clustered Robust Standard Errors

FE_logm_H1 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 | statefips | 0 | statefips )

export_summs(FE_logm_H1, model.names=c("Fixed Eff with cluster rob. SE"))

	Fixed Eff with cluster rob. SE
npl2000	-0.00
	(0.02)
lnmeanhs8	0.81 ***
	(0.03)
nobs	42974
r.squared	0.70
adj.r.squared	0.70
sigma	0.34
statistic	1906.11
p.value	0.00
df	42921.00
df.residual	50.00
logLik	-15090.70
AIC	30289.41
BIC	30757.50
* p < 0.001; p < 0.01; * p < 0.05.

Also controlling for economic variables, the standard OLS model reports the NPL coefficient rising from 0.06 to 0.08, and is highly significant. I include controls for average household income, poverty rate, unemployment rate, and the percentage of households on welfare assistance.

Applying fixed effects at this point results in the estimated coefficient falling to 0.04 and becoming insignificiant.

#Standard OLS with controls for Economic Conditions using clustered robust standard errors
lm_logm_E1 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 | 0 | 0 | statefips)
lm_logm_E2 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8 | 0 | 0 | statefips)
lm_logm_E3 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 | 0 | 0 | statefips)
lm_logm_E4 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 | 0 | 0 | statefips)
lm_logm_E5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 | 0 | 0 | statefips)

export_summs(lm_logm_E1, lm_logm_E2, lm_logm_E3, lm_logm_E4, lm_logm_E5, model.names=c("Basic", "+Avg Inc", "+ Poverty Rate", "+Unemployment", "+Welfare"))

	Basic	+Avg Inc	+ Poverty Rate	+Unemployment	+Welfare
(Intercept)	2.40 ***	3.40 ***	3.03 ***	3.05 ***	3.02 ***
	(0.44)	(0.66)	(0.68)	(0.77)	(0.71)
npl2000	0.04	0.05	0.05 *	0.06 *	0.06 **
	(0.02)	(0.03)	(0.02)	(0.02)	(0.02)
lnmeanhs8	0.86 ***	0.74 ***	0.76 ***	0.76 ***	0.77 ***
	(0.04)	(0.07)	(0.07)	(0.07)	(0.07)
avhhin8		0.00 ***	0.00 ***	0.00 ***	0.00 ***
		(0.00)	(0.00)	(0.00)	(0.00)
povrat8			0.47 *	0.48	-0.13
			(0.21)	(0.31)	(0.34)
unemprt8				-0.08	-0.84
				(0.82)	(0.76)
welfare8					1.31 **
					(0.49)
nobs	42974	42974	42974	42974	42974
r.squared	0.58	0.60	0.60	0.60	0.61
adj.r.squared	0.58	0.60	0.60	0.60	0.61
sigma	0.41	0.40	0.39	0.39	0.39
statistic	29568.13	21292.25	16203.28	12963.25	11122.40
p.value	0.00	0.00	0.00	0.00	0.00
df	42971.00	42970.00	42969.00	42968.00	42967.00
df.residual	42971.00	42970.00	42969.00	42968.00	42967.00
logLik	-22208.05	-21232.65	-21044.87	-21043.94	-20664.61
AIC	44424.11	42475.30	42101.74	42101.88	41345.23
BIC	44458.78	42518.64	42153.75	42162.56	41414.58
* p < 0.001; p < 0.01; * p < 0.05.

#Fixed Effects with controls for Economic Conditions using clustered robust standard errors
FE_logm_E1 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 | statefips | 0 | statefips)
FE_logm_E2 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  | statefips | 0 | statefips)
FE_logm_E3 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 | statefips | 0 | statefips)
FE_logm_E4 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 | statefips | 0 | statefips)
FE_logm_E5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 | statefips | 0 | statefips)

export_summs(FE_logm_E1, FE_logm_E2, FE_logm_E3, FE_logm_E4, FE_logm_E5, model.names=c("Basic", "+Avg Inc", "+ Poverty Rate", "+Unemployment", "+Welfare"))

	Basic	+Avg Inc	+ Poverty Rate	+Unemployment	+Welfare
npl2000	-0.00	0.01	0.01	0.02	0.02
	(0.02)	(0.03)	(0.03)	(0.03)	(0.02)
lnmeanhs8	0.81 ***	0.64 ***	0.67 ***	0.63 ***	0.63 ***
	(0.03)	(0.05)	(0.05)	(0.05)	(0.06)
avhhin8		0.00 ***	0.00 ***	0.00 ***	0.00 ***
		(0.00)	(0.00)	(0.00)	(0.00)
povrat8			0.35	0.73 ***	0.79 ***
			(0.19)	(0.20)	(0.11)
unemprt8				-1.91 ***	-1.85 ***
				(0.16)	(0.26)
welfare8					-0.13
					(0.38)
nobs	42974	42974	42974	42974	42974
r.squared	0.70	0.72	0.72	0.73	0.73
adj.r.squared	0.70	0.72	0.72	0.73	0.73
sigma	0.34	0.33	0.33	0.32	0.32
statistic	1906.11	2105.30	2084.70	2123.85	2086.46
p.value	0.00	0.00	0.00	0.00	0.00
df	42921.00	42920.00	42919.00	42918.00	42917.00
df.residual	50.00	50.00	50.00	50.00	50.00
logLik	-15090.70	-13283.15	-13145.11	-12567.00	-12562.56
AIC	30289.41	26676.30	26402.21	25248.00	25241.13
BIC	30757.50	27153.06	26887.64	25742.10	25743.89
* p < 0.001; p < 0.01; * p < 0.05.

Without fixed effects, and using housing characteristics, demographics and controls for economic status, the regression reports a estimated average increase of 6 per cent in the median value of housing in 2000 as a result of being on the NPL list. This effect is measured as being highly significant too, Including fixed effects drops the value down to an insignificant 0.02.

I include controls for population density, the share of black and hispanic in the population, percentage of those that are under 18 as well as a control for education through a dummy variable for college degree status. Using current controls plus these added demographic controls, the effect becomes significant, with or without fixed effects. Of note here is that now 75 per cent of the overall variation of our dependent variable, median house value, is explained by this version of the model. Initially it only explained 30 percent of variation, so this at a glance seems like a decent improvement in estimation.

Additional controls are availalbe for whether head of household is female and further education related measures, but concerned about including too many controls arbitrarily as this could simply result in an irrelevant variable problem where the standard error becomes too large to infere significance. With controls and fixed effects, the final final regression suggests a shift to the NPL list results in an average treatment effect of 8 per cent on median property values in 2000.

#Standard OLS with controls for Economic Conditions & Demographics using wClustered Robuster Standard Errors
lm_logm_E5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8)
lm_logm_ED2 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 | 0| 0 | statefips)
lm_logm_ED3 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8 | 0| 0 | statefips)
lm_logm_ED4 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 | 0| 0 | statefips)
lm_logm_ED5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 | 0| 0 | statefips)
#lm_logm_ED6 <- lm_robust(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8  + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8)
lm_logm_ED7 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8 | 0| 0 | statefips)

export_summs(lm_logm_E5, lm_logm_ED2, lm_logm_ED3, lm_logm_ED4, lm_logm_ED5, lm_logm_ED7, model.names=c("Economic BK", "+Pop Density", "+% Black", "+% Hispanic", "+Under 18", "+% With BA"))

	Economic BK	+Pop Density	+% Black	+% Hispanic	+Under 18	+% With BA
(Intercept)	3.02 ***	3.33 ***	3.42 ***	3.40 ***	3.72 ***	4.01 ***
	(0.05)	(0.77)	(0.74)	(0.76)	(0.68)	(0.72)
npl2000	0.06 ***	0.12 ***	0.11 ***	0.11 ***	0.12 ***	0.13 ***
	(0.01)	(0.02)	(0.02)	(0.02)	(0.02)	(0.02)
lnmeanhs8	0.77 ***	0.73 ***	0.72 ***	0.72 ***	0.71 ***	0.68 ***
	(0.00)	(0.07)	(0.07)	(0.07)	(0.07)	(0.07)
avhhin8	0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
povrat8	-0.13 ***	-0.20	-0.06	-0.02	-0.16	-0.35
	(0.03)	(0.30)	(0.31)	(0.21)	(0.24)	(0.24)
unemprt8	-0.84 ***	-0.70	-0.67	-0.69	-0.58	-0.44
	(0.06)	(0.66)	(0.65)	(0.60)	(0.59)	(0.57)
welfare8	1.31 ***	0.67	0.96 *	1.00 *	1.22 **	1.38 **
	(0.05)	(0.39)	(0.38)	(0.44)	(0.47)	(0.49)
pop_den8		0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***
		(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
shrblk8			-0.23 ***	-0.25 **	-0.20 *	-0.21 *
			(0.06)	(0.09)	(0.09)	(0.09)
shrhsp8				-0.08	0.03	0.06
				(0.22)	(0.21)	(0.22)
child8					-0.74 ***	-0.61 **
					(0.19)	(0.19)
ba_or_better8						0.38 **
						(0.14)
nobs	42974	42974	42974	42974	42974	42974
r.squared	0.61	0.65	0.66	0.66	0.66	0.67
adj.r.squared	0.61	0.65	0.66	0.66	0.66	0.67
sigma	0.39	0.37	0.37	0.37	0.36	0.36
statistic	11122.40	11522.38	10273.83	9141.66	8448.16	7760.40
p.value	0.00	0.00	0.00	0.00	0.00	0.00
df	42967.00	42966.00	42965.00	42964.00	42963.00	42962.00
df.residual	42967.00	42966.00	42965.00	42964.00	42963.00	42962.00
logLik	-20664.61	-18097.02	-17831.71	-17816.91	-17441.29	-17292.64
AIC	41345.23	36212.04	35683.42	35655.83	34906.59	34611.28
BIC	41414.58	36290.06	35770.10	35751.18	35010.61	34723.97
* p < 0.001; p < 0.01; * p < 0.05.

#Fixed Effects with controls for Economic Conditions & Demographics using Clustered Robuster Standard Errors
FE_logm_E5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 | statefips | 0 | statefips)
FE_logm_ED2 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 | statefips | 0 | statefips)
FE_logm_ED3 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8 | statefips | 0 | statefips)
FE_logm_ED4 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 | statefips | 0 | statefips)
FE_logm_ED5 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 | statefips | 0 | statefips)
#FE_logm_ED6 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 | statefips | 0 | statefips)
FE_logm_ED7 <- felm(data=allsites, lnmdvalhs0 ~ npl2000 + lnmeanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8 | statefips | 0 | statefips)

  export_summs(FE_logm_E5, FE_logm_ED2, FE_logm_ED3, FE_logm_ED4, FE_logm_ED5, FE_logm_ED7, model.names=c("Economic BK", "+Pop Density", "+% Black", "+% Hispanic", "+Under 18", "+% With BA"))

	Economic BK	+Pop Density	+% Black	+% Hispanic	+Under 18	+% With BA
npl2000	0.02	0.06 **	0.06 **	0.06 **	0.07 **	0.08 **
	(0.02)	(0.02)	(0.02)	(0.02)	(0.02)	(0.02)
lnmeanhs8	0.63 ***	0.60 ***	0.60 ***	0.60 ***	0.59 ***	0.54 ***
	(0.06)	(0.04)	(0.04)	(0.04)	(0.04)	(0.04)
avhhin8	0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***
	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
povrat8	0.79 ***	0.63 ***	0.67 ***	0.67 ***	0.54 ***	0.32 **
	(0.11)	(0.08)	(0.09)	(0.09)	(0.10)	(0.10)
unemprt8	-1.85 ***	-1.64 ***	-1.60 ***	-1.60 ***	-1.49 ***	-1.36 ***
	(0.26)	(0.20)	(0.20)	(0.19)	(0.18)	(0.21)
welfare8	-0.13	-0.42	-0.25	-0.25	-0.05	0.13
	(0.38)	(0.27)	(0.25)	(0.23)	(0.24)	(0.23)
pop_den8		0.00 ***	0.00 ***	0.00 ***	0.00 ***	0.00 ***
		(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
shrblk8			-0.11 *	-0.11 *	-0.07	-0.09
			(0.05)	(0.05)	(0.06)	(0.06)
shrhsp8				-0.00	0.09	0.11
				(0.11)	(0.11)	(0.10)
child8					-0.53 ***	-0.34 **
					(0.13)	(0.12)
ba_or_better8						0.51 ***
						(0.14)
nobs	42974	42974	42974	42974	42974	42974
r.squared	0.73	0.75	0.75	0.75	0.75	0.75
adj.r.squared	0.73	0.75	0.75	0.75	0.75	0.75
sigma	0.32	0.31	0.31	0.31	0.31	0.31
statistic	2086.46	2221.20	2192.82	2155.61	2151.57	2159.91
p.value	0.00	0.00	0.00	0.00	0.00	0.00
df	42917.00	42916.00	42915.00	42914.00	42913.00	42912.00
df.residual	50.00	50.00	50.00	50.00	50.00	50.00
logLik	-12562.56	-11287.21	-11214.03	-11214.00	-10973.29	-10643.12
AIC	25241.13	22692.41	22548.05	22550.00	22070.58	21412.25
BIC	25743.89	23203.85	23068.15	23078.77	22608.02	21958.35
* p < 0.001; p < 0.01; * p < 0.05.

1.5 Conditions of Unbiasedness

This would require that either the assignment of tracts to the NPL list was randomly assigned, which would likely serve to make the point of the list in the first place meaningless, or, more appropriately for this case, we have implemented the Conditional Independence Assumption (CIA).

Under CIA, supposing we have controlled for all the relevant variables that would otherwise contribte towards omitted variable bias, then we can interpret the coefficient of the NPL indicator as an unbiased estimate of the true model’s effect of NPL listings on local property values.

However, if any relevant observable or underobserable variable is omitted, then CIA cannot hold and the estimate suffers from endogeneity (a violation of the third assumption of exogeneity of our regressors, key in the application of ordinary least squares regression).

1.6 & 1.7 Balance Check

As displayed in the code below, the initial comparison of treated and control groups in the AllCov dataset displays an imbalance with NPL tracts represented by a greater percentage of occupied units, a smaller share of black and hispanic communities, more children and less degrees.

Following 1.7, I repeated the same exercise using the ‘SiteCovariates’ dataset using three definitions of treated and control groups.

Using NPL dummies again for this new dataset, correlations subside rather dramatically, suggesting a far more balanced dataset. There is however a very large difference with respect to college degree status.
Using HRS instead, with 28.5 as the defining threshold for deciding the HRS dummy variable values, very similar results emerge.
Narrowing the scope of the dataset to observations between HRS_82 at 16.5 and 40.5, with the same definition of dummies, the dataset is now highly balanced relative to 1.6’s case.

Conclusion: As we converge upon the threshhold, the difference between our treated and control group narrows dramatically. This is a very intuitive result, given that sites right upon the cusp of being added to the NPL list are very likely to be next to identical to those who barely made the cut. Especially so, given the high demand for getting on these lists and the limited resources the EPA had available, this would leave the threshold region highly populated as a result. I would posit that the final subsample using observations neighboring the threshold of 28.5 is a relatively more balanced set, given the lack of significant coefficients.

#OLS using controls for Economic Conditions & Demographics with White Standard Errors

#1.6 requires check for balanced data between NPL=1 and NPL=0 for AllCov Dataset
lm_NPL_allcov <- lm_robust(data=allcov, npl2000 ~ meanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8)

#1.7, first bulletpoint requests a repeat of this for SiteCov Dataset
lm_NPL_sitecov <- lm_robust(data=sitecov, npl2000 ~ meanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8)

#Creating dummy variable for hrs_82, dummy=1 if hrs_80>=28.5, I repeat this same balance check on the binded dummies.
HRS_High <- ifelse(sitecov$hrs_82>=28.5,1,0)
sitecov_D <- cbind(sitecov,HRS_High)

#Regression using new dummy for critical value of HRS
lm_HRS_sitecov <- lm_robust(data=sitecov_D, HRS_High ~ meanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8)

#Filtering out high and low values for HRS to get thinner margin analysis;
sitecov_DD <- filter(sitecov, hrs_82>=16.5 & hrs_82<=40.5)
HRS_High_DD <- ifelse(sitecov_DD$hrs_82>=28.5,1,0)
sitecov_DD_R <- cbind(sitecov_DD,HRS_High_DD)

#Regression using new dummy for more concentrated critical value of HRS
lm_HRS_sitecov_DD <- lm_robust(data=sitecov_DD_R, HRS_High_DD ~ meanhs8 + avhhin8  + povrat8 + unemprt8 + welfare8 + pop_den8 + shrblk8  + shrhsp8 + child8 + ba_or_better8)



export_summs(lm_NPL_allcov, lm_NPL_sitecov, lm_HRS_sitecov, lm_HRS_sitecov_DD, model.names=c("AllCov NPL", "SiteCov NPL", "SiteCov HRS", "Reduced SiteCov HRS"))

	AllCov NPL	SiteCov NPL	SiteCov HRS	Reduced SiteCov HRS
(Intercept)	0.03 ***	0.69 ***	0.51 **	0.42
	(0.00)	(0.16)	(0.17)	(0.27)
meanhs8	-0.00	-0.00	0.00	0.00
	(0.00)	(0.00)	(0.00)	(0.00)
avhhin8	-0.00 **	-0.00	-0.00	-0.00
	(0.00)	(0.00)	(0.00)	(0.00)
povrat8	-0.01	0.15	0.34	0.49
	(0.01)	(0.58)	(0.60)	(0.92)
unemprt8	0.04	-1.17	-1.17	0.18
	(0.02)	(0.63)	(0.62)	(1.06)
welfare8	0.01	0.40	0.32	-0.60
	(0.02)	(0.60)	(0.59)	(0.85)
pop_den8	-0.00 ***	-0.00 **	-0.00 *	-0.00
	(0.00)	(0.00)	(0.00)	(0.00)
shrblk8	-0.01 ***	-0.25	-0.30	0.02
	(0.00)	(0.17)	(0.17)	(0.27)
shrhsp8	-0.02 ***	0.19	0.12	0.06
	(0.00)	(0.18)	(0.20)	(0.52)
child8	0.03 ***	0.27	0.55	0.60
	(0.01)	(0.41)	(0.41)	(0.58)
ba_or_better8	-0.04 ***	1.21 **	1.16 **	1.00
	(0.01)	(0.39)	(0.40)	(0.60)
nobs	48245	487	487	227
r.squared	0.01	0.07	0.07	0.04
adj.r.squared	0.01	0.05	0.05	-0.01
statistic	56.40	4.36	4.45	1.16
p.value	0.00	0.00	0.00	0.32
df.residual	48234.00	476.00	476.00	216.00
N	48245.00	487.00	487.00	227.00
se_type	HC2.00	HC2.00	HC2.00	HC2.00
* p < 0.001; p < 0.01; * p < 0.05.

1.8 HRS as IV

We assume that we can extract part of the good variation in \(\text{D}_{i}\) (using our IV, \(\text{Z}_{i}\)) and then use this good part of \(\text{D}_{i}\) to estimate the effect of \(\text{D}_{i}\) on \(\text{Y}_{i}\). This allows us to discard the bad variation in \(\text{D}_{i}\). The main concern is that \(\text{D}_{i}\) covaries with the disturbance term, hence by finding a variable uncorrelated with both the dependent variable and disturbances, we can project a fitted series of observations that stands in for \(\text{D}_{i}\) and satisfies our assumption of exogeneity for the covariate of interest.

On a basic level, this would require that the covariance between HRS_80 and our dependent variable, median house value, is zero. Additionally, HRS_82 would need to be able to generate a good fit for NPL during the first stage regression while remaining uncorrelated with the disturbance term of the model. Should these three conditions hold, we’d be left with a reasonable instrumental variable.

1.9 RD using HRS as run

A cutoff of 28.5 on HRS_82 suggests a sharp regression discontinuity. It could simply be the case however that the relationship we are trying to evaluate is non-linear and a non-linearity is being mistaken as a sign of regression discontinuity. While this does not mean we assume away the non-linear case, it does imply a more careful approach is required to validate the assumption of a our study being a case of discontinuity.

We therefore assume that there is a neighborhood around the discontinuity where we have that;

\(E[Y_i|x_0-\delta<x_i<x_0] %~~% E[Y_{0i}|x_i=x_0]\)

\(E[Y_i|x_0<x_i<x_0+\delta] %~~% E[Y_{1i}|x_i=x_0]\)

This ensures that linear or non-linear, some discontinuous change in the treatment effect occurs upon this sharp threshold. We are therefore assuming that there is differing slopes and differing intercepts between the pre- and post- threshold regressions using a set of linear appromixations to estimate the sudden change in treatment effects.

\(E[Y_{0i} | X_{i} = x]\) and \(E[Y_{1i} | X_{i} = x]\) are linear in \(x\),\(E[Y_{0i}|X_{i}] = \alpha_0 + \beta_0 X_{i}\) and \(E[Y_{1i} | {X}_{i}] = \alpha_1 + \beta_1 X_{i}\)

Now treatment effects are assumed to vary with \(X_{i}\). This imples that \(E[Y_{1i}-Y_{0i}|X_{i}] = \alpha_1 - \alpha_0 + (\beta_1 - \beta_0)X_i\)

Given treatment, \(D_i\),

\[ \begin{align} E[Y_i|X_i,D_i] = D_{i} E[Y_{1i}|X_{i}] + (1- D_{i}) E[Y_{0i} | X_{i}] \end{align} \]

\[ \begin{align} = \alpha_0 + \beta_0 X_i + (\alpha_1 - \alpha_0)D_i + (\beta_1 - \beta_0)D_i X_i \end{align} \]

\[ \begin{align} =E[Y_{i}|X_{i},D_{i}]= \widetilde{\alpha} + \beta_0 \widetilde{X_{i}} + \tau D_{i} + \widetilde{\beta}D_{i}\widetilde{X_{i}} \end{align} \]

Additionally, it assumes that the discontinuity is triggered sharply in the first place, whereas it may be a specific region making the problem a Fuzzy RD. While considering multiple critical thresholds may also be another point to consider, given the design of this list, this consideration seems far less relevant.

In this case we’d be assuming ;

\[ \begin{align} 0 < \lim_{x\downarrow c} Pr(D_{i} = 1 | X_{i} = x) - \lim_{x\uparrow c} Pr(D_{i} = 1 | X_{i} = x) < 1 \end{align} \]

1.10 Three Facts

The EPA states that the 28.5 cutoff was selected because it produced a manageable number of sites.
None of the individuals involved in identifying the site, testing the level of pollution, or running the 1982 HRS test knew the cutoff/threshold.
EPA documentation emphasizes that the HRS test is an imperfect scoring measure.

The first fact is more so a product of the pressure the EPA was under in terms of resources, but does help identify whyand where this cutoff was implemented. It also signals there is likely no distinguising difference between values slightly below and above 28.5 which would suggests an analysis upon the margin of the threshold would be valid.

The second fact suggests very little risk of selection bias, given that HRS was not an observed value when agents were making decisions. This allows it to function well as an IV, concerning a lack of covariance with the error term. It also is unlikely to have influenced median house values in 2000 for the reason that it was not an observable factor to price into the market. This suggests the covariance of our IV between the disturbance and the dependent variable is possibly zero. This makes HRS very enticing as a potential IV!

Lastly, the EPA documentation emphasizes potential measurement error given that HRS are an imperfect scoring measure for how hazardous any particular site was. This is also of use as it further blends observations upon the threshold together into a balanced set, as reflected by the fourth stage of analysis in question 1.7. This enhances the validity of an RD analysis. Specifically, a fuzzy RD analysis since it seemed to be up to luck who gets on either side of the threshhold, given the inaccuracy of HRS.

1.11 2Mile IV Regression

IV regression suggests a 31 per cent average increase in median property values.Further below I apply this same IV process manually, using the 2SLS process.

p_load(magrittr)

#mile %<>% mutate(lnmdvalhs0=log(mdvalhs0))

lnmdvalhs0 = log(mile$mdvalhs0)
mile_ln = cbind(mile, lnmdvalhs0)

IVreg <- iv_robust(lnmdvalhs0 ~ npl2000 | hrs_82 ,data=mile_ln, clusters=statefips)


#FULL SET OF CONTROLS VERSION
#IVreg <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr | hrs_82 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr ,data=mile, clusters=statefips)

#Version with Fixed Effects 
#IVreg <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr | hrs_82 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr ,data=mile, clusters=statefips, fixed_effects=statefips)

export_summs(IVreg)

	Model 1
(Intercept)	11.43 ***
	(0.11)
npl2000	0.31 **
	(0.09)
nobs	483
r.squared	0.01
adj.r.squared	0.01
df.residual	481.00
N	483.00
se_type	CR2.00
statistic	13.17
p.value	0.00
statistic.weakinst
p.value.weakinst
statistic.endogeneity
p.value.endogeneity
statistic.overid
p.value.overid
* p < 0.001; p < 0.01; * p < 0.05.

Using the three steps of the process, where;

Estimate the effect of the instrument \(Z_i\) on our endogenous variable \(D_i\) and (predetermined) covariates \(X_i\). Save \(\hat{D}_i\).
The reduced form regresses the outcome \(Y_i\) (LHS of the second stage) on our instrument \(Z_i\) and covariates \(X_i\) (RHS of the first stage).
Estimate model we wanted—but only using the variation in \(D_i\) that correlates with \(Z_i\), i.e. \(\hat{D}_i\).

I find the same final estimated coefficient as the IV_robust approach initally generated, confirming its validity from a coding standpoint. Further below I capture the ratio of the coefficients of \(Z_i\) to demonstrate success!

#First Stage (FULL CONTROLS)
#Reg111_FS <- lm_robust(npl2000 ~ hrs_82 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr, clusters=statefips, data=mile_ln)

Reg111_FS <- lm_robust(npl2000 ~ hrs_82, clusters=statefips, data=mile_ln)


export_summs(Reg111_FS)

	Model 1
(Intercept)	0.00
	(0.04)
hrs_82	0.02 ***
	(0.00)
nobs	483
r.squared	0.56
adj.r.squared	0.56
statistic	460.21
p.value	0.00
df.residual	14.83
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Reduced Form
Reg111_RF <- lm_robust(lnmdvalhs0 ~ hrs_82, clusters=statefips, data=mile_ln)

export_summs(Reg111_RF)

	Model 1
(Intercept)	11.43 ***
	(0.11)
hrs_82	0.01 **
	(0.00)
nobs	483
r.squared	0.05
adj.r.squared	0.04
statistic	12.70
p.value	0.00
df.residual	14.83
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Second Stage
phat <- Reg111_FS$fitted.values
mile_ln <- cbind(mile_ln, phat)

#mile_ln <- mile_ln[,-53]

Reg111_SS <- lm_robust(lnmdvalhs0 ~ phat, clusters=statefips, data=mile_ln)
export_summs(Reg111_SS)

	Model 1
(Intercept)	11.43 ***
	(0.11)
phat	0.31 **
	(0.09)
nobs	483
r.squared	0.05
adj.r.squared	0.04
statistic	12.70
p.value	0.00
df.residual	14.83
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

Success! This matches the 31 per cent rate the IV_robust and manual process of 2SLS generated.

#Taking the ratio of the two estimated coefficients from the first stage and reduced form, 
est_coeff <- Reg111_RF[["coefficients"]][["hrs_82"]]/Reg111_FS[["coefficients"]][["hrs_82"]]

est_coeff

## [1] 0.3091575

1.12 State Fixed Effect to 2SLS

Applying fixed effects to the previous code, my treatment effect is now reduced by 50 per cent and is insignificant. Adding additional controls would be very important. Treatment was not randomly assigned so we must instead rely upon CIA. In this case, suggesting that CIA is satisfied with no control variables (suggesting X in our standard CIA write-up is an empty set) would likely not hold up in a rigorous acamdemic evaluation. Instead, I would include the use of housing characteristics, economic conditions and demographics more accurately trying to capture the treatment effect through the minimization of selection bias.

IVreg_FE <- iv_robust(lnmdvalhs0 ~ npl2000 | hrs_82 ,data=mile_ln, clusters=statefips, fixed_effects=statefips)

#IVreg_FE <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr | hrs_82 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr +old8_nbr + ba_or_better8_nbr ,data=mile, clusters=statefips, fixed_effects=statefips)

export_summs(IVreg_FE)

	Model 1
npl2000	0.16
	(0.08)
nobs	483
r.squared	0.40
adj.r.squared	0.34
df.residual	442.00
N	483.00
se_type	CR2.00
statistic
p.value
statistic.weakinst
p.value.weakinst
statistic.endogeneity
p.value.endogeneity
statistic.overid
p.value.overid
* p < 0.001; p < 0.01; * p < 0.05.

Applying 2SLS instead of the direct IV regression above, I find an insignificant result of 0.16 per cent again. The first 2SLS without Fixed Effects found the following values for the estimated effect of NPL listings.

Reduced Form Est 0.0062383 First-Stage Esti 0.0201783 Second-Stage Est 0.3091575

The second 2SLS with fixed effects left the estimates coefficient for the first stage of the process largely unchanged, it appears that the reduction in the reduced form estimate by approximately a half directly led to the second-stage estimate for the coefficient of interest being reduced by a half as well. This would suggest that while on average there may be no relationship between HS_82 and median house values in 2000, there may have been some specific states that inferred some specific relationship between the two items.

Additionally, it could simply be the case that including fixed effects introduced enough unique variation in the data that the spuriously small correlation our IV shares with our dependent variable drifted slightly out of frame. If this is the case, this would suggest our results are extremely sensitive when applying fixed effects to a sample of 487 observations.

Reduced Form Est 0.0032926 First-Stage Esti 0.0199812 Second-Stage Est 0.1647854

#First Stage
Reg111_FS_FE <- lm_robust(npl2000 ~ hrs_82, clusters=statefips, fixed_effects = statefips, data=mile)

export_summs(Reg111_FS_FE)

	Model 1
hrs_82	0.02 ***
	(0.00)
nobs	483
r.squared	0.61
adj.r.squared	0.58
statistic
p.value
df.residual	10.14
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Reduced Form
Reg111_RF_FE <- lm_robust(lnmdvalhs0 ~ hrs_82, clusters=statefips, fixed_effects = statefips, data=mile_ln)

export_summs(Reg111_RF_FE)

	Model 1
hrs_82	0.00
	(0.00)
nobs	483
r.squared	0.40
adj.r.squared	0.35
statistic
p.value
df.residual	10.14
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Second Stage
phat_FE <- Reg111_FS_FE$fitted.values
mile_ln <- cbind(mile_ln, phat_FE)

#mile_ln <- mile_ln[,-53]

Reg111_SS_FE <- lm_robust(lnmdvalhs0 ~ phat_FE, clusters=statefips, fixed_effects = statefips, data=mile_ln)
export_summs(Reg111_SS_FE)

	Model 1
phat_FE	0.16
	(0.07)
nobs	483
r.squared	0.40
adj.r.squared	0.35
statistic
p.value
df.residual	10.14
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

Below is the calculate of these coefficient ratios for the second-stage coefficients.

#Taking the ratio of the two estimated coefficients from the first stage and reduced form, 
est_coeff <- Reg111_RF[["coefficients"]][["hrs_82"]]/Reg111_FS[["coefficients"]][["hrs_82"]]

Reg111_RF[["coefficients"]][["hrs_82"]]

## [1] 0.006238263

Reg111_FS[["coefficients"]][["hrs_82"]]

## [1] 0.02017827

est_coeff

## [1] 0.3091575

Reg111_RF_FE[["coefficients"]][["hrs_82"]]

## [1] 0.003292613

Reg111_FS_FE[["coefficients"]][["hrs_82"]]

## [1] 0.01998122

#Taking the ratio of the two estimated coefficients from the first stage and reduced form, 
est_coeff2 <- Reg111_RF_FE[["coefficients"]][["hrs_82"]]/Reg111_FS_FE[["coefficients"]][["hrs_82"]]

est_coeff2

## [1] 0.1647854

1.13 Change HRS IV to Dummy

Changing to use dummies for HRS instead with no fixed effects, I use the following code for the transformation.

#Similarly to 1.7, I split my data with dummy = 1 including 28.5
HRS_High_M <- ifelse(mile_ln$hrs_82>=28.5,1,0)
mile_ln_D <- cbind(mile_ln,HRS_High_M)

The results change slightly, increasing from 0.31 to 0.22 though less significant now. The IVrobust and 2SLS methods generate matching values, suggesting correct implementation.

IVreg_D <- iv_robust(lnmdvalhs0 ~ npl2000 | HRS_High_M ,data=mile_ln_D, clusters=statefips)


export_summs(IVreg_D)

	Model 1
(Intercept)	11.49 ***
	(0.10)
npl2000	0.22 *
	(0.08)
nobs	483
r.squared	0.03
adj.r.squared	0.02
df.residual	481.00
N	483.00
se_type	CR2.00
statistic	8.14
p.value	0.01
statistic.weakinst
p.value.weakinst
statistic.endogeneity
p.value.endogeneity
statistic.overid
p.value.overid
* p < 0.001; p < 0.01; * p < 0.05.

#First Stage
Reg111_FS_D <- lm_robust(npl2000 ~ HRS_High_M, clusters=statefips, data=mile_ln_D)

export_summs(Reg111_FS_D)

	Model 1
(Intercept)	0.16 ***
	(0.03)
HRS_High_M	0.83 ***
	(0.03)
nobs	483
r.squared	0.74
adj.r.squared	0.74
statistic	598.69
p.value	0.00
df.residual	15.56
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Reduced Form
Reg111_RF_D <- lm_robust(lnmdvalhs0 ~ HRS_High_M, clusters=statefips, data=mile_ln_D)

export_summs(Reg111_RF_D)

	Model 1
(Intercept)	11.53 ***
	(0.09)
HRS_High_M	0.18 *
	(0.06)
nobs	483
r.squared	0.03
adj.r.squared	0.03
statistic	7.76
p.value	0.01
df.residual	15.56
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Second Stage
phat_D <- Reg111_FS_D$fitted.values
mile_ln_D <- cbind(mile_ln_D, phat_D)

#mile_ln_D <- mile_ln_D[,-57]

Reg111_SS_D <- lm_robust(lnmdvalhs0 ~ phat_D, clusters=statefips, data=mile_ln_D)
export_summs(Reg111_SS_D)

	Model 1
(Intercept)	11.49 ***
	(0.10)
phat_D	0.22 *
	(0.08)
nobs	483
r.squared	0.03
adj.r.squared	0.03
statistic	7.76
p.value	0.01
df.residual	15.59
N	483.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

1.14 Sharp or Fuzzy RD

The first-stage regression suggests that there is a distinct difference in slope between observations below and above the 28.5 cutoff, with there being an 84 per cent chance of falling into the NPL listing if the HRS score exceeded the threshhold. This finding suggests a grey area, where some sites deserving of the NPL listing in 1982 may not have made it to the 2000 listing (possibly due to cleanups in that time or other competitors for space appearing in the process).

Given that this was also the EPA’s distinct cut off point for deciding which sites to prioritize under a severely limited budget, I would imagine this cutoff point was stuck to rather rigidly but they have conceeded the potential measurement error for HRS. That would suggest that right upon the threshold region, we face an incredibly balanced dataset, where probability of being assigned either side of the threshold converges as HRS approaches 28.5.

All of these factors combined merit the use of a fuzzy RD.

1.15 Standard Plot of RD

The outcome variable vs. the running variable
The treatment variable vs. the running variable
Covariates vs. the running variable
Bin counts vs. the running variable

Firstly, for this sharp RD the running variable must be separated into an assortment of bins. The min value of HRS is 0.00 and the max value is 74.16 in the mile dataset (483 obs), the larger samples are unavailable due to reasons including missing house price data and no HRS information being available. For the sake of distinct bins, I set a max of 74.1 and split my bins into thirteen bins, each representing a 5.7 interval of HRS scores from 0 to 74.1. To pinpoint each average to its respective bin, I run the midpoint of HRS bins across the x-axis, resulting in a balanced visualization of the potential discontinuities. The red line is held fixed at HRS 28.5 to highlight our crossing point wherein no overlap occurs.

#Data: mile_ln, Dependent: lnmdvalhs0, Regressor: npl2000 RunningVariable: hrs_82
# Bins: 

split <- 5.7
bin <- seq(0, 74.1, split)
midpoint <- seq(2.85, 71.25, split)

#2. Place each observation into a block via its fitted p-score.

bin_HRS = cut(mile_ln$hrs_82, breaks = bin, labels = FALSE)
mile_ln_bin <- cbind(mile_ln, bin_HRS)

avg_Y <- aggregate(mile_ln_bin$lnmdvalhs0_nbr, list(mile_ln_bin$bin_HRS), mean)
avg_Y_bin <- cbind(avg_Y,midpoint)

# The outcome variable vs. the running variable
Y_RD <- ggplot(data=avg_Y_bin, aes(y= x, x=midpoint)) +
  geom_point(color="blue", alpha=0.5)+
geom_hline(yintercept = 10, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
    xlab("Hazardous Ranking System Score") + ylab("Average Outcome")
#+
#scale_y_continuous(TeX("$Y_k$")) +
#scale_x_continuous(TeX("$HRS_82$"))

Y_RD

avg_X <- aggregate(mile_ln_bin$npl2000, list(mile_ln_bin$bin_HRS), mean)
avg_X_bin <- cbind(avg_X,midpoint)

# The treatment variable vs. the running variable
X_RD <- ggplot(data=avg_X_bin, aes(y= x, x=midpoint)) +
  geom_point(color="orange", alpha=0.8)+
geom_hline(yintercept = 0, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
  xlab("Hazardous Ranking System Score") + ylab("Average Treatment")

X_RD

# Covariates vs. the running variable
#(i)Example of Housing Characteristics: Occupancy Rate, occupied80_nbr
avg_H <- aggregate(mile_ln_bin$occupied80_nbr, list(mile_ln_bin$bin_HRS), mean)
avg_H_bin <- cbind(avg_H,midpoint)

H_RD <- ggplot(data=avg_H_bin, aes(y= x, x=midpoint)) +
  geom_point(color="purple", alpha=0.8)+
geom_hline(yintercept = 0, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
  xlab("Hazardous Ranking System Score") + ylab("Average Occupancy Rate")

H_RD

#(ii)Example of Economic Conditions: Average Income, avhhin8_nbr
avg_E <- aggregate(mile_ln_bin$avhhin8_nbr, list(mile_ln_bin$bin_HRS), mean)
avg_E_bin <- cbind(avg_E,midpoint)

E_RD <- ggplot(data=avg_E_bin, aes(y= x, x=midpoint)) +
  geom_point(color="purple", alpha=0.8)+
geom_hline(yintercept = 0, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
  xlab("Hazardous Ranking System Score") + ylab("Average Income")

E_RD

#(iii)Example of Demographics: Percentage of Population, Black, shrblk8_nbr
avg_D <- aggregate(mile_ln_bin$shrblk8_nbr, list(mile_ln_bin$bin_HRS), mean)
avg_D_bin <- cbind(avg_D,midpoint)

D_RD <- ggplot(data=avg_D_bin, aes(y= x, x=midpoint)) +
  geom_point(color="purple", alpha=0.8)+
geom_hline(yintercept = 0, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
  xlab("Hazardous Ranking System Score") + ylab("Average Share of Black Pop")

D_RD

# Bin counts vs. the running variable

#Counting Observations per Bin
obs_df0 <- mile_ln_bin %>% 
  group_by(bin_HRS) %>%
  summarise(number = n())
#Dropping the 2 NA's above 74.1 score (max was 74.16)
obs_df1 <- obs_df0[-14,]
avg_N_bin <- cbind(obs_df1,midpoint)


N_RD <- ggplot(data=avg_N_bin, aes(y= number, x=midpoint)) +
  geom_point(color="blue", alpha=0.8)+
geom_hline(yintercept = 0, size = 0.1) +
geom_vline(xintercept = 28.5, color = "red", size = 1)+
  xlab("Hazardous Ranking System Score") + ylab("Average Number of Tracts (Bin Counts)")

N_RD

1.16 Plausibility

While there is a smoothness to the covariates and distinct discontinuous jump with respect to the treatment variable, there is no discontinuous jump visible when taking the simple average of outcomes across the thirteen bins prepared. This seems to be supportive of the conclusions the models have reached in terms of the effect being insignificant when controlling for the appropriate factors and including fixed effects. This further supports the possibility that Superfund cleanup effects are economically small and statistically insignificant with respect to residential property values.

1.17 RD Estimates

Switching back to the a restricted subset of the data as in 1.13, I use the code displayed below. Mirroring 1.13, I leave out fixed effects initially and stick to the same set of controls as before.

#Similarly to 1.7, I filter then split my data with dummy = 1 including 28.5
mile_ln_DD <- filter(mile_ln, hrs_82>=16.5 & hrs_82<=40.5)
HRS_Thresh_DD <- ifelse(mile_ln_DD$hrs_82>=28.5,1,0)
mile_ln_DD_R <- cbind(mile_ln_DD,HRS_Thresh_DD)

The results change from previous findings of signficance in the treatment effect. I estimate a 6 per cent increase in housing value but the result is measured as highly insignificant. Below code is displayed for both the IV regression and 2SLS process to highlight the matching results. In the next portion of code, I reintroduce fixed effects and a set of controls for housing characteristics, economic status and demographics to emphasize the importance of controlling for omitted variable bias.

IVreg_DD <- iv_robust(lnmdvalhs0 ~ npl2000 | HRS_Thresh_DD ,data=mile_ln_DD_R, clusters=statefips)


export_summs(IVreg_DD)

	Model 1
(Intercept)	11.56 ***
	(0.09)
npl2000	0.06
	(0.09)
nobs	226
r.squared	-0.00
adj.r.squared	-0.01
df.residual	224.00
N	226.00
se_type	CR2.00
statistic	0.40
p.value	0.53
statistic.weakinst
p.value.weakinst
statistic.endogeneity
p.value.endogeneity
statistic.overid
p.value.overid
* p < 0.001; p < 0.01; * p < 0.05.

#First Stage
Reg111_FS_DD <- lm_robust(npl2000 ~ HRS_Thresh_DD, clusters=statefips, data=mile_ln_DD_R)

export_summs(Reg111_FS_DD)

	Model 1
(Intercept)	0.27 ***
	(0.04)
HRS_Thresh_DD	0.72 ***
	(0.05)
nobs	226
r.squared	0.59
adj.r.squared	0.58
statistic	242.14
p.value	0.00
df.residual	14.98
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Reduced Form
Reg111_RF_DD <- lm_robust(lnmdvalhs0 ~ HRS_Thresh_DD, clusters=statefips, data=mile_ln_DD_R)

export_summs(Reg111_RF_DD)

	Model 1
(Intercept)	11.57 ***
	(0.08)
HRS_Thresh_DD	0.04
	(0.06)
nobs	226
r.squared	0.00
adj.r.squared	-0.00
statistic	0.39
p.value	0.54
df.residual	14.98
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Second Stage
phat_DD <- Reg111_FS_DD$fitted.values
mile_ln_DD_R <- cbind(mile_ln_DD_R, phat_DD)

#mile_ln <- mile_ln[,-53]

Reg111_SS_DD <- lm_robust(lnmdvalhs0 ~ phat_DD, clusters=statefips, data=mile_ln_DD_R)
export_summs(Reg111_SS_DD)

	Model 1
(Intercept)	11.56 ***
	(0.09)
phat_DD	0.06
	(0.09)
nobs	226
r.squared	0.00
adj.r.squared	-0.00
statistic	0.39
p.value	0.54
df.residual	15.10
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

Regarding choice controls, there are a few changes I would suggest making. Firstly, I would elect to keep fixed effects involved. As Greenstone & Gallagher highlight in their paper, the majority of cases of HRS scores above 28.5 are clustered around the northeastern states of the US. The unbalanced distribution of sites across states introduced concerns to them about the presence of localized market shocks, hence by including fixed effects such constant shocks within groups are dismissed. As displayed in the initial portion of the code below, running a simple regression without fixed effects leads to a completely sparse model in terms of explanatory power. Including fixed effects results in the R-squared term jumping from approximately zero to 40 percent.

I must also highlight the fact that I included housing characteristics in the regression throughout the analysis as is the norm for hedonic housing approaches, although it was not specifically requested by the problem set. From my previous work experience, I had an opportunity to review the Irish housing price index and finalized a working paper for the Central Statistics Office (CSO) on the matter. I believe housing characteristics are a vital part of this hedonic regression, when trying to identify the effect of environmental quality on housing values. It allows a far clearer comparison of like-for-like across various properties.

I re-run estimates upon the given threshhold with fixed effects active and additional housing characteristic controls implemented in a manner that avoids perfect multicollinearity (number of units, occpuancy of units, number of bedrooms, percentage of detached & attached homes, housing age, percentage without full kitchens, no heating, and those without a full bathroom). Ultimately the coefficient remains in the single digit region, 7 per cent, but now the standard error has climbed dramatically, furthering the insignificance of the result. This is likely a result of some non-perfect multicollineary, wherein the correlation between our fitted value and the now larger set of controls has contributed towards an increased standard error. This is a common issue of loading in too many variables into a regression, as irrelevant controls not only poorly explain the variation in \(Y_i\) but also introduce multicollinearity upon our coefficient of interest, effectively concealing the significance of our estimated coefficient.

Regardless, even when using far lower counts of coefficients but including fixed effects, the previous work done has demonstrated that the effect of being placed on the NPL list is consistently insignificant.

IVreg_DD_FE_0 <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr | HRS_Thresh_DD + lnmeanhs8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr+ ba_or_better8_nbr ,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)


# + Controls for Number of Units
IVreg_DD_FE_1 <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr | HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr ,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)

#+ Controls for Attached/Detached, No Full Kitchen, Fire Stove Heating
IVreg_DD_FE_2 <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr + detach80occ_nbr + attach80occ_nbr | HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr + detach80occ_nbr + attach80occ_nbr,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)


# + Controls for Number of Bedrooms
IVreg_DD_FE_3 <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr+bedrms1_80occ_nbr +bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr | HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr + bedrms1_80occ_nbr+ bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)

# + Controls for Year Built
IVreg_DD_FE_4 <- iv_robust(lnmdvalhs0 ~ npl2000 + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr+bedrms1_80occ_nbr +bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr  + blt2_5yrs80occ_nbr + blt6_10yrs80occ_nbr + blt10_20yrs80occ_nbr + blt20_30yrs80occ_nbr + blt30_40yrs80occ_nbr + blt40_yrs80occ_nbr | HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr + bedrms1_80occ_nbr+ bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr  + blt2_5yrs80occ_nbr + blt6_10yrs80occ_nbr + blt10_20yrs80occ_nbr + blt20_30yrs80occ_nbr + blt30_40yrs80occ_nbr + blt40_yrs80occ_nbr,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)

#IVreg_DD_1 <- iv_robust(lnmdvalhs0 ~ npl2000 | HRS_Thresh_DD ,data=mile_ln_DD_R, clusters=statefips)

#IVreg_DD_2 <- iv_robust(lnmdvalhs0 ~ npl2000 | HRS_Thresh_DD ,data=mile_ln_DD_R, clusters=statefips, fixed_effects = statefips)


#export_summs(IVreg_DD_1,IVreg_DD_2)

export_summs(IVreg_DD, IVreg_DD_FE_0, IVreg_DD_FE_1, IVreg_DD_FE_2, IVreg_DD_FE_3, IVreg_DD_FE_4, model.names = c("NO FE of Housing","+Fixed Effects & Econ/Demo", "+Number of Units", "+Features", "+Number of Bedrooms", "+Years Built"))

	NO FE of Housing	+Fixed Effects & Econ/Demo	+Number of Units	+Features	+Number of Bedrooms	+Years Built
(Intercept)	11.56 ***
	(0.09)
npl2000	0.06	0.09	0.08	0.08	0.08	0.07
	(0.09)	(0.06)	(0.06)	(0.06)	(0.06)	(0.06)
lnmeanhs8_nbr		0.40	0.37 *	0.32 *	0.33	0.30
		(0.16)	(0.14)	(0.12)	(0.14)	(0.13)
avhhin8_nbr		0.00 *	0.00 *	0.00 **	0.00 **	0.00 ***
		(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
povrat8_nbr		2.07	1.37	0.72	0.62	0.87
		(0.96)	(1.05)	(1.22)	(1.10)	(1.06)
unemprt8_nbr		-2.55	-2.99	-3.05	-2.91	-2.97
		(1.63)	(1.58)	(1.68)	(1.62)	(1.39)
welfare8_nbr		-2.54	-2.12	-1.80	-1.71	-1.47
		(1.29)	(1.20)	(1.31)	(1.29)	(1.19)
pop_den8_nbr		-0.00	0.00	0.00	0.00	0.00
		(0.00)	(0.00)	(0.00)	(0.00)	(0.00)
shrblk8_nbr		-0.36	-0.15	-0.04	-0.09	0.02
		(0.23)	(0.30)	(0.29)	(0.29)	(0.26)
shrhsp8_nbr		0.50	0.25	0.37	0.34	0.49
		(0.64)	(0.63)	(0.71)	(0.73)	(0.72)
child8_nbr		1.73	1.25	0.92	1.43	0.69
		(0.77)	(0.62)	(0.46)	(0.77)	(0.75)
ba_or_better8_nbr		-0.20	-0.32	-0.13	0.14	0.08
		(0.42)	(0.34)	(0.32)	(0.29)	(0.36)
tothsun8_nbr			0.00	-0.00	-0.00	-0.00
			(0.00)	(0.00)	(0.00)	(0.00)
occupied80_nbr			-0.09	0.08	0.07	0.92
			(1.15)	(1.25)	(1.33)	(1.08)
ownocc8_nbr			-0.00 **	-0.00	-0.00	-0.00
			(0.00)	(0.00)	(0.00)	(0.00)
firestoveheat80_nbr				-0.24	-0.30	-0.04
				(0.43)	(0.41)	(0.37)
nofullkitchen80_nbr				1.37	1.39	0.92
				(1.25)	(1.19)	(1.08)
detach80occ_nbr				-0.78 *	-0.57	-0.29
				(0.33)	(0.37)	(0.44)
attach80occ_nbr				-1.50 **	-1.31 *	-1.20
				(0.37)	(0.43)	(0.53)
bedrms1_80occ_nbr					5.71	9.92
					(8.13)	(8.66)
bedrms2_80occ_nbr					5.22	9.56
					(7.93)	(8.49)
bedrms3_80occ_nbr					5.03	9.57
					(7.89)	(8.56)
bedrms4_80occ_nbr					4.50	8.88
					(7.87)	(8.41)
bedrms5_80occ_nbr					5.30	9.22
					(7.88)	(8.31)
blt2_5yrs80occ_nbr						-2.00
						(1.93)
blt6_10yrs80occ_nbr						-0.80
						(1.50)
blt10_20yrs80occ_nbr						-1.89
						(1.63)
blt20_30yrs80occ_nbr						-1.90
						(1.42)
blt30_40yrs80occ_nbr						-1.69
						(2.10)
blt40_yrs80occ_nbr						-1.45
						(1.48)
nobs	226	226	226	226	226	226
r.squared	-0.00	0.77	0.79	0.81	0.81	0.82
adj.r.squared	-0.01	0.71	0.73	0.75	0.74	0.75
df.residual	224.00	178.00	175.00	171.00	166.00	160.00
N	226.00	226.00	226.00	226.00	226.00	226.00
se_type	CR2.00	CR2.00	CR2.00	CR2.00	CR2.00	CR2.00
statistic	0.40
p.value	0.53
statistic.weakinst
p.value.weakinst
statistic.endogeneity
p.value.endogeneity
statistic.overid
p.value.overid
* p < 0.001; p < 0.01; * p < 0.05.

#First Stage
Reg111_FS_DD_FE <- lm_robust(npl2000 ~ HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr+bedrms1_80occ_nbr +bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr  + blt2_5yrs80occ_nbr + blt6_10yrs80occ_nbr + blt10_20yrs80occ_nbr + blt20_30yrs80occ_nbr + blt30_40yrs80occ_nbr + blt40_yrs80occ_nbr, clusters=statefips, data=mile_ln_DD_R, fixed_effects = statefips)

export_summs(Reg111_FS_DD_FE)

	Model 1
HRS_Thresh_DD	0.69 ***
	(0.06)
lnmeanhs8_nbr	0.02
	(0.14)
tothsun8_nbr	0.00
	(0.00)
occupied80_nbr	-0.94
	(0.96)
ownocc8_nbr	0.00
	(0.00)
avhhin8_nbr	0.00
	(0.00)
povrat8_nbr	-0.14
	(1.06)
unemprt8_nbr	1.63
	(0.86)
welfare8_nbr	1.57
	(1.83)
pop_den8_nbr	-0.00
	(0.00)
shrblk8_nbr	0.22
	(0.35)
shrhsp8_nbr	0.84
	(0.51)
child8_nbr	-1.05
	(1.11)
ba_or_better8_nbr	0.44
	(0.63)
firestoveheat80_nbr	0.23
	(0.72)
nofullkitchen80_nbr	-2.21
	(2.41)
bedrms1_80occ_nbr	16.13
	(11.73)
bedrms2_80occ_nbr	17.47
	(11.61)
bedrms3_80occ_nbr	17.27
	(11.50)
bedrms4_80occ_nbr	16.17
	(11.42)
bedrms5_80occ_nbr	16.05
	(11.31)
detach80occ_nbr	0.57
	(0.55)
attach80occ_nbr	0.16
	(0.62)
blt2_5yrs80occ_nbr	3.89
	(3.11)
blt6_10yrs80occ_nbr	2.38
	(2.56)
blt10_20yrs80occ_nbr	3.20
	(2.58)
blt20_30yrs80occ_nbr	2.21
	(2.27)
blt30_40yrs80occ_nbr	2.44
	(2.58)
blt40_yrs80occ_nbr	2.73
	(2.50)
nobs	226
r.squared	0.68
adj.r.squared	0.55
statistic
p.value
df.residual	12.91
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Reduced Form
Reg111_RF_DD_FE <- lm_robust(lnmdvalhs0 ~ HRS_Thresh_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr + firestoveheat80_nbr + nofullkitchen80_nbr + bedrms1_80occ_nbr+ bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr  + blt2_5yrs80occ_nbr + blt6_10yrs80occ_nbr + blt10_20yrs80occ_nbr + blt20_30yrs80occ_nbr + blt30_40yrs80occ_nbr + blt40_yrs80occ_nbr, clusters=statefips, data=mile_ln_DD_R, fixed_effects = statefips)

export_summs(Reg111_RF_DD_FE)

	Model 1
HRS_Thresh_DD	0.05
	(0.04)
lnmeanhs8_nbr	0.30
	(0.12)
tothsun8_nbr	-0.00
	(0.00)
occupied80_nbr	0.86
	(1.10)
ownocc8_nbr	-0.00
	(0.00)
avhhin8_nbr	0.00 ***
	(0.00)
povrat8_nbr	0.86
	(1.06)
unemprt8_nbr	-2.85
	(1.32)
welfare8_nbr	-1.36
	(1.16)
pop_den8_nbr	0.00
	(0.00)
shrblk8_nbr	0.04
	(0.24)
shrhsp8_nbr	0.55
	(0.71)
child8_nbr	0.61
	(0.75)
ba_or_better8_nbr	0.11
	(0.38)
firestoveheat80_nbr	-0.02
	(0.37)
nofullkitchen80_nbr	0.77
	(1.04)
bedrms1_80occ_nbr	11.05
	(8.89)
bedrms2_80occ_nbr	10.79
	(8.76)
bedrms3_80occ_nbr	10.78
	(8.82)
bedrms4_80occ_nbr	10.02
	(8.68)
bedrms5_80occ_nbr	10.35
	(8.62)
detach80occ_nbr	-0.25
	(0.44)
attach80occ_nbr	-1.19
	(0.54)
blt2_5yrs80occ_nbr	-1.72
	(1.79)
blt6_10yrs80occ_nbr	-0.64
	(1.41)
blt10_20yrs80occ_nbr	-1.67
	(1.53)
blt20_30yrs80occ_nbr	-1.75
	(1.33)
blt30_40yrs80occ_nbr	-1.52
	(1.96)
blt40_yrs80occ_nbr	-1.26
	(1.38)
nobs	226
r.squared	0.83
adj.r.squared	0.76
statistic
p.value
df.residual	12.91
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

#Second Stage
phat_DD_FE <- Reg111_FS_DD_FE$fitted.values
mile_ln_DD_R <- cbind(mile_ln_DD_R, phat_DD_FE)

#mile_ln <- mile_ln[,-53]

Reg111_SS_DD_FE <- lm_robust(lnmdvalhs0 ~ phat_DD + lnmeanhs8_nbr + tothsun8_nbr + occupied80_nbr + ownocc8_nbr + avhhin8_nbr  + povrat8_nbr + unemprt8_nbr + welfare8_nbr + pop_den8_nbr + shrblk8_nbr  + shrhsp8_nbr + child8_nbr + ba_or_better8_nbr+ firestoveheat80_nbr + nofullkitchen80_nbr + bedrms1_80occ_nbr+ bedrms2_80occ_nbr + bedrms3_80occ_nbr + bedrms4_80occ_nbr + bedrms5_80occ_nbr + detach80occ_nbr + attach80occ_nbr  + blt2_5yrs80occ_nbr + blt6_10yrs80occ_nbr + blt10_20yrs80occ_nbr + blt20_30yrs80occ_nbr + blt30_40yrs80occ_nbr + blt40_yrs80occ_nbr, clusters=statefips, data=mile_ln_DD_R, fixed_effects = statefips)
export_summs(Reg111_SS_DD_FE)

	Model 1
phat_DD	0.07
	(0.05)
lnmeanhs8_nbr	0.30
	(0.12)
tothsun8_nbr	-0.00
	(0.00)
occupied80_nbr	0.86
	(1.10)
ownocc8_nbr	-0.00
	(0.00)
avhhin8_nbr	0.00 ***
	(0.00)
povrat8_nbr	0.86
	(1.06)
unemprt8_nbr	-2.85
	(1.32)
welfare8_nbr	-1.36
	(1.16)
pop_den8_nbr	0.00
	(0.00)
shrblk8_nbr	0.04
	(0.24)
shrhsp8_nbr	0.55
	(0.71)
child8_nbr	0.61
	(0.75)
ba_or_better8_nbr	0.11
	(0.38)
firestoveheat80_nbr	-0.02
	(0.37)
nofullkitchen80_nbr	0.77
	(1.04)
bedrms1_80occ_nbr	11.05
	(8.89)
bedrms2_80occ_nbr	10.79
	(8.76)
bedrms3_80occ_nbr	10.78
	(8.82)
bedrms4_80occ_nbr	10.02
	(8.68)
bedrms5_80occ_nbr	10.35
	(8.62)
detach80occ_nbr	-0.25
	(0.44)
attach80occ_nbr	-1.19
	(0.54)
blt2_5yrs80occ_nbr	-1.72
	(1.79)
blt6_10yrs80occ_nbr	-0.64
	(1.41)
blt10_20yrs80occ_nbr	-1.67
	(1.53)
blt20_30yrs80occ_nbr	-1.75
	(1.33)
blt30_40yrs80occ_nbr	-1.52
	(1.96)
blt40_yrs80occ_nbr	-1.26
	(1.38)
nobs	226
r.squared	0.83
adj.r.squared	0.76
statistic
p.value
df.residual	12.91
N	226.00
se_type	CR2.00
* p < 0.001; p < 0.01; * p < 0.05.

If I were to enhance my control selection, I would aim to only control for variables that significantly explain the variation in both \(Y_i\) and \(D_i\), as the product of these two coefficients would be a solid proxy for the selection bias introduced into the model. In this specific setting however, upon the threshold region itself, the treated and control groups as extremely similar. Regressing controls on the dependent variable, \(Y_i\), and our estimated treatment term (\(\hat{D}_i\)), no significant coefficients arise. It appears the controls are very balanced in that sense.

I tried incorporating a model selection tool from the leaps package to come up with an alternative combination of regressors that treats for potentially too much multicollinearity. The paper describes them including every single variable they had available to them and this could infact drown out a significant effect entirely through the inclusion of a multitude of irrelevant controls that simply drive up the standard error on our estimate of interest.

p_load(leaps)

glimpse(mile_ln_DD_R$lnmdvalhs0)
glimpse(mile_ln_DD_R$phat_DD_FE)

#At this point in the code below, I am informed by an error code that y and x different lengths even though the glimpse above reveals them both to be columns in the dataframe, both of 226 rows in length. Sad!
regsubsetsObj <- regsubsets(data=mile_ln_DD_R, x=phat_DD_FE ,y=lnmdvalhs0, nbest = 12, really.big = T)
plot(regsubsetsObj, scale = "adjr2")  # regsubsets plot based on R-sq

Beyond desiring a more component model selection process, I feel at this stage I am ill equiped to apply a reliable procedure. I would apply some mean square error approach over the entire set of possible permutations of the model and find a subset of models which provide the lowest mean-square error measures. These would be appropriately balancing between minor degrees of selection bias and minor degrees of multicollinearity in a bid to best optimize identification of the treatment effect.

1.18 Credibility Check

I believe estimates generated in this most recent permuation of the model is most appropriate. It captures the fixed effects which drive a large portion of the explanatory power of the model. It also accounts for a sufficient albeit possibly too high number of controls, but even at the minimal level of controls with fixed effects implemented, our model initial showed no signs of significant effect too. Given that, though the standard error is somewhat bloated by excessive multicollineary, even with these controls absent it appears there is no significant causal relationship to identify in this case.