Question 1

Q1a: Produce a summary statistics table using stargazer. (5 points)


Statistic	Mean	St. Dev.	Min	Pctl(25)	Median	Pctl(75)	Max

X	500.500	288.819	1	250.8	500.5	750.2	1,000
PublicHousing	29.370	8.801	0.000	23.669	29.827	35.221	60.000
HealthStatus	10.414	2.804	1.000	8.651	10.506	12.258	20.000
Supply	5.036	1.584	0.300	3.929	5.063	6.101	10.000
WaitingTime	25.019	6.139	5.000	20.922	25.128	29.400	46.000
HealthBehavior	12.527	4.078	0.000	9.788	12.515	15.318	25.000
Stamp	197.278	27.484	120.000	178.719	195.641	215.845	300.000
ParentsHealthStatus	9.891	3.280	1.000	7.733	9.797	12.111	20.000
Age	44.531	6.003	20.000	40.559	44.536	48.317	63.000
Race	2.500	1.119	1	1.8	2.5	3.2	4
Education	2.500	1.119	1	1.8	2.5	3.2	4
MaritalStatus	2.500	1.119	1	1.8	2.5	3.2	4
Race_4	0.250	0.433	0	0	0	0.2	1
Race_3	0.250	0.433	0	0	0	0.2	1
Race_1	0.250	0.433	0	0	0	0.2	1
Race_2	0.250	0.433	0	0	0	0.2	1
MaritalStatus_3	0.250	0.433	0	0	0	0.2	1
MaritalStatus_2	0.250	0.433	0	0	0	0.2	1
MaritalStatus_4	0.250	0.433	0	0	0	0.2	1
MaritalStatus_1	0.250	0.433	0	0	0	0.2	1

Q1b: What are the mean and standard deviation of the dependent variable? (2.5 points)

The mean of the dependent variable (HealthStatus) is 10.414, and the standard deviation is 2.804.

Q1c: What are the mean and standard deviation of the policy variable? (2.5 points)

The mean of the policy variable (PublicHousing) is 29.370, and the standard deviation is 8.801.

Question 2

Q2a: Report the results in a regression table with stargazer. (5 points)


	Dependent variable:

	HealthStatus

Constant	-0.89^***
	(0.19)

PublicHousing	0.30^***
	(0.002)

Race_1	0.17^***
	(0.06)

Race_2	0.14^**
	(0.06)

Race_3	0.11^*
	(0.06)

Race_4


Education	-0.04^**
	(0.02)

Age	0.05^***
	(0.004)

MaritalStatus_1	-0.01
	(0.06)

MaritalStatus_2	0.04
	(0.06)

MaritalStatus_3	-0.09
	(0.06)

MaritalStatus_4



Observations	1,000
R²	0.94
Adjusted R²	0.94
Residual Std. Error	0.67 (df = 990)
F Statistic	1,830.58^*** (df = 9; 990)

Note:	p<0.1; p<0.05; p<0.01

Q2b: What is the estimated effect of public housing in this model? Is that statistically significant? How much does an additional month of public housing assistance decrease or increase health status? (5 points)

The estimated effect of public housing in this model is 0.30, which is statistically significant at the 0.01 level. This means that an additional month of public housing assistance is associated with an increase in health status by 0.30 points.

Question 3

Q3a: What three key characteristics should these variables have to be valid instruments? Briefly list and describe them. (15 points)

Correlated with the policy variable: The instrumental variable should have a strong correlation with the policy variable (in this case, PublicHousing). This ensures that the instrumental variable can help explain the variation in the policy variable and contribute to the identification of the causal effect of the policy variable on the dependent variable.
Correlated to the dependent variable only through the policy variable: The instrumental variable should be related to the dependent variable (in this case, HealthStatus) only through the policy variable. This means that the relationship between the instrumental variable and the dependent variable should be fully explained by the policy variable, with no other direct pathways.
Not correlated with the omitted variable: The instrumental variable should not be correlated with any omitted variables that might be affecting both the policy variable and the dependent variable. This ensures that the instrumental variable captures only the variation in the policy variable and helps address issues arising from the presence of omitted variables.

Question 4

Q4a: Which criteria among the ones described in question Q3a can you test by looking at the correlation matrix? (5 points)

Criterion 1 (the correlation between the instrument and policy variable) can indeed be tested empirically using the correlation matrix.

Criterion 2 (IV correlation with the dependent variable only through the policy variable) cannot be fully tested empirically using the correlation matrix, as it requires theoretical reasoning and background knowledge.

Criterion 3 (IV should not be correlated with omitted variable) generally cannot be tested empirically due to the nature of omitted variables being unobserved. However, in this case, the “omitted” variable is included in the matrix… does that mean it’s a not-omitted variable?

In conclusion, criterion 1 can be tested empirically using the correlation matrix, while criterion 2 requires theoretical reasoning and background knowledge, and criterion 3 is generally not testable empirically, but can be in this theoretical matrix since it is included.

Q4b: For each of the three (four?) variables, describe which criteria of a valid instrument they meet and which criteria they do not meet based on the matrix. (15 points)

Supply:

Supply has a positive correlation with PublicHousing (r = 0.106), meeting the first criterion.
Supply has a weak correlation with HealthStatus (r = 0.065) and a positive correlation with PublicHousing (r = 0.106). Theoretically, we can argue that this correlation exists because an increase in supply of public housing leads to a higher number of people having public housing (policy variable), thus improving their health status.
Supply has a very weak correlation with HealthBehavior (r = 0.0136), which suggests that it is not correlated with the omitted variable.

WaitingTime:

WaitingTime has a positive correlation with PublicHousing (r = 0.550), meeting the first criterion.
WaitingTime has a moderate correlation with HealthStatus (r = 0.346) and a positive correlation with PublicHousing (r = 0.550). We can theorize that longer waiting times for public housing may result in individuals experiencing poorer living conditions while waiting, which could impact their health status due to the policy variable.
WaitingTime has a very weak negative correlation with HealthBehavior (r = -0.023), which suggests that it is not correlated with the omitted variable.

Stamp

Stamp has a weak negative correlation with PublicHousing (r = -0.011), which does not meet the first criterion.
Stamp has a very weak negative correlation with HealthStatus (r = -0.006) and a weak negative correlation with PublicHousing (r = -0.011). The correlation between Stamp and HealthStatus cannot be explained through the policy variable.
Stamp has a weak negative correlation with HealthBehavior (r = 0.027), which suggests that it is not correlated with the omitted variable.

ParentsHealthStatus:

ParentsHealthStatus has a weak positive correlation with PublicHousing (r = 0.071), meeting the first criterion.
ParentsHealthStatus has a weak positive correlation with HealthStatus (r = 0.083) and a weak positive correlation with PublicHousing (r = 0.071). We can theorize that parents’ health status might impact the need for public housing, and therefore, indirectly affect their children’s health status through the policy variable.
ParentsHealthStatus has a weak positive correlation with HealthBehavior (r = 0.103), which suggests that it is correlated with the omitted variable.

Q4c: As a result, which variable(s) is(are) a valid instrument? (5 points)

Supply and WaitingTime appear to be valid instruments based on this assessment.

Question 5

Q5a: Run a first-stage model using each of the instrumental variables that you selected as valid in question Q4c. Provide result tables using stargazer. (10 points)

**First-stage regression results**

	Dependent variable:

	PublicHousing
	(1)	(2)

Constant	20.861^***	3.009
	(2.459)	(2.154)

Supply	0.604^***
	(0.175)

WaitingTime		0.795^***
		(0.038)

Race_1	1.015	1.753^***
	(0.784)	(0.656)

Race_2	-0.232	0.611
	(0.782)	(0.654)

Race_3	0.954	1.297^**
	(0.785)	(0.656)

Race_4


Education	-0.237	-0.247
	(0.248)	(0.207)

Age	0.119^**	0.132^***
	(0.046)	(0.039)

MaritalStatus_1	-0.057	-0.110
	(0.784)	(0.655)

MaritalStatus_2	0.306	0.439
	(0.785)	(0.655)

MaritalStatus_3	1.052	0.830
	(0.786)	(0.657)

MaritalStatus_4



Observations	1,000	1,000
R²	0.025	0.319
Adjusted R²	0.016	0.313
Residual Std. Error (df = 990)	8.731	7.296
F Statistic (df = 9; 990)	2.802^***	51.526^***

Note:	p<0.1; p<0.05; p<0.01

Q5b: Given the results of the first stage, which variable would you pick as instrumental variable? (5 points)

I would choose WaitingTime as the instrumental variable.

Q5c: Why? (5 points)

WaitingTime has a stronger relationship with the policy variable PublicHousing (coefficient = 0.795, p < 0.01) compared to Supply (coefficient = 0.604, p < 0.01). The R2 for the model with WaitingTime is higher (0.313) compared to the model with Supply (0.016), indicating a better model fit. Additionally, the F-Statistic for Supply is below 10, likely indicating a weak instrument. Thus, WaitingTime is the more suitable instrumental variable.

Question 6

Q6a: Provide result tables using stargazer. (5 points)

**Naive Model Results and Second-Stage IV Results**

	Dependent variable:

	HealthStatus
	(1)	(2)

Constant	-0.885^***	1.532^*
	(0.185)	(0.783)

PublicHousing	0.304^***
	(0.002)

phat_Waiting		0.203^***
		(0.017)

Race_1	0.170^***	0.272
	(0.060)	(0.231)

Race_2	0.142^**	0.123
	(0.060)	(0.230)

Race_3	0.106^*	0.208
	(0.060)	(0.231)

Race_4


Education	-0.039^**	-0.060
	(0.019)	(0.073)

Age	0.053^***	0.065^***
	(0.004)	(0.014)

MaritalStatus_1	-0.014	-0.025
	(0.060)	(0.231)

MaritalStatus_2	0.044	0.089
	(0.060)	(0.231)

MaritalStatus_3	-0.085	0.022
	(0.060)	(0.232)

MaritalStatus_4



Observations	1,000	1,000
R²	0.943	0.168
Adjusted R²	0.943	0.160
Residual Std. Error (df = 990)	0.671	2.569
F Statistic (df = 9; 990)	1,830.584^***	22.145^***

Note:	p<0.1; p<0.05; p<0.01

Q6b: Is the effect of public housing on health status positive or negative after you utilize an instrumental variable? Is it statistically significant? (5 points)

The effect of public housing on health status is positive after using the instrumental variable, with a coefficient of 0.203. It is statistically significant at p<0.01.

Q6c: How much does the health status of a individual increase or decrease for each additional months spent in a public house? (5 points)

For each additional month spent in public housing, an individual’s health status increases by 0.203 units.

Q6d: Compare results with the naive model. Were we under- or over-estimating the effect of public housing on health status? (5 points)

In the naive model, the coefficient for PublicHousing is 0.304, and it’s statistically significant at p<0.01. The IV model, however, estimates a smaller effect of 0.203, which is also statistically significant at p<0.01. This means that we were overestimating the effect of public housing on health status in the naive model. The IV model provides a more accurate estimate by accounting for the issue of the independent variable being correlated with the error term.

BONUS QUESTION 1

Imagine that you are explaining your results to a journalist. You want to provide a concrete example of the effect of public housing assistance. (5 points)

BQ1a: From the summary statistics table, retrieve the amount of time spent in a public house for an average individual and an individual in the top 25 percent (use the third quantile).

Summary Statistics for PublicHousing
Statistic	Value
Min.	0.00000
1st Qu.	23.66876
Median	29.82691
Mean	29.37035
3rd Qu.	35.22123
Max.	60.00000

The average individual spends 29.37 months in public housing, while an individual in the top 25 percent (third quantile) spends 35.22 months.

BQ1b: If for each 1-point increase of the health status index, the government saves 1200 $ in medical bill, what is the average amount of savings when we move from an average individual to an individual in the top-25 percent? Consider the individual to be white, with a diploma, 35 years old and single.

# Coefficients from the IV model
coefficients <- second_stage_model$coefficients

# PublicHousing values
public_housing_avg <- 29.37
public_housing_top_25 <- 35.22

# Demographics
white <- 0
diploma <- 2
age_35 <- 35
single <- 0

# Health status for average individual
health_status_avg <- coefficients["(Intercept)"] + coefficients["phat_Waiting"] * public_housing_avg + coefficients["Education"] * diploma + coefficients["Age"] * age_35

# Health status for top-25 percent individual
health_status_top_25 <- coefficients["(Intercept)"] + coefficients["phat_Waiting"] * public_housing_top_25 + coefficients["Education"] * diploma + coefficients["Age"] * age_35

# Difference in health status
diff_health_status <- health_status_top_25 - health_status_avg

# Savings
savings <- diff_health_status * 1200

savings

## (Intercept) 
##    1428.045

Savings = $1,428.045

BONUS QUESTION 2

Will the omitted variable be correlated with the residual of stage 1? Will it be correlated with the predicted values of X1? (5 points)

BQ2a: Draw a plot representing the correlation between the omitted variable (the variable is called HealthBehavior in your dataset) and the predicted value of the first stage.

No correlation with the predicted value.

BQ2b: Draw a plot representing the correlation between the omitted variable (the variable is called HealthBehavior in your dataset) and the residuals of the first stage.

Strong correlation with the residuals.

Lab-04, CPP 525, Brett Foster