Question 1

Q1a: Produce a summary statistics table using stargazer. (5 points)

Statistic Mean St. Dev. Min Pctl(25) Median Pctl(75) Max
X 500.500 288.819 1 250.8 500.5 750.2 1,000
PublicHousing 29.370 8.801 0.000 23.669 29.827 35.221 60.000
HealthStatus 10.414 2.804 1.000 8.651 10.506 12.258 20.000
Supply 5.036 1.584 0.300 3.929 5.063 6.101 10.000
WaitingTime 25.019 6.139 5.000 20.922 25.128 29.400 46.000
HealthBehavior 12.527 4.078 0.000 9.788 12.515 15.318 25.000
Stamp 197.278 27.484 120.000 178.719 195.641 215.845 300.000
ParentsHealthStatus 9.891 3.280 1.000 7.733 9.797 12.111 20.000
Age 44.531 6.003 20.000 40.559 44.536 48.317 63.000
Race 2.500 1.119 1 1.8 2.5 3.2 4
Education 2.500 1.119 1 1.8 2.5 3.2 4
MaritalStatus 2.500 1.119 1 1.8 2.5 3.2 4
Race_4 0.250 0.433 0 0 0 0.2 1
Race_3 0.250 0.433 0 0 0 0.2 1
Race_1 0.250 0.433 0 0 0 0.2 1
Race_2 0.250 0.433 0 0 0 0.2 1
MaritalStatus_3 0.250 0.433 0 0 0 0.2 1
MaritalStatus_2 0.250 0.433 0 0 0 0.2 1
MaritalStatus_4 0.250 0.433 0 0 0 0.2 1
MaritalStatus_1 0.250 0.433 0 0 0 0.2 1

Q1b: What are the mean and standard deviation of the dependent variable? (2.5 points)

The mean of the dependent variable (HealthStatus) is 10.414, and the standard deviation is 2.804.

Q1c: What are the mean and standard deviation of the policy variable? (2.5 points)

The mean of the policy variable (PublicHousing) is 29.370, and the standard deviation is 8.801.

Question 2

Q2a: Report the results in a regression table with stargazer. (5 points)

Dependent variable:
HealthStatus
Constant -0.89***
(0.19)
PublicHousing 0.30***
(0.002)
Race_1 0.17***
(0.06)
Race_2 0.14**
(0.06)
Race_3 0.11*
(0.06)
Race_4
Education -0.04**
(0.02)
Age 0.05***
(0.004)
MaritalStatus_1 -0.01
(0.06)
MaritalStatus_2 0.04
(0.06)
MaritalStatus_3 -0.09
(0.06)
MaritalStatus_4
Observations 1,000
R2 0.94
Adjusted R2 0.94
Residual Std. Error 0.67 (df = 990)
F Statistic 1,830.58*** (df = 9; 990)
Note: p<0.1; p<0.05; p<0.01

Q2b: What is the estimated effect of public housing in this model? Is that statistically significant? How much does an additional month of public housing assistance decrease or increase health status? (5 points)

The estimated effect of public housing in this model is 0.30, which is statistically significant at the 0.01 level. This means that an additional month of public housing assistance is associated with an increase in health status by 0.30 points.

Question 3

Q3a: What three key characteristics should these variables have to be valid instruments? Briefly list and describe them. (15 points)

  1. Correlated with the policy variable: The instrumental variable should have a strong correlation with the policy variable (in this case, PublicHousing). This ensures that the instrumental variable can help explain the variation in the policy variable and contribute to the identification of the causal effect of the policy variable on the dependent variable.

  2. Correlated to the dependent variable only through the policy variable: The instrumental variable should be related to the dependent variable (in this case, HealthStatus) only through the policy variable. This means that the relationship between the instrumental variable and the dependent variable should be fully explained by the policy variable, with no other direct pathways.

  3. Not correlated with the omitted variable: The instrumental variable should not be correlated with any omitted variables that might be affecting both the policy variable and the dependent variable. This ensures that the instrumental variable captures only the variation in the policy variable and helps address issues arising from the presence of omitted variables.

Question 4

Q4a: Which criteria among the ones described in question Q3a can you test by looking at the correlation matrix? (5 points)

Criterion 1 (the correlation between the instrument and policy variable) can indeed be tested empirically using the correlation matrix.

Criterion 2 (IV correlation with the dependent variable only through the policy variable) cannot be fully tested empirically using the correlation matrix, as it requires theoretical reasoning and background knowledge.

Criterion 3 (IV should not be correlated with omitted variable) generally cannot be tested empirically due to the nature of omitted variables being unobserved. However, in this case, the “omitted” variable is included in the matrix… does that mean it’s a not-omitted variable?

In conclusion, criterion 1 can be tested empirically using the correlation matrix, while criterion 2 requires theoretical reasoning and background knowledge, and criterion 3 is generally not testable empirically, but can be in this theoretical matrix since it is included.

Q4b: For each of the three (four?) variables, describe which criteria of a valid instrument they meet and which criteria they do not meet based on the matrix. (15 points)

Supply:

  1. Supply has a positive correlation with PublicHousing (r = 0.106), meeting the first criterion.

  2. Supply has a weak correlation with HealthStatus (r = 0.065) and a positive correlation with PublicHousing (r = 0.106). Theoretically, we can argue that this correlation exists because an increase in supply of public housing leads to a higher number of people having public housing (policy variable), thus improving their health status.

  3. Supply has a very weak correlation with HealthBehavior (r = 0.0136), which suggests that it is not correlated with the omitted variable.

WaitingTime:

  1. WaitingTime has a positive correlation with PublicHousing (r = 0.550), meeting the first criterion.

  2. WaitingTime has a moderate correlation with HealthStatus (r = 0.346) and a positive correlation with PublicHousing (r = 0.550). We can theorize that longer waiting times for public housing may result in individuals experiencing poorer living conditions while waiting, which could impact their health status due to the policy variable.

  3. WaitingTime has a very weak negative correlation with HealthBehavior (r = -0.023), which suggests that it is not correlated with the omitted variable.

Stamp

  1. Stamp has a weak negative correlation with PublicHousing (r = -0.011), which does not meet the first criterion.

  2. Stamp has a very weak negative correlation with HealthStatus (r = -0.006) and a weak negative correlation with PublicHousing (r = -0.011). The correlation between Stamp and HealthStatus cannot be explained through the policy variable.

  3. Stamp has a weak negative correlation with HealthBehavior (r = 0.027), which suggests that it is not correlated with the omitted variable.

ParentsHealthStatus:

  1. ParentsHealthStatus has a weak positive correlation with PublicHousing (r = 0.071), meeting the first criterion.

  2. ParentsHealthStatus has a weak positive correlation with HealthStatus (r = 0.083) and a weak positive correlation with PublicHousing (r = 0.071). We can theorize that parents’ health status might impact the need for public housing, and therefore, indirectly affect their children’s health status through the policy variable.

  3. ParentsHealthStatus has a weak positive correlation with HealthBehavior (r = 0.103), which suggests that it is correlated with the omitted variable.

Q4c: As a result, which variable(s) is(are) a valid instrument? (5 points)

Supply and WaitingTime appear to be valid instruments based on this assessment.

Question 5

Q5a: Run a first-stage model using each of the instrumental variables that you selected as valid in question Q4c. Provide result tables using stargazer. (10 points)

First-stage regression results
Dependent variable:
PublicHousing
(1) (2)
Constant 20.861*** 3.009
(2.459) (2.154)
Supply 0.604***
(0.175)
WaitingTime 0.795***
(0.038)
Race_1 1.015 1.753***
(0.784) (0.656)
Race_2 -0.232 0.611
(0.782) (0.654)
Race_3 0.954 1.297**
(0.785) (0.656)
Race_4
Education -0.237 -0.247
(0.248) (0.207)
Age 0.119** 0.132***
(0.046) (0.039)
MaritalStatus_1 -0.057 -0.110
(0.784) (0.655)
MaritalStatus_2 0.306 0.439
(0.785) (0.655)
MaritalStatus_3 1.052 0.830
(0.786) (0.657)
MaritalStatus_4
Observations 1,000 1,000
R2 0.025 0.319
Adjusted R2 0.016 0.313
Residual Std. Error (df = 990) 8.731 7.296
F Statistic (df = 9; 990) 2.802*** 51.526***
Note: p<0.1; p<0.05; p<0.01

Q5b: Given the results of the first stage, which variable would you pick as instrumental variable? (5 points)

I would choose WaitingTime as the instrumental variable.

Q5c: Why? (5 points)

WaitingTime has a stronger relationship with the policy variable PublicHousing (coefficient = 0.795, p < 0.01) compared to Supply (coefficient = 0.604, p < 0.01). The R2 for the model with WaitingTime is higher (0.313) compared to the model with Supply (0.016), indicating a better model fit. Additionally, the F-Statistic for Supply is below 10, likely indicating a weak instrument. Thus, WaitingTime is the more suitable instrumental variable.

Question 6

Q6a: Provide result tables using stargazer. (5 points)

Naive Model Results and Second-Stage IV Results
Dependent variable:
HealthStatus
(1) (2)
Constant -0.885*** 1.532*
(0.185) (0.783)
PublicHousing 0.304***
(0.002)
phat_Waiting 0.203***
(0.017)
Race_1 0.170*** 0.272
(0.060) (0.231)
Race_2 0.142** 0.123
(0.060) (0.230)
Race_3 0.106* 0.208
(0.060) (0.231)
Race_4
Education -0.039** -0.060
(0.019) (0.073)
Age 0.053*** 0.065***
(0.004) (0.014)
MaritalStatus_1 -0.014 -0.025
(0.060) (0.231)
MaritalStatus_2 0.044 0.089
(0.060) (0.231)
MaritalStatus_3 -0.085 0.022
(0.060) (0.232)
MaritalStatus_4
Observations 1,000 1,000
R2 0.943 0.168
Adjusted R2 0.943 0.160
Residual Std. Error (df = 990) 0.671 2.569
F Statistic (df = 9; 990) 1,830.584*** 22.145***
Note: p<0.1; p<0.05; p<0.01

Q6b: Is the effect of public housing on health status positive or negative after you utilize an instrumental variable? Is it statistically significant? (5 points)

The effect of public housing on health status is positive after using the instrumental variable, with a coefficient of 0.203. It is statistically significant at p<0.01.

Q6c: How much does the health status of a individual increase or decrease for each additional months spent in a public house? (5 points)

For each additional month spent in public housing, an individual’s health status increases by 0.203 units.

Q6d: Compare results with the naive model. Were we under- or over-estimating the effect of public housing on health status? (5 points)

In the naive model, the coefficient for PublicHousing is 0.304, and it’s statistically significant at p<0.01. The IV model, however, estimates a smaller effect of 0.203, which is also statistically significant at p<0.01. This means that we were overestimating the effect of public housing on health status in the naive model. The IV model provides a more accurate estimate by accounting for the issue of the independent variable being correlated with the error term.

BONUS QUESTION 1

Imagine that you are explaining your results to a journalist. You want to provide a concrete example of the effect of public housing assistance. (5 points)

BQ1a: From the summary statistics table, retrieve the amount of time spent in a public house for an average individual and an individual in the top 25 percent (use the third quantile).

Summary Statistics for PublicHousing
Statistic Value
Min. 0.00000
1st Qu. 23.66876
Median 29.82691
Mean 29.37035
3rd Qu. 35.22123
Max. 60.00000

The average individual spends 29.37 months in public housing, while an individual in the top 25 percent (third quantile) spends 35.22 months.

BQ1b: If for each 1-point increase of the health status index, the government saves 1200 $ in medical bill, what is the average amount of savings when we move from an average individual to an individual in the top-25 percent? Consider the individual to be white, with a diploma, 35 years old and single.

# Coefficients from the IV model
coefficients <- second_stage_model$coefficients

# PublicHousing values
public_housing_avg <- 29.37
public_housing_top_25 <- 35.22

# Demographics
white <- 0
diploma <- 2
age_35 <- 35
single <- 0

# Health status for average individual
health_status_avg <- coefficients["(Intercept)"] + coefficients["phat_Waiting"] * public_housing_avg + coefficients["Education"] * diploma + coefficients["Age"] * age_35

# Health status for top-25 percent individual
health_status_top_25 <- coefficients["(Intercept)"] + coefficients["phat_Waiting"] * public_housing_top_25 + coefficients["Education"] * diploma + coefficients["Age"] * age_35

# Difference in health status
diff_health_status <- health_status_top_25 - health_status_avg

# Savings
savings <- diff_health_status * 1200

savings
## (Intercept) 
##    1428.045

Savings = $1,428.045

BONUS QUESTION 2

Will the omitted variable be correlated with the residual of stage 1? Will it be correlated with the predicted values of X1? (5 points)

BQ2a: Draw a plot representing the correlation between the omitted variable (the variable is called HealthBehavior in your dataset) and the predicted value of the first stage.

No correlation with the predicted value.

BQ2b: Draw a plot representing the correlation between the omitted variable (the variable is called HealthBehavior in your dataset) and the residuals of the first stage.

Strong correlation with the residuals.