Q01 Load your packages. You’ll probably going to need/want tidyverse and here (among others).
Q02 Now load the data. This time, I saved the same data set as a single format: a .csv file. Use a function that reads .csv files—for example, read.csv() or read_csv() (from the readr package in the tidyverse)
## Warning: package 'pacman' was built under R version 3.6.3
## [1] 25000 12
Q03. Check your data set. Apply the function summary() to your data set. You should have 12 variables.
## fips hh_size hh_income cost_housing
## Min. : 1000 Min. : 1.000 Min. : 0.004 Min. : 4
## 1st Qu.:12099 1st Qu.: 2.000 1st Qu.: 4.600 1st Qu.: 700
## Median :27123 Median : 2.000 Median : 8.000 Median :1100
## Mean :27820 Mean : 2.834 Mean : 10.616 Mean :1278
## 3rd Qu.:42000 3rd Qu.: 4.000 3rd Qu.: 13.000 3rd Qu.:1600
## Max. :56000 Max. :17.000 Max. :143.600 Max. :7400
## n_vehicles hh_share_nonwhite i_renter i_moved
## Min. :0.00 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:1.00 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :2.00 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :2.04 Mean :0.2327 Mean :0.3756 Mean :0.1887
## 3rd Qu.:3.00 3rd Qu.:0.4000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :6.00 Max. :1.0000 Max. :1.0000 Max. :1.0000
## i_foodstamp i_smartphone i_internet time_commuting
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. : 0.25
## 1st Qu.:0.00000 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.: 15.00
## Median :0.00000 Median :1.0000 Median :1.0000 Median : 30.00
## Mean :0.08444 Mean :0.9365 Mean :0.9484 Mean : 36.74
## 3rd Qu.:0.00000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 47.50
## Max. :1.00000 Max. :1.0000 Max. :1.0000 Max. :376.00
Q04. Based upon your answer to Q03: What are the mean and median of household size ( hh_size ). What does this tell you about the distribution of the variable?
The mean and median of household size ( hh_size ) are: Mean: 2.834 Median: 2.000
Because the Mean (2.834) is greater than the Median (2.000), which infers our data is skewed right.
Q05. Based upon your answer to Q03 What are the minimum, maximum, and mean of the indicator for whether a household moved in the last year ( i_moved )? What does the mean of a binary indicator variable (such as i_moved ) tell us?
## [1] 1 0 0 1 0 0 0 0
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.0000 0.0000 0.1887 0.0000 1.0000
The Minimum is: 0 the Median is: 0 The Mean: 0.1887
A Binary indicator variable (such as i_moved ) tell us that we either moved (==1) or we didn’t move (==0)
Q06. Suppose we are interested in the relationship between a household’s housing costs and its time spent commuting. Plot a scatter plot/) (e.g., using geom_point() from ggplot2 ) with housing cost ( cost_housing ) on the y axis and commute time ( time_commuting ) on the x axis.
Q07. Based your plot in Q06., if we regress housing costs on commute time, do you think we could have an issue with heteroskedasticity? Explain/justify your answer.
We would absolutely have an issue with heteroskedasticity. Not only is the data skewed right, but the definition of heteroskedasticity is not have a constant variance. This graph is 06 shows the linear regression line and it clear doesn’t fit the data well. It is also obvious to tell the different variances as you go along the graph.
Q08. What issues can heteroskedasticity cause? (Hint: There are at least two main issues.)
The issues that arise with heteroskedasticity: The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer the most efficient estimator. They will still be unbiased, but they will not be very consistent, meaning there variance is larger and doesn’t fit the data in the most efficient manor. Also when heteroskedasticity is present, depend on the test that you use it could come back as no heteroskedasticity when in fact there really is heteroskedasticity. This is the Problems with GQ and BP Test, Hence why we have the White test.
Q09. Time for a regression. Regress housing cost ( cost_housing ) on commute time ( time_commuting ) and household income ( hh_income ). Report your results—interpreting the intercept and coefficients and commenting on their statistical significance. Reminder: The household income variable is measured in tens of thousands (meaning that a value of 3 tells us the household’s income is $30,000).*
##
## ================================================
## Dependent variable:
## ----------------------------
## cost_housing
## ------------------------------------------------
## hh_income 43.352***
## (0.451)
##
## time_commuting 0.473***
## (0.140)
##
## Constant 800.816***
## (8.272)
##
## ------------------------------------------------
## Observations 25,000
## R2 0.271
## Adjusted R2 0.271
## Residual Std. Error 734.352 (df = 24997)
## F Statistic 4,649.063*** (df = 2; 24997)
## ================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Looking at the stargazer table results, we can see the p-values for each variable which are very small. This is saying that this it’s statistically significant.
When you increase the time commuting by one unit, then you have a 0.473 * 10,000 = 4,730 increase in housing costs. When you increase the household income by one unit, then you 43.352 * 10,000 = 433,520 increase in housing costs.
Q10. Use the residuals from your regression in Q09. to conduct a Breusch-Pagan test for heteroskedasticity. Do you find significant evidence of heteroskedasticity? Justify your answer. Hints . You can get the residuals from an lm object using the residuals() function, e.g., residuals(my_reg) . . You can get the R-squared from an estimated regression (e.g., a regression called my_reg ) using summary(my_reg)$r.squared.*
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 38007 157084 539208 440254 39374912
## [1] 0
Thus We can see that after running the BP Test we end up with a chisq value of 0, which is equal to our null hypothesis. This means that our results our homoskedastic because they don’t systematically vary with our explanatory variables (this is exact what we don’t want).
Q11. Now use your residuals from Q09 to conduct a White test for heteroskedasticity. Does your conclusion about heteroskedasticity change at all? Explain why you think this is. Hints: Recall that in R lm(y ~ I(x^2)) will regress y on x squared. lm(y ~ x1:x2 will regress y on the interaction between x1 and x2.
## [1] 2589.835
Thus after seeing the chi-squared result, the null hypothesis should be zero, thus this goes against the null hypothesis. The white test is telling us that there is heteroskedasticity that the BP missed.
Q12. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you find significant evidence of heteroskedasticity? Explain why this result makes sense.
## [1] 0.5
Here we can see the P-value is larger and thus we reject the null hypothesis. The null hypothesis was that this was a homoskedastic distribution. Thus because we reject the null hypothesis, this data is indeed heteroskedastic. Therefore my conclusion hasn’t changed.
Q13. *Using the lm_robust() function from the estimatr package, calculate heteroskedasticity-robust standard errors. How do these heteroskedasticity-robust standard errors compare to the plain OLS standard errors you previously found?
## $coefficients
## (Intercept) file_data$hh_income file_data$time_commuting
## 800.8162090 43.3516196 0.4728251
##
## $std.error
## (Intercept) file_data$hh_income file_data$time_commuting
## 10.4454793 1.0026973 0.1457625
##
## $df
## (Intercept) file_data$hh_income file_data$time_commuting
## 24997 24997 24997
##
## $statistic
## (Intercept) file_data$hh_income file_data$time_commuting
## 76.666296 43.235000 3.243804
##
## $p.value
## (Intercept) file_data$hh_income file_data$time_commuting
## 0.000000000 0.000000000 0.001180997
##
## $conf.low
## (Intercept) file_data$hh_income file_data$time_commuting
## 780.342455 41.386274 0.187122
If you look the column for standard errors we can see that
file_data$hh_income =1.0026973
file_data$time_commuting = 0.1457625
These robust standard error are different from the OLS standard errors.
Q14. Why did your coefficients remain the same in Q13.—even though your standard errors changed?
The coefficients remained the same because there is heteroskedasticity present in the data.
Q15. If you run weighted least squares (WLS), which the following four possibilities would you expect? Explain your answer.
I would say option number four, if we used WLS then we would have different coefficients from OLS and different standard errors. WLS is a better estimator than ols even when its incorrect.
Q16. Does heteroskedasticity appear to matter in this setting? Explain your answer/reasoning.
I would say that heteroskedasticity matters greatly in this situation considering that all the data is skewed right. The variance is not constant which is the direct definition of heteroskedasticity. So yes I believe that heteroskedasticity is affecting the data.