Q01 Load your packages. You’ll probably going to need/want tidyverse and here (among others).

Q02 Now load the data. This time, I saved the same data set as a single format: a .csv file. Use a function that reads .csv files—for example, read.csv() or read_csv() (from the readr package in the tidyverse)

## Warning: package 'pacman' was built under R version 3.6.3
## [1] 25000    12

Q03. Check your data set. Apply the function summary() to your data set. You should have 12 variables.

##       fips          hh_size         hh_income        cost_housing 
##  Min.   : 1000   Min.   : 1.000   Min.   :  0.004   Min.   :   4  
##  1st Qu.:12099   1st Qu.: 2.000   1st Qu.:  4.600   1st Qu.: 700  
##  Median :27123   Median : 2.000   Median :  8.000   Median :1100  
##  Mean   :27820   Mean   : 2.834   Mean   : 10.616   Mean   :1278  
##  3rd Qu.:42000   3rd Qu.: 4.000   3rd Qu.: 13.000   3rd Qu.:1600  
##  Max.   :56000   Max.   :17.000   Max.   :143.600   Max.   :7400  
##    n_vehicles   hh_share_nonwhite    i_renter         i_moved      
##  Min.   :0.00   Min.   :0.0000    Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:1.00   1st Qu.:0.0000    1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :2.00   Median :0.0000    Median :0.0000   Median :0.0000  
##  Mean   :2.04   Mean   :0.2327    Mean   :0.3756   Mean   :0.1887  
##  3rd Qu.:3.00   3rd Qu.:0.4000    3rd Qu.:1.0000   3rd Qu.:0.0000  
##  Max.   :6.00   Max.   :1.0000    Max.   :1.0000   Max.   :1.0000  
##   i_foodstamp       i_smartphone      i_internet     time_commuting  
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :  0.25  
##  1st Qu.:0.00000   1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.: 15.00  
##  Median :0.00000   Median :1.0000   Median :1.0000   Median : 30.00  
##  Mean   :0.08444   Mean   :0.9365   Mean   :0.9484   Mean   : 36.74  
##  3rd Qu.:0.00000   3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 47.50  
##  Max.   :1.00000   Max.   :1.0000   Max.   :1.0000   Max.   :376.00

Q04. Based upon your answer to Q03: What are the mean and median of household size ( hh_size ). What does this tell you about the distribution of the variable?

The mean and median of household size ( hh_size ) are: Mean: 2.834 Median: 2.000

Because the Mean (2.834) is greater than the Median (2.000), which infers our data is skewed right.

Q05. Based upon your answer to Q03 What are the minimum, maximum, and mean of the indicator for whether a household moved in the last year ( i_moved )? What does the mean of a binary indicator variable (such as i_moved ) tell us?

## [1] 1 0 0 1 0 0 0 0
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0000  0.0000  0.1887  0.0000  1.0000

The Minimum is: 0 the Median is: 0 The Mean: 0.1887

A Binary indicator variable (such as i_moved ) tell us that we either moved (==1) or we didn’t move (==0)

Q06. Suppose we are interested in the relationship between a household’s housing costs and its time spent commuting. Plot a scatter plot/) (e.g., using geom_point() from ggplot2 ) with housing cost ( cost_housing ) on the y axis and commute time ( time_commuting ) on the x axis.

Q07. Based your plot in Q06., if we regress housing costs on commute time, do you think we could have an issue with heteroskedasticity? Explain/justify your answer.

We would absolutely have an issue with heteroskedasticity. Not only is the data skewed right, but the definition of heteroskedasticity is not have a constant variance. This graph is 06 shows the linear regression line and it clear doesn’t fit the data well. It is also obvious to tell the different variances as you go along the graph.

Q08. What issues can heteroskedasticity cause? (Hint: There are at least two main issues.)

The issues that arise with heteroskedasticity: The OLS estimators are no longer the BLUE (Best Linear Unbiased Estimators) because they are no longer the most efficient estimator. They will still be unbiased, but they will not be very consistent, meaning there variance is larger and doesn’t fit the data in the most efficient manor. Also when heteroskedasticity is present, depend on the test that you use it could come back as no heteroskedasticity when in fact there really is heteroskedasticity. This is the Problems with GQ and BP Test, Hence why we have the White test.

Q09. Time for a regression. Regress housing cost ( cost_housing ) on commute time ( time_commuting ) and household income ( hh_income ). Report your results—interpreting the intercept and coefficients and commenting on their statistical significance. Reminder: The household income variable is measured in tens of thousands (meaning that a value of 3 tells us the household’s income is $30,000).*

## 
## ================================================
##                         Dependent variable:     
##                     ----------------------------
##                             cost_housing        
## ------------------------------------------------
## hh_income                    43.352***          
##                               (0.451)           
##                                                 
## time_commuting                0.473***          
##                               (0.140)           
##                                                 
## Constant                     800.816***         
##                               (8.272)           
##                                                 
## ------------------------------------------------
## Observations                   25,000           
## R2                             0.271            
## Adjusted R2                    0.271            
## Residual Std. Error     734.352 (df = 24997)    
## F Statistic         4,649.063*** (df = 2; 24997)
## ================================================
## Note:                *p<0.1; **p<0.05; ***p<0.01

Looking at the stargazer table results, we can see the p-values for each variable which are very small. This is saying that this it’s statistically significant.

When you increase the time commuting by one unit, then you have a 0.473 * 10,000 = 4,730 increase in housing costs. When you increase the household income by one unit, then you 43.352 * 10,000 = 433,520 increase in housing costs.

Q10. Use the residuals from your regression in Q09. to conduct a Breusch-Pagan test for heteroskedasticity. Do you find significant evidence of heteroskedasticity? Justify your answer. Hints . You can get the residuals from an lm object using the residuals() function, e.g., residuals(my_reg) . . You can get the R-squared from an estimated regression (e.g., a regression called my_reg ) using summary(my_reg)$r.squared.*

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##        0    38007   157084   539208   440254 39374912
## [1] 0

Thus We can see that after running the BP Test we end up with a chisq value of 0, which is equal to our null hypothesis. This means that our results our homoskedastic because they don’t systematically vary with our explanatory variables (this is exact what we don’t want).

Q11. Now use your residuals from Q09 to conduct a White test for heteroskedasticity. Does your conclusion about heteroskedasticity change at all? Explain why you think this is. Hints: Recall that in R lm(y ~ I(x^2)) will regress y on x squared. lm(y ~ x1:x2 will regress y on the interaction between x1 and x2.

## [1] 2589.835

Thus after seeing the chi-squared result, the null hypothesis should be zero, thus this goes against the null hypothesis. The white test is telling us that there is heteroskedasticity that the BP missed.

Q12. Now conduct a Goldfeld-Quandt test for heteroskedasticity. Do you find significant evidence of heteroskedasticity? Explain why this result makes sense.

## [1] 0.5

Here we can see the P-value is larger and thus we reject the null hypothesis. The null hypothesis was that this was a homoskedastic distribution. Thus because we reject the null hypothesis, this data is indeed heteroskedastic. Therefore my conclusion hasn’t changed.

Q13. *Using the lm_robust() function from the estimatr package, calculate heteroskedasticity-robust standard errors. How do these heteroskedasticity-robust standard errors compare to the plain OLS standard errors you previously found?

## $coefficients
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##              800.8162090               43.3516196                0.4728251 
## 
## $std.error
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##               10.4454793                1.0026973                0.1457625 
## 
## $df
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##                    24997                    24997                    24997 
## 
## $statistic
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##                76.666296                43.235000                 3.243804 
## 
## $p.value
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##              0.000000000              0.000000000              0.001180997 
## 
## $conf.low
##              (Intercept)      file_data$hh_income file_data$time_commuting 
##               780.342455                41.386274                 0.187122

If you look the column for standard errors we can see that

file_data$hh_income =1.0026973

file_data$time_commuting = 0.1457625

These robust standard error are different from the OLS standard errors.

Q14. Why did your coefficients remain the same in Q13.—even though your standard errors changed?

The coefficients remained the same because there is heteroskedasticity present in the data.

Q15. If you run weighted least squares (WLS), which the following four possibilities would you expect? Explain your answer.

  1. The same coefficients as OLS but different standard errors.
  2. Different coefficients from OLS but the same standard errors.
  3. The same coefficients as OLS and the same standard errors.
  4. Different coefficients from OLS and different standard errors. Note: You do not need to run WLS.

I would say option number four, if we used WLS then we would have different coefficients from OLS and different standard errors. WLS is a better estimator than ols even when its incorrect.

Q16. Does heteroskedasticity appear to matter in this setting? Explain your answer/reasoning.

I would say that heteroskedasticity matters greatly in this situation considering that all the data is skewed right. The variance is not constant which is the direct definition of heteroskedasticity. So yes I believe that heteroskedasticity is affecting the data.