Discussionwk13

Author

Nuobing Fan

1. ISSUE SUMMARY

1. What is “heteroskedasticity”, and the econometric issue it causes (affects point estimates or standard errors)? Do not confuse heteroskedasticity with other terms like multicollinearity, serial correlation, et cetra (2-3 sentences in your own words - EG do not copy/paste directly from the web.)

Heteroskedasticity refers to the phenomenon in regression models where the variance of the error terms is not constant across different levels of the predictor variables. This variability affects the standard errors of the model, leading to unreliable hypothesis tests and confidence intervals. In the presence of heteroskedasticity, OLS estimators are still unbiased, but they are not efficient (i.e., they do not have the minimum variance), which means that hypothesis testing could lead to misleading conclusions.

2. CODING

1. Choose a dataset, specify your linear regression, and estimate the regression in R. Please keep at least 3 independent variables in your regression. This is your main regression.

# Clear environment
remove(list=ls())

# Load necessary libraries
library(MASS)
library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'dplyr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ dplyr::select() masks MASS::select()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(psych)
Warning: package 'psych' was built under R version 4.3.3

Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(stargazer)

Please cite as: 

 Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.
 R package version 5.2.3. https://CRAN.R-project.org/package=stargazer 
# Load the USCrime dataset
data("UScrime")

# Overview of the dataset
str(UScrime)
'data.frame':   47 obs. of  16 variables:
 $ M   : int  151 143 142 136 141 121 127 131 157 140 ...
 $ So  : int  1 0 1 0 0 0 1 1 1 0 ...
 $ Ed  : int  91 113 89 121 121 110 111 109 90 118 ...
 $ Po1 : int  58 103 45 149 109 118 82 115 65 71 ...
 $ Po2 : int  56 95 44 141 101 115 79 109 62 68 ...
 $ LF  : int  510 583 533 577 591 547 519 542 553 632 ...
 $ M.F : int  950 1012 969 994 985 964 982 969 955 1029 ...
 $ Pop : int  33 13 18 157 18 25 4 50 39 7 ...
 $ NW  : int  301 102 219 80 30 44 139 179 286 15 ...
 $ U1  : int  108 96 94 102 91 84 97 79 81 100 ...
 $ U2  : int  41 36 33 39 20 29 38 35 28 24 ...
 $ GDP : int  394 557 318 673 578 689 620 472 421 526 ...
 $ Ineq: int  261 194 250 167 174 126 168 206 239 174 ...
 $ Prob: num  0.0846 0.0296 0.0834 0.0158 0.0414 ...
 $ Time: num  26.2 25.3 24.3 29.9 21.3 ...
 $ y   : int  791 1635 578 1969 1234 682 963 1555 856 705 ...
# Subset the dataset with the dependent variable and at least three independent variables
subset_USCrime <- UScrime[, c("y", "M", "Ed", "Po1")]

# Fit a linear regression model
lm.mod <- lm(formula = y ~ ., data = subset_USCrime)

# Display regression summary using stargazer for better formatting
stargazer(lm.mod, type = "text")

===============================================
                        Dependent variable:    
                    ---------------------------
                                 y             
-----------------------------------------------
M                            12.300***         
                              (3.833)          
                                               
Ed                             4.737           
                              (4.242)          
                                               
Po1                          10.718***         
                              (1.569)          
                                               
Constant                   -2,210.879**        
                             (833.612)         
                                               
-----------------------------------------------
Observations                    47             
R2                             0.575           
Adjusted R2                    0.545           
Residual Std. Error      260.858 (df = 43)     
F Statistic           19.373*** (df = 3; 43)   
===============================================
Note:               *p<0.1; **p<0.05; ***p<0.01
# Residual plot
plot(lm.mod, which = 1)

# Set up multiple plots in the same window
par(mfrow = c(2, 2))

# Create a residual plot with additional plots
plot(lm.mod)

# Reset to one plot per window
par(mfrow = c(1, 1))

# Summary of the linear regression model
summary(lm.mod)

Call:
lm(formula = y ~ ., data = subset_USCrime)

Residuals:
    Min      1Q  Median      3Q     Max 
-559.57 -168.63   -4.43  143.05  565.36 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2210.879    833.612  -2.652  0.01116 *  
M              12.300      3.833   3.209  0.00252 ** 
Ed              4.737      4.242   1.117  0.27031    
Po1            10.718      1.569   6.829 2.27e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 260.9 on 43 degrees of freedom
Multiple R-squared:  0.5748,    Adjusted R-squared:  0.5451 
F-statistic: 19.37 on 3 and 43 DF,  p-value: 4.25e-08

here we have y the crime rate, also the dependent variable; an dM percentage of males, Ed education, Po1 the police expenditure as independent variables.

3. Now, run the auxiliary regression and interpret the R-squared value - what does it tell you about heteroscedasticity? To compute the auxiliary regression, you will first have to find the residuals from the main regression, square them and this vector will be your dependent variable. Your independent variables will be