1 Estimate Multivariate Regression

1.1 Data:

I am choosing to explore a linear model from a project have been working on. Data is from census tracts within the City of Pittsburgh. The goal is to explore which variables explain income.

1.2 Regression Model Specification

\[ Income_c = \beta_0 - \beta_1 Black_c + \beta_2 Owner Occupied_c - \beta_3 Unemployment_c - \beta_4 GRAPI_c -\beta_5 Disability_c + \epsilon_c \]

Where c = census indexes census tract.

All variables represent population proportions for census tracts.

Black = % of tract Black or African American,

Owner Occupied = % of tract that owns their home,

Unemployment = unemployment rate of census tract,

GRAPI = gross rent as percent of income over 35%,

Disability = Disabled population of tract.

1.3 Estimation

model <- lm(log(income) ~ race_b
               + own
               + unemp
               + grapi
               + disab,
               data = data)
stargazer(model, type = 'text')
## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                             log(income)        
## -----------------------------------------------
## race_b                       -0.004***         
##                               (0.001)          
##                                                
## own                          0.008***          
##                               (0.001)          
##                                                
## unemp                        -0.017***         
##                               (0.006)          
##                                                
## grapi                        -0.008***         
##                               (0.002)          
##                                                
## disab                        -0.021***         
##                               (0.005)          
##                                                
## Constant                     11.221***         
##                               (0.105)          
##                                                
## -----------------------------------------------
## Observations                    94             
## R2                             0.683           
## Adjusted R2                    0.665           
## Residual Std. Error       0.257 (df = 88)      
## F Statistic           37.887*** (df = 5; 88)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

2 Heteroskedasticity

2.1 Residual Analysis

From a simple eyeball test, residual plots don’t look that bad in terms of heteroskedasticity. However, more formal tests may help give a more definitive answer.

par(mfrow = c(2, 2))
plot(model)

2.2 Formal Tests with Packages

The lmtest package implements Breusch Pagan test for heteroskedasticty in linear regression models, and skedastic package implements White test.

2.2.1 White’s Test for Heteroskedasticity

White's Test is a special case of the method of Breusch and Pagan (1979). Under the null hypothesis of homoskedasticity, the distribution of the test statistic is asymptotically chi-squared with parameter degrees of freedom.

skedastic_package_white  <- white(mainlm =  model, 
                                  interactions = TRUE
                                  )         

# View the test results
skedastic_package_white
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      19.9   0.467        20 White's Test greater

The test statistic has a value of 19.86, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately 47% - thus we fail to reject the null of homoskedasticity.

2.3 Manually Test with ‘Auxiliary Regression’

Now, we model the squared error terms as a function of predictors and the squared terms of the predictor.

2.3.1 Create the Residual Squared Variable (the dependent y)

We will find the residuals from our model and put it in our data frame.

In heteroskedastcity regression tests, the dependent variable is residual squared - so we must construct it manually. Note, the residual is coming from the model for which you want to test for heteroskedasticity, and you can use the resid command to create them easily. The y variable is not income anymore.

data$residuals <- resid(object = model)
summary(data$residuals) # mean should be zero
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.578430 -0.159852  0.002979  0.000000  0.147901  0.610116
data$squared_residuals <- (data$residuals)^2 

  summary(data$squared_residuals) # should not have any negative values
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000069 0.0055144 0.0231314 0.0619545 0.0689465 0.3722417

2.3.2 Create the X variables

The independent variables will the original predictors, their squared terms and cross terms.

  1. The original predictors in our case were race_b, own, unemp, grapi, and disab.

  2. We can create their squared terms easily with the I function without creating new variable in our data frame. In R, I() is used to prevent the interpretation of a variable or expression as a function or operator. It is called the "as-is" operator and is used to force the interpreter to treat the following expression literally, without any modifications or substitutions. For example, suppose you have a variable x that you want to square in a formula. In most cases, you would write x^2. However, if you want to create a formula that explicitly represents the expression x^2, you would use the I operator, where I(x^2) tells R to treat x^2 as a literal expression, rather than trying to interpret it as a function call or operator.

  3. Cross terms are just interactions terms, and can be created with the operator :

white_auxillary_reg <- lm(formula = squared_residuals ~ 
                            race_b +
                            own +
                            unemp +
                            grapi +
                            disab +
                            I(race_b^2) +
                            I(own^2) + 
                            I(unemp^2) +
                            I(grapi^2) +
                            I(disab^2) +
                            race_b:own +
                            race_b:unemp +
                            race_b:grapi +
                            race_b:disab +
                            own:unemp +
                            own:grapi +
                            own:disab +
                            unemp:grapi +
                            unemp:disab +
                            grapi:disab, 
                          data = data)
                   
white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
## 
## Call:
## lm(formula = squared_residuals ~ race_b + own + unemp + grapi + 
##     disab + I(race_b^2) + I(own^2) + I(unemp^2) + I(grapi^2) + 
##     I(disab^2) + race_b:own + race_b:unemp + race_b:grapi + race_b:disab + 
##     own:unemp + own:grapi + own:disab + unemp:grapi + unemp:disab + 
##     grapi:disab, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.13778 -0.04648 -0.02000  0.02435  0.28809 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)  
## (Intercept)   1.899e-01  1.180e-01   1.609    0.112  
## race_b       -5.186e-04  2.808e-03  -0.185    0.854  
## own          -1.421e-03  2.781e-03  -0.511    0.611  
## unemp         2.182e-02  1.237e-02   1.764    0.082 .
## grapi        -3.774e-03  3.856e-03  -0.979    0.331  
## disab        -9.824e-03  9.181e-03  -1.070    0.288  
## I(race_b^2)  -2.369e-05  2.905e-05  -0.815    0.417  
## I(own^2)      6.720e-06  2.389e-05   0.281    0.779  
## I(unemp^2)   -5.843e-04  6.007e-04  -0.973    0.334  
## I(grapi^2)    3.514e-05  3.810e-05   0.922    0.359  
## I(disab^2)    4.035e-04  3.495e-04   1.155    0.252  
## race_b:own   -1.969e-06  3.233e-05  -0.061    0.952  
## race_b:unemp -1.268e-05  1.274e-04  -0.100    0.921  
## race_b:grapi  2.212e-05  4.235e-05   0.522    0.603  
## race_b:disab  1.043e-04  1.718e-04   0.607    0.546  
## own:unemp    -1.094e-04  1.507e-04  -0.726    0.470  
## own:grapi     4.171e-05  3.352e-05   1.244    0.217  
## own:disab    -2.095e-05  1.169e-04  -0.179    0.858  
## unemp:grapi   4.108e-05  1.629e-04   0.252    0.802  
## unemp:disab  -3.673e-04  5.641e-04  -0.651    0.517  
## grapi:disab  -1.370e-04  1.368e-04  -1.001    0.320  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09423 on 73 degrees of freedom
## Multiple R-squared:  0.2113, Adjusted R-squared:  -0.004803 
## F-statistic: 0.9778 on 20 and 73 DF,  p-value: 0.4978

Low R-squared suggests no heteroskedasticity.

2.3.3 Chi-squared

We can continue testing with chi-squared distribution.

More formally, calculate the chi-squared test statistic and compare it with chi-squared critical value to determine if R squares is significantly small or big.

Calculate the Chi-Square test statistic: \[\chi^2= n * newR^2\]

  1. n: The total number of observations

  2. \(newR^2\): The R-squared of the new regression model that used the squared residuals as the response values.

chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
                         df = 20,       # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e.  equal to the number of variables in the auxillary regression
                         lower.tail = FALSE  )
chisq_p_value
## [1] 0.4666767

We get the exact same pvalue from the package and by hand above.

We can also replicate the test statistic. Recall:

skedastic_package_white
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      19.9   0.467        20 White's Test greater

Now, testing the critical value:

white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 19.86065

Beautiful!

2.4 Breusch-Pagan Test

The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.

Under \(H_0\) the test statistic of the Breusch-Pagan test follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom.

lmtest_package_bp <- bptest(formula = model,  # fitted "lm" object, or run the regression
                         studentize = TRUE   # by default, so this can be commented out
)
                      
# View the test results
lmtest_package_bp
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 3.7881, df = 5, p-value = 0.5803
skedastic_package_bp <- skedastic::breusch_pagan(model)

lmtest_package_bp
## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 3.7881, df = 5, p-value = 0.5803
skedastic_package_bp
## # A tibble: 1 × 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      3.79   0.580         5 Koenker (studentised) greater

lmtest::bptest gives the exact same outpout as skedastic::breusch_pagan, as expected.

The test statistic value is 3.7781, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately 0.5803 thus we fail to reject the null of homoskedasticity.

3 Which is Better?

Professor Sharma’s rpubs and explanation

Both Breusch-Pagan (BP) test and White test are used to test for heteroskedasticity in a linear regression model. However, there is no clear consensus on which test is better as the choice depends on various factors such as the sample size, the distribution of the errors, and the specific objectives of the analysis.

  • In general, the White test is considered more robust than the BP test when the sample size is large or when the errors are not normally distributed.

  • However, the BP test may perform better when the errors are normally distributed and the sample size is small.

Ultimately, the choice of test should be based on the specific characteristics of the data and the research question being investigated.