1 Estimate a Multivariate Regression

Should have atleast 3 independent variables.

1.1 Data

Choose your data.

I will continue with the data for my OLS point estimates in OLS_matrixVSlm in W1.

remove(list=ls())

# install.packages("MASS")
library(MASS)
  help(Boston) 

str(Boston) # 506 rows and 14 columns.
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ✖ dplyr::select() masks MASS::select()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
glimpse(Boston)
## Rows: 506
## Columns: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.08829,…
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5, 1…
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87, 7.…
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.524,…
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.631,…
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9, 9…
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9505…
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4,…
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311, 31…
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2, 15…
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396.90…
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17.10…
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9, 15…
# install.packages("psych")
library(psych)
## 
## Attaching package: 'psych'
## 
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
describe(Boston)
##         vars   n   mean     sd median trimmed    mad    min    max  range  skew
## crim       1 506   3.61   8.60   0.26    1.68   0.33   0.01  88.98  88.97  5.19
## zn         2 506  11.36  23.32   0.00    5.08   0.00   0.00 100.00 100.00  2.21
## indus      3 506  11.14   6.86   9.69   10.93   9.37   0.46  27.74  27.28  0.29
## chas       4 506   0.07   0.25   0.00    0.00   0.00   0.00   1.00   1.00  3.39
## nox        5 506   0.55   0.12   0.54    0.55   0.13   0.38   0.87   0.49  0.72
## rm         6 506   6.28   0.70   6.21    6.25   0.51   3.56   8.78   5.22  0.40
## age        7 506  68.57  28.15  77.50   71.20  28.98   2.90 100.00  97.10 -0.60
## dis        8 506   3.80   2.11   3.21    3.54   1.91   1.13  12.13  11.00  1.01
## rad        9 506   9.55   8.71   5.00    8.73   2.97   1.00  24.00  23.00  1.00
## tax       10 506 408.24 168.54 330.00  400.04 108.23 187.00 711.00 524.00  0.67
## ptratio   11 506  18.46   2.16  19.05   18.66   1.70  12.60  22.00   9.40 -0.80
## black     12 506 356.67  91.29 391.44  383.17   8.09   0.32 396.90 396.58 -2.87
## lstat     13 506  12.65   7.14  11.36   11.90   7.11   1.73  37.97  36.24  0.90
## medv      14 506  22.53   9.20  21.20   21.56   5.93   5.00  50.00  45.00  1.10
##         kurtosis   se
## crim       36.60 0.38
## zn          3.95 1.04
## indus      -1.24 0.30
## chas        9.48 0.01
## nox        -0.09 0.01
## rm          1.84 0.03
## age        -0.98 1.25
## dis         0.46 0.09
## rad        -0.88 0.39
## tax        -1.15 7.49
## ptratio    -0.30 0.10
## black       7.10 4.06
## lstat       0.46 0.32
## medv        1.45 0.41
# save the Boston data in MASS package in a dataframe
Boston <- MASS::Boston

1.2 Main Regression Model Specification

Specify your linear regression -

\(Median \ Home \ Value_{i} = \beta_0 + \beta_1 \ Per \ Capita \ Crime_{i} + \beta_2 \ Prop. \ of \ non retail \ business_{i} + \\ \beta_3 \ Prop. \ of \ Black_{i} + \epsilon_{i}\)

where i indexes town. From the subscripts you can already guess that the data is cross sectional. Type help(Boston) to read more on the exact variable description if you wish.

RECOMMENDATION: Try using equation mode to type out your specification with subscripts. See Typesetting Equations for details. If short on time, just tell us what is your dependent variable and what are your independent variables.

1.3 Estimation

Estimate your model.

subset_Boston <- Boston[ ,c(1,3,12,14)] # indexing to choose my subset

# Linear regression model
lm.mod <- lm(formula = medv ~ . , 
             data    = subset_Boston
             )

summary(lm.mod ) # confirm you have the independant vraobles you want
## 
## Call:
## lm(formula = medv ~ ., data = subset_Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.830  -4.830  -1.513   3.044  31.887 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 23.904515   1.827291  13.082  < 2e-16 ***
## crim        -0.205053   0.045803  -4.477 9.38e-06 ***
## indus       -0.481012   0.056737  -8.478 2.57e-16 ***
## black        0.013251   0.004221   3.139  0.00179 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.762 on 502 degrees of freedom
## Multiple R-squared:  0.2919, Adjusted R-squared:  0.2877 
## F-statistic: 68.98 on 3 and 502 DF,  p-value: < 2.2e-16

2 Heteroskedasticty

2.1 Residual Analysis

Plot “residuals versus fitted values” chart.

The code is coming from Linear Regression: Residual Analysis file in W3.

Eyeballing the data to me initially suggests heteroskedasticty could be a concern, but I have a relatively small sample too. We can rely on more formal statistical tests like Bresuch Pagan or White’s test to confirm this for us.

# Create a residual plot of residuals vs. fitted values
plot(lm.mod, which = 1)

# Set up the plotting region / graphical parameters to have four subplots
par(mfrow = c(2, 2))

# Create a residual plot
plot(lm.mod)

par(mfrow = c(1, 1))

2.2 Formal Tests with packages

You will need to install the lmtest package which implements Breusch Pagan test for heteroskedasticty in linear regression models, and skedastic package which implements White test for heteroskedasticity directly.

# install.packages("skedastic")  # FOR WHITE TEST
library("skedastic")
?skedastic
## No documentation for 'skedastic' in specified packages and libraries:
## you could try '??skedastic'
# install.packages("lmtest")    # FOR BREUSH PAGAN TEST
require("lmtest")
## Loading required package: lmtest
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
?lmtest
## No documentation for 'lmtest' in specified packages and libraries:
## you could try '??lmtest'

2.2.1 White test for heteroskedasticity

Make sure you read up the syntax of skedastic::white in R.

The R help files tells you that

  1. White’s Test entails fitting an auxiliary regression model in which the response variable is the vector of squared residuals from the original model and the design matrix includes the original explanatory variables, their squares, and (optionally) their two-way interactions i.e. cross terms. I ask you to have the cross terms in your ancillary regression - so do ensure interaction=TRUE so that your get the same critical values (exact match).

  2. The test statistic (T) is the number of observations (n) multiplied by the coefficient of determination (fancy name for R-squared) from the auxiliary regression model: \(T = n *R_{\mathrm{aux}}^2\)

  3. White’s Test is a special case of the method of Breusch and Pagan (1979). Under the null hypothesis of homoskedasticity, the distribution of the test statistic is asymptotically chi-squared with parameter degrees of freedom (the number of regressors without the constant in the auxiallary model).

  4. The test is right-tailed.

?white # This function implements the popular method of White (1980) for testing for heteroskedasticity in a linear regression model.  Read the paper in your Dropbox fodler for Week 4.  


# Conduct White test for heteroscedasticity
skedastic_package_white  <- white(mainlm =  lm.mod, 
                                  interactions = TRUE
                                  )         

# View the test results
skedastic_package_white
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      7.91   0.543         9 White's Test greater

The test statistic has a value of 7.91, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately \(54 \%\) - thus we fail to reject the null of homoskedasticity !

2.3 Manually Test with “Auxillary regression”

The purpose of this auxiliary regression is to investigate whether there is a systematic pattern or relationship between the squared residuals and the original independent variables.

We will model the squared error terms as a function of predictors and the squared terms of the predictor. This is how you can create the terms -

2.3.1 White test for heteroskedasticity

2.3.1.1 Create the residual squared variable i.e. (y or dependent variable)

We will find the residuals from our model and out it in our subset_Boston dataframe. Make sure the residuals sum to zero as a sanity check.

In heteroskedastcity regression tests, the dependent variable is residual squared - so construct it manually. Note the residual is coming from the model for which you want to test for heteroskedasticity, and you can use the resid command to create them easily. The y variable is not medv (median house value) anymore !

?resid # xtracts model residuals from objects returned by modeling functions.
subset_Boston$residuals <- resid(object = lm.mod)

  summary(subset_Boston$residuals) # mean should be zero
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -11.830  -4.830  -1.513   0.000   3.044  31.887
subset_Boston$squared_residuals <- (subset_Boston$residuals)^2 

  summary(subset_Boston$squared_residuals) # should not have any negative values
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##    0.0001    3.7540   18.1894   59.7785   45.0763 1016.8044
  # subset_Boston$residuals <- NULL  # remove residuals 

2.3.1.2 Create the X variables

The independent variables will the original predictors, their squared terms and cross terms.

  1. The original predictors in our case were crim, indus and black .

  2. We can create their squared terms easily with the I function without creating new variable in our data frame. In R, I() is used to prevent the interpretation of a variable or expression as a function or operator. It is called the “as-is” operator and is used to force the interpreter to treat the following expression literally, without any modifications or substitutions. For example, suppose you have a variable x that you want to square in a formula. In most cases, you would write x^2. However, if you want to create a formula that explicitly represents the expression x^2, you would use the I operator, where I(x^2) tells R to treat x^2 as a literal expression, rather than trying to interpret it as a function call or operator.

  3. Cross terms are just interactions terms, and can be created with the operator : - see here for some example. We will have three cross terms here - one for each of the

white_auxillary_reg <- lm(formula = squared_residuals ~ crim     +   indus      +  black  + 
                                                     I(crim^2)  + I(indus^2)   + I(black^2) +
                                                   crim:indus + indus:black + black:crim ,
                             data = subset_Boston
                   ) 

white_auxillary_reg_summary <- summary(white_auxillary_reg)
white_auxillary_reg_summary
## 
## Call:
## lm(formula = squared_residuals ~ crim + indus + black + I(crim^2) + 
##     I(indus^2) + I(black^2) + crim:indus + indus:black + black:crim, 
##     data = subset_Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111.64  -52.93  -39.95   -6.58  946.98 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  6.836e+01  2.502e+02   0.273    0.785
## crim         5.993e+01  4.931e+01   1.215    0.225
## indus       -1.078e+01  1.456e+01  -0.741    0.459
## black        3.541e-01  8.230e-01   0.430    0.667
## I(crim^2)   -2.388e-02  3.005e-02  -0.795    0.427
## I(indus^2)   2.734e-01  1.827e-01   1.496    0.135
## I(black^2)  -8.412e-04  9.395e-04  -0.895    0.371
## crim:indus  -3.156e+00  2.725e+00  -1.158    0.247
## indus:black  1.169e-02  3.502e-02   0.334    0.739
## crim:black  -3.303e-03  5.597e-03  -0.590    0.555
## 
## Residual standard error: 148.6 on 496 degrees of freedom
## Multiple R-squared:  0.01563,    Adjusted R-squared:  -0.002229 
## F-statistic: 0.8752 on 9 and 496 DF,  p-value: 0.5473

Intuitively R-squared is low here, suggesting no heteroskedasticty.


Lets test it more formally with chi-squared distribution.

More formally, calculate the chi-squared test statistic and compare it with chi-squared critical value to determine if R squares is significantly small or big.

Calculate the Chi-Square test statistic: \(\chi^2 = n*R^2_{new}\) where

  1. n: The total number of observations

  2. \(R^2_{new}\): The R-squared of the new regression model that used the squared residuals as the response values

?pchisq

chisq_p_value <-  pchisq(q = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
                         df = 9,       # degrees of freedom is the number of parameters estimated in the model minus 1 (for constant term) i.e.  equal to the number of variables in the auxillary regression
                         lower.tail = FALSE  )
chisq_p_value
## [1] 0.5432264
  • We get the exact same p-value from the package and by hand above -
chisq_p_value
## [1] 0.5432264
  • Lets also ensure that we can replicate the test statistic -
skedastic_package_white
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      7.91   0.543         9 White's Test greater
# TEST CRITICAL VALUE 
white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg)
## [1] 7.910143

We can !!!

# critical value
qchisq(p = .05, df = 9)

# test statistic 
qchisq(p  = white_auxillary_reg_summary$r.squared * nobs(white_auxillary_reg), 
       df = 9, 
       lower.tail = FALSE
       )

2.4 Interactions=False

For a stronger understanding, see if you replcaite the White Test user written command without the interactions terms. We can (again) !

If we did not add the cross terms, test results -

# Conduct White test for heteroscedasticity
skedastic_package_white2  <- white(mainlm =  lm.mod, 
                                  interactions = FALSE
                                  )         
skedastic_package_white2
## # A tibble: 1 × 5
##   statistic p.value parameter method       alternative
##       <dbl>   <dbl>     <dbl> <chr>        <chr>      
## 1      5.76   0.450         6 White's Test greater

We can still replicate the test critical value and p-value by hand.

# auxillary regression is shorter
white_auxillary_reg2 <- lm(formula = squared_residuals ~ crim     +   indus      +  black  + 
                                                     I(crim^2)  + I(indus^2)   + I(black^2) ,
                             data = subset_Boston
                   ) 

white_auxillary_reg2_summary <- summary(white_auxillary_reg2)
white_auxillary_reg2_summary
## 
## Call:
## lm(formula = squared_residuals ~ crim + indus + black + I(crim^2) + 
##     I(indus^2) + I(black^2), data = subset_Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -90.57 -54.78 -41.35  -8.34 945.37 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  7.4170384 44.1704183   0.168    0.867
## crim         1.7655481  1.9974721   0.884    0.377
## indus       -5.4383533  4.5146936  -1.205    0.229
## black        0.5607429  0.4066360   1.379    0.169
## I(crim^2)   -0.0221722  0.0299210  -0.741    0.459
## I(indus^2)   0.2059793  0.1740662   1.183    0.237
## I(black^2)  -0.0009352  0.0008927  -1.048    0.295
## 
## Residual standard error: 148.5 on 499 degrees of freedom
## Multiple R-squared:  0.01139,    Adjusted R-squared:  -0.0004941 
## F-statistic: 0.9584 on 6 and 499 DF,  p-value: 0.4529
# test statistic printed above
white_auxillary_reg2_summary$r.squared * nobs(white_auxillary_reg2)
## [1] 5.764818
chisq_p_value2 <-  pchisq(q = white_auxillary_reg2_summary$r.squared * nobs(white_auxillary_reg2), 
                         df = 6,       # equal to the number of variables in the auxillary regression
                         lower.tail = FALSE  )

# p value printed above
chisq_p_value2
## [1] 0.4500443

3 Extra: BP test

3.1 Breusch-Pagan test for heteroskedasticity

Same conclusion, which is usually the case.

The Breusch-Pagan test fits a linear regression model to the residuals of a linear regression model (by default the same explanatory variables are taken as in the main regression model) and rejects if too much of the variance is explained by the additional explanatory variables.

Under \(H_0\) the test statistic of the Breusch-Pagan test follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom.

# Perform the Breusch-Pagan test for heteroskedasticity
??bptest

lmtest_package_bp <- bptest(formula = lm.mod,  # fitted "lm" object, or run the regression
                         studentize = TRUE   # by default, so this can be commented out
)
                      
# View the test results
lmtest_package_bp
## 
##  studentized Breusch-Pagan test
## 
## data:  lm.mod
## BP = 2.8281, df = 3, p-value = 0.4189
skedastic_package_bp <- skedastic::breusch_pagan(lm.mod)

lmtest_package_bp
## 
##  studentized Breusch-Pagan test
## 
## data:  lm.mod
## BP = 2.8281, df = 3, p-value = 0.4189
skedastic_package_bp
## # A tibble: 1 × 5
##   statistic p.value parameter method                alternative
##       <dbl>   <dbl>     <dbl> <chr>                 <chr>      
## 1      2.83   0.419         3 Koenker (studentised) greater

lmtest::bptest gives the exact same outpout as skedastic::breusch_pagan, as expected.

The test statistic value is 2.82, follows a chi-squared distribution with parameter (the number of regressors without the constant in the model) degrees of freedom, and it has a p-value of approximately \(40 \%\) - thus we fail to reject the null of homoskedasticity !

Logic of BP test can be omitted as it can get a bit more involved.

3.2 Which is better ?

Both Breusch-Pagan (BP) test and White test are used to test for heteroskedasticity in a linear regression model. However, there is no clear consensus on which test is better as the choice depends on various factors such as the sample size, the distribution of the errors, and the specific objectives of the analysis.

  • In general, the White test is considered more robust than the BP test when the sample size is large or when the errors are not normally distributed.

  • However, the BP test may perform better when the errors are normally distributed and the sample size is small.

Ultimately, the choice of test should be based on the specific characteristics of the data and the research question being investigated.