For this example, we will continue using the Hald data:

Hald [1952] presented a dataset concerning the heat involved in calories per gram of cement (𝑌) as a function of the amount of each of the following four ingredients in the mixture:

hald<-read.csv("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH376LMT/main/data/hald.csv", 
               header=TRUE)

Multicollinearity

Multicollinearity occurs when two or more predictors in a multiple linear regression model are correlated and provide redundant information about the response variable.

Multicollinearity can have effects on the inference for coefficients in linear regression models.

Example:

Look at fitting models with \(x_2\) and \(x_4\) separately and then together

First look at individual t-tests

# SLR with x2
mod1<-lm(y~x2, data=hald)
summary(mod1)
## 
## Call:
## lm(formula = y ~ x2, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.752  -6.008  -1.684   3.794  21.387 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  57.4237     8.4906   6.763  3.1e-05 ***
## x2            0.7891     0.1684   4.686 0.000665 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.077 on 11 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6359 
## F-statistic: 21.96 on 1 and 11 DF,  p-value: 0.0006648
# SLR with x4
mod2<-lm(y~x4, data=hald)
summary(mod2)
## 
## Call:
## lm(formula = y ~ x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.589  -8.228   1.495   4.726  17.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.5679     5.2622  22.342 1.62e-10 ***
## x4           -0.7382     0.1546  -4.775 0.000576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.645 
## F-statistic:  22.8 on 1 and 11 DF,  p-value: 0.0005762
# MLR using x2 and x4
mod3<-lm(y~x2+x4, data=hald)
summary(mod3)
## 
## Call:
## lm(formula = y ~ x2 + x4, data = hald)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.193  -7.260   0.652   4.104  19.008 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  94.1601    56.6271   1.663    0.127
## x2            0.3109     0.7486   0.415    0.687
## x4           -0.4569     0.6960  -0.657    0.526
## 
## Residual standard error: 9.321 on 10 degrees of freedom
## Multiple R-squared:  0.6801, Adjusted R-squared:  0.6161 
## F-statistic: 10.63 on 2 and 10 DF,  p-value: 0.003352

What did you observe in the individual t-tests for the regression coefficients?

Now look at the global F-test

anova<-anova(mod3)
# Sum of Squares
SS_Tot<-sum(anova$`Sum Sq`)
SS_Res<-sum(mod3$residuals^2)
SS_Reg<-SS_Tot-SS_Res

# Degrees of freedom
n<-nrow(hald)
p<-2
df_Res<-n-p-1
df_Reg<-p

MS_Reg<-SS_Reg/df_Reg
MS_Res<-SS_Res/df_Res

# T-test Statistic
F<-MS_Reg/MS_Res

# P-value
pf(F, df1=df_Reg, df2=df_Res, lower.tail = FALSE)
## [1] 0.003352277

How does these results compare?

Diagnostic Tools

We will explore multiple diagnostic tools to assess the presence of multicollinearity.

1. Pairs plot

pairs(hald)

2. Correlation Matrix

cor(hald)
##             y         x1         x2         x3         x4
## y   1.0000000  0.7307175  0.8162526 -0.5346707 -0.8213050
## x1  0.7307175  1.0000000  0.2285795 -0.8241338 -0.2454451
## x2  0.8162526  0.2285795  1.0000000 -0.1392424 -0.9729550
## x3 -0.5346707 -0.8241338 -0.1392424  1.0000000  0.0295370
## x4 -0.8213050 -0.2454451 -0.9729550  0.0295370  1.0000000

3. Variance Inflation Factor

A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.

\[VIF(\hat{\beta}_j)=\frac{1}{1-R^2_{X_i|X_{-j}}}\]

Example:

Try this for \(x_1\)

mod_x1<-lm(x1~x2+x3+x4, data=hald)
anova(mod_x1)
## Analysis of Variance Table
## 
## Response: x1
##           Df  Sum Sq Mean Sq F value    Pr(>F)    
## x2         1  21.695  21.695  18.102  0.002128 ** 
## x3         1 265.814 265.814 221.794 1.201e-07 ***
## x4         1 116.935 116.935  97.570 3.964e-06 ***
## Residuals  9  10.786   1.198                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Find the R-squared for each explanatory 
SS_Res_x1<-sum(mod_x1$residuals^2)
SS_Tot_x1<-sum(anova(mod_x1)$`Sum Sq`)
R2_x1<-1-(SS_Res_x1/SS_Tot_x1)
R2_x1
## [1] 0.9740234
# Find the VIF
VIF_x1<-1/(1-R2_x1)
VIF_x1
## [1] 38.49621

We can do this using the `vif function

library(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.2
vif(lm(y~x1+x2+x3+x4, data=hald))
##        x1        x2        x3        x4 
##  38.49621 254.42317  46.86839 282.51286

4. Eigenvalues

# Design matrix
xMat<-matrix(data=c(rep(1, nrow(hald)), 
                    hald$x1, 
                    hald$x2, 
                    hald$x3, 
                    hald$x4), 
             nrow=nrow(hald))

# Eigenvalues
ev<-eigen(t(xMat)%*%xMat)
ev
## eigen() decomposition
## $values
## [1] 4.467621e+04 5.965422e+03 8.099521e+02 1.054187e+02 1.218022e-03
## 
## $vectors
##             [,1]         [,2]          [,3]        [,4]        [,5]
## [1,] -0.01699462  0.003722752  4.299921e-05 -0.01104161  0.99978768
## [2,] -0.12788573 -0.042775672 -6.459039e-01 -0.75134403 -0.01028458
## [3,] -0.83967514 -0.509220987 -1.812061e-02  0.18763027 -0.01030393
## [4,] -0.19841920  0.072108971  7.557151e-01 -0.61985010 -0.01051937
## [5,] -0.48880661  0.856534101 -1.066512e-01  0.12625753 -0.01009922
# Condition number 
lamda_max<-max(ev$values)
lamda_min<-min(ev$values)
kappa<-lamda_max/lamda_min
kappa
## [1] 36679307
# Condition indices
kappa_j<-lamda_max/ev$values
sum(kappa_j>1000)
## [1] 1