For this example, we will continue using the Hald data:
Hald [1952] presented a dataset concerning the heat involved in calories per gram of cement (𝑌) as a function of the amount of each of the following four ingredients in the mixture:
hald<-read.csv("https://raw.githubusercontent.com/kitadasmalley/sp21_MATH376LMT/main/data/hald.csv",
header=TRUE)
Multicollinearity occurs when two or more predictors in a multiple linear regression model are correlated and provide redundant information about the response variable.
Multicollinearity can have effects on the inference for coefficients in linear regression models.
Look at fitting models with \(x_2\) and \(x_4\) separately and then together
# SLR with x2
mod1<-lm(y~x2, data=hald)
summary(mod1)
##
## Call:
## lm(formula = y ~ x2, data = hald)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.752 -6.008 -1.684 3.794 21.387
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 57.4237 8.4906 6.763 3.1e-05 ***
## x2 0.7891 0.1684 4.686 0.000665 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9.077 on 11 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6359
## F-statistic: 21.96 on 1 and 11 DF, p-value: 0.0006648
# SLR with x4
mod2<-lm(y~x4, data=hald)
summary(mod2)
##
## Call:
## lm(formula = y ~ x4, data = hald)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.589 -8.228 1.495 4.726 17.524
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 117.5679 5.2622 22.342 1.62e-10 ***
## x4 -0.7382 0.1546 -4.775 0.000576 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared: 0.6745, Adjusted R-squared: 0.645
## F-statistic: 22.8 on 1 and 11 DF, p-value: 0.0005762
# MLR using x2 and x4
mod3<-lm(y~x2+x4, data=hald)
summary(mod3)
##
## Call:
## lm(formula = y ~ x2 + x4, data = hald)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.193 -7.260 0.652 4.104 19.008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 94.1601 56.6271 1.663 0.127
## x2 0.3109 0.7486 0.415 0.687
## x4 -0.4569 0.6960 -0.657 0.526
##
## Residual standard error: 9.321 on 10 degrees of freedom
## Multiple R-squared: 0.6801, Adjusted R-squared: 0.6161
## F-statistic: 10.63 on 2 and 10 DF, p-value: 0.003352
What did you observe in the individual t-tests for the regression coefficients?
anova<-anova(mod3)
# Sum of Squares
SS_Tot<-sum(anova$`Sum Sq`)
SS_Res<-sum(mod3$residuals^2)
SS_Reg<-SS_Tot-SS_Res
# Degrees of freedom
n<-nrow(hald)
p<-2
df_Res<-n-p-1
df_Reg<-p
MS_Reg<-SS_Reg/df_Reg
MS_Res<-SS_Res/df_Res
# T-test Statistic
F<-MS_Reg/MS_Res
# P-value
pf(F, df1=df_Reg, df2=df_Res, lower.tail = FALSE)
## [1] 0.003352277
How does these results compare?
We will explore multiple diagnostic tools to assess the presence of multicollinearity.
pairs(hald)
cor(hald)
## y x1 x2 x3 x4
## y 1.0000000 0.7307175 0.8162526 -0.5346707 -0.8213050
## x1 0.7307175 1.0000000 0.2285795 -0.8241338 -0.2454451
## x2 0.8162526 0.2285795 1.0000000 -0.1392424 -0.9729550
## x3 -0.5346707 -0.8241338 -0.1392424 1.0000000 0.0295370
## x4 -0.8213050 -0.2454451 -0.9729550 0.0295370 1.0000000
A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.
\[VIF(\hat{\beta}_j)=\frac{1}{1-R^2_{X_i|X_{-j}}}\]
Try this for \(x_1\)
mod_x1<-lm(x1~x2+x3+x4, data=hald)
anova(mod_x1)
## Analysis of Variance Table
##
## Response: x1
## Df Sum Sq Mean Sq F value Pr(>F)
## x2 1 21.695 21.695 18.102 0.002128 **
## x3 1 265.814 265.814 221.794 1.201e-07 ***
## x4 1 116.935 116.935 97.570 3.964e-06 ***
## Residuals 9 10.786 1.198
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Find the R-squared for each explanatory
SS_Res_x1<-sum(mod_x1$residuals^2)
SS_Tot_x1<-sum(anova(mod_x1)$`Sum Sq`)
R2_x1<-1-(SS_Res_x1/SS_Tot_x1)
R2_x1
## [1] 0.9740234
# Find the VIF
VIF_x1<-1/(1-R2_x1)
VIF_x1
## [1] 38.49621
vif
functionlibrary(car)
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.2
vif(lm(y~x1+x2+x3+x4, data=hald))
## x1 x2 x3 x4
## 38.49621 254.42317 46.86839 282.51286
# Design matrix
xMat<-matrix(data=c(rep(1, nrow(hald)),
hald$x1,
hald$x2,
hald$x3,
hald$x4),
nrow=nrow(hald))
# Eigenvalues
ev<-eigen(t(xMat)%*%xMat)
ev
## eigen() decomposition
## $values
## [1] 4.467621e+04 5.965422e+03 8.099521e+02 1.054187e+02 1.218022e-03
##
## $vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.01699462 0.003722752 4.299921e-05 -0.01104161 0.99978768
## [2,] -0.12788573 -0.042775672 -6.459039e-01 -0.75134403 -0.01028458
## [3,] -0.83967514 -0.509220987 -1.812061e-02 0.18763027 -0.01030393
## [4,] -0.19841920 0.072108971 7.557151e-01 -0.61985010 -0.01051937
## [5,] -0.48880661 0.856534101 -1.066512e-01 0.12625753 -0.01009922
# Condition number
lamda_max<-max(ev$values)
lamda_min<-min(ev$values)
kappa<-lamda_max/lamda_min
kappa
## [1] 36679307
# Condition indices
kappa_j<-lamda_max/ev$values
sum(kappa_j>1000)
## [1] 1