Week 7: Class 21/03/2023


Exercise on multicollinearity

Using the data “Credit”

  1. Analyze, graphically, the relationship between balance and age, balance and limit and balance and rating
  2. Run three simple regressions :

\[ Balance_i=\beta_0+\beta_1 age_i+u_i \] \[ Balance_i=\beta_0+\beta_1 limit_i+u_i \] \[ Balance_i=\beta_0+\beta_1 rating_i+u_i \]

and analyze the significance of each coefficient

library(ISLR)
attach(Credit)

par(mfrow=c(2, 2))
plot(Age,Balance)
plot(Limit,Balance)
plot(Rating,Balance)

mod1<-lm(Balance~Age)
mod2<-lm(Balance~Limit)
mod3<-lm(Balance~Rating)

summary(mod1)
## 
## Call:
## lm(formula = Balance ~ Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -521.40 -451.50  -59.94  343.47 1476.91 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 517.29222   77.85153   6.645    1e-10 ***
## Age           0.04891    1.33599   0.037    0.971    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 460.3 on 398 degrees of freedom
## Multiple R-squared:  3.368e-06,  Adjusted R-squared:  -0.002509 
## F-statistic: 0.00134 on 1 and 398 DF,  p-value: 0.9708
summary(mod2)
## 
## Call:
## lm(formula = Balance ~ Limit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -676.95 -141.87  -11.55  134.11  776.44 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.928e+02  2.668e+01  -10.97   <2e-16 ***
## Limit        1.716e-01  5.066e-03   33.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 233.6 on 398 degrees of freedom
## Multiple R-squared:  0.7425, Adjusted R-squared:  0.7419 
## F-statistic:  1148 on 1 and 398 DF,  p-value: < 2.2e-16
summary(mod3)
## 
## Call:
## lm(formula = Balance ~ Rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -712.28 -135.32   -9.58  125.67  829.04 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -390.84634   29.06851  -13.45   <2e-16 ***
## Rating         2.56624    0.07509   34.18   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 232.1 on 398 degrees of freedom
## Multiple R-squared:  0.7458, Adjusted R-squared:  0.7452 
## F-statistic:  1168 on 1 and 398 DF,  p-value: < 2.2e-16

According to pure statistical criteria, we can appreciate that:

  • Age is statistically non significant variable (the scatter plot seems to be in this line)
  • Limit and Balance are statistically significant (and the scatter plot seems to say the same)

Now, try to estimate a joint model:

\[ Balance_i=\beta_0+\beta_1 rating_i+ \beta_1 limit_i + \beta_2 age_i+u_i \]

library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
mod1<-lm(Balance~Age+Limit+Rating)

summary(mod1)
## 
## Call:
## lm(formula = Balance ~ Age + Limit + Rating)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -729.67 -135.82   -8.58  127.29  827.65 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -259.51752   55.88219  -4.644 4.66e-06 ***
## Age           -2.34575    0.66861  -3.508 0.000503 ***
## Limit          0.01901    0.06296   0.302 0.762830    
## Rating         2.31046    0.93953   2.459 0.014352 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 229.1 on 396 degrees of freedom
## Multiple R-squared:  0.7536, Adjusted R-squared:  0.7517 
## F-statistic: 403.7 on 3 and 396 DF,  p-value: < 2.2e-16

What happens?

  • Things are quite different: signs of the “age” variable has changed and surprisingly is now significative.
  • However, Limit has change its significance with respect to the simple regression.

Multicollinearity

Collinearity refers to the situation in which two or more predictor variables are closely related to one another. If you analyze the predictors, limit and age appear to have no obvious relationship. In contrast, the predictors limit and rating are very highly correlated with each other, and we say that they are collinear. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. In other words, since limit and rating tend to increase or decrease together, it can be difficult to determine how each one separately is associated with the response, balance.

One way to detect collinearity is using the t-test versus the F-test. The F test

The joint test (also called F-test) has as null hypothesis the following:

\[ H_0 : \beta_1 = \beta_1 = \beta_2 \]

is said: all the slopes are jointly insignificant. Versus:

\[ H_1 : some :\: \beta_1,\beta_1,\beta_2 \neq0 \]

In this case, by looking at the test, we reject the null hypothesis (all the slopes are jointly different from zero) versus the individual test , where we fail to reject some of them.

So WHAT?

This is a consequence of collinearity. Since there are “similar” variables the estimation method struggles finding the true coefficient and, additionally, the estimation is less accurate (standard error of the coefficient tends to be higher).

There is not a clear solution to this problem. In our case, we’ll use a criterium: since we are interested in predicting, we’ll use cross validation to decide among different models.

library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
## 
##     Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
##     Limit, Married, Rating, Student
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(lattice)





train_control<- trainControl(method="cv", number=20,p=0.75, savePredictions = TRUE)

model1_cv<- train(Balance~Age, data=Credit, trControl=train_control, method = "lm" )  
model1_cv
## Linear Regression 
## 
## 400 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 381, 380, 380, 380, 379, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   458.5586  0.0477179  391.008
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model2_cv<- train(Balance~Rating, data=Credit, trControl=train_control, method = "lm" )  
model2_cv
## Linear Regression 
## 
## 400 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   229.5178  0.7639076  175.6195
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model3_cv<- train(Balance~Limit, data=Credit, trControl=train_control, method = "lm" )  
model3_cv
## Linear Regression 
## 
## 400 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   229.2543  0.7636566  177.8597
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
model4_cv<- train(Balance~Age+Rating+Limit, data=Credit, trControl=train_control, method = "lm" )  
model4_cv
## Linear Regression 
## 
## 400 samples
##   3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (20 fold) 
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   223.2596  0.7660667  174.3357
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
Errors<-data.frame(model1_cv$results$RMSE,model2_cv$results$RMSE,model3_cv$results$RMSE,model4_cv$results$RMSE)

Errors
##   model1_cv.results.RMSE model2_cv.results.RMSE model3_cv.results.RMSE
## 1               458.5586               229.5178               229.2543
##   model4_cv.results.RMSE
## 1               223.2596