Using the data “Credit”
\[ Balance_i=\beta_0+\beta_1 age_i+u_i \] \[ Balance_i=\beta_0+\beta_1 limit_i+u_i \] \[ Balance_i=\beta_0+\beta_1 rating_i+u_i \]
and analyze the significance of each coefficient
library(ISLR)
attach(Credit)
par(mfrow=c(2, 2))
plot(Age,Balance)
plot(Limit,Balance)
plot(Rating,Balance)
mod1<-lm(Balance~Age)
mod2<-lm(Balance~Limit)
mod3<-lm(Balance~Rating)
summary(mod1)
##
## Call:
## lm(formula = Balance ~ Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -521.40 -451.50 -59.94 343.47 1476.91
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 517.29222 77.85153 6.645 1e-10 ***
## Age 0.04891 1.33599 0.037 0.971
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 460.3 on 398 degrees of freedom
## Multiple R-squared: 3.368e-06, Adjusted R-squared: -0.002509
## F-statistic: 0.00134 on 1 and 398 DF, p-value: 0.9708
summary(mod2)
##
## Call:
## lm(formula = Balance ~ Limit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -676.95 -141.87 -11.55 134.11 776.44
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.928e+02 2.668e+01 -10.97 <2e-16 ***
## Limit 1.716e-01 5.066e-03 33.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 233.6 on 398 degrees of freedom
## Multiple R-squared: 0.7425, Adjusted R-squared: 0.7419
## F-statistic: 1148 on 1 and 398 DF, p-value: < 2.2e-16
summary(mod3)
##
## Call:
## lm(formula = Balance ~ Rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -712.28 -135.32 -9.58 125.67 829.04
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -390.84634 29.06851 -13.45 <2e-16 ***
## Rating 2.56624 0.07509 34.18 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 232.1 on 398 degrees of freedom
## Multiple R-squared: 0.7458, Adjusted R-squared: 0.7452
## F-statistic: 1168 on 1 and 398 DF, p-value: < 2.2e-16
According to pure statistical criteria, we can appreciate that:
- Age is statistically non significant variable (the scatter plot seems to be in this line)
- Limit and Balance are statistically significant (and the scatter plot seems to say the same)
Now, try to estimate a joint model:
\[ Balance_i=\beta_0+\beta_1 rating_i+ \beta_1 limit_i + \beta_2 age_i+u_i \]
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
mod1<-lm(Balance~Age+Limit+Rating)
summary(mod1)
##
## Call:
## lm(formula = Balance ~ Age + Limit + Rating)
##
## Residuals:
## Min 1Q Median 3Q Max
## -729.67 -135.82 -8.58 127.29 827.65
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.51752 55.88219 -4.644 4.66e-06 ***
## Age -2.34575 0.66861 -3.508 0.000503 ***
## Limit 0.01901 0.06296 0.302 0.762830
## Rating 2.31046 0.93953 2.459 0.014352 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 229.1 on 396 degrees of freedom
## Multiple R-squared: 0.7536, Adjusted R-squared: 0.7517
## F-statistic: 403.7 on 3 and 396 DF, p-value: < 2.2e-16
What happens?
- Things are quite different: signs of the “age” variable has changed and surprisingly is now significative.
- However, Limit has change its significance with respect to the simple regression.
Collinearity refers to the situation in which two or more predictor variables are closely related to one another. If you analyze the predictors, limit and age appear to have no obvious relationship. In contrast, the predictors limit and rating are very highly correlated with each other, and we say that they are collinear. The presence of collinearity can pose problems in the regression context, since it can be difficult to separate out the individual effects of collinear variables on the response. In other words, since limit and rating tend to increase or decrease together, it can be difficult to determine how each one separately is associated with the response, balance.
One way to detect collinearity is using the t-test versus the F-test. The F test
The joint test (also called F-test) has as null hypothesis the following:
\[ H_0 : \beta_1 = \beta_1 = \beta_2 \]
is said: all the slopes are jointly insignificant. Versus:
\[ H_1 : some :\: \beta_1,\beta_1,\beta_2 \neq0 \]
In this case, by looking at the test, we reject the null hypothesis (all the slopes are jointly different from zero) versus the individual test , where we fail to reject some of them.
This is a consequence of collinearity. Since there are “similar” variables the estimation method struggles finding the true coefficient and, additionally, the estimation is less accurate (standard error of the coefficient tends to be higher).
There is not a clear solution to this problem. In our case, we’ll use a criterium: since we are interested in predicting, we’ll use cross validation to decide among different models.
library(ISLR)
attach(Credit)
## The following objects are masked from Credit (pos = 3):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
## The following objects are masked from Credit (pos = 4):
##
## Age, Balance, Cards, Education, Ethnicity, Gender, ID, Income,
## Limit, Married, Rating, Student
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(lattice)
train_control<- trainControl(method="cv", number=20,p=0.75, savePredictions = TRUE)
model1_cv<- train(Balance~Age, data=Credit, trControl=train_control, method = "lm" )
model1_cv
## Linear Regression
##
## 400 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 381, 380, 380, 380, 379, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 458.5586 0.0477179 391.008
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model2_cv<- train(Balance~Rating, data=Credit, trControl=train_control, method = "lm" )
model2_cv
## Linear Regression
##
## 400 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 229.5178 0.7639076 175.6195
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model3_cv<- train(Balance~Limit, data=Credit, trControl=train_control, method = "lm" )
model3_cv
## Linear Regression
##
## 400 samples
## 1 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 229.2543 0.7636566 177.8597
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
model4_cv<- train(Balance~Age+Rating+Limit, data=Credit, trControl=train_control, method = "lm" )
model4_cv
## Linear Regression
##
## 400 samples
## 3 predictor
##
## No pre-processing
## Resampling: Cross-Validated (20 fold)
## Summary of sample sizes: 380, 380, 380, 380, 380, 380, ...
## Resampling results:
##
## RMSE Rsquared MAE
## 223.2596 0.7660667 174.3357
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
Errors<-data.frame(model1_cv$results$RMSE,model2_cv$results$RMSE,model3_cv$results$RMSE,model4_cv$results$RMSE)
Errors
## model1_cv.results.RMSE model2_cv.results.RMSE model3_cv.results.RMSE
## 1 458.5586 229.5178 229.2543
## model4_cv.results.RMSE
## 1 223.2596