PART 1: More Housing in Boston and Extending the Linear Model

In lab last weeek we looked a the Boston dataset. You will need to load it into this lab session from library MASS.

#library(MASS)
#library(ISLR)
#fix(Boston)

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

The model we created in Week 6 looked at median value as a function of lower status and proportion of older homes. When we fit a model with just the additive effects of lstat and age, these are known as the main effects.

modInt0<-lm(medv~lstat+age, data=Boston)
summary(modInt0)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

However, there could be a relationship between lstat and age. We will now explore the their interaction.

Exercise 1: Interaction

First, create a model with only the lstat and age interaction.

modInt1<-lm(medv~lstat:age, data=Boston)
summary(modInt1)
## 
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.347  -4.372  -1.534   1.914  27.193 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.1588631  0.4828240   62.46   <2e-16 ***
## lstat:age   -0.0077146  0.0003799  -20.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared:  0.4501, Adjusted R-squared:  0.449 
## F-statistic: 412.4 on 1 and 504 DF,  p-value: < 2.2e-16

Exercise 2: Full Model

Now create a model with all possible interactions between lstat and age.

modInt2<-lm(medv~lstat*age, data=Boston)
summary(modInt2)
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Exercise 3: Non-linear Transformations

Lets just look at a simple linear regresion model of median value and lower status.

# Simple model 
mod1<-lm(medv~lstat, data=Boston)
summary(mod1)
## 
## Call:
## lm(formula = medv ~ lstat, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16
# Adjusted R-squared:  0.5432 

Lets check the diagnostics (scatter plot and residuals). What can we say about the direction, shape, and strength?

library(tidyverse)

ggplot(data=Boston, aes(x=lstat, y=medv))+
  geom_point()

# Res plot 1
plot(mod1$fitted.values,mod1$residuals, pch=16)
abline(h=0, col="red")

Exercise 4: Quadratic

Since there is clearly some curvature lets add in a quadratic term to our model (now we’re dealing with MLR!)

# Try a model with a quadratic term 
mod2<-lm(medv~lstat+I(lstat^2), data=Boston)
summary(mod2)
## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.2834  -3.8313  -0.5295   2.3095  25.4148 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 42.862007   0.872084   49.15   <2e-16 ***
## lstat       -2.332821   0.123803  -18.84   <2e-16 ***
## I(lstat^2)   0.043547   0.003745   11.63   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared:  0.6407, Adjusted R-squared:  0.6393 
## F-statistic: 448.5 on 2 and 503 DF,  p-value: < 2.2e-16
# Adjusted R-squared:  0.6393 

Let’s check out the residual plot!

# Res plot 2
plot(mod2$fitted.values,mod2$residuals, pch=16)
abline(h=0, col="red")

# Try this witchcraft!
plot(mod2)

We can also perform a hypothesis test to assess if its worth adding in the quadratric term.

# anova compare
anova(mod1, mod2)
## Analysis of Variance Table
## 
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
##   Res.Df   RSS Df Sum of Sq     F    Pr(>F)    
## 1    504 19472                                 
## 2    503 15347  1    4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 5: Polynomial Regression

In general if we want all polynomials up to a certain power, we can use the poly() function.

# Try a model with a poly 5
mod5<-lm(medv~poly(lstat,5), data=Boston)
summary(mod5)
## 
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5433  -3.1039  -0.7052   2.0844  27.1153 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       22.5328     0.2318  97.197  < 2e-16 ***
## poly(lstat, 5)1 -152.4595     5.2148 -29.236  < 2e-16 ***
## poly(lstat, 5)2   64.2272     5.2148  12.316  < 2e-16 ***
## poly(lstat, 5)3  -27.0511     5.2148  -5.187 3.10e-07 ***
## poly(lstat, 5)4   25.4517     5.2148   4.881 1.42e-06 ***
## poly(lstat, 5)5  -19.2524     5.2148  -3.692 0.000247 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared:  0.6817, Adjusted R-squared:  0.6785 
## F-statistic: 214.2 on 5 and 500 DF,  p-value: < 2.2e-16
# Adjusted R-squared:  0.6785

# Res plot 5
plot(mod5$fitted.values,mod5$residuals, pch=16)
abline(h=0, col="red")

PART 2: Sales of Child Car Seats

Exercise 6: Qualitative Predictors

Lets change datasets! Now we’ll be using the Carseat data set in the ISLR package. Its a simulated dataset about the sales of car seats at different stores.

data(Carseats)
names(Carseats)
##  [1] "Sales"       "CompPrice"   "Income"      "Advertising" "Population" 
##  [6] "Price"       "ShelveLoc"   "Age"         "Education"   "Urban"      
## [11] "US"
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Observe that there are a few variables that were “Factors” this means that they are categorical. Using the str() function we can also see how many levels each have.

  • ShelveLoc : Good, Bad, Medium
  • Urban : No, Yes
  • US : No, Yes

First, lets fit a simple model of sales as a function of location:

carMod1<-lm(Sales~ShelveLoc, data=Carseats)
summary(carMod1)
## 
## Call:
## lm(formula = Sales ~ ShelveLoc, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3066 -1.6282 -0.0416  1.5666  6.1471 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       5.5229     0.2388  23.131  < 2e-16 ***
## ShelveLocGood     4.6911     0.3484  13.464  < 2e-16 ***
## ShelveLocMedium   1.7837     0.2864   6.229  1.2e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared:  0.3172, Adjusted R-squared:  0.3138 
## F-statistic: 92.23 on 2 and 397 DF,  p-value: < 2.2e-16
# Visualize the model
ggplot(Carseats, aes(y=Sales, x=ShelveLoc, fill=ShelveLoc))+
  geom_boxplot()

# ANOVA output
anova(carMod1)
## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## ShelveLoc   2 1009.5  504.77   92.23 < 2.2e-16 ***
## Residuals 397 2172.7    5.47                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Second, lets include Price in the model:

carMod2<-lm(Sales~ShelveLoc+Price, data=Carseats)
summary(carMod2)
## 
## Call:
## lm(formula = Sales ~ ShelveLoc + Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8229 -1.3930 -0.0179  1.3868  5.0780 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     12.001802   0.503447  23.839  < 2e-16 ***
## ShelveLocGood    4.895848   0.285921  17.123  < 2e-16 ***
## ShelveLocMedium  1.862022   0.234748   7.932 2.23e-14 ***
## Price           -0.056698   0.004059 -13.967  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.917 on 396 degrees of freedom
## Multiple R-squared:  0.5426, Adjusted R-squared:  0.5391 
## F-statistic: 156.6 on 3 and 396 DF,  p-value: < 2.2e-16
# BAD
# intercept = 12.001802
# slope = -0.056698

# GOOD
# intercept = 12.001802+4.895848
# slope = -0.056698

# MEDIUM
# intercept = 12.001802+1.862022
# slope = -0.056698

# VIZ: Shifts in intercept
ggplot(Carseats, aes(x=Price, y=Sales, color=ShelveLoc))+
  geom_point()+
  geom_abline(intercept = carMod2$coefficients[1], slope=carMod2$coefficients[4],
              color="red", lwd=1)+
  geom_abline(intercept = carMod2$coefficients[1]+carMod2$coefficients[2], slope=carMod2$coefficients[4],
              color="forestgreen", lwd=1)+
  geom_abline(intercept = carMod2$coefficients[1]+carMod2$coefficients[3], slope=carMod2$coefficients[4],
              color="blue", lwd=1)

# ANOVA
anova(carMod2)
## Analysis of Variance Table
## 
## Response: Sales
##            Df Sum Sq Mean Sq F value    Pr(>F)    
## ShelveLoc   2 1009.5  504.77  137.32 < 2.2e-16 ***
## Price       1  717.1  717.10  195.08 < 2.2e-16 ***
## Residuals 396 1455.6    3.68                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Third, lets include an interaction with Price in the model:

carMod3<-lm(Sales~ShelveLoc*Price, data=Carseats)
summary(carMod3)
## 
## Call:
## lm(formula = Sales ~ ShelveLoc * Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9037 -1.3461 -0.0595  1.3679  4.9037 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           11.832984   0.965788  12.252  < 2e-16 ***
## ShelveLocGood          6.135880   1.392844   4.405 1.36e-05 ***
## ShelveLocMedium        1.630481   1.171616   1.392    0.165    
## Price                 -0.055220   0.008276  -6.672 8.57e-11 ***
## ShelveLocGood:Price   -0.010564   0.011742  -0.900    0.369    
## ShelveLocMedium:Price  0.001984   0.010007   0.198    0.843    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.918 on 394 degrees of freedom
## Multiple R-squared:  0.5444, Adjusted R-squared:  0.5386 
## F-statistic: 94.17 on 5 and 394 DF,  p-value: < 2.2e-16
# BAD
# intercept = 11.832984
# slope = -0.055220

# GOOD
# intercept = 11.832984+6.135880
# slope = -0.055220-0.010564

# MEDIUM
# intercept = 11.832984+1.630481
# slope = -0.055220+0.001984

# VIZ: Shift intercepts and slopes
ggplot(Carseats, aes(x=Price, y=Sales, color=ShelveLoc))+
  geom_point()+
  geom_abline(intercept = carMod3$coefficients[1], slope=carMod3$coefficients[4],
              color="red", lwd=1)+
  geom_abline(intercept = carMod3$coefficients[1]+carMod3$coefficients[2], slope=carMod3$coefficients[4]+carMod3$coefficients[5],
              color="forestgreen", lwd=1)+
  geom_abline(intercept = carMod3$coefficients[1]+carMod3$coefficients[3], slope=carMod3$coefficients[4]+carMod3$coefficients[6],
              color="blue", lwd=1)

# ANOVA
anova(carMod3)
## Analysis of Variance Table
## 
## Response: Sales
##                  Df  Sum Sq Mean Sq  F value Pr(>F)    
## ShelveLoc         2 1009.53  504.77 137.1807 <2e-16 ***
## Price             1  717.10  717.10 194.8878 <2e-16 ***
## ShelveLoc:Price   2    5.89    2.95   0.8005 0.4498    
## Residuals       394 1449.75    3.68                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Finally, lets fit a model with all the variables and a couple of interactions (Income:Advertising and Price:Age)

carMod<-lm(Sales~.+Income:Advertising+Price:Age, data=Carseats)
summary(carMod)
## 
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9208 -0.7503  0.0177  0.6754  3.3413 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.5755654  1.0087470   6.519 2.22e-10 ***
## CompPrice           0.0929371  0.0041183  22.567  < 2e-16 ***
## Income              0.0108940  0.0026044   4.183 3.57e-05 ***
## Advertising         0.0702462  0.0226091   3.107 0.002030 ** 
## Population          0.0001592  0.0003679   0.433 0.665330    
## Price              -0.1008064  0.0074399 -13.549  < 2e-16 ***
## ShelveLocGood       4.8486762  0.1528378  31.724  < 2e-16 ***
## ShelveLocMedium     1.9532620  0.1257682  15.531  < 2e-16 ***
## Age                -0.0579466  0.0159506  -3.633 0.000318 ***
## Education          -0.0208525  0.0196131  -1.063 0.288361    
## UrbanYes            0.1401597  0.1124019   1.247 0.213171    
## USYes              -0.1575571  0.1489234  -1.058 0.290729    
## Income:Advertising  0.0007510  0.0002784   2.698 0.007290 ** 
## Price:Age           0.0001068  0.0001333   0.801 0.423812    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared:  0.8761, Adjusted R-squared:  0.8719 
## F-statistic:   210 on 13 and 386 DF,  p-value: < 2.2e-16

Last class we talked about the dummy variables created when we have a categorical variable with more than two levels. We can use the contrast() function to see how this coding is done.

contrasts(Carseats$ShelveLoc)
##        Good Medium
## Bad       0      0
## Good      1      0
## Medium    0      1

Let’s Also check out the anova output

anova(carMod)
## Analysis of Variance Table
## 
## Response: Sales
##                     Df  Sum Sq Mean Sq   F value    Pr(>F)    
## CompPrice            1   13.07   13.07   12.7939 0.0003919 ***
## Income               1   79.07   79.07   77.4228 < 2.2e-16 ***
## Advertising          1  219.35  219.35  214.7726 < 2.2e-16 ***
## Population           1    0.38    0.38    0.3744 0.5409648    
## Price                1 1198.87 1198.87 1173.8422 < 2.2e-16 ***
## ShelveLoc            2 1047.47  523.74  512.8052 < 2.2e-16 ***
## Age                  1  217.39  217.39  212.8503 < 2.2e-16 ***
## Education            1    1.05    1.05    1.0284 0.3111694    
## Urban                1    1.22    1.22    1.1948 0.2750539    
## US                   1    1.57    1.57    1.5344 0.2162066    
## Income:Advertising   1    7.95    7.95    7.7838 0.0055330 ** 
## Price:Age            1    0.65    0.65    0.6411 0.4238116    
## Residuals          386  394.23    1.02                        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

PART 3: Credit Card Balance Data

Exercise 7: Multicollinearity

Lets change datasets again! We’ll be using the Credit data set in the ISLR package. This is another simulated dataset about 10000 credit card customers. For this exercise we’re only interested in four variables: balance, age, limit, and rating.

data(Credit)
names(Credit)
##  [1] "ID"        "Income"    "Limit"     "Rating"    "Cards"    
##  [6] "Age"       "Education" "Gender"    "Student"   "Married"  
## [11] "Ethnicity" "Balance"
creditTrim<-data.frame(balance=Credit$Balance, 
                       limit=Credit$Limit, 
                       rating=Credit$Rating, 
                       age=Credit$Age)

Multicollinearity occurs when there is are linear relationships between explanatory variables. So lets looks at a pairs plot and also the correlation matrix.

pairs(creditTrim)

cor(creditTrim)
##             balance     limit    rating         age
## balance 1.000000000 0.8616973 0.8636252 0.001835119
## limit   0.861697267 1.0000000 0.9968797 0.100887922
## rating  0.863625161 0.9968797 1.0000000 0.103164996
## age     0.001835119 0.1008879 0.1031650 1.000000000

This causes our variance to be overestimated thereby decreasing the respective test statistic and causing larger p-values.

Consider the two models :

credMod1<-lm(balance~age+limit, creditTrim)
summary(credMod1)
## 
## Call:
## lm(formula = balance ~ age + limit, data = creditTrim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -696.84 -150.78  -13.01  126.68  755.56 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.734e+02  4.383e+01  -3.957 9.01e-05 ***
## age         -2.291e+00  6.725e-01  -3.407 0.000723 ***
## limit        1.734e-01  5.026e-03  34.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230.5 on 397 degrees of freedom
## Multiple R-squared:  0.7498, Adjusted R-squared:  0.7486 
## F-statistic:   595 on 2 and 397 DF,  p-value: < 2.2e-16
credMod2<-lm(balance~rating+limit, creditTrim)
summary(credMod2)
## 
## Call:
## lm(formula = balance ~ rating + limit, data = creditTrim)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -707.8 -135.9   -9.5  124.0  817.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -377.53680   45.25418  -8.343 1.21e-15 ***
## rating         2.20167    0.95229   2.312   0.0213 *  
## limit          0.02451    0.06383   0.384   0.7012    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 232.3 on 397 degrees of freedom
## Multiple R-squared:  0.7459, Adjusted R-squared:  0.7447 
## F-statistic: 582.8 on 2 and 397 DF,  p-value: < 2.2e-16

A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.

\(VIF(\hat{\beta}_j)=\frac{1}{1-R_{X_j|X_{-j}}^2}\).

#install.packages("car")
library(car)

vif(lm(balance~age+rating+limit, creditTrim))
##        age     rating      limit 
##   1.011385 160.668301 160.592880