Part I: Numeric Interactions

In the past, we have looked at the Boston dataset to explore polynomial regression. We are going to use this dataset for Part I. You will need to load it into this lab session from library MASS.

library(MASS)
library(ISLR)

names(Boston)
##  [1] "crim"    "zn"      "indus"   "chas"    "nox"     "rm"      "age"    
##  [8] "dis"     "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"
head(Boston)
##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12
##   lstat medv
## 1  4.98 24.0
## 2  9.14 21.6
## 3  4.03 34.7
## 4  2.94 33.4
## 5  5.33 36.2
## 6  5.21 28.7

The model we created in Week 6 looked at median value as a function of lower status and proportion of older homes. When we fit a model with just the additive effects of lstat and age, these are known as the main effects.

modInt0<-lm(medv~lstat+age, data=Boston)
summary(modInt0)
## 
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.981  -3.978  -1.283   1.968  23.158 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.22276    0.73085  45.458  < 2e-16 ***
## lstat       -1.03207    0.04819 -21.416  < 2e-16 ***
## age          0.03454    0.01223   2.826  0.00491 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared:  0.5513, Adjusted R-squared:  0.5495 
## F-statistic:   309 on 2 and 503 DF,  p-value: < 2.2e-16

However, there could be a relationship between lstat and age. We will now explore the their interaction.

A colon can be use to create a model with only that specified interaction An astrix can be used to create a model with all levels of interactions and main effects

Exercise 1: Interaction

First, create a model with only the lstat and age interaction.

modInt1<-lm(medv~lstat:age, data=Boston)
summary(modInt1)
## 
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -13.347  -4.372  -1.534   1.914  27.193 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.1588631  0.4828240   62.46   <2e-16 ***
## lstat:age   -0.0077146  0.0003799  -20.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared:  0.4501, Adjusted R-squared:  0.449 
## F-statistic: 412.4 on 1 and 504 DF,  p-value: < 2.2e-16

Exercise 2: Full Model

Now create a model with all possible interactions between lstat and age.

modInt2<-lm(medv~lstat*age, data=Boston)
summary(modInt2)
## 
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.806  -4.045  -1.333   2.085  27.552 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 36.0885359  1.4698355  24.553  < 2e-16 ***
## lstat       -1.3921168  0.1674555  -8.313 8.78e-16 ***
## age         -0.0007209  0.0198792  -0.036   0.9711    
## lstat:age    0.0041560  0.0018518   2.244   0.0252 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared:  0.5557, Adjusted R-squared:  0.5531 
## F-statistic: 209.3 on 3 and 502 DF,  p-value: < 2.2e-16

Part II: Multicollinearity

Lets change datasets again! We’ll be using the Credit data set in the ISLR package. This is another simulated dataset about 10000 credit card customers. For this exercise we’re only interested in four variables: balance, age, limit, and rating.

data(Credit)
names(Credit)
##  [1] "ID"        "Income"    "Limit"     "Rating"    "Cards"    
##  [6] "Age"       "Education" "Gender"    "Student"   "Married"  
## [11] "Ethnicity" "Balance"
creditTrim<-data.frame(balance=Credit$Balance, 
                       limit=Credit$Limit, 
                       rating=Credit$Rating, 
                       age=Credit$Age)

Multicollinearity occurs when there is are linear relationships between explanatory variables. So lets looks at a pairs plot and also the correlation matrix.

pairs(creditTrim)

cor(creditTrim)
##             balance     limit    rating         age
## balance 1.000000000 0.8616973 0.8636252 0.001835119
## limit   0.861697267 1.0000000 0.9968797 0.100887922
## rating  0.863625161 0.9968797 1.0000000 0.103164996
## age     0.001835119 0.1008879 0.1031650 1.000000000

This causes our variance to be overestimated thereby decreasing the respective test statistic and causing larger p-values.

Consider the two models :

credMod1<-lm(balance~age+limit, creditTrim)
summary(credMod1)
## 
## Call:
## lm(formula = balance ~ age + limit, data = creditTrim)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -696.84 -150.78  -13.01  126.68  755.56 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.734e+02  4.383e+01  -3.957 9.01e-05 ***
## age         -2.291e+00  6.725e-01  -3.407 0.000723 ***
## limit        1.734e-01  5.026e-03  34.496  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 230.5 on 397 degrees of freedom
## Multiple R-squared:  0.7498, Adjusted R-squared:  0.7486 
## F-statistic:   595 on 2 and 397 DF,  p-value: < 2.2e-16
credMod2<-lm(balance~rating+limit, creditTrim)
summary(credMod2)
## 
## Call:
## lm(formula = balance ~ rating + limit, data = creditTrim)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -707.8 -135.9   -9.5  124.0  817.6 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -377.53680   45.25418  -8.343 1.21e-15 ***
## rating         2.20167    0.95229   2.312   0.0213 *  
## limit          0.02451    0.06383   0.384   0.7012    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 232.3 on 397 degrees of freedom
## Multiple R-squared:  0.7459, Adjusted R-squared:  0.7447 
## F-statistic: 582.8 on 2 and 397 DF,  p-value: < 2.2e-16

A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.

\[VIF(\hat{\beta}_j)=\frac{1}{1-R^2_{X_i|X_{-j}}}\]

library(DAAG)

vif(lm(balance~age+rating+limit, creditTrim))
##      age   rating    limit 
##   1.0114 160.6700 160.5900