In the past, we have looked at the Boston dataset to explore polynomial regression. We are going to use this dataset for Part I. You will need to load it into this lab session from library MASS.
library(MASS)
library(ISLR)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
The model we created in Week 6 looked at median value as a function of lower status and proportion of older homes. When we fit a model with just the additive effects of lstat and age, these are known as the main effects.
modInt0<-lm(medv~lstat+age, data=Boston)
summary(modInt0)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
However, there could be a relationship between lstat and age. We will now explore the their interaction.
A colon can be use to create a model with only that specified interaction An astrix can be used to create a model with all levels of interactions and main effects
First, create a model with only the lstat and age interaction.
modInt1<-lm(medv~lstat:age, data=Boston)
summary(modInt1)
##
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.347 -4.372 -1.534 1.914 27.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1588631 0.4828240 62.46 <2e-16 ***
## lstat:age -0.0077146 0.0003799 -20.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared: 0.4501, Adjusted R-squared: 0.449
## F-statistic: 412.4 on 1 and 504 DF, p-value: < 2.2e-16
Now create a model with all possible interactions between lstat and age.
modInt2<-lm(medv~lstat*age, data=Boston)
summary(modInt2)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Lets change datasets again! We’ll be using the Credit data set in the ISLR package. This is another simulated dataset about 10000 credit card customers. For this exercise we’re only interested in four variables: balance, age, limit, and rating.
data(Credit)
names(Credit)
## [1] "ID" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "Education" "Gender" "Student" "Married"
## [11] "Ethnicity" "Balance"
creditTrim<-data.frame(balance=Credit$Balance,
limit=Credit$Limit,
rating=Credit$Rating,
age=Credit$Age)
Multicollinearity occurs when there is are linear relationships between explanatory variables. So lets looks at a pairs plot and also the correlation matrix.
pairs(creditTrim)
cor(creditTrim)
## balance limit rating age
## balance 1.000000000 0.8616973 0.8636252 0.001835119
## limit 0.861697267 1.0000000 0.9968797 0.100887922
## rating 0.863625161 0.9968797 1.0000000 0.103164996
## age 0.001835119 0.1008879 0.1031650 1.000000000
This causes our variance to be overestimated thereby decreasing the respective test statistic and causing larger p-values.
Consider the two models :
credMod1<-lm(balance~age+limit, creditTrim)
summary(credMod1)
##
## Call:
## lm(formula = balance ~ age + limit, data = creditTrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -696.84 -150.78 -13.01 126.68 755.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.734e+02 4.383e+01 -3.957 9.01e-05 ***
## age -2.291e+00 6.725e-01 -3.407 0.000723 ***
## limit 1.734e-01 5.026e-03 34.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230.5 on 397 degrees of freedom
## Multiple R-squared: 0.7498, Adjusted R-squared: 0.7486
## F-statistic: 595 on 2 and 397 DF, p-value: < 2.2e-16
credMod2<-lm(balance~rating+limit, creditTrim)
summary(credMod2)
##
## Call:
## lm(formula = balance ~ rating + limit, data = creditTrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -707.8 -135.9 -9.5 124.0 817.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -377.53680 45.25418 -8.343 1.21e-15 ***
## rating 2.20167 0.95229 2.312 0.0213 *
## limit 0.02451 0.06383 0.384 0.7012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 232.3 on 397 degrees of freedom
## Multiple R-squared: 0.7459, Adjusted R-squared: 0.7447
## F-statistic: 582.8 on 2 and 397 DF, p-value: < 2.2e-16
A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.
\[VIF(\hat{\beta}_j)=\frac{1}{1-R^2_{X_i|X_{-j}}}\]
library(DAAG)
vif(lm(balance~age+rating+limit, creditTrim))
## age rating limit
## 1.0114 160.6700 160.5900