In lab last weeek we looked a the Boston dataset. You will need to load it into this lab session from library MASS.
#library(MASS)
#library(ISLR)
#fix(Boston)
names(Boston)
## [1] "crim" "zn" "indus" "chas" "nox" "rm" "age"
## [8] "dis" "rad" "tax" "ptratio" "black" "lstat" "medv"
head(Boston)
## crim zn indus chas nox rm age dis rad tax ptratio black
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12
## lstat medv
## 1 4.98 24.0
## 2 9.14 21.6
## 3 4.03 34.7
## 4 2.94 33.4
## 5 5.33 36.2
## 6 5.21 28.7
The model we created in Week 6 looked at median value as a function of lower status and proportion of older homes. When we fit a model with just the additive effects of lstat and age, these are known as the main effects.
modInt0<-lm(medv~lstat+age, data=Boston)
summary(modInt0)
##
## Call:
## lm(formula = medv ~ lstat + age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.981 -3.978 -1.283 1.968 23.158
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 33.22276 0.73085 45.458 < 2e-16 ***
## lstat -1.03207 0.04819 -21.416 < 2e-16 ***
## age 0.03454 0.01223 2.826 0.00491 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.173 on 503 degrees of freedom
## Multiple R-squared: 0.5513, Adjusted R-squared: 0.5495
## F-statistic: 309 on 2 and 503 DF, p-value: < 2.2e-16
However, there could be a relationship between lstat and age. We will now explore the their interaction.
First, create a model with only the lstat and age interaction.
modInt1<-lm(medv~lstat:age, data=Boston)
summary(modInt1)
##
## Call:
## lm(formula = medv ~ lstat:age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.347 -4.372 -1.534 1.914 27.193
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.1588631 0.4828240 62.46 <2e-16 ***
## lstat:age -0.0077146 0.0003799 -20.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.827 on 504 degrees of freedom
## Multiple R-squared: 0.4501, Adjusted R-squared: 0.449
## F-statistic: 412.4 on 1 and 504 DF, p-value: < 2.2e-16
Now create a model with all possible interactions between lstat and age.
modInt2<-lm(medv~lstat*age, data=Boston)
summary(modInt2)
##
## Call:
## lm(formula = medv ~ lstat * age, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.806 -4.045 -1.333 2.085 27.552
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 36.0885359 1.4698355 24.553 < 2e-16 ***
## lstat -1.3921168 0.1674555 -8.313 8.78e-16 ***
## age -0.0007209 0.0198792 -0.036 0.9711
## lstat:age 0.0041560 0.0018518 2.244 0.0252 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.149 on 502 degrees of freedom
## Multiple R-squared: 0.5557, Adjusted R-squared: 0.5531
## F-statistic: 209.3 on 3 and 502 DF, p-value: < 2.2e-16
Lets just look at a simple linear regresion model of median value and lower status.
# Simple model
mod1<-lm(medv~lstat, data=Boston)
summary(mod1)
##
## Call:
## lm(formula = medv ~ lstat, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.168 -3.990 -1.318 2.034 24.500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.55384 0.56263 61.41 <2e-16 ***
## lstat -0.95005 0.03873 -24.53 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
## F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.5432
Lets check the diagnostics (scatter plot and residuals). What can we say about the direction, shape, and strength?
library(tidyverse)
ggplot(data=Boston, aes(x=lstat, y=medv))+
geom_point()
# Res plot 1
plot(mod1$fitted.values,mod1$residuals, pch=16)
abline(h=0, col="red")
Since there is clearly some curvature lets add in a quadratic term to our model (now we’re dealing with MLR!)
# Try a model with a quadratic term
mod2<-lm(medv~lstat+I(lstat^2), data=Boston)
summary(mod2)
##
## Call:
## lm(formula = medv ~ lstat + I(lstat^2), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.2834 -3.8313 -0.5295 2.3095 25.4148
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.862007 0.872084 49.15 <2e-16 ***
## lstat -2.332821 0.123803 -18.84 <2e-16 ***
## I(lstat^2) 0.043547 0.003745 11.63 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.524 on 503 degrees of freedom
## Multiple R-squared: 0.6407, Adjusted R-squared: 0.6393
## F-statistic: 448.5 on 2 and 503 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.6393
Let’s check out the residual plot!
# Res plot 2
plot(mod2$fitted.values,mod2$residuals, pch=16)
abline(h=0, col="red")
# Try this witchcraft!
plot(mod2)
We can also perform a hypothesis test to assess if its worth adding in the quadratric term.
# anova compare
anova(mod1, mod2)
## Analysis of Variance Table
##
## Model 1: medv ~ lstat
## Model 2: medv ~ lstat + I(lstat^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 504 19472
## 2 503 15347 1 4125.1 135.2 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In general if we want all polynomials up to a certain power, we can use the poly() function.
# Try a model with a poly 5
mod5<-lm(medv~poly(lstat,5), data=Boston)
summary(mod5)
##
## Call:
## lm(formula = medv ~ poly(lstat, 5), data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5433 -3.1039 -0.7052 2.0844 27.1153
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.5328 0.2318 97.197 < 2e-16 ***
## poly(lstat, 5)1 -152.4595 5.2148 -29.236 < 2e-16 ***
## poly(lstat, 5)2 64.2272 5.2148 12.316 < 2e-16 ***
## poly(lstat, 5)3 -27.0511 5.2148 -5.187 3.10e-07 ***
## poly(lstat, 5)4 25.4517 5.2148 4.881 1.42e-06 ***
## poly(lstat, 5)5 -19.2524 5.2148 -3.692 0.000247 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.215 on 500 degrees of freedom
## Multiple R-squared: 0.6817, Adjusted R-squared: 0.6785
## F-statistic: 214.2 on 5 and 500 DF, p-value: < 2.2e-16
# Adjusted R-squared: 0.6785
# Res plot 5
plot(mod5$fitted.values,mod5$residuals, pch=16)
abline(h=0, col="red")
Lets change datasets! Now we’ll be using the Carseat data set in the ISLR package. Its a simulated dataset about the sales of car seats at different stores.
data(Carseats)
names(Carseats)
## [1] "Sales" "CompPrice" "Income" "Advertising" "Population"
## [6] "Price" "ShelveLoc" "Age" "Education" "Urban"
## [11] "US"
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Observe that there are a few variables that were “Factors” this means that they are categorical. Using the str() function we can also see how many levels each have.
First, lets fit a simple model of sales as a function of location:
carMod1<-lm(Sales~ShelveLoc, data=Carseats)
summary(carMod1)
##
## Call:
## lm(formula = Sales ~ ShelveLoc, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3066 -1.6282 -0.0416 1.5666 6.1471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.5229 0.2388 23.131 < 2e-16 ***
## ShelveLocGood 4.6911 0.3484 13.464 < 2e-16 ***
## ShelveLocMedium 1.7837 0.2864 6.229 1.2e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.339 on 397 degrees of freedom
## Multiple R-squared: 0.3172, Adjusted R-squared: 0.3138
## F-statistic: 92.23 on 2 and 397 DF, p-value: < 2.2e-16
# Visualize the model
ggplot(Carseats, aes(y=Sales, x=ShelveLoc, fill=ShelveLoc))+
geom_boxplot()
# ANOVA output
anova(carMod1)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## ShelveLoc 2 1009.5 504.77 92.23 < 2.2e-16 ***
## Residuals 397 2172.7 5.47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Second, lets include Price in the model:
carMod2<-lm(Sales~ShelveLoc+Price, data=Carseats)
summary(carMod2)
##
## Call:
## lm(formula = Sales ~ ShelveLoc + Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.8229 -1.3930 -0.0179 1.3868 5.0780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.001802 0.503447 23.839 < 2e-16 ***
## ShelveLocGood 4.895848 0.285921 17.123 < 2e-16 ***
## ShelveLocMedium 1.862022 0.234748 7.932 2.23e-14 ***
## Price -0.056698 0.004059 -13.967 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.917 on 396 degrees of freedom
## Multiple R-squared: 0.5426, Adjusted R-squared: 0.5391
## F-statistic: 156.6 on 3 and 396 DF, p-value: < 2.2e-16
# BAD
# intercept = 12.001802
# slope = -0.056698
# GOOD
# intercept = 12.001802+4.895848
# slope = -0.056698
# MEDIUM
# intercept = 12.001802+1.862022
# slope = -0.056698
# VIZ: Shifts in intercept
ggplot(Carseats, aes(x=Price, y=Sales, color=ShelveLoc))+
geom_point()+
geom_abline(intercept = carMod2$coefficients[1], slope=carMod2$coefficients[4],
color="red", lwd=1)+
geom_abline(intercept = carMod2$coefficients[1]+carMod2$coefficients[2], slope=carMod2$coefficients[4],
color="forestgreen", lwd=1)+
geom_abline(intercept = carMod2$coefficients[1]+carMod2$coefficients[3], slope=carMod2$coefficients[4],
color="blue", lwd=1)
# ANOVA
anova(carMod2)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## ShelveLoc 2 1009.5 504.77 137.32 < 2.2e-16 ***
## Price 1 717.1 717.10 195.08 < 2.2e-16 ***
## Residuals 396 1455.6 3.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Third, lets include an interaction with Price in the model:
carMod3<-lm(Sales~ShelveLoc*Price, data=Carseats)
summary(carMod3)
##
## Call:
## lm(formula = Sales ~ ShelveLoc * Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9037 -1.3461 -0.0595 1.3679 4.9037
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.832984 0.965788 12.252 < 2e-16 ***
## ShelveLocGood 6.135880 1.392844 4.405 1.36e-05 ***
## ShelveLocMedium 1.630481 1.171616 1.392 0.165
## Price -0.055220 0.008276 -6.672 8.57e-11 ***
## ShelveLocGood:Price -0.010564 0.011742 -0.900 0.369
## ShelveLocMedium:Price 0.001984 0.010007 0.198 0.843
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.918 on 394 degrees of freedom
## Multiple R-squared: 0.5444, Adjusted R-squared: 0.5386
## F-statistic: 94.17 on 5 and 394 DF, p-value: < 2.2e-16
# BAD
# intercept = 11.832984
# slope = -0.055220
# GOOD
# intercept = 11.832984+6.135880
# slope = -0.055220-0.010564
# MEDIUM
# intercept = 11.832984+1.630481
# slope = -0.055220+0.001984
# VIZ: Shift intercepts and slopes
ggplot(Carseats, aes(x=Price, y=Sales, color=ShelveLoc))+
geom_point()+
geom_abline(intercept = carMod3$coefficients[1], slope=carMod3$coefficients[4],
color="red", lwd=1)+
geom_abline(intercept = carMod3$coefficients[1]+carMod3$coefficients[2], slope=carMod3$coefficients[4]+carMod3$coefficients[5],
color="forestgreen", lwd=1)+
geom_abline(intercept = carMod3$coefficients[1]+carMod3$coefficients[3], slope=carMod3$coefficients[4]+carMod3$coefficients[6],
color="blue", lwd=1)
# ANOVA
anova(carMod3)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## ShelveLoc 2 1009.53 504.77 137.1807 <2e-16 ***
## Price 1 717.10 717.10 194.8878 <2e-16 ***
## ShelveLoc:Price 2 5.89 2.95 0.8005 0.4498
## Residuals 394 1449.75 3.68
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Finally, lets fit a model with all the variables and a couple of interactions (Income:Advertising and Price:Age)
carMod<-lm(Sales~.+Income:Advertising+Price:Age, data=Carseats)
summary(carMod)
##
## Call:
## lm(formula = Sales ~ . + Income:Advertising + Price:Age, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.9208 -0.7503 0.0177 0.6754 3.3413
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.5755654 1.0087470 6.519 2.22e-10 ***
## CompPrice 0.0929371 0.0041183 22.567 < 2e-16 ***
## Income 0.0108940 0.0026044 4.183 3.57e-05 ***
## Advertising 0.0702462 0.0226091 3.107 0.002030 **
## Population 0.0001592 0.0003679 0.433 0.665330
## Price -0.1008064 0.0074399 -13.549 < 2e-16 ***
## ShelveLocGood 4.8486762 0.1528378 31.724 < 2e-16 ***
## ShelveLocMedium 1.9532620 0.1257682 15.531 < 2e-16 ***
## Age -0.0579466 0.0159506 -3.633 0.000318 ***
## Education -0.0208525 0.0196131 -1.063 0.288361
## UrbanYes 0.1401597 0.1124019 1.247 0.213171
## USYes -0.1575571 0.1489234 -1.058 0.290729
## Income:Advertising 0.0007510 0.0002784 2.698 0.007290 **
## Price:Age 0.0001068 0.0001333 0.801 0.423812
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.011 on 386 degrees of freedom
## Multiple R-squared: 0.8761, Adjusted R-squared: 0.8719
## F-statistic: 210 on 13 and 386 DF, p-value: < 2.2e-16
Last class we talked about the dummy variables created when we have a categorical variable with more than two levels. We can use the contrast() function to see how this coding is done.
contrasts(Carseats$ShelveLoc)
## Good Medium
## Bad 0 0
## Good 1 0
## Medium 0 1
Let’s Also check out the anova output
anova(carMod)
## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## CompPrice 1 13.07 13.07 12.7939 0.0003919 ***
## Income 1 79.07 79.07 77.4228 < 2.2e-16 ***
## Advertising 1 219.35 219.35 214.7726 < 2.2e-16 ***
## Population 1 0.38 0.38 0.3744 0.5409648
## Price 1 1198.87 1198.87 1173.8422 < 2.2e-16 ***
## ShelveLoc 2 1047.47 523.74 512.8052 < 2.2e-16 ***
## Age 1 217.39 217.39 212.8503 < 2.2e-16 ***
## Education 1 1.05 1.05 1.0284 0.3111694
## Urban 1 1.22 1.22 1.1948 0.2750539
## US 1 1.57 1.57 1.5344 0.2162066
## Income:Advertising 1 7.95 7.95 7.7838 0.0055330 **
## Price:Age 1 0.65 0.65 0.6411 0.4238116
## Residuals 386 394.23 1.02
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Lets change datasets again! We’ll be using the Credit data set in the ISLR package. This is another simulated dataset about 10000 credit card customers. For this exercise we’re only interested in four variables: balance, age, limit, and rating.
data(Credit)
names(Credit)
## [1] "ID" "Income" "Limit" "Rating" "Cards"
## [6] "Age" "Education" "Gender" "Student" "Married"
## [11] "Ethnicity" "Balance"
creditTrim<-data.frame(balance=Credit$Balance,
limit=Credit$Limit,
rating=Credit$Rating,
age=Credit$Age)
Multicollinearity occurs when there is are linear relationships between explanatory variables. So lets looks at a pairs plot and also the correlation matrix.
pairs(creditTrim)
cor(creditTrim)
## balance limit rating age
## balance 1.000000000 0.8616973 0.8636252 0.001835119
## limit 0.861697267 1.0000000 0.9968797 0.100887922
## rating 0.863625161 0.9968797 1.0000000 0.103164996
## age 0.001835119 0.1008879 0.1031650 1.000000000
This causes our variance to be overestimated thereby decreasing the respective test statistic and causing larger p-values.
Consider the two models :
credMod1<-lm(balance~age+limit, creditTrim)
summary(credMod1)
##
## Call:
## lm(formula = balance ~ age + limit, data = creditTrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -696.84 -150.78 -13.01 126.68 755.56
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.734e+02 4.383e+01 -3.957 9.01e-05 ***
## age -2.291e+00 6.725e-01 -3.407 0.000723 ***
## limit 1.734e-01 5.026e-03 34.496 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 230.5 on 397 degrees of freedom
## Multiple R-squared: 0.7498, Adjusted R-squared: 0.7486
## F-statistic: 595 on 2 and 397 DF, p-value: < 2.2e-16
credMod2<-lm(balance~rating+limit, creditTrim)
summary(credMod2)
##
## Call:
## lm(formula = balance ~ rating + limit, data = creditTrim)
##
## Residuals:
## Min 1Q Median 3Q Max
## -707.8 -135.9 -9.5 124.0 817.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -377.53680 45.25418 -8.343 1.21e-15 ***
## rating 2.20167 0.95229 2.312 0.0213 *
## limit 0.02451 0.06383 0.384 0.7012
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 232.3 on 397 degrees of freedom
## Multiple R-squared: 0.7459, Adjusted R-squared: 0.7447
## F-statistic: 582.8 on 2 and 397 DF, p-value: < 2.2e-16
A way to quantify the impact of multicollinearity is to look at the VIF (variance inflation factor). A VIF close to 1 would imply no correlation. The larger the VIF the larger the amount of multicollinearity.
\(VIF(\hat{\beta}_j)=\frac{1}{1-R_{X_j|X_{-j}}^2}\).
#install.packages("car")
library(car)
vif(lm(balance~age+rating+limit, creditTrim))
## age rating limit
## 1.011385 160.668301 160.592880