In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
attach(Wage)
str(Wage)
## 'data.frame': 3000 obs. of 11 variables:
## $ year : int 2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
## $ age : int 18 24 45 43 50 54 44 30 41 52 ...
## $ maritl : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
## $ race : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
## $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
## $ region : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ jobclass : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
## $ health : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
## $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
## $ logwage : num 4.32 4.26 4.88 5.04 4.32 ...
## $ wage : num 75 70.5 131 154.7 75 ...
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(boot)
set.seed(1)
cv.error <- rep(NA,10)
for (i in 1:10) {
glm.fit <- glm(wage~poly(age, i), data = Wage)
cv.error[i] <- cv.glm(Wage, glm.fit, K=10)$delta[1]
}
cv.error
## [1] 1676.826 1600.763 1598.399 1595.651 1594.977 1596.061 1594.298 1598.134
## [9] 1593.913 1595.950
plot(1:10, cv.error, xlab = 'Degree', ylab = 'CV Error', type = 'l')
d.min <-which.min(cv.error)
points(d.min, cv.error[d.min], col ="dark blue", cex = 2, pch = 20)
Based on the polynomial regression plot, D=9 is the optimal degree for the polynomial with a cross validation error of 1593.913.
I will now conduct ANOVA hypothesis testing.
fit.1 <- lm(wage~poly(age, 1), data=Wage)
fit.2 <- lm(wage~poly(age, 2), data=Wage)
fit.3 <- lm(wage~poly(age, 3), data=Wage)
fit.4 <- lm(wage~poly(age, 4), data=Wage)
fit.5 <- lm(wage~poly(age, 5), data=Wage)
fit.6 <- lm(wage~poly(age, 6), data=Wage)
fit.7 <- lm(wage~poly(age, 7), data=Wage)
fit.8 <- lm(wage~poly(age, 8), data=Wage)
fit.9 <- lm(wage~poly(age, 9), data=Wage)
fit.10 <- lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.7638 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9005 0.001669 **
## 4 2995 4771604 1 6070 3.8143 0.050909 .
## 5 2994 4770322 1 1283 0.8059 0.369398
## 6 2993 4766389 1 3932 2.4709 0.116074
## 7 2992 4763834 1 2555 1.6057 0.205199
## 8 2991 4763707 1 127 0.0796 0.777865
## 9 2990 4756703 1 7004 4.4014 0.035994 *
## 10 2989 4756701 1 3 0.0017 0.967529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA table results, all of the polynomials above degree 3 are not significant with the exception of D=9 at the .05 significance level. However, D=2 and D=3 are significant at higher significance levels, indicating that a more complex model is required such as cubic model.
plot(wage~age, data=Wage, col="darkgrey", main ="Polynomial Degree 9 and 3")
agelims <- range(Wage$age)
age.grid <- seq(from=agelims[1], to=agelims[2])
lm.fit <- lm(wage~poly(age, 3), data=Wage)
lm.pred <- predict(lm.fit, data.frame(age=age.grid))
lines(age.grid, lm.pred, col="blue", lwd=2)
lm.fit2 <- lm(wage~poly(age, 9), data=Wage)
lm.pred2 <- predict(lm.fit2, data.frame(age=age.grid))
lines(age.grid, lm.pred2, col="red", lwd=2)
Based on the plot of the two polynomial predictions, I would have more confidence in the D=3 due to the fit of the line and the evidence in the ANOVA results.
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
cvstep <- rep(NA, 10)
for (i in 2:10) {
Wage$age.cut <- cut(Wage$age, i)
lm.fit3 <- glm(wage~age.cut, data = Wage)
cvstep[i] <- cv.glm(Wage, lm.fit3, K=10)$delta[2]
}
cvstep
## [1] NA 1735.017 1682.147 1636.701 1631.334 1625.009 1613.236 1599.865
## [9] 1611.936 1606.397
plot(2:10, cvstep[-1], xlab = "Cuts", ylab ="CV Error", type ="l")
d.min2 <- which.min(cvstep)
points(d.min2, cv.error[d.min2], col ="dark blue", cex = 2, pch = 20)
Based on the cross validation results and plot, the optimal number of cuts is K=8 for a cross validation error rate of 1600.350.
lm.fit4 <- glm(wage~cut(age, 8), data=Wage)
agelims2 = range(Wage$age)
age.grid2 = seq(from=agelims2[1], to=agelims2[2])
lm.pred3 = predict(lm.fit4, data.frame(age=age.grid))
plot(wage~age, data=Wage, col="darkgrey")
lines(age.grid, lm.pred3, col="blue", lwd=2)
detach(Wage)
This question relates to the College data set.
library(ISLR)
library(leaps)
attach(College)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(1)
train <- sample(length(Outstate), length(Outstate)/2)
test <- -train
College.train <- College[train, ]
College.test <- College[test, ]
reg.fit <- regsubsets(Outstate ~ ., data = College.train, nvmax = 17, method = "forward")
reg.summary = summary(reg.fit)
reg.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College.train, nvmax = 17,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## 9 ( 1 ) "*" " " " " " " "*" " " " "
## 10 ( 1 ) "*" " " " " " " "*" " " " "
## 11 ( 1 ) "*" " " "*" " " "*" " " " "
## 12 ( 1 ) "*" "*" "*" " " "*" " " " "
## 13 ( 1 ) "*" "*" "*" "*" "*" " " " "
## 14 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 15 ( 1 ) "*" "*" "*" "*" "*" "*" " "
## 16 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" " " "*" " "
## 9 ( 1 ) " " "*" " " "*" " " "*" "*"
## 10 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 11 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 12 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 13 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 14 ( 1 ) " " "*" "*" "*" " " "*" "*"
## 15 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 16 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 17 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
## 9 ( 1 ) "*" "*" "*"
## 10 ( 1 ) "*" "*" "*"
## 11 ( 1 ) "*" "*" "*"
## 12 ( 1 ) "*" "*" "*"
## 13 ( 1 ) "*" "*" "*"
## 14 ( 1 ) "*" "*" "*"
## 15 ( 1 ) "*" "*" "*"
## 16 ( 1 ) "*" "*" "*"
## 17 ( 1 ) "*" "*" "*"
reg.summary$adjr2
## [1] 0.4486533 0.6157583 0.6806427 0.7262750 0.7418534 0.7516133 0.7538978
## [8] 0.7559309 0.7577210 0.7587138 0.7587900 0.7646744 0.7690583 0.7706887
## [15] 0.7703905 0.7698240 0.7692238
which.max(reg.summary$adjr2)
## [1] 14
reg.summary$cp
## [1] 538.19147 259.02390 151.39447 76.27854 51.30582 36.07408 33.23602
## [8] 30.83073 28.84098 28.16937 28.99957 20.39253 14.26824 12.63239
## [15] 14.11928 16.03507 18.00000
which.min(reg.summary$cp)
## [1] 14
reg.summary$bic
## [1] -220.0937 -355.2429 -422.0560 -476.9311 -494.7198 -504.7297 -503.3736
## [8] -501.6536 -499.5740 -496.2339 -491.4261 -496.0811 -498.4523 -496.2790
## [15] -490.8555 -484.9828 -479.0586
which.min(reg.summary$bic)
## [1] 6
reg.summary$rss
## [1] 3843936983 2671956475 2214992160 1893552628 1781123214 1709296809
## [7] 1689130494 1670768074 1654137740 1643001768 1638125736 1593912800
## [13] 1560048632 1544893426 1542754678 1542403678 1542257478
which.min(reg.summary$rss)
## [1] 17
par(mfrow = c(1, 3))
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l")
points(14, reg.summary$adjr2[14], col ='red', cex = 2, pch = 20)
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "CP", type = "l")
points(14, reg.summary$cp[14], col ='green', cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(6, reg.summary$bic[6], col ='blue', cex = 2, pch = 20)
coef(reg.fit, 14)
## (Intercept) PrivateYes Apps Accept Enroll
## -1105.5419407 2218.1453812 -0.3925751 0.9165783 -0.9704322
## Top10perc Top25perc Room.Board Books Personal
## 61.3795033 -23.7045316 1.0714537 -0.9555993 -0.3123567
## Terminal S.F.Ratio perc.alumni Expend Grad.Rate
## 33.3259647 -68.9039073 47.8594759 0.1471632 26.3490845
Based on the plots of forward stepwise selection on the training set, the model would fit the best predictors of Out of State with 14 variables (PrivateYes, Apps, Accept, Enroll, Top10perc, Top25perc, Room.Board, Books, Personal, Terminal, S.F.Ratio, perc.alumni, Expend, Grad.Rate).
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
gam.fit <- gam(Outstate ~ Private + s(Apps, df = 2) + s(Accept, df =2) + s(Enroll, df =2) + s(Top10perc, df = 2) + s(Top25perc, df =2) + s(Room.Board, df =2) + s(Books, df=2) + s(Personal, df =2) + s(Terminal, df=2) + s(S.F.Ratio, df=2) + s(perc.alumni, df=2) + s(Expend, df=2) + s(Grad.Rate, df=2), data = College.train)
par(mfrow = c(2,3))
plot(gam.fit, se= T, col = 'blue')
The GAM plots suggest a potential non-linear relationship between Books, S.F.Ratio and Personal.
(c) Evaluate the model obtained on the test set, and explain the results obtained.
gam.pred <- predict(gam.fit, College.test)
gam.err <- mean((College.test$Outstate - gam.pred)^2)
gam.err
## [1] 3327954
gam.tss = mean((College.test$Outstate - mean(College.test$Outstate))^2)
test.rss = 1 - gam.err/gam.tss
test.rss
## [1] 0.7674922
Using GAM on the test set, we obtain a test r-squared of .77 and mean squared of 3327954.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Apps, df = 2) + s(Accept,
## df = 2) + s(Enroll, df = 2) + s(Top10perc, df = 2) + s(Top25perc,
## df = 2) + s(Room.Board, df = 2) + s(Books, df = 2) + s(Personal,
## df = 2) + s(Terminal, df = 2) + s(S.F.Ratio, df = 2) + s(perc.alumni,
## df = 2) + s(Expend, df = 2) + s(Grad.Rate, df = 2), data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6098.91 -1241.11 -10.78 1222.05 8680.80
##
## (Dispersion Parameter for gaussian family taken to be 3613005)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1300682882 on 360.0003 degrees of freedom
## AIC: 6988.854
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1929816146 1929816146 534.1305 < 2.2e-16 ***
## s(Apps, df = 2) 1 691457239 691457239 191.3801 < 2.2e-16 ***
## s(Accept, df = 2) 1 77177822 77177822 21.3611 5.308e-06 ***
## s(Enroll, df = 2) 1 10541100 10541100 2.9175 0.0884832 .
## s(Top10perc, df = 2) 1 1157444142 1157444142 320.3550 < 2.2e-16 ***
## s(Top25perc, df = 2) 1 25267529 25267529 6.9935 0.0085388 **
## s(Room.Board, df = 2) 1 573575250 573575250 158.7530 < 2.2e-16 ***
## s(Books, df = 2) 1 40871174 40871174 11.3122 0.0008528 ***
## s(Personal, df = 2) 1 42455695 42455695 11.7508 0.0006786 ***
## s(Terminal, df = 2) 1 76318619 76318619 21.1233 5.968e-06 ***
## s(S.F.Ratio, df = 2) 1 75223534 75223534 20.8202 6.930e-06 ***
## s(perc.alumni, df = 2) 1 115818908 115818908 32.0561 3.065e-08 ***
## s(Expend, df = 2) 1 214622358 214622358 59.4027 1.264e-13 ***
## s(Grad.Rate, df = 2) 1 47800735 47800735 13.2302 0.0003157 ***
## Residuals 360 1300682882 3613005
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Apps, df = 2) 1 3.339 0.06849 .
## s(Accept, df = 2) 1 3.526 0.06122 .
## s(Enroll, df = 2) 1 0.355 0.55144
## s(Top10perc, df = 2) 1 0.624 0.42993
## s(Top25perc, df = 2) 1 1.569 0.21119
## s(Room.Board, df = 2) 1 1.627 0.20289
## s(Books, df = 2) 1 2.693 0.10168
## s(Personal, df = 2) 1 2.152 0.14326
## s(Terminal, df = 2) 1 0.741 0.38976
## s(S.F.Ratio, df = 2) 1 4.864 0.02806 *
## s(perc.alumni, df = 2) 1 0.504 0.47817
## s(Expend, df = 2) 1 39.333 1.025e-09 ***
## s(Grad.Rate, df = 2) 1 0.253 0.61526
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA for Non-parametric effects, there is evidence of a non-linear relationship between the response variable Outstate with S.F. Ratio and Expend.
detach(College)