6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
library(boot)
attach(Wage)
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen,and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
set.seed(10)
cv.error = rep(0,5)
for(i in 1:5){
glm.fit = glm(wage ~ poly(age,i), data = Wage)
cv.error[i] = cv.glm(Wage,glm.fit,K=10)$delta[1]
}
cv.error
## [1] 1676.058 1599.290 1596.142 1594.627 1594.628
plot(cv.error, type = "b")
points(which.min(cv.error), cv.error[4], col="Orange", pch=20)
With the plot above we see that the best degree by cross-validation is 4
Optimal value using ANOVA
fit.1 = lm(wage ~ age, data=Wage)
fit.2 = lm(wage ~ poly(age, 2), data=Wage)
fit.3 = lm(wage ~ poly(age, 3), data=Wage)
fit.4 = lm(wage ~ poly(age, 4), data=Wage)
fit.5 = lm(wage ~ poly(age, 5), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
agelims = range(age)
age.grid=seq(from=agelims[1], to=agelims[2])
preds = predict(fit.4, newdata=list(age=age.grid), se=TRUE)
se.bands= cbind(preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit)
plot(age, wage,xlim=agelims, cex=.5, col="grey")
lines(age.grid, preds$fit, lwd=2, col="blue")
matlines(age.grid, se.bands, lwd=1, col="blue", lty=3)
(b) Fit a step function to predict wage using age, and perform cross- validation to choose the optimal number of cuts. Make a plot of the fit obtained
set.seed(5)
cv.errors = rep(NA, 10)
for(i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
glm.fit = glm(wage ~ age.cut, data = Wage)
cv.errors[i] = cv.glm(Wage, glm.fit)$delta[1]
}
cv.errors
## [1] NA 1734.064 1682.763 1635.894 1631.450 1623.291 1611.604 1601.006
## [9] 1610.213 1604.804
plot(2:10, cv.errors[-1], type="b")
points(which.min(cv.errors), cv.errors[which.min(cv.errors)], col="red", pch=20, cex=2)
The best number of cuts is 8
fit.step = glm(wage ~ cut(age, 8), data=Wage)
preds = predict(fit.step, data.frame(age = age.grid))
plot(age, wage, col="darkgray")
lines(age.grid, preds, col="red", lwd=2)
10. This question relates to the College data set.
library(ISLR)
library(leaps)
attach(College)
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(1)
train = sample(1: nrow(College), nrow(College)/2)
test = -train
fwd.fit = regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
summary(fwd.fit)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " "*" " " " "
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " " " " " " "
## 6 ( 1 ) " " "*" " " " " " " "*" " "
## 7 ( 1 ) " " "*" " " "*" " " "*" " "
## 8 ( 1 ) " " "*" " " "*" " " "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) "*" " " " "
## 3 ( 1 ) "*" "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" "*"
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
coef(fwd.fit, id = 6)
## (Intercept) PrivateYes Room.Board Terminal perc.alumni
## -4726.8810613 2717.7019276 1.1032433 36.9990286 59.0863753
## Expend Grad.Rate
## 0.1930814 33.8303314
using forward selection our best model includes 6 predictors
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
library(splines)
gam.fit = gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.fit, se = TRUE, col = 'red')
With our plots above we see that Expend~Grad rate are non linear with outstate
(c) Evaluate the model obtained on the test set, and explain the results obtained.
preds = predict(gam.fit, College[test, ])
RSS = sum((College[test, ]$Outstate - preds)^2)
TSS = sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)
## [1] 0.7652114
The \(R^2\) test is 76.52%
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal,
## 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5),
## data = College, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7270.32 -1115.76 -58.54 1231.45 7013.47
##
## (Dispersion Parameter for gaussian family taken to be 3666101)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1323460787 on 360.9995 degrees of freedom
## AIC: 6993.591
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1780121359 1780121359 485.562 < 2.2e-16 ***
## s(Room.Board, 5) 1 1635038271 1635038271 445.988 < 2.2e-16 ***
## s(Terminal, 5) 1 281113708 281113708 76.679 < 2.2e-16 ***
## s(perc.alumni, 5) 1 351003562 351003562 95.743 < 2.2e-16 ***
## s(Expend, 5) 1 607193060 607193060 165.624 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 89036829 89036829 24.287 1.267e-06 ***
## Residuals 361 1323460787 3666101
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 5) 4 2.0626 0.0852 .
## s(Terminal, 5) 4 1.5754 0.1803
## s(perc.alumni, 5) 4 0.4126 0.7996
## s(Expend, 5) 4 21.2640 8.882e-16 ***
## s(Grad.Rate, 5) 4 0.8399 0.5006
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ANOVA shows that Expend and Grad.rate have a non linear relationship with Outstate