In this exercise, you will further analyze the Wage data set considered throughout this chapter.
library(ISLR)
attach(Wage)
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
First let’s do the cross validation with K = 10,
rm(list = ls())
set.seed(1)
library(boot)
cv.error = NA
for (i in 1:15) {
glm.fit = glm(wage ~ poly(age, i), data = Wage)
cv.error[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
cv.error
## [1] 1676.826 1600.763 1598.399 1595.651 1594.977 1596.061 1594.298 1598.134
## [9] 1593.913 1595.950 1598.368 1597.399 1598.879 1597.397 1598.424
Next let’s plot the result and mark the minimum
plot( x = 1:15, y = cv.error, xlab = "degree of age", ylab = "CV MSE",
type = "l", ylim = c( min(cv.error) - sd(cv.error), max(cv.error) + sd(cv.error) ) )
abline(h = min(cv.error) + sd(cv.error) , lty = "dotted")
points( x = which.min(cv.error), y = min(cv.error), col = "red", pch = 20, cex = 2)
Let’s redo the model using ANOVA
fit.1 = lm(wage ~ age, data=Wage)
fit.2 = lm(wage ~ poly(age,2), data=Wage)
fit.3 = lm(wage ~ poly(age,3), data=Wage)
fit.4 = lm(wage ~ poly(age,4), data=Wage)
fit.5 = lm(wage ~ poly(age,5), data=Wage)
fit.6 = lm(wage ~ poly(age,6), data=Wage)
fit.7 = lm(wage ~ poly(age,7), data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.6926 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8956 0.001673 **
## 4 2995 4771604 1 6070 3.8125 0.050966 .
## 5 2994 4770322 1 1283 0.8055 0.369516
## 6 2993 4766389 1 3932 2.4697 0.116165
## 7 2992 4763834 1 2555 1.6049 0.205311
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Through ANOVA, we can see that the polynomials that are over degree 3 are insignificant, where as in cross validation gave the minimum value at degree of 9.
plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
lm.fit <- lm(wage ~ poly(age, 2), data = Wage)
lm.pred <- predict(lm.fit, data.frame(age = age.grid), se = TRUE)
lines(x = age.grid , y = lm.pred$fit, col = "blue", lwd = 2)
matlines( x = age.grid, y = cbind( lm.pred$fit + 2*lm.pred$se.fit, lm.pred$fit - 2*lm.pred$se.fit),
lty = "dashed", col = "blue")
This is the plotting of the ANOVA.
(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained
cv.error <- rep(NA,10)
for (i in 2:10) {
Wage$age.cut <- cut(Wage$age, i)
lm.fit <- glm(wage ~ age.cut, data = Wage)
cv.error[i] <- cv.glm(Wage, lm.fit, K = 10)$delta[1]
}
cv.error
## [1] NA 1734.350 1681.835 1637.685 1632.480 1623.502 1611.735 1606.892
## [9] 1608.471 1606.970
Let’s plot this
plot(2:10, cv.error[-1], xlab = "Number of cuts", ylab = "CV MSE",
type = "l")
abline(h = min(cv.error, na.rm = TRUE) + sd(cv.error, na.rm = TRUE) , lty = "dotted")
points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE), col = "red", pch = 20, cex = 1.5 )
We can see that the minimum occurs at k=8 knots according to cross-validation model. The parsimonious model that is one standard deviation away is at k=4. So let us plot graph of fit at k=4.
lm.fit = glm(wage ~ cut(age, 4), data = Wage)
agelims = range(Wage$age)
grid = seq(from = agelims[1], to = agelims[2])
preds = predict(lm.fit, data.frame(age = age.grid),type="response", se = T)
plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n")
lines(age.grid, lm.pred$fit, col = "#2c7fb8", lwd = 2)
matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,
lm.pred$fit - 2* lm.pred$se.fit),
col = "#7fcdbb", lty ="dashed")
This is the plot that the data is evaluated by step function using 4 cuts.
This question relates to the College data set.
library(ISLR)
attach(College)
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
library(leaps)
train.set = sample(1: nrow(College), nrow(College)/2)
test.set = -train.set
gam1 = regsubsets(Outstate ~ ., data = College, subset = train.set, method = 'forward')
gam1.summary = summary(gam1)
gam1.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train.set,
## method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " " " " "*"
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " "*" " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " "*" "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" " " " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " " " " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) " " "*" "*"
## 5 ( 1 ) " " "*" "*"
## 6 ( 1 ) " " "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
It seems that it has 6 predictors that are
coef(gam1, id = 6)
## (Intercept) PrivateYes Room.Board Personal PhD
## -3218.0203965 3057.2061852 0.9755799 -0.4388594 39.8182967
## Expend Grad.Rate
## 0.2286988 41.1942393
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
gam.m3=gam(Outstate~s(Room.Board,5)+s(Terminal,5)+s(perc.alumni, 5)+s(Expend,5)+s(Grad.Rate,5)+Private,data=College,subset=train.set)
par(mfrow=c(2,3))
plot(gam.m3,se=TRUE,col='#1c9099')
According to the result we can see that expend and terminal are strong nonlinear predictors.
(c) Evaluate the model obtained on the test set, and explain the results obtained.
preds=predict(gam.m3,College[test.set,])
RSS=sum((College[test.set,]$Outstate-preds)^2)
TSS=sum((College[test.set,]$Outstate-mean(College[test.set,]$Outstate))^2)
1-(RSS/TSS)
## [1] 0.7930589
We can see that \(R^2\) statistics in this case is 0.79
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.m3)
##
## Call: gam(formula = Outstate ~ s(Room.Board, 5) + s(Terminal, 5) +
## s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5) + Private,
## data = College, subset = train.set)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6953.0 -1172.5 167.7 1365.8 4796.0
##
## (Dispersion Parameter for gaussian family taken to be 3925775)
##
## Null Deviance: 6450567118 on 387 degrees of freedom
## Residual Deviance: 1417204245 on 360.9999 degrees of freedom
## AIC: 7020.144
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## s(Room.Board, 5) 1 2619992261 2619992261 667.382 < 2.2e-16 ***
## s(Terminal, 5) 1 70869313 70869313 18.052 2.739e-05 ***
## s(perc.alumni, 5) 1 669406699 669406699 170.516 < 2.2e-16 ***
## s(Expend, 5) 1 487516883 487516883 124.184 < 2.2e-16 ***
## s(Grad.Rate, 5) 1 160020196 160020196 40.761 5.291e-10 ***
## Private 1 372772837 372772837 94.955 < 2.2e-16 ***
## Residuals 361 1417204245 3925775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## s(Room.Board, 5) 4 3.9833 0.003551 **
## s(Terminal, 5) 4 1.4086 0.230527
## s(perc.alumni, 5) 4 0.9276 0.447894
## s(Expend, 5) 4 9.5263 2.482e-07 ***
## s(Grad.Rate, 5) 4 2.7212 0.029478 *
## Private
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the summary data we can see that Expend is a predictor that has fairly strong non-linear relationship with Outstate.