Problem 6.

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
attach(Wage)

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

First let’s do the cross validation with K = 10,

rm(list = ls())
set.seed(1)
library(boot)

cv.error = NA

for (i in 1:15) {
  glm.fit =  glm(wage ~ poly(age, i), data = Wage)
  cv.error[i] =  cv.glm(Wage, glm.fit, K = 10)$delta[1]
}
cv.error
##  [1] 1676.826 1600.763 1598.399 1595.651 1594.977 1596.061 1594.298 1598.134
##  [9] 1593.913 1595.950 1598.368 1597.399 1598.879 1597.397 1598.424

Next let’s plot the result and mark the minimum

plot( x = 1:15, y = cv.error, xlab = "degree of age", ylab = "CV MSE", 
      type = "l", ylim = c( min(cv.error) - sd(cv.error), max(cv.error) + sd(cv.error) ) )

abline(h = min(cv.error) + sd(cv.error) , lty = "dotted")

points( x = which.min(cv.error), y = min(cv.error), col = "red", pch = 20, cex = 2)

Let’s redo the model using ANOVA

fit.1 = lm(wage ~ age, data=Wage)
fit.2 = lm(wage ~ poly(age,2), data=Wage)
fit.3 = lm(wage ~ poly(age,3), data=Wage)
fit.4 = lm(wage ~ poly(age,4), data=Wage)
fit.5 = lm(wage ~ poly(age,5), data=Wage)
fit.6 = lm(wage ~ poly(age,6), data=Wage)
fit.7 = lm(wage ~ poly(age,7), data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6926 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8956  0.001673 ** 
## 4   2995 4771604  1      6070   3.8125  0.050966 .  
## 5   2994 4770322  1      1283   0.8055  0.369516    
## 6   2993 4766389  1      3932   2.4697  0.116165    
## 7   2992 4763834  1      2555   1.6049  0.205311    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Through ANOVA, we can see that the polynomials that are over degree 3 are insignificant, where as in cross validation gave the minimum value at degree of 9.

plot(wage ~ age, data = Wage, col = "darkgrey",  bty = "n")
agelims <-  range(Wage$age)
age.grid <-  seq(from = agelims[1], to = agelims[2])
lm.fit <-  lm(wage ~ poly(age, 2), data = Wage)
lm.pred <-  predict(lm.fit, data.frame(age = age.grid), se = TRUE)
lines(x = age.grid , y = lm.pred$fit, col = "blue", lwd = 2)
matlines( x = age.grid, y = cbind( lm.pred$fit + 2*lm.pred$se.fit, lm.pred$fit - 2*lm.pred$se.fit),
          lty = "dashed", col = "blue")

This is the plotting of the ANOVA.

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained

cv.error <-  rep(NA,10)
for (i in 2:10) {
  Wage$age.cut <-  cut(Wage$age, i)
  lm.fit <-  glm(wage ~ age.cut, data = Wage)
  cv.error[i] <-  cv.glm(Wage, lm.fit, K = 10)$delta[1]
}
cv.error
##  [1]       NA 1734.350 1681.835 1637.685 1632.480 1623.502 1611.735 1606.892
##  [9] 1608.471 1606.970

Let’s plot this

plot(2:10, cv.error[-1], xlab = "Number of cuts", ylab = "CV MSE", 
     type = "l")
abline(h = min(cv.error, na.rm = TRUE) + sd(cv.error, na.rm = TRUE) , lty = "dotted")
points( x = which.min(cv.error), y = min(cv.error, na.rm = TRUE), col = "red", pch = 20, cex = 1.5 )

We can see that the minimum occurs at k=8 knots according to cross-validation model. The parsimonious model that is one standard deviation away is at k=4. So let us plot graph of fit at k=4.

lm.fit =  glm(wage ~ cut(age, 4), data = Wage)
agelims =  range(Wage$age)
grid =  seq(from = agelims[1], to = agelims[2])
preds =  predict(lm.fit, data.frame(age = age.grid),type="response", se = T)
plot(wage ~ age, data = Wage, col = "darkgrey", bty = "n")
lines(age.grid, lm.pred$fit, col = "#2c7fb8", lwd = 2)
matlines(age.grid, cbind( lm.pred$fit + 2* lm.pred$se.fit,
                          lm.pred$fit - 2* lm.pred$se.fit),
         col = "#7fcdbb", lty ="dashed")

This is the plot that the data is evaluated by step function using 4 cuts.

Problem 10.

This question relates to the College data set.

library(ISLR)
attach(College)

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(ISLR)
library(leaps)
train.set = sample(1: nrow(College), nrow(College)/2)
test.set = -train.set
gam1 = regsubsets(Outstate ~ ., data = College, subset = train.set, method = 'forward')
gam1.summary = summary(gam1)
gam1.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train.set, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    " "       " "       "*"        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) " "         "*"    "*"      
## 5  ( 1 ) " "         "*"    "*"      
## 6  ( 1 ) " "         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

It seems that it has 6 predictors that are

coef(gam1, id = 6)
##   (Intercept)    PrivateYes    Room.Board      Personal           PhD 
## -3218.0203965  3057.2061852     0.9755799    -0.4388594    39.8182967 
##        Expend     Grad.Rate 
##     0.2286988    41.1942393

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
gam.m3=gam(Outstate~s(Room.Board,5)+s(Terminal,5)+s(perc.alumni, 5)+s(Expend,5)+s(Grad.Rate,5)+Private,data=College,subset=train.set)
par(mfrow=c(2,3))
plot(gam.m3,se=TRUE,col='#1c9099')

According to the result we can see that expend and terminal are strong nonlinear predictors.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds=predict(gam.m3,College[test.set,])
RSS=sum((College[test.set,]$Outstate-preds)^2)
TSS=sum((College[test.set,]$Outstate-mean(College[test.set,]$Outstate))^2)
1-(RSS/TSS)
## [1] 0.7930589

We can see that \(R^2\) statistics in this case is 0.79

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.m3)
## 
## Call: gam(formula = Outstate ~ s(Room.Board, 5) + s(Terminal, 5) + 
##     s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5) + Private, 
##     data = College, subset = train.set)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6953.0 -1172.5   167.7  1365.8  4796.0 
## 
## (Dispersion Parameter for gaussian family taken to be 3925775)
## 
##     Null Deviance: 6450567118 on 387 degrees of freedom
## Residual Deviance: 1417204245 on 360.9999 degrees of freedom
## AIC: 7020.144 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## s(Room.Board, 5)    1 2619992261 2619992261 667.382 < 2.2e-16 ***
## s(Terminal, 5)      1   70869313   70869313  18.052 2.739e-05 ***
## s(perc.alumni, 5)   1  669406699  669406699 170.516 < 2.2e-16 ***
## s(Expend, 5)        1  487516883  487516883 124.184 < 2.2e-16 ***
## s(Grad.Rate, 5)     1  160020196  160020196  40.761 5.291e-10 ***
## Private             1  372772837  372772837  94.955 < 2.2e-16 ***
## Residuals         361 1417204245    3925775                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F     Pr(F)    
## (Intercept)                                   
## s(Room.Board, 5)        4 3.9833  0.003551 ** 
## s(Terminal, 5)          4 1.4086  0.230527    
## s(perc.alumni, 5)       4 0.9276  0.447894    
## s(Expend, 5)            4 9.5263 2.482e-07 ***
## s(Grad.Rate, 5)         4 2.7212  0.029478 *  
## Private                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the summary data we can see that Expend is a predictor that has fairly strong non-linear relationship with Outstate.