Chapter 07 (page 297): 6, 10

Problem 6

library(ISLR2)
attach(Wage)
sum(is.na(Wage))
## [1] 0

In this exercise, you will further analyze the Wage data set considered throughout this chapter.
Q6(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
A6(a) The cross validation using a K-fold of 10 shows that to the fifth power is the best fit with the lowest error. The results of the seventh show that cubed is the most statistically significant fit.
Cross-validation

library(boot)
set.seed(1)

cv.delta=rep(NA,8)

for (i in 1:8){
 glm.fit=glm(wage~poly(age ,i),data=Wage) 
 cv.delta[i]=cv.glm(Wage,glm.fit,K=10)$delta [2]
}

cv.delta
## [1] 1676.681 1600.607 1598.089 1595.381 1594.716 1595.676 1593.962 1597.595
which.min(cv.delta)
## [1] 7

Poly Regression Plot

plot(1:8, cv.delta, xlab="Degree", ylab="CV error", type="l", pch=20, lwd=2, ylim=c(1590, 1650))
min.point = min(cv.delta)
sd.points = sd(cv.delta)
abline(h=min.point + 0.2 * sd.points, col="blue", lty="dashed")
abline(h=min.point - 0.2 * sd.points, col="blue", lty="dashed")
legend("topright", "0.2-standard deviation lines", lty="dashed", col="blue")

ANOVA

set.seed(1)
fit1=lm(wage~poly(age,1),data=Wage)
fit2=lm(wage~poly(age,2),data=Wage)
fit3=lm(wage~poly(age,3),data=Wage)
fit4=lm(wage~poly(age,4),data=Wage)
fit5=lm(wage~poly(age,5),data=Wage)
fit6=lm(wage~poly(age,6),data=Wage)
anova(fit1, fit2, fit3, fit4, fit5, fit6)
## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6636 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8936  0.001675 ** 
## 4   2995 4771604  1      6070   3.8117  0.050989 .  
## 5   2994 4770322  1      1283   0.8054  0.369565    
## 6   2993 4766389  1      3932   2.4692  0.116201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA Plots

set.seed(1)
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
pred=predict(fit3,newdata = list(age=age.grid), se=TRUE)
se.bands=cbind(pred$fit + 2*pred$se.fit, pred$fit + 2*pred$se.fit)
plot(age, wage, xlim=agelims, cex=.5, col = "grey")
title("Wage by Age: Degree-3 Polynomial")
lines(age.grid, pred$fit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=1, col="darkblue", lty=3)

Q6(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
A6(b) I tested cutting age into up to 10 intervals, finding that the optimal number of cuts based off cross validation is 8 (based on error rate). These 8 cuts occur at 25.8, 35.5, 41.2, 49, 58.6, 64.5, and 72.2.

set.seed(1)
cv.error.10 = rep(NA,10)

for (i in 2:10) {
  Wage$age.cut = cut(Wage$age, i)
  glm.fit = glm(wage ~ age.cut, data = Wage)
  cv.error.10[i] = cv.glm(Wage, glm.fit, K = 10)$delta[1]
}

cv.error.10
##  [1]       NA 1734.489 1684.271 1635.552 1632.080 1623.415 1614.996 1601.318
##  [9] 1613.954 1606.331
which.min(cv.error.10)
## [1] 8
plot(cv.error.10, xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cv.error.10)
points(which.min(cv.error.10), cv.error.10[which.min(cv.error.10)], col = "red", cex = 2, pch = 20) 

fit=lm(wage~cut(age,8),data=Wage)
coef(summary(fit))
##                        Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)            76.28175   2.629812 29.006542 3.110596e-163
## cut(age, 8)(25.8,33.5] 25.83329   3.161343  8.171618  4.440913e-16
## cut(age, 8)(33.5,41.2] 40.22568   3.049065 13.192791  1.136044e-38
## cut(age, 8)(41.2,49]   43.50112   3.018341 14.412262  1.406253e-45
## cut(age, 8)(49,56.8]   40.13583   3.176792 12.634076  1.098741e-35
## cut(age, 8)(56.8,64.5] 44.10243   3.564299 12.373380  2.481643e-34
## cut(age, 8)(64.5,72.2] 28.94825   6.041576  4.791505  1.736008e-06
## cut(age, 8)(72.2,80.1] 15.22418   9.781110  1.556488  1.196978e-01
set.seed(1)
step.fit <- glm(wage~cut(age, 8), data=Wage)
agelims <- range(Wage$age)
age.grid <- seq(from=agelims[1], to=agelims[2])
step.pred <- predict(step.fit, data.frame(age=age.grid))
plot(wage~age, data=Wage, col="grey")
lines(age.grid, step.pred, col="darkblue", lwd=2)

detach(Wage)

Problem 10

This question relates to the College data set.

library(ISLR)

Q10(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
10(a) The satisfactory model created using the caret package leapForward method would use 6 predictors, those predictors being: PrivateYes, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate.

library(caret)
attach(College)
library(leaps)

set.seed(2)
train = sample(1:(nrow(College)*.7),)
test = (-train)
y.test = College$Outstate[test]
ctrl = trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     selectionFunction = "oneSE")

set.seed(2)

regfit.fwd = train(Outstate ~ .,
                       data = College[train,],
                       method = "leapForward",
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))

regfit.fwd
## Linear Regression with Forward Selection 
## 
## 543 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 488, 490, 487, 489, 488, 488, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     2887.834  0.5283618  2250.335
##    2     2596.589  0.5918783  1997.631
##    3     2373.364  0.6600780  1816.364
##    4     2265.918  0.6889796  1731.873
##    5     2164.105  0.7140326  1657.511
##    6     2099.697  0.7323984  1610.292
##    7     2053.399  0.7437871  1584.451
##    8     2059.134  0.7423178  1587.181
##    9     2120.405  0.7262376  1600.790
##   10     2093.225  0.7346601  1573.285
##   11     2073.188  0.7402819  1577.003
##   12     2052.706  0.7454172  1561.297
##   13     2048.839  0.7466662  1561.308
##   14     2055.490  0.7449685  1568.859
##   15     2056.691  0.7446181  1568.809
##   16     2057.302  0.7444717  1569.219
##   17     2057.446  0.7444756  1569.828
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 6.
summary(regfit.fwd)
## Subset selection object
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 6
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    "*"
coef(regfit.fwd$finalModel, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3769.0587788  2748.6944010     0.8999634    38.5143460    44.4889713 
##        Expend     Grad.Rate 
##     0.2543900    31.2043096
par(mfrow=c(1,3))
plot(summary(regfit.fwd)$cp) # best: 6 variable model
x = which.min(summary(regfit.fwd)$cp)
y = summary(regfit.fwd)$cp[x]
points(x,y, col='red', cex = 2.5, pch = 20)
x
## [1] 6
plot(summary(regfit.fwd)$bic) # best: 6 variable model
x = which.min(summary(regfit.fwd)$bic)
y = summary(regfit.fwd)$bic[x]
points(x,y, col='red', cex = 2.5, pch = 20)
x
## [1] 6
plot(summary(regfit.fwd)$adjr2) # best: 6 variable model
x = which.max(summary(regfit.fwd)$adjr2)
y = summary(regfit.fwd)$adjr2[x]
points(x,y, col='red', cex = 2.5, pch = 20)

x
## [1] 6

Q10(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
A10(b) When fitting GAM along with a smoothing spline on the training data PhD and Expend clearly have a non-linear relationship.

set.seed(2)
library(gam)
gam.fit = gam(Outstate ~ Private +  s(Room.Board) +  s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College, subset=train)
summary(gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6741.75 -1044.33    37.93  1159.33  7139.55 
## 
## (Dispersion Parameter for gaussian family taken to be 3253406)
## 
##     Null Deviance: 8614032615 on 542 degrees of freedom
## Residual Deviance: 1695024814 on 521 degrees of freedom
## AIC: 9706.91 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 1810946422 1810946422 556.631 < 2.2e-16 ***
## s(Room.Board)    1 1703398779 1703398779 523.574 < 2.2e-16 ***
## s(PhD)           1  667322951  667322951 205.115 < 2.2e-16 ***
## s(perc.alumni)   1  343712478  343712478 105.647 < 2.2e-16 ***
## s(Expend)        1  802314787  802314787 246.608 < 2.2e-16 ***
## s(Grad.Rate)     1  112120479  112120479  34.462 7.742e-09 ***
## Residuals      521 1695024814    3253406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F     Pr(F)    
## (Intercept)                                
## Private                                    
## s(Room.Board)        3  3.629  0.012948 *  
## s(PhD)               3  3.846  0.009642 ** 
## s(perc.alumni)       3  1.027  0.380413    
## s(Expend)            3 36.475 < 2.2e-16 ***
## s(Grad.Rate)         3  2.797  0.039645 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(2, 3))
plot(gam.fit, se = TRUE, col = "blue")

Q10(c) Evaluate the model obtained on the test set, and explain the results obtained.
A10(c) The r-squared for this model is .7513565 and the MSE is 4104102

set.seed(1)
preds = predict(gam.fit, newdata=College[test, ])
RSS = sum((College[test, ]$Outstate - preds)^2) 
TSS = sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)  
## [1] 0.7513565
MSE = mean((preds- College[test, ]$Outstate)^2)

Q10(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
A10(d) According to the summary, any significant p-value under the Nonparametric Effects corresponds to evidence of a non-linear relationship. Based of significant p-values under < .05 Expend, PhD, Room.Board, and Grad.Rate all have significant, non-parametric relationships with Wage.

summary(gam.fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6741.75 -1044.33    37.93  1159.33  7139.55 
## 
## (Dispersion Parameter for gaussian family taken to be 3253406)
## 
##     Null Deviance: 8614032615 on 542 degrees of freedom
## Residual Deviance: 1695024814 on 521 degrees of freedom
## AIC: 9706.91 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 1810946422 1810946422 556.631 < 2.2e-16 ***
## s(Room.Board)    1 1703398779 1703398779 523.574 < 2.2e-16 ***
## s(PhD)           1  667322951  667322951 205.115 < 2.2e-16 ***
## s(perc.alumni)   1  343712478  343712478 105.647 < 2.2e-16 ***
## s(Expend)        1  802314787  802314787 246.608 < 2.2e-16 ***
## s(Grad.Rate)     1  112120479  112120479  34.462 7.742e-09 ***
## Residuals      521 1695024814    3253406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F     Pr(F)    
## (Intercept)                                
## Private                                    
## s(Room.Board)        3  3.629  0.012948 *  
## s(PhD)               3  3.846  0.009642 ** 
## s(perc.alumni)       3  1.027  0.380413    
## s(Expend)            3 36.475 < 2.2e-16 ***
## s(Grad.Rate)         3  2.797  0.039645 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1