In this exercise, you will further analyze the Wage data set considered throughout this chapter.(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(MASS)
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.0.5
library(boot)
attach(Wage)
str(Wage)
## 'data.frame':    3000 obs. of  11 variables:
##  $ year      : int  2006 2004 2003 2003 2005 2008 2009 2008 2006 2004 ...
##  $ age       : int  18 24 45 43 50 54 44 30 41 52 ...
##  $ maritl    : Factor w/ 5 levels "1. Never Married",..: 1 1 2 2 4 2 2 1 1 2 ...
##  $ race      : Factor w/ 4 levels "1. White","2. Black",..: 1 1 1 3 1 1 4 3 2 1 ...
##  $ education : Factor w/ 5 levels "1. < HS Grad",..: 1 4 3 4 2 4 3 3 3 2 ...
##  $ region    : Factor w/ 9 levels "1. New England",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ jobclass  : Factor w/ 2 levels "1. Industrial",..: 1 2 1 2 2 2 1 2 2 2 ...
##  $ health    : Factor w/ 2 levels "1. <=Good","2. >=Very Good": 1 2 1 2 1 2 2 1 2 2 ...
##  $ health_ins: Factor w/ 2 levels "1. Yes","2. No": 2 2 1 1 1 1 1 1 1 1 ...
##  $ logwage   : num  4.32 4.26 4.88 5.04 4.32 ...
##  $ wage      : num  75 70.5 131 154.7 75 ...
set.seed(1)
delta <- rep(NA,10)
for (i in 1:10){
  fit <- glm(wage ~ poly(age, i), data = Wage)
  delta[i] <- cv.glm(Wage, fit, K=10)$delta[1]
}
plot(1:10, delta, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(delta)
points(d.min, delta[d.min], col = "red", cex = 2, pch = 20)

From the plot above, we can see that cross-validation chose the optimal degree of the polynomial as 9.The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough. The comparison by ANOVA (anova(fit.1,fit.2,fit.3,fit.4,fit.5), on page 290, section 7.8.1) suggests degree 4 is enough.

set.seed(1)
fit.1 = lm(wage~poly(age,1), data = Wage)
fit.2 = lm(wage~poly(age,2), data = Wage)
fit.3 = lm(wage~poly(age,3), data = Wage)
fit.4 = lm(wage~poly(age,4), data = Wage)
fit.5 = lm(wage~poly(age,5), data = Wage)
fit.6 = lm(wage~poly(age,6), data = Wage)
fit.7 = lm(wage~poly(age,7), data = Wage)
fit.8 = lm(wage~poly(age,8), data = Wage)
fit.9 = lm(wage~poly(age,9), data = Wage)
fit.10 = lm(wage~poly(age,10), data = Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5,fit.6,fit.7,fit.8,fit.9,fit.10)
## Analysis of Variance Table
## 
## Model  1: wage ~ poly(age, 1)
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the ANOVA results, the 3rd and 9th order polynomials have significant p-values. Hence, those could be a good fit.

Let us plot the data using the 3rd order polynomial first.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Now let’s plot the data using the 9th order polynomial.

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ poly(age, 9), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
cvs <- rep(0, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "CV Error", type = "l")
d.min <- which.min(cvs)
points(d.min, delta[d.min], col = "red", cex = 2, pch = 20)

The plot shows that the optimal number of cuts is 8.

Predict with 8-cuts step function:

plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age.grid <- seq(from = agelims[1], to = agelims[2])
fit <- lm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

#### Understand the cut() function:

res <- cut(c(1,5,2,3,8), 2)
res
## [1] (0.993,4.5] (4.5,8.01]  (0.993,4.5] (0.993,4.5] (4.5,8.01] 
## Levels: (0.993,4.5] (4.5,8.01]
length(res)
## [1] 5
class(res[1])
## [1] "factor"

cut(x, k) acts like bin or binage, turning a continuous quantitative variable into a discrete qualitative variable, by deviding the range of x evenly into k intervals. Each interval is called a level. The output of cut(x, k) is a vector with the same length of x. Each element of output (a factor object) is a level where the corresponding input element falls in.

Exercise 10

This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

detach(Wage)
attach(College)
library(ISLR)
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...
train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  "*"    " "    " "       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

List names of 6 predictors:

coef(fit, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3191.6057181  2950.2717035     0.8732379    37.0498937    51.0144153 
##        Expend     Grad.Rate 
##     0.1848395    29.9264164

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

Expend and Grad.Rate are strong non-linear with the Outstate, based on the shape of the fit curves.

library(foreach)
## Warning: package 'foreach' was built under R version 4.0.5
library(splines)
library(gam)
## Warning: package 'gam' was built under R version 4.0.5
## Loaded gam 1.20
gam.model <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.model, se = TRUE, col = 'blue')

(c) Evaluate the model obtained on the test set, and explain the results obtained.

The R2 statistic on test set is 0.77.

preds <- predict(gam.model, College[test, ])
RSS <- sum((College[test, ]$Outstate - preds)^2) # based on equation (3.16)
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)   # based on equation
## [1] 0.7834314

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.model)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7159.85 -1162.81    47.96  1095.18  7648.37 
## 
## (Dispersion Parameter for gaussian family taken to be 3334500)
## 
##     Null Deviance: 5860806972 on 387 degrees of freedom
## Residual Deviance: 1203754008 on 360.9998 degrees of freedom
## AIC: 6956.806 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1624052654 1624052654 487.045 < 2.2e-16 ***
## s(Room.Board, 5)    1 1068282762 1068282762 320.373 < 2.2e-16 ***
## s(Terminal, 5)      1  345093780  345093780 103.492 < 2.2e-16 ***
## s(perc.alumni, 5)   1  280343396  280343396  84.074 < 2.2e-16 ***
## s(Expend, 5)        1  507867707  507867707 152.307 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   69135717   69135717  20.733 7.227e-06 ***
## Residuals         361 1203754008    3334500                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Room.Board, 5)        4  1.5341    0.1917    
## s(Terminal, 5)          4  1.0288    0.3922    
## s(perc.alumni, 5)       4  1.8927    0.1111    
## s(Expend, 5)            4 16.7207 1.387e-12 ***
## s(Grad.Rate, 5)         4  0.7858    0.5350    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the Anova for non-parametric effects, only Expend has a significant p-value. Hence, there is evidence of non-linear relationship with Outstate. This is the same behavior that we observed from the plots of the GAM model as well. Anova for Nonparametric Effects shows Expend has strong non-linear relationshop with the Outstate. Grad.Rate and PhD have moderate non-linear relationship with the Outstate, which coincide with the result of (b).By examining the ANOVA results for the GAM model, we can see from the ANOVA for parametric effects that all predictors are significant. Hence, Room.Board, PhD, perc.alumni, Expend and Grad.Rate all exhibit a linear relationship.