Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR2)
library(boot)
library(pander)
library(caret)
library(gam)
library(leaps)
attach(Wage)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree){
  fit <- glm(wage~poly(age,i),data=Wage)
  cv.errs[i] <- cv.glm(Wage,fit,K=degree)$delta[1]
}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

The lowest test MSE is at degree 9 but the MSE at degree 4, 5, and 7 are all very similar. The ANOVA model should provide some comparison for which degree would be best.

# ANOVA model
set.seed(1)

fit1=lm(wage~poly(age,1),data=Wage)
fit2=lm(wage~poly(age,2),data=Wage)
fit3=lm(wage~poly(age,3),data=Wage)
fit4=lm(wage~poly(age,4),data=Wage)
fit5=lm(wage~poly(age,5),data=Wage)
fit6=lm(wage~poly(age,6),data=Wage)

pander(anova(fit1, fit2, fit3, fit4, fit5, fit6))
Analysis of Variance Table
Res.Df RSS Df Sum of Sq F Pr(>F)
2998 5022216 NA NA NA NA
2997 4793430 1 228786 143.7 2.29e-32
2996 4777674 1 15756 9.894 0.001675
2995 4771604 1 6070 3.812 0.05099
2994 4770322 1 1283 0.8054 0.3696
2993 4766389 1 3932 2.469 0.1162
plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

The ANOVA suggests that a degree of 4 provides the optimal model for this data. Although the lowest MSE is at a degree of 9, the MSE is low enough at a degree of 4, so that would be the best choice for both the polynomial and ANOVA models.

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

The optimal number of cuts for this model is 8.

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

detach(Wage)
attach(College)

Problem 10

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(1)

in_train <- sample(nrow(College) * 0.75)
train <- College[in_train, ]
test <- College[-in_train, ]
library(leaps)
fit <- regsubsets(Outstate ~ ., data = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"
# Identify the top 6 variables
coef(fit, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3750.9573664  2932.0977865     0.8904835    35.2357778    45.7352685 
##        Expend     Grad.Rate 
##     0.2611519    30.7269995

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

gam.mod <- gam(Outstate ~ Private + s(Room.Board, 4) + s(PhD, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = train)
par(mfrow = c(2,3))
plot(gam.mod, se = TRUE, col = 'darkred')

Looking at the plots, there are not a lot of linear relationships between the dependent variable Outstate and the independent variables. There is a slight linear relationship between Outstate and Room.Board as well as Outstate and perc.alumni.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

gam.pred <-  predict(gam.mod, test)
gam.err <-  mean((test$Outstate - gam.pred)^2)
gam.tss <-  mean((test$Outstate - mean(test$Outstate))^2)
gam.r2 <-  1 - gam.err / gam.tss

gam.r2
## [1] 0.7588494

The r2 for the model obtained on the test set is 75.88%. This means that just under 76% of the variance in Outstate can be explained by the variables Private, Room.Board, PhD, perc.alumni, expend, and Grad.Rate.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(PhD, 
##     4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), 
##     data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6855.38 -1064.53    16.12  1242.89  7131.43 
## 
## (Dispersion Parameter for gaussian family taken to be 3279494)
## 
##     Null Deviance: 9261203461 on 581 degrees of freedom
## Residual Deviance: 1836516009 on 559.9998 degrees of freedom
## AIC: 10407.08 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2167242895 2167242895 660.847 < 2.2e-16 ***
## s(Room.Board, 4)    1 1783298982 1783298982 543.773 < 2.2e-16 ***
## s(PhD, 4)           1  660457576  660457576 201.390 < 2.2e-16 ***
## s(perc.alumni, 4)   1  382401587  382401587 116.604 < 2.2e-16 ***
## s(Expend, 4)        1  852532186  852532186 259.959 < 2.2e-16 ***
## s(Grad.Rate, 4)     1  119779828  119779828  36.524 2.754e-09 ***
## Residuals         560 1836516009    3279494                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F   Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Room.Board, 4)        3  3.121 0.02564 *  
## s(PhD, 4)               3  2.947 0.03235 *  
## s(perc.alumni, 4)       3  1.256 0.28865    
## s(Expend, 4)            3 39.022 < 2e-16 ***
## s(Grad.Rate, 4)         3  3.080 0.02708 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

According to the ANOVA for Nonparametric Effects, there is a significant non-linear relationship between the dependent variable Outstate and the independent variables Room.Board, PhD, Expend, and Grad.Rate.