Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

  1. Perform polynomial regression to predict Wage using Age. Use cross-validation to select the optimal degree for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
library(caret)
attach(Wage)
set.seed(1)
RMSE <- c()
for(i in 1:10){
  Wage$age.degree <- poly(Wage$age, i)
  lm.fit <- train(wage ~ age.degree, data = Wage, method = "lm", 
                  metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
  RMSE[i] <- lm.fit$results$RMSE
}

data.frame(degree = 1:10, RMSE = RMSE) %>% 
  ggplot(aes(x = degree, y = RMSE)) + 
  geom_line(size = 1, color = "#1c9099") +
  geom_point(color = "#1c9099") +
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
  labs(x = "\n Degree", y = "Cross-Validation RMSE\n")

A polynomial degree of 6 gives us the lowest error in the cross-validation plot. However, based on the ANOVA results, the model with a 6th order polynomial term is insignificant, a polynomial degree of 3 is sufficient. Below is a plot of the cubic polynomial fit of Age to predict Wage.

fit.1 <- lm(wage ~ poly(age, 1, raw = TRUE), data = Wage)
fit.2 <- lm(wage ~ poly(age, 2, raw = TRUE), data = Wage)
fit.3 <- lm(wage ~ poly(age, 3, raw = TRUE), data = Wage)
fit.4 <- lm(wage ~ poly(age, 4, raw = TRUE), data = Wage)
fit.5 <- lm(wage ~ poly(age, 5, raw = TRUE), data = Wage)
fit.6 <- lm(wage ~ poly(age, 6, raw = TRUE), data = Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6)
## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1, raw = TRUE)
## Model 2: wage ~ poly(age, 2, raw = TRUE)
## Model 3: wage ~ poly(age, 3, raw = TRUE)
## Model 4: wage ~ poly(age, 4, raw = TRUE)
## Model 5: wage ~ poly(age, 5, raw = TRUE)
## Model 6: wage ~ poly(age, 6, raw = TRUE)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.6636 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8936  0.001675 ** 
## 4   2995 4771604  1      6070   3.8117  0.050989 .  
## 5   2994 4770322  1      1283   0.8054  0.369565    
## 6   2993 4766389  1      3932   2.4692  0.116201    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(age, wage)) +
  geom_point(alpha = 0.5) +
  stat_smooth(method = lm, formula = y ~ poly(x, 3, raw = TRUE)) +
  labs(title = "Predicting Wage with a Cubic Polynomial Function of Age",
    x = "Age", y = "Wage")

  1. Fit a step function to predict Wage using Age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
RMSE <- c()
for(i in 2:10){
  Wage$age.cut <- cut(Wage$age, i)
  lm.fit <- train(wage ~ age.cut, data = Wage, method = "lm", 
                  metric = "RMSE", trControl = trainControl(method = "cv", number = 10))
  RMSE[i] <- lm.fit$results$RMSE
}

data.frame(degree = 1:10, RMSE = RMSE) %>% 
  ggplot(aes(x = degree, y = RMSE)) + 
  geom_line(size = 1, color = "#1c9099") +
  geom_point(color = "#1c9099") +
  scale_x_continuous(breaks = 1:10, minor_breaks = NULL) +
  labs(x = "\n Knots", y = "Cross-Validation RMSE \n")

Using 10-fold cross-validation, the optimal number of cuts that gives us the lowest CV error is k = 7. Below is the fit of a 7-Step Function of Age.

ggplot(Wage, aes(age, wage)) + 
  geom_point(alpha = 0.5) + 
  stat_smooth(method = lm, formula = y ~ cut(x, 7)) + 
  labs(title = "Predicting Wage with a 7-Step Function of Age", x = "Age", y = "Wage")

detach(Wage)

Question 10

This question relates to the College data set.

  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
attach(College)
set.seed(2)
train <- College$Outstate %>%
  createDataPartition(p = 0.6, list = FALSE)
college.train <- College[train, ]
college.test <- College[-train, ]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
lm.forward <- train(Outstate ~., data = college.train, 
                    method = "leapForward", metric = "RMSE",
                    tuneGrid = data.frame(nvmax = 1:17), 
                    trControl = ctrl)
plot(lm.forward)

A subset of six variables minimizes the cross-validation error. The best six variables include Private, Room.Board, PhD, Perc.alumni, Expend, and Grad.Rate.

min <- lm.forward$bestTune$nvmax
coef(lm.forward$finalModel, min)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3426.1279395  2625.0467512     1.0544391    31.6186696    62.2164054 
##        Expend     Grad.Rate 
##     0.1835718    27.5736003
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
gam.fit <- train(Outstate ~ Private + Room.Board + PhD + perc.alumni + Expend + Grad.Rate, 
                 data = college.train, method = "gamSpline", metric = "RMSE",
                 trControl = ctrl, tuneGrid = data.frame(df = seq(2, 10, by = 1)))
plot(gam.fit)

Based on the cross-validation plot, smoothing splines with three degrees of freedom give us the lowest CV error. The GAM plots show that there is evidence of non-linearity in PhD, Grad.Rate, and Expend. We will use ANOVA to test for nonparametric effects.

par(mfrow = c(2,3))
plot.Gam(gam.fit$finalModel, se = TRUE, col = "#1c9099")

  1. Evaluate the model obtained on the test set, and explain the results obtained.

    The Generalized Additive Model (GAM) achieved a lower test error and a higher test \(r^2\) than the linear model, indicating that the smoothing splines provide a better fit in predicting out-of-state tuition.

gam.preds <- gam.fit %>% 
  predict(college.test)
gam.rmse <- RMSE(college.test$Outstate, gam.preds)
gam.r2 <- R2(college.test$Outstate, gam.preds)

lm.preds <- lm.forward %>% 
  predict(college.test)
lm.rmse <- RMSE(college.test$Outstate, lm.preds)
lm.r2 <- R2(college.test$Outstate, lm.preds)

data.frame(Model = c("GAM", "Linear Model"),
  RMSE = c(gam.rmse, lm.rmse),
  RSquared = percent(c(gam.r2, lm.r2), accuracy = 0.1)) %>% 
  arrange(RMSE) %>% 
  kable(format.args = list(big.mark = ","), align = c("l", "c", "c"), digits = 0) %>% 
  kable_styling(bootstrap_options = c("striped"))
Model RMSE RSquared
GAM 1,973 75.9%
Linear Model 2,138 71.6%
  1. For which variables, if any, is there evidence of a non-linear relationship with the response?

    Based on the nonparametric ANOVA tests, there is strong evidence of a non-linear relationship between Outstate and Expend.

summary.Gam(gam.fit$finalModel)$anova
## Anova for Nonparametric Effects
##                        Npar Df  Npar F   Pr(F)    
## (Intercept)                                       
## PrivateYes                                        
## s(perc.alumni, df = 3)       2  0.4687  0.6261    
## s(PhD, df = 3)               2  2.0337  0.1320    
## s(Grad.Rate, df = 3)         2  1.5558  0.2121    
## s(Room.Board, df = 3)        2  2.0234  0.1334    
## s(Expend, df = 3)            2 26.6981 1.1e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach(College)