In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR2)
library(boot)
library(pander)
library(caret)
library(gam)
library(leaps)
attach(Wage)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree){
fit <- glm(wage~poly(age,i),data=Wage)
cv.errs[i] <- cv.glm(Wage,fit,K=degree)$delta[1]
}
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
The lowest test MSE is at degree 9 but the MSE at degree 4, 5, and 7 are all very similar. The ANOVA model should provide some comparison for which degree would be best.
# ANOVA model
set.seed(1)
fit1=lm(wage~poly(age,1),data=Wage)
fit2=lm(wage~poly(age,2),data=Wage)
fit3=lm(wage~poly(age,3),data=Wage)
fit4=lm(wage~poly(age,4),data=Wage)
fit5=lm(wage~poly(age,5),data=Wage)
fit6=lm(wage~poly(age,6),data=Wage)
pander(anova(fit1, fit2, fit3, fit4, fit5, fit6))
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 2998 | 5022216 | NA | NA | NA | NA |
| 2997 | 4793430 | 1 | 228786 | 143.7 | 2.29e-32 |
| 2996 | 4777674 | 1 | 15756 | 9.894 | 0.001675 |
| 2995 | 4771604 | 1 | 6070 | 3.812 | 0.05099 |
| 2994 | 4770322 | 1 | 1283 | 0.8054 | 0.3696 |
| 2993 | 4766389 | 1 | 3932 | 2.469 | 0.1162 |
plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
The ANOVA suggests that a degree of 4 provides the optimal model for this data. Although the lowest MSE is at a degree of 9, the MSE is low enough at a degree of 4, so that would be the best choice for both the polynomial and ANOVA models.
(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(1)
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
Wage$age.cut <- cut(Wage$age, i)
fit <- glm(wage ~ age.cut, data = Wage)
cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
The optimal number of cuts for this model is 8.
plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)
detach(Wage)
attach(College)
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(1)
in_train <- sample(nrow(College) * 0.75)
train <- College[in_train, ]
test <- College[-in_train, ]
library(leaps)
fit <- regsubsets(Outstate ~ ., data = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " " " " " "
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
# Identify the top 6 variables
coef(fit, id = 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3750.9573664 2932.0977865 0.8904835 35.2357778 45.7352685
## Expend Grad.Rate
## 0.2611519 30.7269995
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
gam.mod <- gam(Outstate ~ Private + s(Room.Board, 4) + s(PhD, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = train)
par(mfrow = c(2,3))
plot(gam.mod, se = TRUE, col = 'darkred')
Looking at the plots, there are not a lot of linear relationships
between the dependent variable Outstate and the independent
variables. There is a slight linear relationship between
Outstate and Room.Board as well as
Outstate and perc.alumni.
(c) Evaluate the model obtained on the test set, and explain the results obtained.
gam.pred <- predict(gam.mod, test)
gam.err <- mean((test$Outstate - gam.pred)^2)
gam.tss <- mean((test$Outstate - mean(test$Outstate))^2)
gam.r2 <- 1 - gam.err / gam.tss
gam.r2
## [1] 0.7588494
The r2 for the model obtained on the test set is 75.88%. This means
that just under 76% of the variance in Outstate can be
explained by the variables Private,
Room.Board, PhD, perc.alumni,
expend, and Grad.Rate.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.mod)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(PhD,
## 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4),
## data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6855.38 -1064.53 16.12 1242.89 7131.43
##
## (Dispersion Parameter for gaussian family taken to be 3279494)
##
## Null Deviance: 9261203461 on 581 degrees of freedom
## Residual Deviance: 1836516009 on 559.9998 degrees of freedom
## AIC: 10407.08
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2167242895 2167242895 660.847 < 2.2e-16 ***
## s(Room.Board, 4) 1 1783298982 1783298982 543.773 < 2.2e-16 ***
## s(PhD, 4) 1 660457576 660457576 201.390 < 2.2e-16 ***
## s(perc.alumni, 4) 1 382401587 382401587 116.604 < 2.2e-16 ***
## s(Expend, 4) 1 852532186 852532186 259.959 < 2.2e-16 ***
## s(Grad.Rate, 4) 1 119779828 119779828 36.524 2.754e-09 ***
## Residuals 560 1836516009 3279494
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 4) 3 3.121 0.02564 *
## s(PhD, 4) 3 2.947 0.03235 *
## s(perc.alumni, 4) 3 1.256 0.28865
## s(Expend, 4) 3 39.022 < 2e-16 ***
## s(Grad.Rate, 4) 3 3.080 0.02708 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the ANOVA for Nonparametric Effects, there is a
significant non-linear relationship between the dependent variable
Outstate and the independent variables
Room.Board, PhD, Expend, and
Grad.Rate.