In this exercise, you will further analyze the Wage data set considered throughout this chapter.
a. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
attach(Wage)
fit<- lm(wage~poly(age, 3, raw=TRUE), data=Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.524391e+01 2.218373e+01 -3.391851 7.032236e-04
## poly(age, 3, raw = TRUE)1 1.018999e+01 1.605228e+00 6.348002 2.511372e-10
## poly(age, 3, raw = TRUE)2 -1.680286e-01 3.686021e-02 -4.558535 5.357433e-06
## poly(age, 3, raw = TRUE)3 8.494522e-04 2.702449e-04 3.143268 1.687063e-03
fit.1<- lm(wage~age,data=Wage)
fit.2<- lm(wage~poly(age,2),data=Wage)
fit.3<- lm(wage~poly(age,3),data=Wage)
fit.4<- lm(wage~poly(age,4),data=Wage)
fit.5<- lm(wage~poly(age,5),data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The optimal degree chosen is \(d=3\). The results of the polynomial regression are practically identical to the ANOVA test, both showing a cubic degree to be the most fitting for the polynomial.
agelims<- range(age)
age.grid<- seq(from=agelims[1], to=agelims[2])
preds<- predict(fit,newdata=list(age=age.grid), se=TRUE)
se.bands<- cbind(
preds$fit+2*preds$se.fit, preds$fit-2*preds$se.fit
)
plot(age, wage, xlim=agelims, cex=.5, col='darkgrey')
title('Degree-3 Polynomial')
lines(age.grid, preds$fit, lwd=2, col='red')
matlines(age.grid, se.bands, lwd=1, col='black', lty=3)
b. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
library(boot)
crossval<- rep(NA, 10)
for (i in 2:10) {Wage$age.cut = cut(Wage$age, i)
fit<- glm(wage ~ age.cut, data = Wage)
crossval[i]<- cv.glm(Wage, fit, K = 10)$delta[1]
}
plot(
2:10, crossval[-1], xlab = "Cuts", ylab = "Test MSE",
type = "l"
)
d.min<- which.min(crossval)
points(
which.min(crossval), crossval[which.min(crossval)],
col = "red", cex = 2, pch = 20
)
It appears that 8 cuts is the optimal amount for the step function.
agelims<- range(age)
age.grid<- seq(from = agelims[1], to = agelims[2])
fit<- glm(wage ~ cut(age, 8), data = Wage)
preds<- predict(fit, data.frame(age = age.grid))
plot(wage ~ age, data = Wage, col = 'darkgrey')
title('Step Function: 8 Cuts')
lines(age.grid, preds, col = 'red', lwd = 2)
This question relates to the College data set.
a. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.4
set.seed(1)
attach(College)
train <- sample(length(Outstate), length(Outstate) / 2)
College.train <- College[train, ]
College.test <- College[-train, ]
fit <- regsubsets(
Outstate ~ ., data = College.train, nvmax = 17,
method = "forward"
)
fit.summary <- summary(fit)
par(mfrow = c(1, 3))
plot(
fit.summary$cp, xlab = "Number of variables", ylab = "Cp",
type = "l"
)
min.cp <- min(fit.summary$cp)
std.cp <- sd(fit.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)
plot(
fit.summary$bic, xlab = "Number of variables", ylab = "BIC",
type='l'
)
min.bic <- min(fit.summary$bic)
std.bic <- sd(fit.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)
plot(
fit.summary$adjr2, xlab = "Number of variables",
ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84)
)
max.adjr2 <- max(fit.summary$adjr2)
std.adjr2 <- sd(fit.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)
Using penalized metrics, the best model in terms of accuracy and simplicity contains a subset of 6 predictors.
b. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Warning: package 'gam' was built under R version 4.0.5
## Loading required package: splines
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.0.4
## Loaded gam 1.20
coefs =coef(fit, id = 6)
names(coefs)
## [1] "(Intercept)" "PrivateYes" "Room.Board" "Terminal" "perc.alumni"
## [6] "Expend" "Grad.Rate"
fit = gam(
Outstate ~ Private + s(Room.Board, 4) + s(PhD, 4) +
s(perc.alumni, 4) + s(Expend, 5) + s(Grad.Rate, 4),
data=College.train
)
par(mfrow = c(2, 3))
plot(fit, se = T, col = "red")
Based on these results, there appears to be evidence of non-linearity in almost every variable, specifically Expend. However, this trend is less obvious in perc.alumni, meaning there is potential the relationship is linear.
c. Evaluate the model obtained on the test set, and explain the results obtained.
preds<- predict(fit, College.test)
rss<- mean((College.test$Outstate - preds)^2)
tss <- mean(
(College.test$Outstate-mean(College.test$Outstate))^2
)
1 - rss / tss
## [1] 0.7676881
The test set has an \(R^2\) of 76.77%, meaning about 77% of the variation in out of state tuition can be explained by the 6 variables chosen. This is relatively high, meaning the model we created is fairly accurate.
d. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(PhD,
## 4) + s(perc.alumni, 4) + s(Expend, 5) + s(Grad.Rate, 4),
## data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7337.44 -1087.21 -65.61 1227.83 7294.53
##
## (Dispersion Parameter for gaussian family taken to be 3705721)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1352586452 on 364.9995 degrees of freedom
## AIC: 6994.038
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1780598219 1780598219 480.500 < 2.2e-16 ***
## s(Room.Board, 4) 1 1573559093 1573559093 424.630 < 2.2e-16 ***
## s(PhD, 4) 1 327060211 327060211 88.258 < 2.2e-16 ***
## s(perc.alumni, 4) 1 329334364 329334364 88.872 < 2.2e-16 ***
## s(Expend, 5) 1 528926955 528926955 142.732 < 2.2e-16 ***
## s(Grad.Rate, 4) 1 87976099 87976099 23.741 1.646e-06 ***
## Residuals 365 1352586452 3705721
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 4) 3 2.0107 0.1120
## s(PhD, 4) 3 0.7920 0.4989
## s(perc.alumni, 4) 3 0.3604 0.7816
## s(Expend, 5) 4 19.5536 1.354e-14 ***
## s(Grad.Rate, 4) 3 0.9858 0.3995
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is evidence of a non-linear relationship between Outstate and Expend. None of the other variables show evidence of a non-linear relationship at any level of significance.