Wage
data set considered throughout this chapter.library(ISLR2)
library(boot)
attach(Wage)
set.seed(1)
a) Perform polynomial regression to predict wage
using age. Use cross-validation to select the optimal
degree \(d\) for the polynomial. What
degree was chosen and how does this compare to the results of hypothesis
testing using ANOVA? Make a plot of the resulting polynomial fit to the
data.
all_deltas <- rep(NA, 10)
for (i in 1:10){
glm_fit = glm(wage ~ poly(age, i), data = Wage)
all_deltas[i] = cv.glm(Wage, glm_fit, K = 10)$delta[2]
}
plot(1:10, all_deltas, xlab = "Degree", ylab = "CV Error", type = "l", pch = 20, lwd = 2, ylim = c(1590, 1700))
min_points <- min(all_deltas)
sd_points <- sd(all_deltas)
abline(h = min_points + 0.2 * sd_points, col = "blue", lty = "dashed")
abline(h = min_points - 0.2 * sd_points, col = "blue", lty = "dashed")
legend("topright", "0.2 - standard deviation lines", lty = "dashed", col = "blue")
According the CV-plot with 0.2 standard deviation lines, d = 3 is the smallest degree that gives a small cross-validation error.
Find the best degree using Anova:
fit1 <- lm(wage ~ poly(age, 1), data = Wage)
fit2 <- lm(wage ~ poly(age, 2), data = Wage)
fit3 <- lm(wage ~ poly(age, 3), data = Wage)
fit4 <- lm(wage ~ poly(age, 4), data = Wage)
fit5 <- lm(wage ~ poly(age, 5), data = Wage)
fit6 <- lm(wage ~ poly(age, 6), data = Wage)
fit7 <- lm(wage ~ poly(age, 7), data = Wage)
fit8 <- lm(wage ~ poly(age, 8), data = Wage)
fit9 <- lm(wage ~ poly(age, 9), data = Wage)
fit10 <- lm(wage ~ poly(age, 10), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5, fit6, fit7, fit8, fit9, fit10)
## Analysis of Variance Table
##
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Model 6: wage ~ poly(age, 6)
## Model 7: wage ~ poly(age, 7)
## Model 8: wage ~ poly(age, 8)
## Model 9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.7638 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.9005 0.001669 **
## 4 2995 4771604 1 6070 3.8143 0.050909 .
## 5 2994 4770322 1 1283 0.8059 0.369398
## 6 2993 4766389 1 3932 2.4709 0.116074
## 7 2992 4763834 1 2555 1.6057 0.205199
## 8 2991 4763707 1 127 0.0796 0.777865
## 9 2990 4756703 1 7004 4.4014 0.035994 *
## 10 2989 4756701 1 3 0.0017 0.967529
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova shows that the second and third degree polynomials are statistically significant at at least the 0.01 level, while all polynomials above three are not statistically significant at the 0.01 level.
plot(wage ~ age, data = Wage, col = "darkgrey")
agelims <- range(Wage$age)
age_grid <- seq(agelims[1], agelims[2])
lm_fit <- lm(wage ~ poly(age, 3), data = Wage)
lm_pred <- predict(lm_fit, data.frame(age = age_grid))
lines(age_grid, lm_pred, col = "blue", lwd = 2)
b) Fit a step function to predict wage using
age, and perform cross-validation to choose the optimal
number of cuts. Make a plot of the fit obtained.
all_cvs <- rep(NA, 10)
for (i in 2:10){
Wage$age_cut = cut(Wage$age, i)
lm_fit = glm(wage ~ age_cut, data = Wage)
all_cvs[i] = cv.glm(Wage, lm_fit, K = 10)$delta[2]
}
plot(2:10, all_cvs[-1], xlab = "Number of cuts", ylab = "CV error", type = "l", pch = 20, lwd = 2)
According the CV-error plot, cross-validation indicates that the lowest test error occurs when the number of cuts is 8.
lm_fit2 <- glm(wage ~ cut(age, 8), data = Wage)
agelim <- range(Wage$age)
age_grid2 <- seq(agelim[1], agelim[2])
lm_pred2 <- predict(lm_fit2, data.frame(age = age_grid2))
plot(wage ~ age, data = Wage, col = "darkgrey")
lines(age_grid2, lm_pred2, col = "red", lwd = 2)
College data set.a) Split the data into a training set and test set. Using out-of-state tuition as the response and the other variables as the predictors, perform stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
set.seed(1)
library(leaps)
attach(College)
train <- sample(length(Outstate), length(Outstate) / 2)
test <- -train
train_college <- College[train, ]
test_college <- College[test, ]
reg_fit <- regsubsets(Outstate ~., data = train_college, nvmax = 17, method = "forward")
reg_summary <- summary(reg_fit)
par(mfrow = c(1, 3))
plot(reg_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min_cp <- min(reg_summary$cp)
std_cp <- sd(reg_summary$cp)
abline(h = min_cp + 0.2 * std_cp, col = "blue", lty = 2)
abline(h = min_cp - 0.2 * std_cp, col = "blue", lty = 2)
plot(reg_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min_bic <- min(reg_summary$bic)
std_bic <- sd(reg_summary$bic)
abline(h = min_bic + 0.2 * std_bic, col = "blue", lty = 2)
abline(h = min_bic - 0.2 * std_bic, col = "blue", lty = 2)
plot(reg_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", type = "l", ylim = c(0.4, 0.84))
max_adjr2 <- max(reg_summary$adjr2)
std_adjr2 <- sd(reg_summary$adjr2)
abline(h = max_adjr2 + 0.2 * std_adjr2, col = "blue", lty = 2)
abline(h = max_adjr2 - 0.2 * std_adjr2, col = "blue", lty = 2)
According the plots, it appears that the minimum size for the subset for which scores are within 0.2 standard deviations of optimum is 6. This is based on the Cp, BIC, and adjr2 scores. Therefore 6 is is chosen as the best subset size, and we find the best 6 vairables using the entire data set.
reg_fit2 <- regsubsets(Outstate ~., data = College, method = "forward")
coefi <- coef(reg_fit2, id = 6)
names(coefi)
## [1] "(Intercept)" "PrivateYes" "Room.Board" "PhD" "perc.alumni"
## [6] "Expend" "Grad.Rate"
b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
gam_fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5)
+ s(Grad.Rate, df = 2), data = train_college)
par(mfrow = c(2, 3))
plot(gam_fit, se = T, col = "blue")
c) Evaluate the model obtained on the test set, and explain the results obtained.
gam_pred <- predict(gam_fit, test_college)
gam_err <- mean((test_college$Outstate - gam_pred) ^ 2)
gam_err
## [1] 3349290
gam_tss <- mean((test_college$Outstate - mean(test_college$Outstate)) ^ 2)
test_rss <- 1 - gam_err / gam_tss
test_rss
## [1] 0.7660016
The GAM with 6 predictors obtained a test \(R^2\) value of 0.766, which is a slight improvement over the test RSS of 0.74 obtained using Ordinary Least Squares.
d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam_fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,
## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,
## df = 2), data = train_college)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7402.89 -1114.45 -12.67 1282.69 7470.60
##
## (Dispersion Parameter for gaussian family taken to be 3711182)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1384271126 on 373 degrees of freedom
## AIC: 6987.021
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1778718277 1778718277 479.286 < 2.2e-16 ***
## s(Room.Board, df = 2) 1 1577115244 1577115244 424.963 < 2.2e-16 ***
## s(PhD, df = 2) 1 322431195 322431195 86.881 < 2.2e-16 ***
## s(perc.alumni, df = 2) 1 336869281 336869281 90.771 < 2.2e-16 ***
## s(Expend, df = 5) 1 530538753 530538753 142.957 < 2.2e-16 ***
## s(Grad.Rate, df = 2) 1 86504998 86504998 23.309 2.016e-06 ***
## Residuals 373 1384271126 3711182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 2) 1 1.9157 0.1672
## s(PhD, df = 2) 1 0.9699 0.3253
## s(perc.alumni, df = 2) 1 0.1859 0.6666
## s(Expend, df = 5) 4 20.5075 2.665e-15 ***
## s(Grad.Rate, df = 2) 1 0.5702 0.4506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to the Anova test for Nonparametric Effects, there appears
to be strong evidence of a non-linear relationship between the response,
Outstate and the variable Expend.