6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data

library(ISLR)
attach(Wage)
fit=lm(wage~poly(age,10),data=Wage)
coef(summary(fit))
##                   Estimate Std. Error      t value     Pr(>|t|)
## (Intercept)      111.70361  0.7283319 153.36910451 0.000000e+00
## poly(age, 10)1   447.06785 39.8923800  11.20684835 1.390296e-28
## poly(age, 10)2  -478.31581 39.8923800 -11.99015466 2.187330e-32
## poly(age, 10)3   125.52169 39.8923800   3.14650783 1.668583e-03
## poly(age, 10)4   -77.91118 39.8923800  -1.95303416 5.090874e-02
## poly(age, 10)5   -35.81289 39.8923800  -0.89773759 3.693978e-01
## poly(age, 10)6    62.70772 39.8923800   1.57192216 1.160744e-01
## poly(age, 10)7    50.54979 39.8923800   1.26715403 2.051989e-01
## poly(age, 10)8   -11.25473 39.8923800  -0.28212736 7.778654e-01
## poly(age, 10)9   -83.69180 39.8923800  -2.09793947 3.599425e-02
## poly(age, 10)10    1.62405 39.8923800   0.04071077 9.675292e-01
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(age,wage,xlim=agelims,cex =.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="darkblue")
matlines(age.grid,se.bands,lwd=1,col="lightblue",lty=3)

fit.1=lm(wage~age,data=Wage)
fit.2=lm(wage~poly(age,2),data=Wage)
fit.3=lm(wage~poly(age,3),data=Wage)
fit.4=lm(wage~poly(age,4),data=Wage)
fit.5=lm(wage~poly(age,5),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

10. This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: ggplot2
## Loading required package: lattice
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
set.seed(1)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))
train <- College[train_index, ]
test <- College[-train_index, ]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1,  selectionFunction = "oneSE")
model_forward <- train(Outstate ~ ., data = train, method = "leapForward", metric = "MSE", maximize = F, trControl = ctrl, tuneGrid = data.frame(nvmax = 1:17))
## Warning in train.default(x, y, weights = w, ...): The metric "MSE" was not in
## the result set. RMSE will be used instead.
model_forward
## Linear Regression with Forward Selection 
## 
## 544 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 489, 489, 489, 490, 489, 489, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE     
##    1     3264.437  0.4543993  2584.769
##    2     2770.180  0.5772085  2139.418
##    3     2408.152  0.6678757  1874.577
##    4     2230.315  0.7107524  1743.983
##    5     2175.728  0.7246605  1700.876
##    6     2140.273  0.7331822  1673.365
##    7     2155.826  0.7287980  1692.371
##    8     2175.808  0.7234398  1692.957
##    9     2168.176  0.7253524  1686.569
##   10     2131.400  0.7342856  1676.669
##   11     2120.075  0.7366834  1661.793
##   12     2106.659  0.7398460  1654.583
##   13     2089.662  0.7442338  1643.490
##   14     2091.543  0.7440925  1644.694
##   15     2096.765  0.7428962  1648.278
##   16     2098.375  0.7425564  1651.121
##   17     2098.915  0.7424432  1652.009
## 
## RMSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 6.

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings

coef(model_forward$finalModel, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3764.3413062  2793.2069104     0.9703210    38.2157650    59.0358377 
##        Expend     Grad.Rate 
##     0.2031532    28.6548780