(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data
library(ISLR)
attach(Wage)
fit=lm(wage~poly(age,10),data=Wage)
coef(summary(fit))
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 111.70361 0.7283319 153.36910451 0.000000e+00
## poly(age, 10)1 447.06785 39.8923800 11.20684835 1.390296e-28
## poly(age, 10)2 -478.31581 39.8923800 -11.99015466 2.187330e-32
## poly(age, 10)3 125.52169 39.8923800 3.14650783 1.668583e-03
## poly(age, 10)4 -77.91118 39.8923800 -1.95303416 5.090874e-02
## poly(age, 10)5 -35.81289 39.8923800 -0.89773759 3.693978e-01
## poly(age, 10)6 62.70772 39.8923800 1.57192216 1.160744e-01
## poly(age, 10)7 50.54979 39.8923800 1.26715403 2.051989e-01
## poly(age, 10)8 -11.25473 39.8923800 -0.28212736 7.778654e-01
## poly(age, 10)9 -83.69180 39.8923800 -2.09793947 3.599425e-02
## poly(age, 10)10 1.62405 39.8923800 0.04071077 9.675292e-01
agelims=range(age)
age.grid=seq(from=agelims[1],to=agelims[2])
preds=predict(fit,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(age,wage,xlim=agelims,cex =.5,col="darkgrey")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="darkblue")
matlines(age.grid,se.bands,lwd=1,col="lightblue",lty=3)
fit.1=lm(wage~age,data=Wage)
fit.2=lm(wage~poly(age,2),data=Wage)
fit.3=lm(wage~poly(age,3),data=Wage)
fit.4=lm(wage~poly(age,4),data=Wage)
fit.5=lm(wage~poly(age,5),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(caret)
## Warning: package 'caret' was built under R version 4.1.2
## Loading required package: ggplot2
## Loading required package: lattice
library(leaps)
## Warning: package 'leaps' was built under R version 4.1.3
set.seed(1)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))
train <- College[train_index, ]
test <- College[-train_index, ]
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 1, selectionFunction = "oneSE")
model_forward <- train(Outstate ~ ., data = train, method = "leapForward", metric = "MSE", maximize = F, trControl = ctrl, tuneGrid = data.frame(nvmax = 1:17))
## Warning in train.default(x, y, weights = w, ...): The metric "MSE" was not in
## the result set. RMSE will be used instead.
model_forward
## Linear Regression with Forward Selection
##
## 544 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 489, 489, 489, 490, 489, 489, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE
## 1 3264.437 0.4543993 2584.769
## 2 2770.180 0.5772085 2139.418
## 3 2408.152 0.6678757 1874.577
## 4 2230.315 0.7107524 1743.983
## 5 2175.728 0.7246605 1700.876
## 6 2140.273 0.7331822 1673.365
## 7 2155.826 0.7287980 1692.371
## 8 2175.808 0.7234398 1692.957
## 9 2168.176 0.7253524 1686.569
## 10 2131.400 0.7342856 1676.669
## 11 2120.075 0.7366834 1661.793
## 12 2106.659 0.7398460 1654.583
## 13 2089.662 0.7442338 1643.490
## 14 2091.543 0.7440925 1644.694
## 15 2096.765 0.7428962 1648.278
## 16 2098.375 0.7425564 1651.121
## 17 2098.915 0.7424432 1652.009
##
## RMSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 6.
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings
coef(model_forward$finalModel, id = 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3764.3413062 2793.2069104 0.9703210 38.2157650 59.0358377
## Expend Grad.Rate
## 0.2031532 28.6548780