6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
The optimal degree that was chosen by the cross validation is d=5 with an MSE of 1594.977. The results of the hypothesis testing using ANOVA suggests that model 2 with a cubic fit and model 3 with a quadratic fit is sufficient. In comparison to the MSE of the cross validation, the MSE of the cubic fit is 1599.409 and the MSE of the quadratic fit is 1594.684. The MSE results are not too different from each model.
library(ISLR)
attach(Wage)
require(boot)
#fitting and cross validation
set.seed(1)
cv.poly <- rep(0,5)
for (i in 1:5){
glm.fit.poly <- glm(wage ~ poly(age,i),data=Wage)
cv.poly[i]<- cv.glm(Wage,glm.fit.poly,K=10)$delta[1]
}
cv.poly
## [1] 1676.826 1600.763 1598.399 1595.651 1594.977
range <- range(cv.poly)
min.cv <- min(cv.poly)
range
## [1] 1594.977 1676.826
min.cv
## [1] 1594.977
plot(cv.poly, type="b")
points(which.min(cv.poly), cv.poly[5], col="red", pch=20, cex=2)
set.seed(1)
fit.1=lm(wage~age,data=Wage)
fit.2=lm(wage~poly(age,2),data=Wage)
fit.3=lm(wage~poly(age,3),data=Wage)
fit.4=lm(wage~poly(age,4),data=Wage)
fit.5=lm(wage~poly(age,5),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
library(dvmisc)
get_mse(fit.1)
## [1] 1675.189
get_mse(fit.2)
## [1] 1599.409
get_mse(fit.3)
## [1] 1594.684
get_mse(fit.4)
## [1] 1593.19
get_mse(fit.5)
## [1] 1593.294
#plotting the result of the polynomial fit to the data
agelims <- range(age)
age.grid <- seq(from=agelims[1], to=agelims[2])
preds <- predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands <- cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
plot(age, wage, xlim=agelims, cex=.5, col="darkgrey")
lines(age.grid, preds$fit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=1, col="red", lty=3)
(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.
set.seed(1)
cv.step <- rep(NA,5)
for (i in 2:5){
Wage$age.cut <- cut(Wage$age,i)
glm.fit.step <- glm(wage ~ age.cut, data = Wage)
cv.step[i]<- cv.glm(Wage,glm.fit.step,K=10)$delta[1]
}
cv.step
## [1] NA 1734.489 1684.271 1635.552 1632.080
range <- range(cv.step)
plot(cv.step, type="b")
points(which.min(cv.step), cv.step[5], col="red", pch=20, cex=2)
plot(wage ~ age, data = Wage, col = "darkgrey")
glm.fit.step <- glm(wage ~ cut(age, 5), data = Wage)
preds <- predict(glm.fit.step, list(age = age.grid))
lines(age.grid, preds, col = "blue", lwd = 2)
10. This question relates to the College data set.
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
The best subset may be a model that includes the 6 predictors PrivateYes, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate.
#split data into train and test set
attach(College)
College=na.omit(College)
# 70% of the sample size
smp_size <- floor(.7 * nrow(College))
# set the seed to make partition reproducible
set.seed(1)
train_ind <- sample(seq_len(nrow(College)), size = smp_size)
train <- College[train_ind, ]
test <- College[-train_ind, ]
set.seed(1)
library(leaps)
forward.fit <- regsubsets(Outstate ~ ., data = train, method = 'forward')
summary(forward.fit)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables (and intercept)
## Forced in Forced out
## PrivateYes FALSE FALSE
## Apps FALSE FALSE
## Accept FALSE FALSE
## Enroll FALSE FALSE
## Top10perc FALSE FALSE
## Top25perc FALSE FALSE
## F.Undergrad FALSE FALSE
## P.Undergrad FALSE FALSE
## Room.Board FALSE FALSE
## Books FALSE FALSE
## Personal FALSE FALSE
## PhD FALSE FALSE
## Terminal FALSE FALSE
## S.F.Ratio FALSE FALSE
## perc.alumni FALSE FALSE
## Expend FALSE FALSE
## Grad.Rate FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
## PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " "
## 3 ( 1 ) "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" " " " " " " " " " " " "
## 7 ( 1 ) "*" " " " " " " " " " " " "
## 8 ( 1 ) "*" " " " " " " " " " " " "
## P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1 ( 1 ) " " " " " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " "
## 3 ( 1 ) " " "*" " " " " " " " " " "
## 4 ( 1 ) " " "*" " " " " " " " " " "
## 5 ( 1 ) " " "*" " " " " "*" " " " "
## 6 ( 1 ) " " "*" " " " " "*" " " " "
## 7 ( 1 ) " " "*" " " "*" "*" " " " "
## 8 ( 1 ) " " "*" " " "*" "*" "*" " "
## perc.alumni Expend Grad.Rate
## 1 ( 1 ) " " "*" " "
## 2 ( 1 ) " " "*" " "
## 3 ( 1 ) " " "*" " "
## 4 ( 1 ) "*" "*" " "
## 5 ( 1 ) "*" "*" " "
## 6 ( 1 ) "*" "*" "*"
## 7 ( 1 ) "*" "*" "*"
## 8 ( 1 ) "*" "*" "*"
coef(forward.fit, id = 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3782.6544026 2808.0522815 0.9722009 38.0615210 59.1918276
## Expend Grad.Rate
## 0.2032837 28.6598253
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
Based on the plots, it seems that out of state tuition is higher for private school. There is a somewhat of a positive linear relationship for the variables Room.Board, PhD, perc.alumni, and grad.rate. There is a non-linear relationship for variable Expend.
library(gam)
set.seed(1)
gam.fit <- gam(Outstate ~ Private + s(Room.Board, 5) + s(PhD, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = train)
par(mfrow = c(2,3))
plot(gam.fit, se=TRUE,col ="#1c9099")
(c) Evaluate the model obtained on the test set, and explain the results obtained.
The linear regression mse’s are lower than the gam model
set.seed(1)
preds=predict(gam.fit,newdata=test)
mse <- mean((preds-test$Outstate)^2)
mse
## [1] 3187953
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
Based on the plots, there is a non-linear relationship for variable Expend.