R Markdown

  1. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data. 324 7. Moving Beyond Linearity
library(ISLR)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
  1. Fit a step function to predict wage using age, and perform cross validation to choose the optimal number of cuts. Make a plot of the fit obtained
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)
lines(age.grid, preds, col = "red", lwd = 2)
  1. This question relates to the College data set.
  1. Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
library(ISLR)
library(leaps)
test <- -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
coef(fit, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3815.6574509  2880.3858979     0.9861841    43.6735045    40.4602197 
  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
gam.mod <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
  1. Evaluate the model obtained on the test set, and explain the results obtained.
preds <- predict(gam.mod, College[test, ])
1 - (RSS / TSS)
  1. For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.mod)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -7289.5 -1004.3    18.3  1123.6  4218.8 
## (Intercept)                                    
## Private                                        
## s(Room.Board, 5)        4  3.6201  0.006576 ** 
## s(Terminal, 5)          4  2.3018  0.058243 .  
## s(perc.alumni, 5)       4  0.8690  0.482600    
## s(Expend, 5)            4 28.0768 < 2.2e-16 ***