library(ISLR)
library(boot)
library(leaps)
library(gam)
data("Wage")
data("College")

Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

set.seed(1)
split <- sample(1:nrow(College), nrow(College)/2)
collegeTrain <- College[split,]
collegeTest <- College[-split,]
collegeFWD <- regsubsets(Outstate~., data=collegeTrain, nvmax=17, method="forward")
collegeFWD.summary = summary(collegeFWD)
par(mfrow=c(2,2))

plot(collegeFWD.summary$adjr2, type="b", xlab="Predictors", ylab="Adjusted R^2", main="Adjusted R^2")
adjrMax <- which.max(collegeFWD.summary$adjr2)
points(adjrMax, collegeFWD.summary$adjr[adjrMax], col="red", lwd=4)

plot(collegeFWD.summary$bic, type="b", xlab="Predictors", ylab="BIC", main="BIC")
bicMin <- which.min(collegeFWD.summary$bic)
points(bicMin, collegeFWD.summary$bic[bicMin], col="red", lwd=4)

plot(collegeFWD.summary$cp, type="b", xlab="Predictors", ylab="CP", main="CP")
cpMin <- which.min(collegeFWD.summary$cp)
points(cpMin, collegeFWD.summary$cp[cpMin], col="red", lwd=4)
coef(collegeFWD, 6)
##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -4726.8810613  2717.7019276     1.1032433    36.9990286    59.0863753 
##        Expend     Grad.Rate 
##     0.1930814    33.8303314

For the different predictors, Expend doesnโ€™t have a linear relationship with Outstate compared to the other 5. (b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

collegeGAM <- gam(Outstate~Private + s(Room.Board, df=3) + s(Terminal, df=3) + s(perc.alumni, df=3) + s(Expend, df=3) + s(Grad.Rate, df=3), data=collegeTrain)
par(mfrow=c(2,3))
plot(collegeGAM, se=TRUE, col="red")

(c) Evaluate the model obtained on the test set, and explain the results obtained.

collegeGAM.pred <- predict(collegeGAM, collegeTest)
(collegeGAM.mse <- mean((collegeTest$Outstate - collegeGAM.pred)^2))
## [1] 3378416
collegeGAM.tss <- mean((collegeTest$Outstate - mean(collegeTest$Outstate))^2)
collegeGAM.rss <- 1-collegeGAM.mse/collegeGAM.tss
collegeGAM.rss
## [1] 0.7639667

The MSE here is 3841676 and our r-squared is .7637752. This tells us that 77% of the variance can be explained.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(collegeGAM)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 3) + s(Terminal, 
##     df = 3) + s(perc.alumni, df = 3) + s(Expend, df = 3) + s(Grad.Rate, 
##     df = 3), data = collegeTrain)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -6851.9 -1144.9   -99.9  1275.7  7845.3 
## 
## (Dispersion Parameter for gaussian family taken to be 3811388)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1414024650 on 370.9999 degrees of freedom
## AIC: 6999.272 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 1769107166 1769107166 464.163 < 2.2e-16 ***
## s(Room.Board, df = 3)    1 1625808440 1625808440 426.566 < 2.2e-16 ***
## s(Terminal, df = 3)      1  305564773  305564773  80.171 < 2.2e-16 ***
## s(perc.alumni, df = 3)   1  358311620  358311620  94.011 < 2.2e-16 ***
## s(Expend, df = 3)        1  569459943  569459943 149.410 < 2.2e-16 ***
## s(Grad.Rate, df = 3)     1   91109927   91109927  23.905  1.51e-06 ***
## Residuals              371 1414024650    3811388                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df Npar F     Pr(F)    
## (Intercept)                                        
## Private                                            
## s(Room.Board, df = 3)        2  1.598    0.2037    
## s(Terminal, df = 3)          2  1.137    0.3220    
## s(perc.alumni, df = 3)       2  0.278    0.7575    
## s(Expend, df = 3)            2 34.287 2.176e-14 ***
## s(Grad.Rate, df = 3)         2  0.909    0.4039    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Because of the low p-value, Expend might share a nonlinear relationship with Outstate.For the other variables they seem to have linear relationships because of the high p-value.