library(ISLR)
library(boot)
library(leaps)
library(gam)
data("Wage")
data("College")
In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
set.seed(1)
split <- sample(1:nrow(College), nrow(College)/2)
collegeTrain <- College[split,]
collegeTest <- College[-split,]
collegeFWD <- regsubsets(Outstate~., data=collegeTrain, nvmax=17, method="forward")
collegeFWD.summary = summary(collegeFWD)
par(mfrow=c(2,2))
plot(collegeFWD.summary$adjr2, type="b", xlab="Predictors", ylab="Adjusted R^2", main="Adjusted R^2")
adjrMax <- which.max(collegeFWD.summary$adjr2)
points(adjrMax, collegeFWD.summary$adjr[adjrMax], col="red", lwd=4)
plot(collegeFWD.summary$bic, type="b", xlab="Predictors", ylab="BIC", main="BIC")
bicMin <- which.min(collegeFWD.summary$bic)
points(bicMin, collegeFWD.summary$bic[bicMin], col="red", lwd=4)
plot(collegeFWD.summary$cp, type="b", xlab="Predictors", ylab="CP", main="CP")
cpMin <- which.min(collegeFWD.summary$cp)
points(cpMin, collegeFWD.summary$cp[cpMin], col="red", lwd=4)
coef(collegeFWD, 6)
## (Intercept) PrivateYes Room.Board Terminal perc.alumni
## -4726.8810613 2717.7019276 1.1032433 36.9990286 59.0863753
## Expend Grad.Rate
## 0.1930814 33.8303314
For the different predictors, Expend doesnโt have a linear relationship with Outstate compared to the other 5. (b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
collegeGAM <- gam(Outstate~Private + s(Room.Board, df=3) + s(Terminal, df=3) + s(perc.alumni, df=3) + s(Expend, df=3) + s(Grad.Rate, df=3), data=collegeTrain)
par(mfrow=c(2,3))
plot(collegeGAM, se=TRUE, col="red")
(c) Evaluate the model obtained on the test set, and explain the results obtained.
collegeGAM.pred <- predict(collegeGAM, collegeTest)
(collegeGAM.mse <- mean((collegeTest$Outstate - collegeGAM.pred)^2))
## [1] 3378416
collegeGAM.tss <- mean((collegeTest$Outstate - mean(collegeTest$Outstate))^2)
collegeGAM.rss <- 1-collegeGAM.mse/collegeGAM.tss
collegeGAM.rss
## [1] 0.7639667
The MSE here is 3841676 and our r-squared is .7637752. This tells us that 77% of the variance can be explained.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(collegeGAM)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 3) + s(Terminal,
## df = 3) + s(perc.alumni, df = 3) + s(Expend, df = 3) + s(Grad.Rate,
## df = 3), data = collegeTrain)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -6851.9 -1144.9 -99.9 1275.7 7845.3
##
## (Dispersion Parameter for gaussian family taken to be 3811388)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1414024650 on 370.9999 degrees of freedom
## AIC: 6999.272
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1769107166 1769107166 464.163 < 2.2e-16 ***
## s(Room.Board, df = 3) 1 1625808440 1625808440 426.566 < 2.2e-16 ***
## s(Terminal, df = 3) 1 305564773 305564773 80.171 < 2.2e-16 ***
## s(perc.alumni, df = 3) 1 358311620 358311620 94.011 < 2.2e-16 ***
## s(Expend, df = 3) 1 569459943 569459943 149.410 < 2.2e-16 ***
## s(Grad.Rate, df = 3) 1 91109927 91109927 23.905 1.51e-06 ***
## Residuals 371 1414024650 3811388
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 3) 2 1.598 0.2037
## s(Terminal, df = 3) 2 1.137 0.3220
## s(perc.alumni, df = 3) 2 0.278 0.7575
## s(Expend, df = 3) 2 34.287 2.176e-14 ***
## s(Grad.Rate, df = 3) 2 0.909 0.4039
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Because of the low p-value, Expend might share a nonlinear relationship with Outstate.For the other variables they seem to have linear relationships because of the high p-value.