6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
10. This question relates to the College data set.
(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.16.1
gam.fit <- gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data=College.train)
## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored
par(mfrow = c(2, 3))
plot(gam.fit, se = T, col = "blue")

(c) Evaluate the model obtained on the test set, and explain the results obtained.
gam.pred <- predict(gam.fit, College.test)
gam.err <- mean((College.test$Outstate - gam.pred)^2)
gam.err
## [1] 3349290
gam.tss <- mean((College.test$Outstate - mean(College.test$Outstate))^2)
test.rss <- 1 - gam.err / gam.tss
test.rss
## [1] 0.7660016
OUr test r-squared for the GAM with 6 predictors was 0.77, which is an improvement from a test r-squared of 0.74 from the OLS.
(d) For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.fit)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,
## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,
## df = 2), data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7402.89 -1114.45 -12.67 1282.69 7470.60
##
## (Dispersion Parameter for gaussian family taken to be 3711182)
##
## Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1384271126 on 373 degrees of freedom
## AIC: 6987.021
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1778718277 1778718277 479.286 < 2.2e-16 ***
## s(Room.Board, df = 2) 1 1577115244 1577115244 424.963 < 2.2e-16 ***
## s(PhD, df = 2) 1 322431195 322431195 86.881 < 2.2e-16 ***
## s(perc.alumni, df = 2) 1 336869281 336869281 90.771 < 2.2e-16 ***
## s(Expend, df = 5) 1 530538753 530538753 142.957 < 2.2e-16 ***
## s(Grad.Rate, df = 2) 1 86504998 86504998 23.309 2.016e-06 ***
## Residuals 373 1384271126 3711182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 2) 1 1.9157 0.1672
## s(PhD, df = 2) 1 0.9699 0.3253
## s(perc.alumni, df = 2) 1 0.1859 0.6666
## s(Expend, df = 5) 4 20.5075 2.665e-15 ***
## s(Grad.Rate, df = 2) 1 0.5702 0.4506
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The Anova test shows a strong evidence of a non-linear relationship between Response and Expend, as well as a moderately strong non-linear relationship (using p value of 0.05) between response and Grad.Rate or PhD.