Chapter 07 (page 297): 6, 10
Question #6:
library(ISLR)
library(boot)
data("Wage")
summary(Wage)
## year age maritl race
## Min. :2003 Min. :18.00 1. Never Married: 648 1. White:2480
## 1st Qu.:2004 1st Qu.:33.75 2. Married :2074 2. Black: 293
## Median :2006 Median :42.00 3. Widowed : 19 3. Asian: 190
## Mean :2006 Mean :42.41 4. Divorced : 204 4. Other: 37
## 3rd Qu.:2008 3rd Qu.:51.00 5. Separated : 55
## Max. :2009 Max. :80.00
##
## education region jobclass
## 1. < HS Grad :268 2. Middle Atlantic :3000 1. Industrial :1544
## 2. HS Grad :971 1. New England : 0 2. Information:1456
## 3. Some College :650 3. East North Central: 0
## 4. College Grad :685 4. West North Central: 0
## 5. Advanced Degree:426 5. South Atlantic : 0
## 6. East South Central: 0
## (Other) : 0
## health health_ins logwage wage
## 1. <=Good : 858 1. Yes:2083 Min. :3.000 Min. : 20.09
## 2. >=Very Good:2142 2. No : 917 1st Qu.:4.447 1st Qu.: 85.38
## Median :4.653 Median :104.92
## Mean :4.654 Mean :111.70
## 3rd Qu.:4.857 3rd Qu.:128.68
## Max. :5.763 Max. :318.34
##
set.seed(123)
train <- sample(nrow(Wage), nrow(Wage)/2)
test <- (-train)
Wage_train <- Wage[train, ]
Wage_test <- Wage[test, ]
library(boot)
cv_err <- rep(0, 10)
for (d in 1:10) {
fit <- glm(wage ~ poly(age, d), data = Wage_train)
cv_err[d] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
opt.d <- which.min(cv_err)
cat("The optimal degree is", opt.d, "\n")
## The optimal degree is
The optimal degree determined through Cross-Validation is 4. The comparison by ANOVA suggests that a degree of 4 is enough.
plot(wage ~ age, data = Wage, col = "orange")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "blue", lwd = 2)
b. Fit a step function to predict wage using age, and perform cross-
validation to choose the optimal number of cuts. Make a plot of the fit
obtained.
plot(Wage_train$age, Wage_train$wage, col = "gray", xlab = "Age", ylab = "Wage")
points(Wage_test$age, Wage_test$wage, col = "gray", pch = 20)
age.grid <- seq(from = min(Wage$age), to = max(Wage$age))
preds <- predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
matlines(age.grid, se.bands, col = "blue", lty = 2)
lines(age.grid, preds$fit, lwd = 2, col = "red")
It looks like the number of cuts that minimizes the error is 8.
data(College)
set.seed(5)
train <- sample(nrow(College), nrow(College)/4)
test <- (-train)
College.train <- College[train, ]
College.test <- College[test, ]
coef(fit, 6)
## (Intercept) poly(age, 3)1 poly(age, 3)2 poly(age, 3)3
## 111.7036 447.0679 -478.3158 125.5217
b.Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-3
gam.mod <- gam(Outstate ~ Private + s(Room.Board, 4) + s(Terminal, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = College, subset = train)
plot(gam.mod, se = TRUE)
Based on the shape of the fit curves, it appears that Expend and
Grad.Rate exhibit strong non-linear relationships with Outstate.
c.Evaluate the model obtained on the test set, and explain the results obtained.
pred.gam <- predict(gam.mod, College[test,])
(mse.gam <- mean((College[test,'Outstate']-pred.gam)^2))
## [1] 3807477
tss.gam <- mean((College[test,'Outstate'] - mean(College[test,'Outstate']))^2)
(rss.test = 1 - mse.gam/tss.gam)
## [1] 0.7655204
d.For which variables, if any, is there evidence of a non-linear relationship with the response?
summary(gam.mod)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(Terminal,
## 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4),
## data = College, subset = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5093.1 -931.3 73.7 1183.0 3755.3
##
## (Dispersion Parameter for gaussian family taken to be 3053873)
##
## Null Deviance: 3081045758 on 193 degrees of freedom
## Residual Deviance: 525265433 on 171.9998 degrees of freedom
## AIC: 3469.99
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 800522655 800522655 262.1336 < 2.2e-16 ***
## s(Room.Board, 4) 1 732080071 732080071 239.7219 < 2.2e-16 ***
## s(Terminal, 4) 1 300944143 300944143 98.5451 < 2.2e-16 ***
## s(perc.alumni, 4) 1 199410159 199410159 65.2975 1.091e-13 ***
## s(Expend, 4) 1 204161196 204161196 66.8532 6.166e-14 ***
## s(Grad.Rate, 4) 1 15087815 15087815 4.9406 0.02754 *
## Residuals 172 525265433 3053873
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, 4) 3 2.2509 0.0842 .
## s(Terminal, 4) 3 2.0745 0.1054
## s(perc.alumni, 4) 3 1.4466 0.2310
## s(Expend, 4) 3 8.7914 1.865e-05 ***
## s(Grad.Rate, 4) 3 0.8806 0.4524
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary statistics show that there is no apparent strong non-linear relationship observed for certain variables, such as Expend, Grad.Rate, and Personal.