Chapter 07 (page 297): 6, 10

Question #6:

  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polyno- mial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
library(ISLR)
library(boot)
data("Wage")
summary(Wage)
##       year           age                     maritl           race     
##  Min.   :2003   Min.   :18.00   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00   3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41   4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00   5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                          
##                                                                        
##               education                     region               jobclass   
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456  
##  3. Some College   :650   3. East North Central:   0                        
##  4. College Grad   :685   4. West North Central:   0                        
##  5. Advanced Degree:426   5. South Atlantic    :   0                        
##                           6. East South Central:   0                        
##                           (Other)              :   0                        
##             health      health_ins      logwage           wage       
##  1. <=Good     : 858   1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. >=Very Good:2142   2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                                      Median :4.653   Median :104.92  
##                                      Mean   :4.654   Mean   :111.70  
##                                      3rd Qu.:4.857   3rd Qu.:128.68  
##                                      Max.   :5.763   Max.   :318.34  
## 
set.seed(123)
train <- sample(nrow(Wage), nrow(Wage)/2)
test <- (-train)
Wage_train <- Wage[train, ]
Wage_test <- Wage[test, ]
library(boot)
cv_err <- rep(0, 10)
for (d in 1:10) {
  fit <- glm(wage ~ poly(age, d), data = Wage_train)
  cv_err[d] <- cv.glm(Wage, fit, K = 10)$delta[1]
}
opt.d <- which.min(cv_err)
cat("The optimal degree is", opt.d, "\n")
## The optimal degree is

The optimal degree determined through Cross-Validation is 4. The comparison by ANOVA suggests that a degree of 4 is enough.

plot(wage ~ age, data = Wage, col = "orange")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "blue", lwd = 2)

b. Fit a step function to predict wage using age, and perform cross- validation to choose the optimal number of cuts. Make a plot of the fit obtained.

plot(Wage_train$age, Wage_train$wage, col = "gray", xlab = "Age", ylab = "Wage")
points(Wage_test$age, Wage_test$wage, col = "gray", pch = 20)
age.grid <- seq(from = min(Wage$age), to = max(Wage$age))
preds <- predict(fit, newdata = list(age = age.grid), se = TRUE)
se.bands <- cbind(preds$fit + 2 * preds$se.fit, preds$fit - 2 * preds$se.fit)
matlines(age.grid, se.bands, col = "blue", lty = 2)
lines(age.grid, preds$fit, lwd = 2, col = "red")

It looks like the number of cuts that minimizes the error is 8.

  1. Split the data into a training set and a test set.Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
data(College)
set.seed(5)
train <- sample(nrow(College), nrow(College)/4)
test <- (-train)
College.train <- College[train, ]
College.test <- College[test, ]
coef(fit, 6)
##   (Intercept) poly(age, 3)1 poly(age, 3)2 poly(age, 3)3 
##      111.7036      447.0679     -478.3158      125.5217

b.Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.22-3
gam.mod <- gam(Outstate ~ Private + s(Room.Board, 4) + s(Terminal, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = College, subset = train)
plot(gam.mod, se = TRUE)

Based on the shape of the fit curves, it appears that Expend and Grad.Rate exhibit strong non-linear relationships with Outstate.

c.Evaluate the model obtained on the test set, and explain the results obtained.

pred.gam <- predict(gam.mod, College[test,])
(mse.gam <- mean((College[test,'Outstate']-pred.gam)^2))
## [1] 3807477
tss.gam <- mean((College[test,'Outstate'] - mean(College[test,'Outstate']))^2)
(rss.test = 1 - mse.gam/tss.gam)
## [1] 0.7655204

d.For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(Terminal, 
##     4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), 
##     data = College, subset = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -5093.1  -931.3    73.7  1183.0  3755.3 
## 
## (Dispersion Parameter for gaussian family taken to be 3053873)
## 
##     Null Deviance: 3081045758 on 193 degrees of freedom
## Residual Deviance: 525265433 on 171.9998 degrees of freedom
## AIC: 3469.99 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df    Sum Sq   Mean Sq  F value    Pr(>F)    
## Private             1 800522655 800522655 262.1336 < 2.2e-16 ***
## s(Room.Board, 4)    1 732080071 732080071 239.7219 < 2.2e-16 ***
## s(Terminal, 4)      1 300944143 300944143  98.5451 < 2.2e-16 ***
## s(perc.alumni, 4)   1 199410159 199410159  65.2975 1.091e-13 ***
## s(Expend, 4)        1 204161196 204161196  66.8532 6.166e-14 ***
## s(Grad.Rate, 4)     1  15087815  15087815   4.9406   0.02754 *  
## Residuals         172 525265433   3053873                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F     Pr(F)    
## (Intercept)                                   
## Private                                       
## s(Room.Board, 4)        3 2.2509    0.0842 .  
## s(Terminal, 4)          3 2.0745    0.1054    
## s(perc.alumni, 4)       3 1.4466    0.2310    
## s(Expend, 4)            3 8.7914 1.865e-05 ***
## s(Grad.Rate, 4)         3 0.8806    0.4524    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary statistics show that there is no apparent strong non-linear relationship observed for certain variables, such as Expend, Grad.Rate, and Personal.