Assignment6

Chapter 07 (page 297): 6, 10

Question 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)
library(boot)

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.
The 9th degree was chosen as the optimal degree for the polynomial regression. The ANOVA says that the 4th degree is sufficient. The polynomial also shows that the 4th degree is pretty close to the 9th degree.

set.seed(1)

cv.errors <- NA
for (i in 1:10) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errors[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}

cv.errors

##  [1] 1676.826 1600.763 1598.399 1595.651 1594.977 1596.061 1594.298 1598.134
##  [9] 1593.913 1595.950

plot(cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")
opt.degree <- which.min(cv.errors)
points(opt.degree, cv.errors[opt.degree], col = "red")

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.
The optimal number of cuts is 8.

set.seed(1)

cv.errors <- NA
for (i in 2:10) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errors[i] <- cv.glm(Wage, fit, K = 10)$delta[1]
}

cv.errors

##  [1]       NA 1734.489 1684.271 1635.552 1632.080 1623.415 1614.996 1601.318
##  [9] 1613.954 1606.331

plot(cv.errors, type = "b", xlab = "Degree", ylab = "CV Error")
opt.degree <- which.min(cv.errors)
points(opt.degree, cv.errors[opt.degree], col = "red")

Question 10.

This question relates to the College data set.

library(leaps)

attach(College)

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
According to the forward stepwise selection, the best model selected uses 12 or 13 variables. However, the plots show that using 5 or 6 variables will likely yield a sufficient model.

set.seed(1)
train <- sample(nrow(College) * 0.7)
Ctrain <- College[train, ]
Ctest <- College[-train, ]

regfit.fwd=regsubsets(Outstate~., data=Ctrain, nvmax=17, method="forward")

regfit.summary <- summary(regfit.fwd)
par(mfrow=c(2,2))
plot(regfit.summary$rss,xlab="Number of Variables",ylab="RSS",type="l")
plot(regfit.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l")
plot(regfit.summary$cp,xlab="Number of Variables",ylab="Adjusted Cp",type="l")
plot(regfit.summary$bic,xlab="Number of Variables",ylab="Adjusted BIC",type="l")

which.max(regfit.summary$adjr2)

## [1] 13

which.min(regfit.summary$cp)

## [1] 12

which.min(regfit.summary$bic)

## [1] 12

regfit.summary

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = Ctrain, nvmax = 17, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 17
## Selection Algorithm: forward
##           PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 )  " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 )  "*"        " "  " "    " "    " "       " "       " "        
## 9  ( 1 )  "*"        " "  "*"    " "    " "       " "       " "        
## 10  ( 1 ) "*"        " "  "*"    "*"    " "       " "       " "        
## 11  ( 1 ) "*"        "*"  "*"    "*"    " "       " "       " "        
## 12  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 13  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       " "        
## 14  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 15  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 16  ( 1 ) "*"        "*"  "*"    "*"    "*"       " "       "*"        
## 17  ( 1 ) "*"        "*"  "*"    "*"    "*"       "*"       "*"        
##           P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 )  " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 )  " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 )  " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 )  " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 9  ( 1 )  " "         "*"        " "   "*"      "*" "*"      " "      
## 10  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 11  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 12  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
## 13  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 14  ( 1 ) " "         "*"        " "   "*"      "*" "*"      "*"      
## 15  ( 1 ) " "         "*"        "*"   "*"      "*" "*"      "*"      
## 16  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
## 17  ( 1 ) "*"         "*"        "*"   "*"      "*" "*"      "*"      
##           perc.alumni Expend Grad.Rate
## 1  ( 1 )  " "         "*"    " "      
## 2  ( 1 )  " "         "*"    " "      
## 3  ( 1 )  " "         "*"    " "      
## 4  ( 1 )  "*"         "*"    " "      
## 5  ( 1 )  "*"         "*"    " "      
## 6  ( 1 )  "*"         "*"    "*"      
## 7  ( 1 )  "*"         "*"    "*"      
## 8  ( 1 )  "*"         "*"    "*"      
## 9  ( 1 )  "*"         "*"    "*"      
## 10  ( 1 ) "*"         "*"    "*"      
## 11  ( 1 ) "*"         "*"    "*"      
## 12  ( 1 ) "*"         "*"    "*"      
## 13  ( 1 ) "*"         "*"    "*"      
## 14  ( 1 ) "*"         "*"    "*"      
## 15  ( 1 ) "*"         "*"    "*"      
## 16  ( 1 ) "*"         "*"    "*"      
## 17  ( 1 ) "*"         "*"    "*"

test_mat <- model.matrix(Outstate ~ ., data = Ctest)

val.errors <- rep(NA, 17)
for(i in 1:17) {
  coef <- coef(regfit.fwd, id=i)
  pred.fwd <- test_mat[, names(coef)]%*%coef
  val.errors[i] <- mean((Ctest$Outstate-pred.fwd)^2)
}

which.min(val.errors)

## [1] 13

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
Private, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate were the 6 variables. Expend, PhD, Room.board, and Grad.Rate have non-linear relationships with Outstate.

library(gam)

## Loading required package: splines

## Loading required package: foreach

## Loaded gam 1.20

library(splines)

gam_fit <- gam(Outstate ~ Private + s(Room.Board) + s(PhD, 4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = Ctrain)

par(mfrow=c(2, 3))
plot(gam_fit, se=TRUE, col="green")

(c) Evaluate the model obtained on the test set, and explain the results obtained.
The MSE for the GAM using the above 6 variables is 4104102.

summary(gam_fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD, 4) + 
##     s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), data = Ctrain)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6741.75 -1044.33    37.93  1159.33  7139.55 
## 
## (Dispersion Parameter for gaussian family taken to be 3253406)
## 
##     Null Deviance: 8614032615 on 542 degrees of freedom
## Residual Deviance: 1695024814 on 521 degrees of freedom
## AIC: 9706.91 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1810946422 1810946422 556.631 < 2.2e-16 ***
## s(Room.Board)       1 1703398779 1703398779 523.574 < 2.2e-16 ***
## s(PhD, 4)           1  667322951  667322951 205.115 < 2.2e-16 ***
## s(perc.alumni, 4)   1  343712478  343712478 105.647 < 2.2e-16 ***
## s(Expend, 4)        1  802314787  802314787 246.608 < 2.2e-16 ***
## s(Grad.Rate, 4)     1  112120479  112120479  34.462 7.742e-09 ***
## Residuals         521 1695024814    3253406                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F     Pr(F)    
## (Intercept)                                   
## Private                                       
## s(Room.Board)           3  3.629  0.012948 *  
## s(PhD, 4)               3  3.846  0.009642 ** 
## s(perc.alumni, 4)       3  1.027  0.380413    
## s(Expend, 4)            3 36.475 < 2.2e-16 ***
## s(Grad.Rate, 4)         3  2.797  0.039645 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

pred.gam <- predict(gam_fit, newdata = Ctest)
error.gam <- mean((Ctest$Outstate - pred.gam)^2)
error.gam

## [1] 4104102

val.errors[6] - error.gam

## [1] 96175.82

For which variables, if any, is there evidence of a non-linear relationship with the response?
According to the Anova for nonparametric effects, Expend, PhD, Room.board, and Grad.Rate have non-linear relationships with Outstate.