Assignment 6, Chapter 7

Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.
(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR2)
library(boot)
attach(Wage)
set.seed(1)
poly.mse=c()
for(degree in 1:10){
  fit=glm(wage~poly(age,degree,raw=T),data=Wage)
  mse=cv.glm(fit,data=Wage,K=10)$delta[1]
  poly.mse=c(poly.mse,mse)
}
plot(poly.mse,xlab='Degree',ylab='Test MSE',type='l')
x=which.min(poly.mse)
points(x,poly.mse[x],pch=20,cex=2,col='red')

Though degree 9 is the minimum test MSE, degree 4 appears to be sufficient and will help avoid overfitting. This aligns well with the results of the ANOVA test.

plot(wage~age, data=Wage)
age.range=range(Wage$age)
age.grid=seq(from=age.range[1], to=age.range[2])
fit=lm(wage~poly(age,4), data=Wage)
preds=predict(fit, newdata=list(age=age.grid))
lines(age.grid, preds, col="red", lwd=2)

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
step.mse=c()
for(degree in 2:10){
  Wage.model=model.frame(wage~cut(age,degree),data=Wage)
  names(Wage.model)=c('wage','age')
  step.fit=glm(wage~age,data=Wage.model)
  mse=cv.glm(step.fit,data=Wage.model,K=10)$delta[1]
  step.mse=c(step.mse,mse)
}
plot(step.mse,xlab='Cuts',ylab='Test MSE',type='l')
x=which.min(step.mse)
points(x,step.mse[x],pch=20,cex=2,col='red')

7 cuts is optimal.

plot(wage~age,data=Wage)
fit=glm(wage~cut(age,7), data=Wage)
preds=predict(fit, list(age=age.grid))
lines(age.grid, preds, col="red", lwd=2)

detach(Wage)

Problem 10

This question relates to the College data set.
(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

attach(College)
set.seed(1)
train=sample(dim(College)[1], dim(College)[1]/2)
test=-train

library(leaps)
fit=regsubsets(Outstate~., data=College, subset=train, method='forward')
summary(fit)

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    "*"       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      " " "*"      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      " " "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) "*"         " "    " "      
## 3  ( 1 ) "*"         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    "*"      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

coef(fit,id=6)

##   (Intercept)    PrivateYes    Room.Board      Terminal   perc.alumni 
## -4726.8810613  2717.7019276     1.1032433    36.9990286    59.0863753 
##        Expend     Grad.Rate 
##     0.1930814    33.8303314

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)
gam.fit=gam(Outstate~Private+s(Room.Board)+s(Terminal)+s(perc.alumni)+s(Expend)+s(Grad.Rate),data=College[train,])
par(mfrow=c(2, 3))
plot(gam.fit,se=T,col='blue')

Outstate appears to be strongly non-linear with Expend.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

gam.pred=predict(gam.fit, College[test,])
mean((College[test,"Outstate"]-gam.pred)^2)

## [1] 3353802

1-mean(abs(College[test,"Outstate"]-gam.pred))/mean(College[test,"Outstate"])

## [1] 0.8640176

We can predict Out of state tuition with 86% accuracy.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(Terminal) + 
##     s(perc.alumni) + s(Expend) + s(Grad.Rate), data = College[train, 
##     ])
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7128.62 -1133.86   -74.25  1231.50  7369.50 
## 
## (Dispersion Parameter for gaussian family taken to be 3724586)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1363197370 on 365.9997 degrees of freedom
## AIC: 6995.069 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 1764398916 1764398916 473.717 < 2.2e-16 ***
## s(Room.Board)    1 1616561254 1616561254 434.024 < 2.2e-16 ***
## s(Terminal)      1  287918343  287918343  77.302 < 2.2e-16 ***
## s(perc.alumni)   1  354690429  354690429  95.230 < 2.2e-16 ***
## s(Expend)        1  601731164  601731164 161.556 < 2.2e-16 ***
## s(Grad.Rate)     1   90312393   90312393  24.248 1.284e-06 ***
## Residuals      366 1363197370    3724586                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df  Npar F     Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Room.Board)        3  1.9107    0.1274    
## s(Terminal)          3  1.4636    0.2241    
## s(perc.alumni)       3  0.3498    0.7893    
## s(Expend)            3 26.1184 2.442e-15 ***
## s(Grad.Rate)         3  0.9075    0.4375    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Expend has a strong non-linear relationship with Outstate based on the Anova for nonparametric effects.

Assignment 6, Chapter 7

Adam Ramsey

2023-04-03

Problem 6

Problem 10