Problem 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR2)
attach(Wage)

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Hypothesis Testing Using ANOVA to choose degree of polynomial

Hypothesis testing using ANOVA with an alpha of .05 would suggest that the third degree polynomial model is optimal given p-value less than 0.05. In this case the null hypothesis is that the simple linear model is better and we reject null hypothesis whenever the p-value is less than 0.05.

fit.1<-lm(wage~age,data=Wage)
fit.2<-lm(wage~poly(age,2),data=Wage)
fit.3<-lm(wage~poly(age,3),data=Wage)
fit.4<-lm(wage~poly(age,4),data=Wage)
fit.5<-lm(wage~poly(age,5),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.5<-lm(wage~poly(age,5),data=Wage)
coef(summary(fit.5))
##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
## poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
## poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01

Cross-Validation

The cross-validation method identifies the four degree polynomial model as being optimal. This is similar to the results seen from hypothesis testing using ANOVA. Although the four degree polynomial model would have been rejected under hypothesis testing, the p-value of 0.051046 is very close to the 0.05 threshold.

library(boot)
set.seed(1)
degree<-5
cv.errs<-rep(NA, degree)
for(i in 1:degree) {
  fit<-glm(wage~poly(age,i),data=Wage)
  cv.errs[i]<-cv.glm(Wage,fit)$delta[1]
}

The plot below illustrates that the four-degree polynomial model has the lowest Test MSE.

plot(1:degree,cv.errs,xlab ='Degree',ylab='Test MSE',type='l')
deg.min<- which.min(cv.errs)
points(deg.min,cv.errs[deg.min],col='red',cex=2,pch=19)

Degree - 4 Polynomial Model

First we use the lm() function to predict wage using a four-degree polynomial model.

fit<-lm(wage~poly(age,4,raw=TRUE),data=Wage)
coef(summary(fit))
##                                Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)               -1.841542e+02 6.004038e+01 -3.067172 0.0021802539
## poly(age, 4, raw = TRUE)1  2.124552e+01 5.886748e+00  3.609042 0.0003123618
## poly(age, 4, raw = TRUE)2 -5.638593e-01 2.061083e-01 -2.735743 0.0062606446
## poly(age, 4, raw = TRUE)3  6.810688e-03 3.065931e-03  2.221409 0.0263977518
## poly(age, 4, raw = TRUE)4 -3.203830e-05 1.641359e-05 -1.951938 0.0510386498

We then create a grid of values for age at which we want predictions.

agelims<-range(age)
range(age)
## [1] 18 80
age_grid<-seq(from=agelims[1],to=agelims[2])

Below is a plot of our four-degree polynomial model fit to the data to predict wage using age.

preds<-predict(fit,newdata=list(age=age_grid),se=TRUE)
se_bands<-cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
plot(age,wage,xlim=agelims,cex =.5,col="darkgrey")
title("Degree - 4 Polynomial")
lines(age_grid,preds$fit,lwd=2,col='darkblue')
matlines(age_grid,se_bands,lwd=1,col='red',ltw=3)

(b) Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(2)
cv.errs2<-rep(NA,10)
for(i in 2:10) {
  Wage$age.cut<-cut(Wage$age,i)
  fit2<-glm(wage~age.cut,data=Wage)
  cv.errs2[i]<-cv.glm(Wage,fit2)$delta[1]
}

With the loop above we can use cross-validation to identify the optimal number of cuts for the step function. According to the plot produced below, eight cuts would have the lowest test MSE and be optimal for the model.

plot(2:10,cv.errs2[-1],xlab ='Cuts',ylab ='Test MSE',type ='l')
deg.min2<-which.min(cv.errs2)
points(deg.min2,cv.errs2[deg.min2],col='red',cex=2,pch=19)

Below we have a plot of our step function with eight cuts used to predict wage using age.

step_fit<-lm(wage~cut(age,8),data=Wage)
step_preds<-predict(step_fit,data.frame(age=age_grid))
plot(age,wage,col="darkgray")
lines(age_grid,step_preds,col="darkblue",lwd=2)

detach(Wage)

Problem 10

This question relates to the College data set.

library(ISLR)
library(leaps)
library(gam)
library(splines)
attach(College)
str(College)
## 'data.frame':    777 obs. of  18 variables:
##  $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Apps       : num  1660 2186 1428 417 193 ...
##  $ Accept     : num  1232 1924 1097 349 146 ...
##  $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
##  $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
##  $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
##  $ F.Undergrad: num  2885 2683 1036 510 249 ...
##  $ P.Undergrad: num  537 1227 99 63 869 ...
##  $ Outstate   : num  7440 12280 11250 12960 7560 ...
##  $ Room.Board : num  3300 6450 3750 5450 4120 ...
##  $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
##  $ Personal   : num  2200 1500 1165 875 1500 ...
##  $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
##  $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
##  $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
##  $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
##  $ Expend     : num  7041 10527 8735 19016 10922 ...
##  $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward step-wise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

We begin by splitting the data into training and test sets.

set.seed(3)
train<-sample(c(TRUE,FALSE),nrow(College),rep=TRUE)
test<-(!train)

Now, we apply regsubsets() to the training set in order to perform forward step-wise selection.

regfit_fwd=regsubsets(Outstate~.,data=College[train,],nvmax=17,method="forward")

We than make a model matrix from the test data.

test_mat=model.matrix(Outstate~.,data=College[test,])

We now run the loop below in which for each size i, we extract the coefficients from regfit__fwd for the best model of that size. These coefficients are subsequently multiplied into the appropriate columns of the test model matrix to form the predictions, and compute the test MSE.

set.seed(5)
val.errors<-rep(NA,17)
for(i in 1:17){
 coefi<-coef(regfit_fwd,id=i)
 pred<-test_mat[,names(coefi)]%*%coefi
 val.errors[i]<-mean((College$Outstate[test]-pred)^2)
}
which.min(val.errors)
## [1] 12

From the plot below, we can see that a 12 variable model yields the lowest test MSE. However, a simpler model with only 6 variables could be sufficient given that the test MSE does not appear do decrease significantly after this point.

plot(val.errors,xlab ='Variables',ylab ='Test MSE',type='b')
points(which.min(val.errors),val.errors[12],col='red',pch=20,cex=2)

Best 6 Variable Model

Performing forward subset selection on the entire data set provides an estimate of the following variables and coefficients for our six variable model: PrivateYes(2768.6347), Room.Board(0.96791), PhD(35.52834), perc.alumni(48.4221), Expend(0.22103), and Grad.Rate(29.71191).

regfit_fwd2<-regsubsets(Outstate~.,data=College,nvmax=17,method='forward')
coef(regfit_fwd2,6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3553.2345268  2768.6347025     0.9679086    35.5283359    48.4221031 
##        Expend     Grad.Rate 
##     0.2210255    29.7119093

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

Below is a model fitted using smoothing splines on the Room.Board, PhD, perc.alumni, Expend, and Grad.Rate variables using four degrees of freedom. The PrivateYes variable is left as is and converted into two dummy variables because it is qualitative.

From the plots produced below it appears that the Expend and Grad.Rate variables have a stronger non-linear relationship with the Outsate variable. It is also clear that private schools have more out of state students.

gam_fit<-gam(Outstate~Private+s(Room.Board,4)+s(PhD,4)+s(perc.alumni,4)+s(Expend,4)+s(Grad.Rate,4),data=College, subset=train)
par(mfrow=c(2,3))
plot(gam_fit,se=TRUE,col='blue')

(c) Evaluate the model obtained on the test set, and explain the results obtained.

The GAM model yields an R squared statistic of 0.7766 or 77.66% when applied to the test set.

gam_pred<-predict(gam_fit,College[test, ])
RSS<-sum((College[test, ]$Outstate-gam_pred)^2)
TSS<-sum((College[test, ]$Outstate-mean(College[test, ]$Outstate))^2)
1-(RSS/TSS)
## [1] 0.7765878

We can also see that the GAM model appears to outperform the simple linear model with six variables, given that it has a smaller MSE.

error<-mean((College[test, ]$Outstate-gam_pred)^2)
val.errors[6]
## [1] 4589374
error
## [1] 3568642
val.errors[6]-error
## [1] 1020732

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

The summary below suggests that a non-linear relationship exists between Outstate and Room.Board, Expend, and Grad.Rate given that these variables all have p-values smaller than 0.05. The strongest non-linear relationship appears to exist between Outstate and Expend given an even smaller p-value of 1.449e-06.

summary(gam_fit)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 4) + s(PhD, 
##     4) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), 
##     data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7637.25 -1005.41    34.28  1245.70  7735.28 
## 
## (Dispersion Parameter for gaussian family taken to be 3506750)
## 
##     Null Deviance: 6239637206 on 381 degrees of freedom
## Residual Deviance: 1262427697 on 359.9993 degrees of freedom
## AIC: 6864.227 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 1900928038 1900928038 542.077 < 2.2e-16 ***
## s(Room.Board, 4)    1 1301977133 1301977133 371.277 < 2.2e-16 ***
## s(PhD, 4)           1  574192710  574192710 163.739 < 2.2e-16 ***
## s(perc.alumni, 4)   1  138758167  138758167  39.569 9.189e-10 ***
## s(Expend, 4)        1  507942485  507942485 144.847 < 2.2e-16 ***
## s(Grad.Rate, 4)     1   65212388   65212388  18.596 2.089e-05 ***
## Residuals         360 1262427697    3506750                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Room.Board, 4)        3  3.5467   0.01476 *  
## s(PhD, 4)               3  2.3985   0.06773 .  
## s(perc.alumni, 4)       3  1.7447   0.15747    
## s(Expend, 4)            3 10.3765 1.449e-06 ***
## s(Grad.Rate, 4)         3  3.1707   0.02440 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1