ISLR-CH7-Moving Beyond Linearity

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

library(ISLR)

6(a). Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

Let’s run the cross validation over a for loop and save the MSEs.

library(boot)

## Warning: package 'boot' was built under R version 3.6.2

# set seed
set.seed(1)

# 10-fold cross-validation
# Repeat upto 5 degrees to find polynomial degree. 
# Use delta[2] for adjusted cross-validation estimate. The adjustment is designed to compensate for the bias introduced by not using leave-one-out cross-validation.

cv.error = rep(0,7)
for (d in 1:7){
  poly.fit = glm(wage~poly(age, d), data = Wage)
  cv.error[d] = cv.glm(Wage, poly.fit,K =10)$delta[2]
}

Lets find the optimal degree d by using the minimum cross validation error rate.

mse = min(cv.error)
d = which.min(cv.error)
cat("The optimal degree d is ",d, " and the corresponding lowest 10-fold CV error is ", mse)

## The optimal degree d is  7  and the corresponding lowest 10-fold CV error is  1593.962

The optimal degree shows as 7 when we run up to 7 degrees but with such a high degree overfiting can be possible as the training set has only 4000 observations.

Lets draw a plot to see the cross validation error rates.

plot(cv.error,xlab='Degree of Polynomial',ylab='Cross Validation Error',type='l')
points(d,cv.error[d],pch=20,cex=2,col='red')
points(5,cv.error[5],pch=20,cex=2,col='green')
title("Cross Validation Errors at different degrees of Polynomial")

The error rates show that at degree = 4, the cross validation seems to drop, then at degree = 5, the cross validation error rate reached a low value, then it increased again at degree = 6, and then dropped back. This indicates that overfitting is happening.

So, we conclude that degree = 5 is the OPTIMAL degree.

Lets fit models with different degrees separately and run the ANOVA to compare.

poly.fit.1 = lm(wage~poly(age, 1), data = Wage)
poly.fit.2 = lm(wage~poly(age, 2), data = Wage)
poly.fit.3 = lm(wage~poly(age, 3), data = Wage)
poly.fit.4 = lm(wage~poly(age, 4), data = Wage)
poly.fit.5 = lm(wage~poly(age, 5), data = Wage)
anova(poly.fit.1, poly.fit.2, poly.fit.3, poly.fit.4, poly.fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ poly(age, 1)
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The 10-fold CV approach shows that degree 5 having the lowest CV error and the best fit. However, ANOVA test shows that polynomial at degree 3 (with p-value cut off 0.05) is the best fit. This is also the simpler model as we increase the degree there is high possibility we will be overfitting the data.

However, lets confirm which model is best by checking the summary of the degree = 5 polynomial fit.

coef(summary(poly.fit.5))

##                 Estimate Std. Error     t value     Pr(>|t|)
## (Intercept)    111.70361  0.7287647 153.2780243 0.000000e+00
## poly(age, 5)1  447.06785 39.9160847  11.2001930 1.491111e-28
## poly(age, 5)2 -478.31581 39.9160847 -11.9830341 2.367734e-32
## poly(age, 5)3  125.52169 39.9160847   3.1446392 1.679213e-03
## poly(age, 5)4  -77.91118 39.9160847  -1.9518743 5.104623e-02
## poly(age, 5)5  -35.81289 39.9160847  -0.8972045 3.696820e-01

In fact the p-values are same for degree=3, degree=4 and degree=5 and at the same time, this summary shows that we can accept the simpler cubic polynomial (i.e. degree=3).

Let’s make a plot of the resulting poynomial fit i.e. polynomial with degree = 3.

agelimits=range(Wage$age)
age.grid=seq(from=agelimits[1],to=agelimits[2])
preds=predict(poly.fit.3,newdata=list(age=age.grid),se=TRUE)
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
par(mfrow=c(1,1),mar=c(4.5,4.5,1,1),oma=c(0,0,2,0))
plot(Wage$age,Wage$wage,xlim=agelimits,cex =.5,col="darkgrey",xlab='Age',ylab='Wage')
lines(age.grid,preds$fit,lwd=2,col="blue")
matlines(age.grid,se.bands,lwd=1,col="darkgreen",lty=3)
title("Degree-3 Polynomial")

6(b). Fit a step function to predict wage using age, and perform cross-validation to choose the optimal number of cuts. Make a plot of the fit obtained.

Let’s identify the optimum number of cuts by performing cross validation

# set the seed
set.seed(2)

# Create a vector with NA for no values as we are iterating from 2 to assume minimum of 2 cuts
# Also we dont want the vector to start with 0 value otherwise which.min will throw value as 1
cv.error.cut <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    fit <- glm(wage ~ age.cut, data = Wage)
    cv.error.cut[i] <- cv.glm(Wage, fit, K = 10)$delta[2]
}
d.cut <- which.min(cv.error.cut)
plot(2:10, cv.error.cut[-1], xlab = "Cuts", ylab = "MSE", type = "l")
points(d.cut, cv.error.cut[d.cut], col = 'red', cex = 2, pch = 20)
title("Cross Validation Errors at different cuts")

10-fold Cross validation resulted in 8 cuts as the optimal cuts.

Lets fit the step function with 8 cuts and make a plot.

step.fit = glm(wage~cut(age, d.cut), data = Wage)
step.preds = predict(step.fit , newdata=list(age = age.grid), se = TRUE)
plot(Wage$age,Wage$wage, xlim = agelimits, xlab= 'Wage', ylab= 'Age', cex = 0.5, col='darkgrey')
title('Step-function with 8 cuts')
lines(age.grid, step.preds$fit, lwd = 3, col = 'blue')

10. This question relates to the College data set.

nrow(College)

## [1] 777

# Get the total number of predictors. This is total columns - response variable (outstate)
p=dim(College)[2]-1
p

## [1] 17

10(a). Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

# Split the data into a training and test data set

# set the seed
set.seed(3)

# Create a training and test set with 80% split of training data set and 20% split of test data set
split = 0.80
train <- sample(1:nrow(College), split*nrow(College))
College.train <- College[ train,]
College.test <- College[-train,]
nrow(College.test)

## [1] 156

# load leaps library
library(leaps)

# Perform forward stepwise selection on training set

# Fit the model with all predictors on training dataset and out-of-state tuition as response variable
fit.fwdstep = regsubsets(Outstate~.,data=College.train,method = 'forward',nvmax = p)

# Get the summary
fit.summary<-summary(fit.fwdstep)

# plot for RSS with minimum RSS value highlighted
plot(fit.summary$rss ,xlab="Number of Variables ",ylab="RSS",type="l")
points(which.min(fit.summary$rss),fit.summary$rss[which.min(fit.summary$rss)], col="red",cex=2,pch=20)
points(6,fit.summary$rss[6], col="Blue",cex=2,pch=20)
points(11,fit.summary$rss[11], col="Green",cex=2,pch=20)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Residual Sum of Squares(RSS)')

The above RSS plot indicate that the MSE is monotonically reducing even though we notice that the difference is considerably less from predictor 6 onwards (Blue dot) and very minimal after 11th predictor onwards (Green dot).This indicates that we might have to use validation approach on test dataset to decide. However, lets try with RSquared, Adjusted Rsquared, Mallows Cp and BIC values to determine the best number of predictors using training dataset itself.

Lets take the predictors count where the maximum R2 first reached and is with in the standard deviation.

# Lets create plots for RSquared, Adjusted RSquared, Mallows Cp and BIC to determine best subset of predictors
par(mfrow = c(2, 2))

# plot for Rsquared with maximum R2 value highlighted and within .2 standard deviation limits
sd.r2 = sd(fit.summary$rsq)
max.r2 = max(fit.summary$rsq)
p.r2 = which.max(fit.summary$rsq)

plot(fit.summary$rsq ,xlab="Number of Variables ", ylab="RSquared",type="l", ylim = c(min(fit.summary$rsq),1))
points(p.r2,fit.summary$rsq[p.r2], col="red",cex=2,pch=20)
points(11,fit.summary$rsq[11],col="green",cex=2,pch=20)
abline(h = max.r2 - 0.2 * sd.r2, col = "red", lty = 2)
abline(h = max.r2 + 0.2 * sd.r2, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('RSquared(R2)')

# plot for Adjusted Rsquared with maximum adjR2 value highlighted and within .2 standard deviation
sd.adjr2 = sd(fit.summary$adjr2)
max.adjr2 = max(fit.summary$adjr2)
p.adjr2 = which.max(fit.summary$adjr2)

plot(fit.summary$adjr2,xlab="Number of Variables",ylab="Adjusted RSq",type="l",ylim=c(min(fit.summary$adjr2),1))
points(p.adjr2,fit.summary$adjr2[p.adjr2], col="red",cex=2,pch=20)
abline(h = max.adjr2 - 0.2 * sd.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 + 0.2 * sd.adjr2, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Adjusted RSquare(adjR2)')

# plot for cp with minimum cp value highlighted and within .2 standard deviation

sd.cp = sd(fit.summary$cp)
min.cp = min(fit.summary$cp)
p.cp = which.min(fit.summary$cp)

plot(fit.summary$cp ,xlab="Number of Variables ",ylab="Cp", type='l')
points(p.cp,fit.summary$cp[p.cp],col="red",cex=2,pch=20)
abline(h = min.cp - 0.2 * sd.cp, col = "red", lty = 2)
abline(h = min.cp + 0.2 * sd.cp, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('Mallows Cp(Cp)')

# plot for BIC with minimum BIC value highlighted and within .2 standard deviation

sd.bic = sd(fit.summary$bic)
min.bic = min(fit.summary$bic)
p.bic = which.min(fit.summary$bic)

plot(fit.summary$bic ,xlab="Number of Variables ",ylab="BIC",type='l')
points(p.bic,fit.summary$bic[p.bic],col="red",cex=2,pch=20)
abline(h = min.bic - 0.2 * sd.bic, col = "red", lty = 2)
abline(h = min.bic + 0.2 * sd.bic, col = "red", lty = 2)
axis(side=1,at=seq(1,p,2),labels = seq(1,p,2))
title('BIC(BIC)')

The above charts indicate that once we add the 6th predictor, the RSquared, Adjusted RSquared, Mallows CP, and BIC are within the .2 standard deviations and we should be able to use 6 predictors as the best subset. However, the best subset seems to be between 11 and 14 from the Adjusted RSquare, Cp and BIC with the training dataset.

Without doing the validation approach on the test dataset, with this limited knowledge, we would like to proceed with 11 predictors as a satisfactory model even though it is usually evident that training dataset monotonically reduces in the error.

Let’s find the 11 predictors and their coefficients.

coef(fit.fwdstep,11)

##   (Intercept)    PrivateYes          Apps        Accept     Top10perc 
## -2170.1566337  2280.5778509    -0.1980874     0.7020644    28.3906163 
##   F.Undergrad    Room.Board      Personal      Terminal   perc.alumni 
##    -0.2199194     0.8278636    -0.1724561    30.8253124    43.6722984 
##        Expend     Grad.Rate 
##     0.1951104    22.7274480

10(b). Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

# Load the library gam to use the smoothing spline 
library(gam)

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.6.2

## Loaded gam 1.16.1

# Fit GAM on training data and the 11 predictors from 10.a. 
# Private is a Qualitative variable and we will leave it as is
set.seed(4)
gam.fit <- gam(Outstate ~ Private + s(Accept, 4) + s(F.Undergrad, 4) + s(Room.Board,3) + s(Personal, 3) + s(PhD, 3) + s(Terminal,3) + s(S.F.Ratio,3) + s(perc.alumni,4) + s(Expend,4) + s(Grad.Rate,4), data = College.train)

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored

par(mfrow=c(3,4))
plot.Gam(gam.fit, se=TRUE, col='blue')

we ran the GAM with smoother splines on all except the Private variable as it is qualitative. The plots indicate that several of the features are close to linear and may not need to use spline functions.
The predictors Personal, PhD, and Expend seems to be the ones that need to use a spline function.
Rest of the predictors seem to be more linear. Infact, the function of Terminal and Room.Board pretty much look linear even though we smoothed them.
We conclude that we could run ANOVA to identify a much better fit by doing different models with linearity, spline function, etc.
Also, running cross validation to see what variables are really required may help in better interpretability and reduce overfitting as the number of observations are limited.

10(c). Evaluate the model obtained on the test set, and explain the results obtained.

# set the seed
set.seed(5)

# Write predict function for regsubsets
predict.regsubsets = function(object, newdata, id, ...) {
    form = as.formula(object$call[[2]])
    mat = model.matrix(form, newdata)
    coefi = coef(object, id = id)
    mat[, names(coefi)] %*% coefi
}

# Lets find the MSE by running the original forward step wise model (10a) with 11 predictors on test set
fwd.preds<-predict.regsubsets(fit.fwdstep,newdata=College.test,id=11)
fwd.mse <- mean((College.test$Outstate - fwd.preds)^2)
fwd.mse

## [1] 3736806

# Lets find the MSE by running the GAM model with 11 predictors (10b) on test set
gam.preds<-predict(gam.fit,College.test)
gam.mse <- mean((College.test$Outstate - gam.preds)^2)
gam.mse

## [1] 3266643

Now, lets try to do the best model by doing validation approach running different models on test datset to identify the optimal number of predictors.

# Use model.matrix function and run it on test set and identify the lowest cross validation error
test.mat = model.matrix(Outstate ~., data=College.test)

val.errors = rep(NA,17)

for (i in 1:17){
        coefi = coef(fit.fwdstep, id=i)
        pred = test.mat[,names(coefi)]%*%coefi
        val.errors[i] = mean((College.test$Outstate-pred)^2)
}
which.min(val.errors)

## [1] 6

Let’s identify the 6 predictors.

# Identify the predictors and the coefficients by running the model on the entire College set
fit.6 <- regsubsets(Outstate ~ ., data = College, method = "forward")
coeffs <- coef(fit.6, id = 6)
coeffs

##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3553.2345268  2768.6347025     0.9679086    35.5283359    48.4221031 
##        Expend     Grad.Rate 
##     0.2210255    29.7119093

This model entirely changed the predictors set. Lets fit a GAM model on this list and verify linearity.

# GAM with 6 predictors. Leave Private as is and dont apply spline function
gam.fit.6 <- gam(Outstate ~ Private + s(Room.Board,3) + s(PhD, 3) + s(perc.alumni,4) + s(Expend,4) + s(Grad.Rate,4), data = College.train)

## Warning in model.matrix.default(mt, mf, contrasts): non-list contrasts argument
## ignored

par(mfrow=c(2,3))
plot.Gam(gam.fit.6, se=TRUE, col='blue')

This above chart clearly indicates that the spline function is required on ‘Expend’ and ‘Grad.Rate’. All the others seem to be more close to linear.

Lets find the MSE by running it on test dataset.

# Lets find the MSE by running the GAM model with 11 predictors (10b) on test set
gam.preds.6<-predict(gam.fit.6,College.test)
gam.mse.6 <- mean((College.test$Outstate - gam.preds.6)^2)
gam.mse.6

## [1] 3188957

# Get all the MSEs into a matrix and print it
MSE_all <- matrix(c(fwd.mse,gam.mse,gam.mse.6))
dimnames(MSE_all) = list(c("MSE-Fwd step - 10a","MSE-GAM-10b","MSE-GAM with 6 predictors -10c"))
MSE_all

##                                   [,1]
## MSE-Fwd step - 10a             3736806
## MSE-GAM-10b                    3266643
## MSE-GAM with 6 predictors -10c 3188957

The original model fit with forward stepwise with training set has monotonically reduced RSS and hence we did validation approach by running different models on test set and identified the best subset of predictors. we ran all the models on the test set. We identified that linearity is not there and hence performed the GAM with spline functions on all predictors except Private. Our assumption is valid. So, we fit the GAM on this and finally achieved a better MSE on the model with 6 predictors. The model fit better on the test dataset.

Let’s confirm the model variability by evaluting the Rsquared.Let’s get the Rsquare by using misctools library.

library(miscTools)

# Calculate R2 for all Regression model predictions for test data set
R2_fwd <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - fwd.preds))
R2_gam <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - gam.preds))
R2_gam.6 <- rSquared(College.test$Outstate, resid = as.matrix(College.test$Outstate - gam.preds.6))

# Get all the RSquared into a matrix and print it
R2_all <- matrix(c(R2_fwd,R2_gam,R2_gam.6))
dimnames(R2_all) = list(c("R2-Fwd step - 10a","R2-GAM-10b","R2-GAM with 6 predictors -10c"))
R2_all

##                                    [,1]
## R2-Fwd step - 10a             0.7829694
## R2-GAM-10b                    0.8102761
## R2-GAM with 6 predictors -10c 0.8147880

Finally, we conclude that the model performed better on the test dataset and it is clear that linearity doesnt hold much on some of the predictors. We could conclude that 81.5% of variability is explained by this GAM fit with 6 predictors.

10(d). For which variables, if any, is there evidence of a non-linear relationship with the response?

# Lets print the summary
summary(gam.fit.6)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 3) + s(PhD, 
##     3) + s(perc.alumni, 4) + s(Expend, 4) + s(Grad.Rate, 4), 
##     data = College.train)
## Deviance Residuals:
##        Min         1Q     Median         3Q        Max 
## -7358.0067 -1129.6487     0.8298  1236.7461  7924.4294 
## 
## (Dispersion Parameter for gaussian family taken to be 3566999)
## 
##     Null Deviance: 9873303076 on 620 degrees of freedom
## Residual Deviance: 2143763985 on 600.9993 degrees of freedom
## AIC: 11153.17 
## 
## Number of Local Scoring Iterations: 2 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2605271147 2605271147 730.382 < 2.2e-16 ***
## s(Room.Board, 3)    1 1957860876 1957860876 548.882 < 2.2e-16 ***
## s(PhD, 3)           1  637208618  637208618 178.640 < 2.2e-16 ***
## s(perc.alumni, 4)   1  398355862  398355862 111.678 < 2.2e-16 ***
## s(Expend, 4)        1  783088579  783088579 219.537 < 2.2e-16 ***
## s(Grad.Rate, 4)     1  140623891  140623891  39.424 6.541e-10 ***
## Residuals         601 2143763985    3566999                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df Npar F   Pr(F)    
## (Intercept)                                 
## Private                                     
## s(Room.Board, 3)        2  2.632 0.07279 .  
## s(PhD, 3)               2  1.568 0.20938    
## s(perc.alumni, 4)       3  1.416 0.23702    
## s(Expend, 4)            3 33.973 < 2e-16 ***
## s(Grad.Rate, 4)         3  2.179 0.08942 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

ANOVA shows a strong evidence of non-linear relationship between ‘Outstate’ and ‘Expend’. This is evident even on the GAM plot. Room.Board and Grad.Rate are the next close ones (based on p-value 0.1).

ISLR-CH7-Moving Beyond Linearity - Assignment#6

Sitharam Chilakapati

7/23/2020