Nicole-Brown-Algorithms-II-HW-6

Problem 6. In this exercise, you will further analyze the [Wage]:https://rdrr.io/cran/ISLR/man/Wage.html data set considered throughout this chapter.

library(ISLR)
library(caret)
library(boot)
attach(Wage)

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

set.seed(5)
deltas <- rep(NA, 5)
for (i in 1:5) {
    poly.fit <- glm(wage ~ poly(age, i), data = Wage)
    deltas[i] <- cv.glm(Wage, poly.fit, K=10)$delta[1]
}
plot(1:5, deltas, xlab = "Degree", ylab = "Test MSE", type = "l")
d.min <- which.min(deltas)
points(which.min(deltas), deltas[which.min(deltas)], col = "blue", cex = 2, pch = 20)

#Use ANOVA for hypothesis testing. null is that simplier model is sufficient to explain data compared to more complex model. Models must be nested for this, likelihood estimate.
fit.1<-lm(wage~age, data=Wage)
fit.2<-lm(wage~poly(age,2), data=Wage)
fit.3<-lm(wage~poly(age,3), data=Wage)
fit.4<-lm(wage~poly(age,4), data=Wage)
fit.5<-lm(wage~poly(age,5), data=Wage)
  

anova(fit.1, fit.2, fit.3, fit.4, fit.5)

## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We reject the null that the simpler model is better and find that the squared and cubed terms are better. Ultimately we would leverage the cubed model as the cubed model is better than the squared. The fourth degree polynomial is slightly better but not enough better to warrant adding that level of complexity to the model.

#to create a grid containing each age
agelims=range(age)
age.grid=seq(from=agelims[1], to=agelims[2])
preds=predict(poly.fit, newdata=list(age=age.grid), se=TRUE)
#To get the lower and upper bounds of the standard error. i.e the confidence interval
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
#To plot predictions and line
plot(age, wage, xlim=agelims, cex=.5, col='darkgrey')
title('Degree-4 Polynomial')
lines(age.grid, preds$fit, lwd=2, col='darkblue')
#allows to create a matrix of lines. This is adding the standard errors
matlines(age.grid, se.bands, lwd=1, col='blue', lty=3)

Both ANOVA and Cross Validation have confirmed that a 4th degree polynomial is optimal, although if just leveraging ANOVA I would likely have went with a cubic polynomial.

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

cvs <- rep(NA, 10)
for (i in 2:10) {
    Wage$age.cut <- cut(Wage$age, i)
    step.fit <- glm(wage ~ age.cut, data = Wage)
    cvs[i] <- cv.glm(Wage, step.fit, K = 10)$delta[1]
}
plot(2:10, cvs[-1], xlab = "Cuts", ylab = "Test MSE", type = "l")
d.min <- which.min(cvs)
points(which.min(cvs), cvs[which.min(cvs)], col = "blue", cex = 2, pch = 20)

#to create a grid containing each age
agelims=range(age)
age.grid=seq(from=agelims[1], to=agelims[2])
step.fit <- glm(wage ~ cut(age, 8), data = Wage)
preds=predict(step.fit, newdata=list(age=age.grid), se=TRUE)
#To get the lower and upper bounds of the standard error. i.e the confidence interval
se.bands=cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
#To plot predictions and line
plot(age, wage, xlim=agelims, cex=.5, col='darkgrey')
title('8 cut step')
lines(age.grid, preds$fit, lwd=2, col='darkblue')
#allows to create a matrix of lines. This is adding the standard errors
matlines(age.grid, se.bands, lwd=1, col='blue', lty=3)

Problem 10. This question relates to the [College]: https://rdrr.io/cran/ISLR/man/College.html data set.

attach(College)

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(1)
inTrain<-sample(1:nrow(College), nrow(College)/2)
trainset<-College[inTrain,]
testset<-College[-inTrain,]

library(leaps)
fs<-regsubsets(Outstate~., data=trainset, method='forward', nvmax=17)
reg.summary=summary(fs)
reg.summary$adjr2

##  [1] 0.4486533 0.6157583 0.6806427 0.7262750 0.7418534 0.7516133 0.7538978
##  [8] 0.7559309 0.7577210 0.7587138 0.7587900 0.7646744 0.7690583 0.7706887
## [15] 0.7703905 0.7698240 0.7692238

which.min(reg.summary$rss)

## [1] 17

which.max(reg.summary$adjr2)

## [1] 14

which.min(reg.summary$cp)

## [1] 14

which.min(reg.summary$bic)

## [1] 6

par(mfrow=c(2,2))
plot(reg.summary$rss, xlab="Number of Variables",  ylab = "RSS", type ="l")
mtext("Based on Training Errors", side = 3, line = -2, outer = TRUE)
points(17, reg.summary$rss[17], col='blue', cex=2, pch=20)
plot(reg.summary$cp, xlab="Number of Variables",  ylab = "CP", type ="l")
points(14, reg.summary$cp[14], col='blue', cex=2, pch=20)

plot(reg.summary$adjr2, xlab="Number of Variables",  ylab = "adjusted r square", type ="l")
points(14, reg.summary$adjr2[14], col='blue', cex=2, pch=20)
plot(reg.summary$bic, xlab="Number of Variables",  ylab = "BIC", type ="l")
points(6, reg.summary$bic[6], col='blue', cex=2, pch=20)

#preview the coefficients for the best subset model
coef(fs, 13)

##   (Intercept)    PrivateYes          Apps        Accept        Enroll 
## -1490.8137457  2269.6712944    -0.3648672     0.8457219    -0.9071488 
##     Top10perc    Room.Board         Books      Personal      Terminal 
##    36.8227416     1.0722170    -1.0117675    -0.3041390    29.3315128 
##     S.F.Ratio   perc.alumni        Expend     Grad.Rate 
##   -67.5678909    46.9746618     0.1584946    25.4121347

#Show the Adjr^2 for the model chosen
reg.summary$adjr2[13]

## [1] 0.7690583

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(splines)
#natural splines
gam1<-lm(Outstate~ ns(Apps,4) + ns(Accept, 4) + ns(Enroll,4)+ ns(Top10perc, 4)+ ns(Room.Board, 4)+ ns(Books, 4)+ ns(Personal,4)+ ns(Terminal, 4) + ns(S.F.Ratio, 4)+ ns(Expend, 4)+ ns(Grad.Rate, 4)+ Private, data=trainset)
library(gam)
#function to use smoothing spline from GAM library. Uses method of least squares to fit
gam.m3<-gam(Outstate~ s(Apps,4) + s(Accept, 4) + s(Enroll,4)+ s(Top10perc, 4)+ s(Room.Board, 4)+ s(Books, 4)+ s(Personal,4)+ s(Terminal, 4) + s(S.F.Ratio, 4)+ s(Expend, 4)+ s(Grad.Rate, 4)+ Private, data=trainset)

Plot of smoothing Spline Models

par(mfrow=c(3, 3))
plot(gam.m3, se=TRUE, col='blue')

Plot of Natural Spline Models

par(mfrow=c(3, 3))
plot.Gam(gam1, se=TRUE, col='blue')

The GAM models suggest a potential non-linear relationship between S.F. Ratio, Books and Terminal.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

gam.preds=predict(gam.m3, newdata=testset, se=TRUE)
#Mean Squared Error
gam.mse=mean((testset$Outstate-gam.preds)^2)
gam.mse

## [1] 3233367

The test MSE is lower than the training MSE. This means that the model performed better on the test set.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.m3)

## 
## Call: gam(formula = Outstate ~ s(Apps, 4) + s(Accept, 4) + s(Enroll, 
##     4) + s(Top10perc, 4) + s(Room.Board, 4) + s(Books, 4) + s(Personal, 
##     4) + s(Terminal, 4) + s(S.F.Ratio, 4) + s(Expend, 4) + s(Grad.Rate, 
##     4) + Private, data = trainset)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6115.99 -1113.56   -21.26  1106.84  7500.00 
## 
## (Dispersion Parameter for gaussian family taken to be 3500659)
## 
##     Null Deviance: 6989966760 on 387 degrees of freedom
## Residual Deviance: 1197225139 on 341.9999 degrees of freedom
## AIC: 6992.696 
## 
## Number of Local Scoring Iterations: 4 
## 
## Anova for Parametric Effects
##                   Df     Sum Sq    Mean Sq F value    Pr(>F)    
## s(Apps, 4)         1   75186000   75186000  21.478 5.098e-06 ***
## s(Accept, 4)       1  415874022  415874022 118.799 < 2.2e-16 ***
## s(Enroll, 4)       1  458093985  458093985 130.859 < 2.2e-16 ***
## s(Top10perc, 4)    1 1946218444 1946218444 555.958 < 2.2e-16 ***
## s(Room.Board, 4)   1  770127462  770127462 219.995 < 2.2e-16 ***
## s(Books, 4)        1   49412082   49412082  14.115 0.0002021 ***
## s(Personal, 4)     1   38670606   38670606  11.047 0.0009847 ***
## s(Terminal, 4)     1    1277757    1277757   0.365 0.5461404    
## s(S.F.Ratio, 4)    1  210211035  210211035  60.049 1.065e-13 ***
## s(Expend, 4)       1  427754379  427754379 122.192 < 2.2e-16 ***
## s(Grad.Rate, 4)    1  143519446  143519446  40.998 5.032e-10 ***
## Private            1  207886683  207886683  59.385 1.419e-13 ***
## Residuals        342 1197225139    3500659                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                  Npar Df  Npar F     Pr(F)    
## (Intercept)                                   
## s(Apps, 4)             3  2.1545  0.093182 .  
## s(Accept, 4)           3  4.2873  0.005473 ** 
## s(Enroll, 4)           3  2.2215  0.085427 .  
## s(Top10perc, 4)        3  0.6579  0.578456    
## s(Room.Board, 4)       3  1.5065  0.212609    
## s(Books, 4)            3  1.5833  0.193151    
## s(Personal, 4)         3  1.2084  0.306611    
## s(Terminal, 4)         3  1.0843  0.355772    
## s(S.F.Ratio, 4)        3  3.7856  0.010751 *  
## s(Expend, 4)           3 19.9913 5.807e-12 ***
## s(Grad.Rate, 4)        3  0.5308  0.661429    
## Private                                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA p-values it can be seen that Accept, S.F. Ratio, and Expend exhibit non-linear relationshipw with Outstate.