Assignment 6

6. In this exercise, you will further analyze the Wage dataset coming with the ISLR package.

a).Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

data("Wage")
set.seed(123)
k=10
folds = sample(1:k,nrow(Wage),replace=TRUE)
cvErr = matrix(nrow=k,ncol=5)
for(d in 1:5){
   for(l in 1:k){
      fit = lm(wage~poly(age,d),data=Wage,subset=(folds!=l))
      preds = predict(fit,newdata=Wage[(folds==l),])
      cvErr[l,d] = mean((preds-Wage[folds==l,c("wage")])^2)
   }   
}

x = apply(cvErr, 2, mean)
plot(x) #average of folds per model
points.default(which.min(x), x[which.min(x)], col = 'red', pch = 20)

fit1 = lm(wage ~ age, data = Wage)
fit2 = lm(wage ~ poly(age, 2), data = Wage)
fit3 = lm(wage ~ poly(age, 3), data = Wage)
fit4 = lm(wage ~ poly(age, 4), data = Wage)
fit5 = lm(wage ~ poly(age, 5), data = Wage)
anova(fit1, fit2, fit3, fit4, fit5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2)
## Model 3: wage ~ poly(age, 3)
## Model 4: wage ~ poly(age, 4)
## Model 5: wage ~ poly(age, 5)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
agelims = range(Wage$age)
age.grid = seq(from=agelims[1],to=agelims[2])
fitpoly = lm(wage ~ poly(age, 4), data = Wage)
preds = predict(fitpoly,newdata=list(age=age.grid),se=TRUE)
se.bands = cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)
plot(wage ~ age, data = Wage, col = "gray")
title("Degree-4 Polynomial",outer=T)
lines(age.grid,preds$fit,lwd=2,col="darkblue")
matlines(age.grid,se.bands,lwd=1,col="lightblue",lty=3)

The model with a polynomial of 4 has the lost average MSE across 10 folds. Comparing the polynomial models on the ANOVA table, we can see that cubed polynomial is the last to have a value significance less than 0.05 and the 4th power polynomial model is only significant if our cut off was 0.10. We can see why this is when plotting our polynomial, as the outliers at higher ages seem to be where the majority of the curves show in our plotted model.

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

cvErr2 = rep(NA, 10)
for (i in 2:10) {
  Wage$age.cut =  cut(Wage$age, i)
  lm.fit = glm(wage ~ age.cut, data = Wage)
  cvErr2[i] = cv.glm(Wage, lm.fit, K = 10)$delta[1]
}

plot(2:10, cvErr2[-1], xlab = "Cuts", ylab = "Test MSE")
d.min <- which.min(cvErr2)
points(which.min(cvErr2), cvErr2[which.min(cvErr2)], col = "red", pch = 20)

fitcuts = glm(wage ~ cut(age, 8), data = Wage)
preds = predict(fitcuts, data.frame(age = age.grid))

plot(wage ~ age, data = Wage, col = "gray")
lines(age.grid, preds, col = "red", lwd = 2)

10. This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

rm(list=ls())
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.3
set.seed(321)
data("College")
trainrow = createDataPartition(College$Apps, p = .8, list = FALSE)
ctrain = College[trainrow,]
ctest = College[-trainrow,]
fitsub = regsubsets(Outstate ~ ., data = College, subset = trainrow, method = 'forward', nvmax=17)
regsum = summary(fitsub)
regsum$rsq
##  [1] 0.4927125 0.6298241 0.6956560 0.7250009 0.7409728 0.7508795 0.7545699
##  [8] 0.7569189 0.7634278 0.7704561 0.7731312 0.7737121 0.7741284 0.7743658
## [15] 0.7745213 0.7745319 0.7745359
par(mfrow=c(2,2))
plot(regsum$rss, xlab="Number of Variables ", ylab="RSS", type="l")
plot(regsum$adjr2, xlab="Number of Variables ", ylab="Adjusted RSq",type="l")
which.max(regsum$adjr2)
## [1] 13
points(13,regsum$adjr2[13], col="red", cex=2, pch =20)
plot(regsum$cp,xlab="Number of Variables ",ylab="Cp",type='l')
which.min(regsum$cp)
## [1] 11
points(11,regsum$cp[11], col ="red", cex=2, pch =20)
which.min(regsum$bic)
## [1] 11
plot(regsum$bic ,xlab="Number of Variables ", ylab="BIC", type='l')
points(11,regsum$bic[11],col="red",cex=2,pch =20)

coef(fitsub, id = 11)
##   (Intercept)    PrivateYes          Apps        Accept        Enroll 
## -2408.4572437  2308.4736529    -0.3857758     0.9670271    -1.0795261 
##     Top10perc    Room.Board      Personal      Terminal   perc.alumni 
##    19.4150375     0.7832799    -0.3302781    28.7931569    38.3252122 
##        Expend     Grad.Rate 
##     0.2938918    25.8598714

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

fitgam <- gam(Outstate ~ Private + s(Apps, 5) + s(Accept, 5) + s(Enroll, 5) + s(Top10perc, 5) + s(Room.Board, 5) + s(Personal, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = trainrow)
par(mfrow = c(2,3))
plot(fitgam, se = TRUE, col = 'blue')

While most variables have a near linear relationship with out of state tuition, Personal, Expend, and Grad Rate seem to have significant non linear relationships with out of state tuition.

(c) Evaluate the model obtained on the test set, and explain the results obtained.

preds = predict(fitgam, ctest)
gamErr =  mean((ctest$Outstate - preds)^2)
gamErr
## [1] 3574513
RSS = sum((ctest$Outstate - preds)^2) 
TSS = sum((ctest$Outstate - mean(ctest$Outstate)) ^ 2)
1 - (RSS/TSS)
## [1] 0.7979348

The model returned a mean squared error rate of 3574513 and an r-squared of 0.7979348. This means that our model explains a significant portion of the variance on the test data set, enough to be consider satisfactory for predicting out of state tuition.

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(fitgam)
## 
## Call: gam(formula = Outstate ~ Private + s(Apps, 5) + s(Accept, 5) + 
##     s(Enroll, 5) + s(Top10perc, 5) + s(Room.Board, 5) + s(Personal, 
##     5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + 
##     s(Grad.Rate, 5), data = College, subset = trainrow)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -6596.44 -1039.50    96.49  1117.30  7588.37 
## 
## (Dispersion Parameter for gaussian family taken to be 2999996)
## 
##     Null Deviance: 9805340913 on 623 degrees of freedom
## Residual Deviance: 1715994535 on 571.9989 degrees of freedom
## AIC: 11128.95 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private             1 2947008796 2947008796 982.337 < 2.2e-16 ***
## s(Apps, 5)          1 1046088964 1046088964 348.697 < 2.2e-16 ***
## s(Accept, 5)        1  100251687  100251687  33.417 1.223e-08 ***
## s(Enroll, 5)        1  177674274  177674274  59.225 6.201e-14 ***
## s(Top10perc, 5)     1 1128823231 1128823231 376.275 < 2.2e-16 ***
## s(Room.Board, 5)    1  730670580  730670580 243.557 < 2.2e-16 ***
## s(Personal, 5)      1   36724053   36724053  12.241 0.0005038 ***
## s(Terminal, 5)      1  130694980  130694980  43.565 9.367e-11 ***
## s(perc.alumni, 5)   1  187481492  187481492  62.494 1.383e-14 ***
## s(Expend, 5)        1  565828704  565828704 188.610 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   62685772   62685772  20.895 5.948e-06 ***
## Residuals         572 1715994535    2999996                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Apps, 5)              4  1.8978 0.1093212    
## s(Accept, 5)            4 16.7803 5.127e-13 ***
## s(Enroll, 5)            4  5.8503 0.0001281 ***
## s(Top10perc, 5)         4  0.8450 0.4970018    
## s(Room.Board, 5)        4  1.5543 0.1850930    
## s(Personal, 5)          4  2.8798 0.0221822 *  
## s(Terminal, 5)          4  1.2199 0.3011918    
## s(perc.alumni, 5)       4  1.6815 0.1527158    
## s(Expend, 5)            4 16.4491 9.126e-13 ***
## s(Grad.Rate, 5)         4  2.5615 0.0376134 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Accept, Enroll, Personal, Expend, and Grad Rate all tested significant for nonparametric effects, suggesting a potential non-linear relationship with the data. This reinforces our earlier observations of the individual variables against out of state tuition plots.