Assignment 6 - Chapter 7

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.5.0 
## ✔ readr   2.1.3      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(readxl)
library(tinytex)
library(ISLR)
library(ISLR2)

## 
## Attaching package: 'ISLR2'
## 
## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

library(MASS)

## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## 
## The following object is masked from 'package:dplyr':
## 
##     select

library(class)
library(e1071)
library(boot)
library(caret)

## Loading required package: lattice
## 
## Attaching package: 'lattice'
## 
## The following object is masked from 'package:boot':
## 
##     melanoma
## 
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

library(glmnet)

## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## 
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## 
## Loaded glmnet 4.1-7

library(pls)

## 
## Attaching package: 'pls'
## 
## The following object is masked from 'package:caret':
## 
##     R2
## 
## The following object is masked from 'package:stats':
## 
##     loadings

library(leaps)
library(gam)

## Loading required package: splines
## Loading required package: foreach
## 
## Attaching package: 'foreach'
## 
## The following objects are masked from 'package:purrr':
## 
##     accumulate, when
## 
## Loaded gam 1.22-2

PROBLEM 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

set.seed(1)
combos = rep(NA, 10)
for (i in 1:10) 
  { 
  glm.fit = glm(wage~poly(age, i), data = Wage)
  combos[i] = cv.glm(Wage, glm.fit, K = 10)$delta[2]
  }

plot(1:10, combos, xlab = "Degree", ylab  = "Cross Validation Error", type = "l", pch = 20, lwd = 2, ylim =c(1590, 1700))
min.point = min(combos)
sd.points = sd(combos)
deg.min = which.min(combos)
abline(h = min.point + .2 * sd.points, col = "blue", lty = "dashed")
abline(h = min.point - .2 * sd.points, col = "blue", lty = "dashed")
legend("topright", ".2 SD deveation line", lty = "dashed", col = "blue")
points(deg.min, combos[deg.min], col = "red", cex =2, pch = 19)

fit.1 = lm(wage~poly(age, 1), data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
fit.6 = lm(wage~poly(age, 6), data=Wage)
fit.7 = lm(wage~poly(age, 7), data=Wage)
fit.8 = lm(wage~poly(age, 8), data=Wage)
fit.9 = lm(wage~poly(age, 9), data=Wage)
fit.10 = lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)

## Analysis of Variance Table
## 
## Model  1: wage ~ poly(age, 1)
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the plot above we can see that the min MSE is at the 9th degree. However using the ANOVA table we can see that degrees above 3 are largely insignificant. This corresponds with the SD line in the plot as well with the line crossing over around the 3 degree mark. Thus, d=3.

plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.fit = lm(wage~poly(age, 3), data=Wage)
lm.pred = predict(lm.fit, data.frame(age=age.grid))
lines(age.grid, lm.pred, col="red", lwd=2)

Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

cuts = rep(NA, 10)
for( i in 2:10){
  Wage$age.cut = cut(Wage$age, i)
  lm.fit = glm(wage ~ age.cut, data = Wage)
  cuts[i] = cv.glm(Wage, lm.fit, K = 10)$delta[2]
}

plot(2:10, cuts[-1], xlab="Cuts", ylab="CV error", type="l", pch=20, lwd=2)
deg.min = which.min(cuts)
points(deg.min, cuts[deg.min], col = "red", cex =2, pch = 19)

The plot generated shows the lowest error at point 8. Thus we will continue to use 8 as the cut value below.

lm.fit = glm(wage ~cut(age, 8), data = Wage)
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
lm.pred = predict(lm.fit, data.frame(age = age.grid))
plot(wage ~age, data = Wage, col = "darkgrey")
lines(age.grid, lm.pred, col = "Red", lwd =2)

PROBLEM 10

This question relates to the College data set.

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

set.seed(1)


Default = as.data.frame(College)
Default$id <- 1:nrow(Default)

#75% of dataset as training set and 25% as test set 
train <- Default %>% dplyr::sample_frac(0.75)
test  <- dplyr::anti_join(Default, train, by = 'id')

reg.fit = regsubsets(Outstate ~., data = train, nvmax = 17, method = "forward")
reg.summary = summary(reg.fit)

par(mfrow = c(1, 3))
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min.cp = min(reg.summary$cp)
std.cp = sd(reg.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "red", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "red", lty = 2)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min.bic = min(reg.summary$bic)
std.bic = sd(reg.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "red", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "red", lty = 2)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", 
    type = "l", ylim = c(0.4, 0.84))
max.adjr2 = max(reg.summary$adjr2)
std.adjr2 = sd(reg.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "red", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "red", lty = 2)

Keeping the number of variables low will help reduce bias in our model. It looks like 5 is a good subset size.

reg.fit = regsubsets(Outstate ~ ., data = College, method = "forward")
regnames = coef(reg.fit, id = 5)
names(regnames)

## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "PhD"         "perc.alumni"
## [6] "Expend"

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

gam.fit = gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5), data = train)

par(mfrow = c(2, 3))
plot(gam.fit, se = T, col = "red")

Room.Board , PhD , and perc.alumni have positive slopes but Expend seems to level out to 0 and negative.

Evaluate the model obtained on the test set, and explain the results obtained

gam.pred = predict(gam.fit, test)
gam.err = mean((test$Outstate - gam.pred)^2)
gam.err

## [1] 3344739

gam.tss = mean((test$Outstate - mean(test$Outstate))^2)
test.rss = 1 - gam.err/gam.tss
test.rss

## [1] 0.751605

With 5 predictors chosen, we have an R squared value of .7516 which means the response variation can be explained around 3/4ths by the 5 predictors.

summary(gam.fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5), data = train)
## Deviance Residuals:
##     Min      1Q  Median      3Q     Max 
## -8562.2 -1208.3   127.3  1256.1  7515.8 
## 
## (Dispersion Parameter for gaussian family taken to be 3777450)
## 
##     Null Deviance: 9873788135 on 582 degrees of freedom
## Residual Deviance: 2153146707 on 570 degrees of freedom
## AIC: 10498.61 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 2528336162 2528336162  669.32 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1933168407 1933168407  511.77 < 2.2e-16 ***
## s(PhD, df = 2)           1  673366665  673366665  178.26 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  520405137  520405137  137.77 < 2.2e-16 ***
## s(Expend, df = 5)        1  769076255  769076255  203.60 < 2.2e-16 ***
## Residuals              570 2153146707    3777450                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F   Pr(F)    
## (Intercept)                                       
## Private                                           
## s(Room.Board, df = 2)        1  3.8553 0.05007 .  
## s(PhD, df = 2)               1  1.1672 0.28043    
## s(perc.alumni, df = 2)       1  1.7320 0.18868    
## s(Expend, df = 5)            4 26.6307 < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova for Nonparametric Effects returns Room.Board and Expend as significant at the .05 level

Anova for Parametric Effects returns Room.Board, PhD, perc.alumni, and Expend as significant at the .05 level.

Assignment 6 - Chapter 7 - Data Mining

Nicholas Homonko

2023-04-12

PROBLEM 6

PROBLEM 10