Chapter 07

Assignment 6: 06, 10

6) Analyze the Wage data set considered through this chapter:

a) Perform polynomial regression to predict wage using age. Use CV to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis using ANOVA? Make a plot of the resulting polynomial fit of to the data.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.1.3

library(boot)

set.seed(1)
possible_combos = rep(NA, 10)
for (i in 1:10) { # k fold = 10
  glm.fit = glm(wage~poly(age, i), data = Wage)
  possible_combos[i] = cv.glm(Wage, glm.fit, K = 10)$delta[2]
}

plot(1:10, possible_combos, xlab = "Degree", ylab  = "Cross Validation Error", type = "l", pch = 20, lwd = 2, ylim =c(1590, 1700))
min.point = min(possible_combos)
sd.points = sd(possible_combos)
deg.min = which.min(possible_combos)
abline(h = min.point + .2 * sd.points, col = "blue", lty = "dashed")
abline(h = min.point - .2 * sd.points, col = "blue", lty = "dashed")
legend("topright", ".2 SD deveation line", lty = "dashed", col = "red")
points(deg.min, possible_combos[deg.min], col = "blue", cex =2, pch = 19)

We obtain the minimum test MSE at degree 9. However, the cv plot with SD lines shows that d = 3 is approximatly the smalles degree given a reasonably small CV error

Lets work on the anova:

fit.1 = lm(wage~poly(age, 1), data=Wage)
fit.2 = lm(wage~poly(age, 2), data=Wage)
fit.3 = lm(wage~poly(age, 3), data=Wage)
fit.4 = lm(wage~poly(age, 4), data=Wage)
fit.5 = lm(wage~poly(age, 5), data=Wage)
fit.6 = lm(wage~poly(age, 6), data=Wage)
fit.7 = lm(wage~poly(age, 7), data=Wage)
fit.8 = lm(wage~poly(age, 8), data=Wage)
fit.9 = lm(wage~poly(age, 9), data=Wage)
fit.10 = lm(wage~poly(age, 10), data=Wage)
anova(fit.1, fit.2, fit.3, fit.4, fit.5, fit.6, fit.7, fit.8, fit.9, fit.10)

## Analysis of Variance Table
## 
## Model  1: wage ~ poly(age, 1)
## Model  2: wage ~ poly(age, 2)
## Model  3: wage ~ poly(age, 3)
## Model  4: wage ~ poly(age, 4)
## Model  5: wage ~ poly(age, 5)
## Model  6: wage ~ poly(age, 6)
## Model  7: wage ~ poly(age, 7)
## Model  8: wage ~ poly(age, 8)
## Model  9: wage ~ poly(age, 9)
## Model 10: wage ~ poly(age, 10)
##    Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1    2998 5022216                                    
## 2    2997 4793430  1    228786 143.7638 < 2.2e-16 ***
## 3    2996 4777674  1     15756   9.9005  0.001669 ** 
## 4    2995 4771604  1      6070   3.8143  0.050909 .  
## 5    2994 4770322  1      1283   0.8059  0.369398    
## 6    2993 4766389  1      3932   2.4709  0.116074    
## 7    2992 4763834  1      2555   1.6057  0.205199    
## 8    2991 4763707  1       127   0.0796  0.777865    
## 9    2990 4756703  1      7004   4.4014  0.035994 *  
## 10   2989 4756701  1         3   0.0017  0.967529    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on this information, all polynomials above degree 3 are insignificant at 1 significant level

Make a plot of resulting polynomial fit to the data:

plot(wage~age, data=Wage, col="darkgrey")
agelims = range(Wage$age)
age.grid = seq(from=agelims[1], to=agelims[2])
lm.fit = lm(wage~poly(age, 3), data=Wage)
lm.pred = predict(lm.fit, data.frame(age=age.grid))
lines(age.grid, lm.pred, col="blue", lwd=2)

fit a step function to predict wage using age, and perform CV to choose the optimal number of cuts. Make a plot of the fit obtained

Settled with 10 cuts

cuts = rep(NA, 10)
for( i in 2:10){
  Wage$age.cut = cut(Wage$age, i)
  lm.fit = glm(wage ~ age.cut, data = Wage)
  cuts[i] = cv.glm(Wage, lm.fit, K = 10)$delta[2]
}

plot(2:10, cuts[-1], xlab="Number of cuts", ylab="CV error", type="l", pch=20, lwd=2)
deg.min = which.min(cuts)
points(deg.min, cuts[deg.min], col = "blue", cex =2, pch = 19)

Test error for the CV appears to be the lowest at 8 cuts.

Now: Lets train the data with step function at cut 8

lm.fit = glm(wage ~cut(age, 8), data = Wage)
agelims = range(Wage$age)
age.grid = seq(from = agelims[1], to = agelims[2])
lm.pred = predict(lm.fit, data.frame(age = age.grid))
plot(wage ~age, data = Wage, col = "darkgrey")
lines(age.grid, lm.pred, col = "Blue", lwd =2)

Slay.

10) Relates to the College data set

a) Split the data into train and test. Use out of state tuition are response and other variables as the predictors, perform forward stepwise on training to ID a satisfactory model that uses just a subset of pred.

# Train Test Split
set.seed(1)
library(dplyr)

## Warning: package 'dplyr' was built under R version 4.1.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(leaps)

## Warning: package 'leaps' was built under R version 4.1.3

#make this example reproducible
set.seed(1)

#create ID column
Default = as.data.frame(College)
Default$id <- 1:nrow(Default)

#use 70% of dataset as training set and 30% as test set 
train <- Default %>% dplyr::sample_frac(0.70)
test  <- dplyr::anti_join(Default, train, by = 'id')

# fitting
reg.fit = regsubsets(Outstate ~., data = train, nvmax = 17, method = "forward")
reg.summary = summary(reg.fit)

par(mfrow = c(1, 3))
plot(reg.summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
min.cp = min(reg.summary$cp)
std.cp = sd(reg.summary$cp)
abline(h = min.cp + 0.2 * std.cp, col = "blue", lty = 2)
abline(h = min.cp - 0.2 * std.cp, col = "blue", lty = 2)
plot(reg.summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
min.bic = min(reg.summary$bic)
std.bic = sd(reg.summary$bic)
abline(h = min.bic + 0.2 * std.bic, col = "blue", lty = 2)
abline(h = min.bic - 0.2 * std.bic, col = "blue", lty = 2)
plot(reg.summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R2", 
    type = "l", ylim = c(0.4, 0.84))
max.adjr2 = max(reg.summary$adjr2)
std.adjr2 = sd(reg.summary$adjr2)
abline(h = max.adjr2 + 0.2 * std.adjr2, col = "blue", lty = 2)
abline(h = max.adjr2 - 0.2 * std.adjr2, col = "blue", lty = 2)

The Cp, BIC, and adjuster r^2 yield different results. The blue lines demonstrate 0.2 SD from the optimum. It appears, that from all of these graphs, a size of 6 is the min of size for the subset within .2 SD.

Thus, 6 is the best subset size.

reg.fit = regsubsets(Outstate ~ ., data = College, method = "forward")
regnames = coef(reg.fit, id = 6)
names(regnames)

## [1] "(Intercept)" "PrivateYes"  "Room.Board"  "PhD"         "perc.alumni"
## [6] "Expend"      "Grad.Rate"

B) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

library(gam)

## Warning: package 'gam' was built under R version 4.1.3

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.1.3

## Loaded gam 1.22-1

gam.fit = gam(Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, df = 2), data = train)

# plotting the 6 additive models
par(mfrow = c(2, 3))
plot(gam.fit, se = T, col = "blue")

C) Evaluate the model obtained on the test set, and explain the results obtained

gam.pred = predict(gam.fit, test)
gam.err = mean((test$Outstate - gam.pred)^2)
gam.err

## [1] 3176107

gam.tss = mean((test$Outstate - mean(test$Outstate))^2)
test.rss = 1 - gam.err/gam.tss
test.rss

## [1] 0.7707912

We obtain a test R^2 of .77 using GAM with 6 predictors. Thus, about 77% of the variance can be explained by the predictors.

D) For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.fit)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, 
##     df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, 
##     df = 2), data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7568.91 -1135.11    72.18  1263.47  7875.88 
## 
## (Dispersion Parameter for gaussian family taken to be 3625458)
## 
##     Null Deviance: 9263945675 on 543 degrees of freedom
## Residual Deviance: 1917866878 on 528.9999 degrees of freedom
## AIC: 9776.894 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                         Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private                  1 2414263354 2414263354 665.919 < 2.2e-16 ***
## s(Room.Board, df = 2)    1 1720068079 1720068079 474.442 < 2.2e-16 ***
## s(PhD, df = 2)           1  617485163  617485163 170.319 < 2.2e-16 ***
## s(perc.alumni, df = 2)   1  464768186  464768186 128.196 < 2.2e-16 ***
## s(Expend, df = 5)        1  752411278  752411278 207.536 < 2.2e-16 ***
## s(Grad.Rate, df = 2)     1  100320529  100320529  27.671 2.092e-07 ***
## Residuals              529 1917866878    3625458                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                        Npar Df  Npar F   Pr(F)    
## (Intercept)                                       
## Private                                           
## s(Room.Board, df = 2)        1  3.3157 0.06918 .  
## s(PhD, df = 2)               1  0.6590 0.41728    
## s(perc.alumni, df = 2)       1  0.9819 0.32220    
## s(Expend, df = 5)            4 27.1965 < 2e-16 ***
## s(Grad.Rate, df = 2)         1  1.6590 0.19830    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

By analyzing the anova for non-parametric effects, we see that there is a non-linear relationship between Out of state tuition, and Expend. There is also a moderately strong non-linear relationship between response and Graduation Rate, Room Board, and PhD.