hw 6

In this exercise, you will further analyze the Wage data set considered throughout this chapter.

Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA?Make a plot of the resulting polynomial fit to the data.

library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.3

library(boot)
set.seed(1)
degree <- 5
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}

plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

degree <- 10
cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

res <- cut(c(1,5,2,3,8), 2)
res

## [1] (0.993,4.5] (4.5,8.01]  (0.993,4.5] (0.993,4.5] (4.5,8.01] 
## Levels: (0.993,4.5] (4.5,8.01]

length(res)

## [1] 5

class(res[1])

## [1] "factor"

This question relates to the College data set.

Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

library(gam)

## Warning: package 'gam' was built under R version 4.0.5

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.0.4

## Loaded gam 1.20

library(Metrics)

## Warning: package 'Metrics' was built under R version 4.0.5

library(ISLR)
library(leaps)

## Warning: package 'leaps' was built under R version 4.0.4

train <- sample(1: nrow(College), nrow(College)/2)
test <- -train
fit <- regsubsets(Outstate ~ ., data = College, subset = train, method = 'forward')
fit.summary <- summary(fit)
fit.summary

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = College, subset = train, 
##     method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  "*"    " "    " "       " "       " "        
## 7  ( 1 ) "*"        "*"  "*"    " "    " "       " "       " "        
## 8  ( 1 ) "*"        "*"  "*"    " "    " "       " "       "*"        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 8  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         " "    " "      
## 2  ( 1 ) "*"         " "    " "      
## 3  ( 1 ) "*"         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    " "      
## 7  ( 1 ) "*"         "*"    " "      
## 8  ( 1 ) "*"         "*"    " "

coef(fit, id = 6)

##   (Intercept)    PrivateYes        Accept    Room.Board           PhD 
## -3268.8517597  3442.7753501     0.1284518     0.9910055    44.7129044 
##   perc.alumni        Expend 
##    68.3972798     0.1835779

Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

gam.mod <- gam(Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data = College, subset = train)
par(mfrow = c(2,3))
plot(gam.mod, se = TRUE, col = 'blue')

Evaluate the model obtained on the test set, and explain the results obtained.

preds <- predict(gam.mod, College[test, ])
RSS <- sum((College[test, ]$Outstate - preds)^2)   
TSS <- sum((College[test, ]$Outstate - mean(College[test, ]$Outstate)) ^ 2)
1 - (RSS / TSS)

## [1] 0.7583146

For which variables, if any, is there evidence of a non-linear relationship with the response?

summary(gam.mod)

## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board, 5) + s(Terminal, 
##     5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), 
##     data = College, subset = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7246.55 -1077.75    64.38  1155.21  4453.84 
## 
## (Dispersion Parameter for gaussian family taken to be 3162268)
## 
##     Null Deviance: 6261948859 on 387 degrees of freedom
## Residual Deviance: 1141577322 on 360.9996 degrees of freedom
## AIC: 6936.23 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                    Df     Sum Sq    Mean Sq  F value    Pr(>F)    
## Private             1 1885216382 1885216382 596.1596 < 2.2e-16 ***
## s(Room.Board, 5)    1 1299441721 1299441721 410.9208 < 2.2e-16 ***
## s(Terminal, 5)      1  460777320  460777320 145.7110 < 2.2e-16 ***
## s(perc.alumni, 5)   1  211389027  211389027  66.8473 5.037e-15 ***
## s(Expend, 5)        1  442634710  442634710 139.9738 < 2.2e-16 ***
## s(Grad.Rate, 5)     1   30729204   30729204   9.7175  0.001972 ** 
## Residuals         361 1141577322    3162268                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                   Npar Df  Npar F     Pr(F)    
## (Intercept)                                    
## Private                                        
## s(Room.Board, 5)        4  1.0837   0.36434    
## s(Terminal, 5)          4  2.4721   0.04426 *  
## s(perc.alumni, 5)       4  0.6529   0.62514    
## s(Expend, 5)            4 16.5784 1.748e-12 ***
## s(Grad.Rate, 5)         4  2.1604   0.07299 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Grad.Rate and PhD have moderate non-linear relationship with the Outstate, which coincide with the result of (b).

hw 6

Sudeep Jacob

4/23/2021