Renee Brewington

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

The optimal degree that was chosen by the cross validation is d=5 with an MSE of 1594.977. The results of the hypothesis testing using ANOVA suggests that model 2 with a cubic fit and model 3 with a quadratic fit is sufficient. In comparison to the MSE of the cross validation, the MSE of the cubic fit is 1599.409 and the MSE of the quadratic fit is 1594.684. The MSE results are not too different from each model.

library(ISLR) 
attach(Wage)
require(boot)

#fitting and cross validation
set.seed(1)
cv.poly <- rep(0,5)

for (i in 1:5){
glm.fit.poly <- glm(wage ~ poly(age,i),data=Wage)
cv.poly[i]<- cv.glm(Wage,glm.fit.poly,K=10)$delta[1]
}

cv.poly

## [1] 1676.826 1600.763 1598.399 1595.651 1594.977

range <- range(cv.poly)
min.cv <- min(cv.poly)
range

## [1] 1594.977 1676.826

min.cv

## [1] 1594.977

plot(cv.poly, type="b")
points(which.min(cv.poly), cv.poly[5], col="red", pch=20, cex=2)

set.seed(1)
fit.1=lm(wage~age,data=Wage)
fit.2=lm(wage~poly(age,2),data=Wage)
fit.3=lm(wage~poly(age,3),data=Wage)
fit.4=lm(wage~poly(age,4),data=Wage)
fit.5=lm(wage~poly(age,5),data=Wage)
anova(fit.1,fit.2,fit.3,fit.4,fit.5)

library(dvmisc)
get_mse(fit.1)

## [1] 1675.189

get_mse(fit.2)

## [1] 1599.409

get_mse(fit.3)

## [1] 1594.684

get_mse(fit.4)

## [1] 1593.19

get_mse(fit.5)

## [1] 1593.294

#plotting the result of the polynomial fit to the data
agelims <- range(age)
age.grid <- seq(from=agelims[1], to=agelims[2])
preds <- predict(fit.4,newdata=list(age=age.grid),se=TRUE)
se.bands <- cbind(preds$fit+2*preds$se.fit,preds$fit-2*preds$se.fit)

plot(age, wage, xlim=agelims, cex=.5, col="darkgrey")
lines(age.grid, preds$fit, lwd=2, col="darkblue")
matlines(age.grid, se.bands, lwd=1, col="red", lty=3)

(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

set.seed(1)
cv.step <- rep(NA,5)

for (i in 2:5){
Wage$age.cut <- cut(Wage$age,i)
glm.fit.step <- glm(wage ~ age.cut, data = Wage)
cv.step[i]<- cv.glm(Wage,glm.fit.step,K=10)$delta[1]
}

cv.step

## [1]       NA 1734.489 1684.271 1635.552 1632.080

range <- range(cv.step)

plot(cv.step, type="b")
points(which.min(cv.step), cv.step[5], col="red", pch=20, cex=2)

plot(wage ~ age, data = Wage, col = "darkgrey")
glm.fit.step <- glm(wage ~ cut(age, 5), data = Wage)
preds <- predict(glm.fit.step, list(age = age.grid))
lines(age.grid, preds, col = "blue", lwd = 2)

10. This question relates to the College data set.

(a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

The best subset may be a model that includes the 6 predictors PrivateYes, Room.Board, PhD, perc.alumni, Expend, and Grad.Rate.

#split data into train and test set 

attach(College)
College=na.omit(College)

# 70% of the sample size
smp_size <- floor(.7 * nrow(College))

# set the seed to make partition reproducible
set.seed(1)
train_ind <- sample(seq_len(nrow(College)), size = smp_size)

train <- College[train_ind, ]
test <- College[-train_ind, ]

set.seed(1)
library(leaps)
forward.fit <- regsubsets(Outstate ~ ., data = train, method = 'forward')
summary(forward.fit)

## Subset selection object
## Call: regsubsets.formula(Outstate ~ ., data = train, method = "forward")
## 17 Variables  (and intercept)
##             Forced in Forced out
## PrivateYes      FALSE      FALSE
## Apps            FALSE      FALSE
## Accept          FALSE      FALSE
## Enroll          FALSE      FALSE
## Top10perc       FALSE      FALSE
## Top25perc       FALSE      FALSE
## F.Undergrad     FALSE      FALSE
## P.Undergrad     FALSE      FALSE
## Room.Board      FALSE      FALSE
## Books           FALSE      FALSE
## Personal        FALSE      FALSE
## PhD             FALSE      FALSE
## Terminal        FALSE      FALSE
## S.F.Ratio       FALSE      FALSE
## perc.alumni     FALSE      FALSE
## Expend          FALSE      FALSE
## Grad.Rate       FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: forward
##          PrivateYes Apps Accept Enroll Top10perc Top25perc F.Undergrad
## 1  ( 1 ) " "        " "  " "    " "    " "       " "       " "        
## 2  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 3  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 4  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 5  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 6  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 7  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
## 8  ( 1 ) "*"        " "  " "    " "    " "       " "       " "        
##          P.Undergrad Room.Board Books Personal PhD Terminal S.F.Ratio
## 1  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 2  ( 1 ) " "         " "        " "   " "      " " " "      " "      
## 3  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 4  ( 1 ) " "         "*"        " "   " "      " " " "      " "      
## 5  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 6  ( 1 ) " "         "*"        " "   " "      "*" " "      " "      
## 7  ( 1 ) " "         "*"        " "   "*"      "*" " "      " "      
## 8  ( 1 ) " "         "*"        " "   "*"      "*" "*"      " "      
##          perc.alumni Expend Grad.Rate
## 1  ( 1 ) " "         "*"    " "      
## 2  ( 1 ) " "         "*"    " "      
## 3  ( 1 ) " "         "*"    " "      
## 4  ( 1 ) "*"         "*"    " "      
## 5  ( 1 ) "*"         "*"    " "      
## 6  ( 1 ) "*"         "*"    "*"      
## 7  ( 1 ) "*"         "*"    "*"      
## 8  ( 1 ) "*"         "*"    "*"

coef(forward.fit, id = 6)

##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3782.6544026  2808.0522815     0.9722009    38.0615210    59.1918276 
##        Expend     Grad.Rate 
##     0.2032837    28.6598253

(b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

Based on the plots, it seems that out of state tuition is higher for private school. There is a somewhat of a positive linear relationship for the variables Room.Board, PhD, perc.alumni, and grad.rate. There is a non-linear relationship for variable Expend.

library(gam)
set.seed(1)
gam.fit <- gam(Outstate ~ Private + s(Room.Board, 5) + s(PhD, 5) + s(perc.alumni, 5) + s(Expend, 5) + s(Grad.Rate, 5), data  = train)

par(mfrow = c(2,3))
plot(gam.fit, se=TRUE,col ="#1c9099")

(c) Evaluate the model obtained on the test set, and explain the results obtained.

The linear regression mse’s are lower than the gam model

set.seed(1)
preds=predict(gam.fit,newdata=test)
mse <- mean((preds-test$Outstate)^2)

mse

## [1] 3187953

(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

Based on the plots, there is a non-linear relationship for variable Expend.

Renee Brewington

Assignment #6 Chapter 07 (page 297): 6, 10 - STA 6543 Data Analytics II - Spring 2021