Assignment #6

6. In this exercise, you will further analyze the Wage data set considered throughout this chapter.

##(a) Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

library(ISLR2)
library(boot)
set.seed(1)
degree <- 10
cv.errs <- rep(NA, degree)
for (i in 1:degree) {
  fit <- glm(wage ~ poly(age, i), data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}

plot(1:degree, cv.errs, xlab = 'Degree', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

##The minimum of test MSE at the degree 9. But test MSE of degree 4 is small enough.

plot(wage ~ age, data = Wage, col = "darkgrey")
age.range <- range(Wage$age)
age.grid <- seq(from = age.range[1], to = age.range[2])
fit <- lm(wage ~ poly(age, 3), data = Wage)
preds <- predict(fit, newdata = list(age = age.grid))
lines(age.grid, preds, col = "red", lwd = 2)

##(b) Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

cv.errs <- rep(NA, degree)
for (i in 2:degree) {
  Wage$age.cut <- cut(Wage$age, i)
  fit <- glm(wage ~ age.cut, data = Wage)
  cv.errs[i] <- cv.glm(Wage, fit)$delta[1]
}
plot(2:degree, cv.errs[-1], xlab = 'Cuts', ylab = 'Test MSE', type = 'l')
deg.min <- which.min(cv.errs)
points(deg.min, cv.errs[deg.min], col = 'red', cex = 2, pch = 19)

##So 8 cuts produce minimum test MSE.

plot(wage ~ age, data = Wage, col = "darkgrey")
fit <- glm(wage ~ cut(age, 8), data = Wage)
preds <- predict(fit, data.frame(age = age.grid))  # both `data.frame` and `list` work
lines(age.grid, preds, col = "red", lwd = 2)

10. This question relates to the College data set.

## (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors

rm(list = ls())
library(ISLR2)
attach(College)

train=sample(nrow(College), floor(nrow(College)* 2/3))
DF<-College
DFTrain <-DF[train,]
DFTest <-DF[-train,]

library(pander)

## Warning: package 'pander' was built under R version 4.2.3

pander(names(DF))

Private, Apps, Accept, Enroll, Top10perc, Top25perc, F.Undergrad, P.Undergrad, Outstate, Room.Board, Books, Personal, PhD, Terminal, S.F.Ratio, perc.alumni, Expend and Grad.Rate

library(leaps)

## Warning: package 'leaps' was built under R version 4.2.3

regfit.full<-regsubsets(Outstate~.,data = DFTrain,method = "forward",nvmax = 18)
reg.summary <- summary(regfit.full)
mse_v <- reg.summary$rss / nrow(DF)
plot(mse_v) 
title(c("MSE versus model size for forward subset selection algorithm on training set.","Forward SSS"))

plot(regfit.full ,scale ="bic")
title("$BIC$ Forward SSS")

model_fss8 <- coef(regfit.full,8)
pander(names(model_fss8))

(Intercept), PrivateYes, Apps, Accept, Room.Board, PhD, perc.alumni, Expend and Grad.Rate

## (b) Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.
library(gam)

## Warning: package 'gam' was built under R version 4.2.3

## Loading required package: splines

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 4.2.3

## Loaded gam 1.22-2

gam.fit=gam(Outstate~Private+s(Room.Board,4)+s(Personal,4)+s(Terminal,4)+s(S.F.Ratio,4)+s(perc.alumni)+s(Expend,4)+s(Grad.Rate,4) ,data=DFTrain)
plot(gam.fit, se=TRUE ,col ="blue ")

## The plots show the univariate fits. One of the predictors selected by the forward SSS algorithm is a factor and is not fit to a smoothing spline.  There is strong evidence of non linear relationships in the data.  The variables ${Personal, S.F.Ratio, perc.alumni, expend}$ show this particularly

## (c) Evaluate the model obtained on the test set, and explain the results obtained.

preds=predict (gam.fit,newdata =DFTest)
RSS <- sum((preds-DFTest$Outstate)^2)
TSS <- sum((DFTest$Outstate - mean(DFTest$Outstate))^2)
RS2_Test <- 1- (RSS/TSS)
plot(DFTest$Outstate,DFTest$Outstate-preds)
title(c("Residual plot for test set",sprintf("R-Squared = %f",RS2_Test)))

preds=predict (gam.fit,newdata =DFTrain)
RSS <- sum((preds-DFTrain$Outstate)^2)
TSS <- sum((DFTrain$Outstate - mean(DFTrain$Outstate))^2)
RS2_Train <- 1- (RSS/TSS)

## We see no significant trend in the residual plot, indicating that there is no unaccounted for non-linear relationships in the model. We also see that the training and test set  $R^2$ statistic indicate a resonable fit.  As expected the test $R^2$ statistic is below the training $R^2$ value. 

##(d) For which variables, if any, is there evidence of a non-linear relationship with the response?

##The variables ${Personal, S.F.Ratio, perc.alumni, expend}$ particularly show a non -linear relationship with the response.