1. In this exercise, you will further analyze the Wage data set considered throughout this chapter.
  1. Perform polynomial regression to predict wage using age. Use cross-validation to select the optimal degree d for the polynomial. What degree was chosen, and how does this compare to the results of hypothesis testing using ANOVA? Make a plot of the resulting polynomial fit to the data.

ANSWER 6a: Using 10-fold cross validation to th select the best model I found the third order polynomial as shown. By using the ANOVA instead of cross-validation could impact the results in selecting the third or fourth polynomials of age. As shown below, I plotted the third-order polynomial fit and it has a very smooth curvature.

require(ISLR)
## Loading required package: ISLR
require(dplyr)
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
#require(glmnet)
#require(leaps)
#require(MASS)
#require(stats)

ctrl <- trainControl(method = "cv", number = 10)

CV_RMSE <- c()

set.seed(123)

for (i in 1:10) {
  model_temp <- train(y = Wage$wage,
                      x = poly(Wage$age, i, raw = T, simple = T),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i] <- model_temp$results$RMSE
}

data.frame(CV_RMSE = CV_RMSE,degree = 1:10) %>% mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>% 
  ggplot(aes(x = degree, y = CV_RMSE)) +
  geom_line(col = "blue") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Selecting the 'age' polynomial degree with cross-validation",
       x = "Degree",
       y = "CV RMSE")

fit_1 <- lm(wage ~ age, data = Wage)
fit_2 <- lm(wage ~ poly(age, 2, raw = T), data = Wage)
fit_3 <- lm(wage ~ poly(age, 3, raw = T), data = Wage)
fit_4 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
fit_5 <- lm(wage ~ poly(age, 5, raw = T), data = Wage)

anova(fit_1, fit_2, fit_3, fit_4, fit_5)
## Analysis of Variance Table
## 
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
##   Res.Df     RSS Df Sum of Sq        F    Pr(>F)    
## 1   2998 5022216                                    
## 2   2997 4793430  1    228786 143.5931 < 2.2e-16 ***
## 3   2996 4777674  1     15756   9.8888  0.001679 ** 
## 4   2995 4771604  1      6070   3.8098  0.051046 .  
## 5   2994 4770322  1      1283   0.8050  0.369682    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") + 
  labs(title = "Wage Dataset - Polynomial Regression",
       subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")

  1. Fit a step function to predict wage using age, and perform crossvalidation to choose the optimal number of cuts. Make a plot of the fit obtained.

ANSWER 6b:

Tested by cutting age into 20 intervals for the step function using crossvalidation to select the best model. The selected model had 12 intervals of age as shown below.

require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)

CV_RMSE <- c()

set.seed(124)

for (i in 2:20) {
  model_temp <- train(y = Wage$wage,
                      x = data.frame(cut(Wage$age, i)),
                      method = "lm",
                      metric = "RMSE",
                      trControl = ctrl)
  CV_RMSE[i-1] <- model_temp$results$RMSE
}


data.frame(cuts = 2:20, CV_RMSE = CV_RMSE) %>%
  mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
  ggplot(aes(x = cuts, y = CV_RMSE)) +
  geom_line(col = "blue") +
  geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
  scale_x_continuous(breaks = seq(2, 20), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) +
  theme(legend.position = "none") +
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Selecting number of 'age' cut-points with cross-validation",
       x = "Intervals",
       y = "CV RMSE")

ggplot(Wage, aes(x = age, y = wage)) + 
  geom_point(alpha = 0.3) + 
  geom_smooth(method = "lm", formula = "y ~ cut(x, 12)") + 
  labs(title = "Wage Dataset - Step Function",
       subtitle = "Predicting 'wage' with a 12-interval step function of 'age'")

10.This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.

ANSWER 10a:

I start by randomly sampling 70% of the train observations, the rest of the observations are taken by the test dataset as shown below. By utilizing the selectionFunction = "oneSE" for cross-validation, we can identify the model that minimizes the cross-validation error and select the simplest model within 1 SE of it. As shown in the model_foward results nvmax = 6 which means that the 6-variable model was selected. The model with the minimum cross-validation error had 14 predictors as shown below.

require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)


set.seed(1)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))

train <- College[train_index, ]
nrow(train) / nrow(College)
## [1] 0.7001287
test <- College[-train_index, ]
nrow(test) / nrow(College)
## [1] 0.2998713
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
  c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
    Rsquared = summary(lm(pred ~ obs, data))$r.squared,
    MAE = mean(abs(data$obs-data$pred)), 
    MSE = mean((data$obs-data$pred)^2),
    RSS = sum((data$obs-data$pred)^2))
}



ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 1, 
                     summaryFunction = custom_regression_metrics, 
                     selectionFunction = "oneSE")

set.seed(11)

model_forward <- train(Outstate ~ .,
                       data = train,
                       method = "leapForward",
                       metric = "MSE",
                       maximize = F,
                       trControl = ctrl,
                       tuneGrid = data.frame(nvmax = 1:17))

model_forward
## Linear Regression with Forward Selection 
## 
## 544 samples
##  17 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 489, 490, 492, 491, 489, 488, ... 
## Resampling results across tuning parameters:
## 
##   nvmax  RMSE      Rsquared   MAE       MSE      RSS      
##    1     3116.676  0.5047899  2480.277  9924143  537869955
##    2     2670.511  0.6141343  2018.854  7253757  393129955
##    3     2394.728  0.6806436  1847.491  5787625  314365711
##    4     2223.172  0.7179308  1740.158  4974734  270218937
##    5     2160.560  0.7329846  1701.678  4701270  255744224
##    6     2117.879  0.7416528  1665.929  4519142  245832663
##    7     2123.836  0.7408057  1670.549  4548786  247459481
##    8     2119.735  0.7406173  1674.291  4533400  246643018
##    9     2155.722  0.7293720  1691.149  4710988  256372597
##   10     2157.944  0.7291680  1680.271  4726621  257176853
##   11     2113.974  0.7416482  1663.049  4513600  245493174
##   12     2096.665  0.7459114  1653.184  4443008  241535828
##   13     2086.080  0.7482922  1647.345  4399171  239126796
##   14     2082.560  0.7487630  1644.411  4383112  238253554
##   15     2090.142  0.7471492  1651.776  4414671  239941188
##   16     2088.295  0.7476358  1651.515  4404923  239413821
##   17     2088.644  0.7475671  1651.796  4406352  239491261
## 
## MSE was used to select the optimal model using  the one SE rule.
## The final value used for the model was nvmax = 6.
model_forward_SE <- model_forward$results %>%
  mutate(MSE_SE_low = MSE - (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)),
         MSE_SE_high = MSE + (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)), 
         min_CV_MSE = as.numeric(min(MSE) == MSE))

ggplot(model_forward_SE, aes(x = nvmax, y = MSE)) +
  geom_line(col = "blue") +
  geom_hline(aes(yintercept = model_forward_SE$MSE_SE_high[model_forward_SE$min_CV_MSE == 1], col = "1 s.e. model")) +
  geom_vline(aes(xintercept = model_forward$bestTune$nvmax, col = "red")) +
  geom_errorbar(aes(ymin = MSE_SE_low, ymax = MSE_SE_high), 
                colour = "grey55",
                width = 0.4) +
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
  scale_color_manual(values = c("deepskyblue3", "orange", "red", "red")) +
  scale_x_continuous(breaks = 1:17, minor_breaks = NULL) +
  theme(legend.position = "none") +
  labs(title = "College Dataset - Linear Model",
       subtitle = "Selecting number of predictors with forward selection (cross-validation MSE, 1 s.e. rule)",
       x = "Number of Predictors", 
       y = "Cross-Validation MSE")

  1. Fit a GAM on the training data, using out-of-state tuition as the response and the features selected in the previous step as the predictors. Plot the results, and explain your findings.

ANSWER 10b:

By looking at the plots below there is evidence of a nonlinear effect for Expend,Room.Board,PhD,Grad.Rate, and a linear relationship for perc.alumni

require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)
## Loading required package: gam
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
coef(model_forward$finalModel, id = 6)
##   (Intercept)    PrivateYes    Room.Board           PhD   perc.alumni 
## -3764.3413062  2793.2069104     0.9703210    38.2157650    59.0358377 
##        Expend     Grad.Rate 
##     0.2031532    28.6548780
model_gam <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train)

par(mfrow = c(2, 3))
plot(model_gam, se = T, col = "blue")

  1. Evaluate the model obtained on the test set, and explain the results obtained.

ANSWER 10c:

The GAM model has the better test performace(in comparison to the linear model test) as shown below.

require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)

#GAM test MSE and R-squared:
mean((predict(model_gam, newdata = test) - test$Outstate)^2)
## [1] 3170227
test_TSS <- sum((test$Outstate - mean(test$Outstate))^2)
test_RSS <- sum((predict(model_gam, newdata = test) - test$Outstate)^2)

1 - test_RSS/test_TSS
## [1] 0.7712156
#Linear Model test MSE and R-squared:
mean((predict(model_forward, newdata = test) - test$Outstate)^2)
## [1] 3616631
test_RSS <- sum((predict(model_forward, newdata = test) - test$Outstate)^2)

1 - test_RSS/test_TSS
## [1] 0.7390001
  1. For which variables, if any, is there evidence of a non-linear relationship with the response?

ANSWER 10d:

By looking at the p-values under the Anova for Non-parametric Effects bottom table shown, we can determine if a smooth/non-linear effect is present, where p<0.05 which indicates the presence of a non-linear effect.Expend and Room.Board have p-values <0.05, therefore they are significant and indicate presence of a non-linear effect. PhD,perc.alumni and Grad.Rate show evidence of a linear effect.

require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)

summary(model_gam)
## 
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + 
##     s(Expend) + s(Grad.Rate), data = train)
## Deviance Residuals:
##      Min       1Q   Median       3Q      Max 
## -7511.91 -1172.59    38.71  1267.44  7827.85 
## 
## (Dispersion Parameter for gaussian family taken to be 3617918)
## 
##     Null Deviance: 9263945675 on 543 degrees of freedom
## Residual Deviance: 1888551843 on 521.9997 degrees of freedom
## AIC: 9782.515 
## 
## Number of Local Scoring Iterations: NA 
## 
## Anova for Parametric Effects
##                 Df     Sum Sq    Mean Sq F value    Pr(>F)    
## Private          1 2402224634 2402224634 663.980 < 2.2e-16 ***
## s(Room.Board)    1 1721929895 1721929895 475.945 < 2.2e-16 ***
## s(PhD)           1  620590394  620590394 171.532 < 2.2e-16 ***
## s(perc.alumni)   1  456743095  456743095 126.245 < 2.2e-16 ***
## s(Expend)        1  746743434  746743434 206.401 < 2.2e-16 ***
## s(Grad.Rate)     1   99786036   99786036  27.581 2.197e-07 ***
## Residuals      522 1888551843    3617918                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Anova for Nonparametric Effects
##                Npar Df Npar F   Pr(F)    
## (Intercept)                              
## Private                                  
## s(Room.Board)        3  2.680 0.04629 *  
## s(PhD)               3  1.179 0.31718    
## s(perc.alumni)       3  0.937 0.42233    
## s(Expend)            3 32.503 < 2e-16 ***
## s(Grad.Rate)         3  1.995 0.11380    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1