ANSWER 6a: Using 10-fold cross validation to th select the best model I found the third order polynomial as shown. By using the ANOVA instead of cross-validation could impact the results in selecting the third or fourth polynomials of age. As shown below, I plotted the third-order polynomial fit and it has a very smooth curvature.
require(ISLR)
## Loading required package: ISLR
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
require(ggplot2)
#require(glmnet)
#require(leaps)
#require(MASS)
#require(stats)
ctrl <- trainControl(method = "cv", number = 10)
CV_RMSE <- c()
set.seed(123)
for (i in 1:10) {
model_temp <- train(y = Wage$wage,
x = poly(Wage$age, i, raw = T, simple = T),
method = "lm",
metric = "RMSE",
trControl = ctrl)
CV_RMSE[i] <- model_temp$results$RMSE
}
data.frame(CV_RMSE = CV_RMSE,degree = 1:10) %>% mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
ggplot(aes(x = degree, y = CV_RMSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
scale_x_continuous(breaks = seq(1, 10), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Wage Dataset - Polynomial Regression",
subtitle = "Selecting the 'age' polynomial degree with cross-validation",
x = "Degree",
y = "CV RMSE")
fit_1 <- lm(wage ~ age, data = Wage)
fit_2 <- lm(wage ~ poly(age, 2, raw = T), data = Wage)
fit_3 <- lm(wage ~ poly(age, 3, raw = T), data = Wage)
fit_4 <- lm(wage ~ poly(age, 4, raw = T), data = Wage)
fit_5 <- lm(wage ~ poly(age, 5, raw = T), data = Wage)
anova(fit_1, fit_2, fit_3, fit_4, fit_5)
## Analysis of Variance Table
##
## Model 1: wage ~ age
## Model 2: wage ~ poly(age, 2, raw = T)
## Model 3: wage ~ poly(age, 3, raw = T)
## Model 4: wage ~ poly(age, 4, raw = T)
## Model 5: wage ~ poly(age, 5, raw = T)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5931 < 2.2e-16 ***
## 3 2996 4777674 1 15756 9.8888 0.001679 **
## 4 2995 4771604 1 6070 3.8098 0.051046 .
## 5 2994 4770322 1 1283 0.8050 0.369682
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = "y ~ poly(x, 3, raw = T)") +
labs(title = "Wage Dataset - Polynomial Regression",
subtitle = "Predicting 'wage' with a cubic polynomial of 'age'")
ANSWER 6b:
Tested by cutting age into 20 intervals for the step function using crossvalidation to select the best model. The selected model had 12 intervals of age as shown below.
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
CV_RMSE <- c()
set.seed(124)
for (i in 2:20) {
model_temp <- train(y = Wage$wage,
x = data.frame(cut(Wage$age, i)),
method = "lm",
metric = "RMSE",
trControl = ctrl)
CV_RMSE[i-1] <- model_temp$results$RMSE
}
data.frame(cuts = 2:20, CV_RMSE = CV_RMSE) %>%
mutate(min_CV_RMSE = as.numeric(min(CV_RMSE) == CV_RMSE)) %>%
ggplot(aes(x = cuts, y = CV_RMSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_RMSE))) +
scale_x_continuous(breaks = seq(2, 20), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Wage Dataset - Step Function",
subtitle = "Selecting number of 'age' cut-points with cross-validation",
x = "Intervals",
y = "CV RMSE")
ggplot(Wage, aes(x = age, y = wage)) +
geom_point(alpha = 0.3) +
geom_smooth(method = "lm", formula = "y ~ cut(x, 12)") +
labs(title = "Wage Dataset - Step Function",
subtitle = "Predicting 'wage' with a 12-interval step function of 'age'")
10.This question relates to the College data set. (a) Split the data into a training set and a test set. Using out-of-state tuition as the response and the other variables as the predictors, perform forward stepwise selection on the training set in order to identify a satisfactory model that uses just a subset of the predictors.
ANSWER 10a:
I start by randomly sampling 70% of the train observations, the rest of the observations are taken by the test dataset as shown below. By utilizing the selectionFunction = "oneSE" for cross-validation, we can identify the model that minimizes the cross-validation error and select the simplest model within 1 SE of it. As shown in the model_foward results nvmax = 6 which means that the 6-variable model was selected. The model with the minimum cross-validation error had 14 predictors as shown below.
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
set.seed(1)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))
train <- College[train_index, ]
nrow(train) / nrow(College)
## [1] 0.7001287
test <- College[-train_index, ]
nrow(test) / nrow(College)
## [1] 0.2998713
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
Rsquared = summary(lm(pred ~ obs, data))$r.squared,
MAE = mean(abs(data$obs-data$pred)),
MSE = mean((data$obs-data$pred)^2),
RSS = sum((data$obs-data$pred)^2))
}
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 1,
summaryFunction = custom_regression_metrics,
selectionFunction = "oneSE")
set.seed(11)
model_forward <- train(Outstate ~ .,
data = train,
method = "leapForward",
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = data.frame(nvmax = 1:17))
model_forward
## Linear Regression with Forward Selection
##
## 544 samples
## 17 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times)
## Summary of sample sizes: 489, 490, 492, 491, 489, 488, ...
## Resampling results across tuning parameters:
##
## nvmax RMSE Rsquared MAE MSE RSS
## 1 3116.676 0.5047899 2480.277 9924143 537869955
## 2 2670.511 0.6141343 2018.854 7253757 393129955
## 3 2394.728 0.6806436 1847.491 5787625 314365711
## 4 2223.172 0.7179308 1740.158 4974734 270218937
## 5 2160.560 0.7329846 1701.678 4701270 255744224
## 6 2117.879 0.7416528 1665.929 4519142 245832663
## 7 2123.836 0.7408057 1670.549 4548786 247459481
## 8 2119.735 0.7406173 1674.291 4533400 246643018
## 9 2155.722 0.7293720 1691.149 4710988 256372597
## 10 2157.944 0.7291680 1680.271 4726621 257176853
## 11 2113.974 0.7416482 1663.049 4513600 245493174
## 12 2096.665 0.7459114 1653.184 4443008 241535828
## 13 2086.080 0.7482922 1647.345 4399171 239126796
## 14 2082.560 0.7487630 1644.411 4383112 238253554
## 15 2090.142 0.7471492 1651.776 4414671 239941188
## 16 2088.295 0.7476358 1651.515 4404923 239413821
## 17 2088.644 0.7475671 1651.796 4406352 239491261
##
## MSE was used to select the optimal model using the one SE rule.
## The final value used for the model was nvmax = 6.
model_forward_SE <- model_forward$results %>%
mutate(MSE_SE_low = MSE - (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)),
MSE_SE_high = MSE + (MSESD / sqrt(model_forward$control$number * model_forward$control$repeats)),
min_CV_MSE = as.numeric(min(MSE) == MSE))
ggplot(model_forward_SE, aes(x = nvmax, y = MSE)) +
geom_line(col = "blue") +
geom_hline(aes(yintercept = model_forward_SE$MSE_SE_high[model_forward_SE$min_CV_MSE == 1], col = "1 s.e. model")) +
geom_vline(aes(xintercept = model_forward$bestTune$nvmax, col = "red")) +
geom_errorbar(aes(ymin = MSE_SE_low, ymax = MSE_SE_high),
colour = "grey55",
width = 0.4) +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("deepskyblue3", "orange", "red", "red")) +
scale_x_continuous(breaks = 1:17, minor_breaks = NULL) +
theme(legend.position = "none") +
labs(title = "College Dataset - Linear Model",
subtitle = "Selecting number of predictors with forward selection (cross-validation MSE, 1 s.e. rule)",
x = "Number of Predictors",
y = "Cross-Validation MSE")
ANSWER 10b:
By looking at the plots below there is evidence of a nonlinear effect for Expend,Room.Board,PhD,Grad.Rate, and a linear relationship for perc.alumni
require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)
## Loading required package: gam
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.20
coef(model_forward$finalModel, id = 6)
## (Intercept) PrivateYes Room.Board PhD perc.alumni
## -3764.3413062 2793.2069104 0.9703210 38.2157650 59.0358377
## Expend Grad.Rate
## 0.2031532 28.6548780
model_gam <- gam(Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) + s(Expend) + s(Grad.Rate), data = train)
par(mfrow = c(2, 3))
plot(model_gam, se = T, col = "blue")
ANSWER 10c:
The GAM model has the better test performace(in comparison to the linear model test) as shown below.
require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)
#GAM test MSE and R-squared:
mean((predict(model_gam, newdata = test) - test$Outstate)^2)
## [1] 3170227
test_TSS <- sum((test$Outstate - mean(test$Outstate))^2)
test_RSS <- sum((predict(model_gam, newdata = test) - test$Outstate)^2)
1 - test_RSS/test_TSS
## [1] 0.7712156
#Linear Model test MSE and R-squared:
mean((predict(model_forward, newdata = test) - test$Outstate)^2)
## [1] 3616631
test_RSS <- sum((predict(model_forward, newdata = test) - test$Outstate)^2)
1 - test_RSS/test_TSS
## [1] 0.7390001
ANSWER 10d:
By looking at the p-values under the Anova for Non-parametric Effects bottom table shown, we can determine if a smooth/non-linear effect is present, where p<0.05 which indicates the presence of a non-linear effect.Expend and Room.Board have p-values <0.05, therefore they are significant and indicate presence of a non-linear effect. PhD,perc.alumni and Grad.Rate show evidence of a linear effect.
require(ISLR)
require(ISLR)
require(dplyr)
require(caret)
require(ggplot2)
require(gam)
summary(model_gam)
##
## Call: gam(formula = Outstate ~ Private + s(Room.Board) + s(PhD) + s(perc.alumni) +
## s(Expend) + s(Grad.Rate), data = train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -7511.91 -1172.59 38.71 1267.44 7827.85
##
## (Dispersion Parameter for gaussian family taken to be 3617918)
##
## Null Deviance: 9263945675 on 543 degrees of freedom
## Residual Deviance: 1888551843 on 521.9997 degrees of freedom
## AIC: 9782.515
##
## Number of Local Scoring Iterations: NA
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 2402224634 2402224634 663.980 < 2.2e-16 ***
## s(Room.Board) 1 1721929895 1721929895 475.945 < 2.2e-16 ***
## s(PhD) 1 620590394 620590394 171.532 < 2.2e-16 ***
## s(perc.alumni) 1 456743095 456743095 126.245 < 2.2e-16 ***
## s(Expend) 1 746743434 746743434 206.401 < 2.2e-16 ***
## s(Grad.Rate) 1 99786036 99786036 27.581 2.197e-07 ***
## Residuals 522 1888551843 3617918
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board) 3 2.680 0.04629 *
## s(PhD) 3 1.179 0.31718
## s(perc.alumni) 3 0.937 0.42233
## s(Expend) 3 32.503 < 2e-16 ***
## s(Grad.Rate) 3 1.995 0.11380
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1