More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
ANSWER a: The correct answer is iii:Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Explanation: When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.(pg.231)
ANSWER b: The correct answer is iii: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease invariance.
Explanation: Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.(pg.225)
ANSWER c: The correct answer is ii: More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Explanation: These models do have a higher variance, but they increase prediction accuracy.
ANSWER 9-a:
require(ISLR)
## Loading required package: ISLR
set.seed(3)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))
train <- College[train_index, ]
nrow(train) / nrow(College)
## [1] 0.7001287
test <- College[-train_index, ]
nrow(test) / nrow(College)
## [1] 0.2998713
ANSWER 9-b:
The test error metric is 1413287 using MSE as shown:
require(ISLR)
require(caret)
## Loading required package: caret
## Loading required package: lattice
## Loading required package: ggplot2
model_ln <- lm(Apps ~ ., data = train)
summary(model_ln)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3371.9 -437.6 -63.3 350.2 6233.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -605.81244 493.25798 -1.228 0.219927
## PrivateYes -584.35163 164.18527 -3.559 0.000406 ***
## Accept 1.32861 0.06124 21.696 < 2e-16 ***
## Enroll -0.24166 0.21699 -1.114 0.265908
## Top10perc 52.92817 6.99182 7.570 1.68e-13 ***
## Top25perc -17.86210 5.44140 -3.283 0.001097 **
## F.Undergrad 0.04199 0.03837 1.094 0.274342
## P.Undergrad 0.04110 0.03435 1.197 0.231996
## Outstate -0.07531 0.02325 -3.239 0.001274 **
## Room.Board 0.17973 0.05972 3.010 0.002742 **
## Books -0.09909 0.26773 -0.370 0.711436
## Personal 0.02642 0.07653 0.345 0.730076
## PhD -9.93047 5.51335 -1.801 0.072249 .
## Terminal -5.68887 6.08867 -0.934 0.350559
## S.F.Ratio 23.09251 15.20844 1.518 0.129514
## perc.alumni -6.52534 5.04819 -1.293 0.196713
## Expend 0.12521 0.01797 6.966 9.77e-12 ***
## Grad.Rate 10.65431 3.52288 3.024 0.002614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1042 on 526 degrees of freedom
## Multiple R-squared: 0.9215, Adjusted R-squared: 0.9189
## F-statistic: 363 on 17 and 526 DF, p-value: < 2.2e-16
ols_pred <- predict(model_ln, test)
(ols_mse <- mean((ols_pred - test$Apps)^2))
## [1] 1413287
ANSWER 9-c:
The test error obtained is 1545921 as shown below:
require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loaded glmnet 4.1-2
require(tidyr)
## Loading required package: tidyr
##
## Attaching package: 'tidyr'
## The following objects are masked from 'package:Matrix':
##
## expand, pack, unpack
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
train_mat <- dummyVars(Apps ~ ., data = train, fullRank = F) %>% predict(newdata = train) %>%as.matrix()
test_mat <- dummyVars(Apps ~ ., data = test, fullRank = F) %>% predict(newdata = test) %>% as.matrix()
set.seed(3)
model_ridge <- cv.glmnet(y = train$Apps,
x = train_mat,
alpha = 0,
lambda = 10^seq(2,-2, length = 100),
standardize = TRUE,
nfolds = 5)
data.frame(lambda = model_ridge$lambda,
cv_mse = model_ridge$cvm) %>%
ggplot(aes(x = lambda, y = cv_mse)) +
geom_point() +
geom_line(col="blue") +
geom_vline(xintercept = model_ridge$lambda.min, col = "red") +
geom_hline(yintercept = min(model_ridge$cvm), col = "red") +
scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) +
scale_y_continuous(labels = scales::comma_format()) +
theme(legend.position = "bottom") +
labs(x = "Lambda",
y = "Cross-Validation MSE",
col = "Non-Zero Coefficients:",
title = "Ridge Regression - Lambda Selection (5-Fold Cross-Validation)")
model_ridge_best <- glmnet(y = train$Apps,
x = train_mat,
alpha = 0,
lambda = 10^seq(2,-2, length = 100))
ridge_pred <- predict(model_ridge_best, s = model_ridge$lambda.min, newx = test_mat)
(ridge_mse <- mean((ridge_pred - test$Apps)^2))
## [1] 1545921
ANSWER 9-d:
The test error obtained is 1444668 and the number of non-zero coefficients are shown below:
require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)
require(tidyr)
require(dplyr)
set.seed(4)
model_lasso <- cv.glmnet(y = train$Apps,
x = train_mat,
alpha = 1,
lambda = 10^seq(2, -2, length = 100),
standardize = TRUE,
nfolds = 5,
thresh = 1e-12)
data.frame(lambda = model_lasso$lambda,
cv_mse = model_lasso$cvm,
nonzero_coeff = model_lasso$nzero) %>%
ggplot(aes(x = lambda, y = cv_mse, col = nonzero_coeff)) +
geom_point() +
geom_line() +
geom_vline(xintercept = model_lasso$lambda.min, col = "red") +
geom_hline(yintercept = min(model_lasso$cvm), col = "red") +
scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) +
scale_y_continuous(labels = scales::comma_format()) +
theme(legend.position = "bottom") +
scale_color_gradient(low = "red", high = "green") +
labs(x = "Lambda",
y = "Cross-Validation MSE",
col = "Non-Zero Coefficients:",
title = "Lasso - Lambda Selection (Using 5-Fold Cross-Validation)")
model_lasso_best <- glmnet(y = train$Apps,
x = train_mat,
alpha = 1,
lambda = 10^seq(2,-5, length = 100))
lasso_pred <- predict(model_lasso_best, s = model_lasso$lambda.min, newx = test_mat)
(lasso_mse <- mean((lasso_pred - test$Apps)^2))
## [1] 1444668
lasso_coef <- predict(model_lasso_best, type = "coefficients", s = model_lasso$lambda.min)
round(lasso_coef, 3)
## 19 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -1302.682
## Private.No 568.744
## Private.Yes .
## Accept 1.281
## Enroll .
## Top10perc 46.089
## Top25perc -12.664
## F.Undergrad 0.018
## P.Undergrad 0.035
## Outstate -0.063
## Room.Board 0.164
## Books -0.004
## Personal 0.010
## PhD -8.092
## Terminal -5.653
## S.F.Ratio 18.699
## perc.alumni -7.071
## Expend 0.120
## Grad.Rate 9.274
#model_lasso$lambda
ANSWER 9-e:
The test error obtained is 1550794 at M=16 as shown below:
require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)
require(tidyr)
require(dplyr)
require(pls)
## Loading required package: pls
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
set.seed(5)
model_pcr <- pcr(Apps ~ .,
data = train,
scale = T,
validation = "CV")
model_pcr_mse <- MSEP(model_pcr, estimate = "CV")$val %>% reshape2::melt() %>% mutate(M = 0:(nrow(.)-1)) %>% select(M, value) %>% rename(CV_MSE = value)
model_pcr_mse
## M CV_MSE
## 1 0 13409401
## 2 1 13221496
## 3 2 3031993
## 4 3 3073801
## 5 4 2444471
## 6 5 2078195
## 7 6 1836587
## 8 7 1671864
## 9 8 1593721
## 10 9 1607070
## 11 10 1598481
## 12 11 1608062
## 13 12 1587359
## 14 13 1600032
## 15 14 1598151
## 16 15 1613936
## 17 16 1203941
## 18 17 1207079
model_pcr_mse %>%
mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
ggplot(aes(x = M, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_y_continuous(labels = scales::comma_format()) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(x = "M",
y = "Cross-Validation MSE",
col = "Non-Zero Coefficients:",
title = "PCR - M Selection (Using 10-Fold Cross-Validation)")
pcr_pred <- predict(model_pcr, test, ncomp = 16)
(pcr_mse <- mean((pcr_pred - test$Apps)^2))
## [1] 1550794
ANSWER 9-f:
The test error obtained is 1451594 at M=8 as shown below:
require(ISLR)
require(pls)
set.seed(6)
model_pls <- plsr(Apps ~ .,
data = train,
scale = T,
validation = "CV")
model_pls_mse <- MSEP(model_pls, estimate = "CV")$val %>% reshape2::melt() %>% mutate(M = 0:(nrow(.)-1)) %>% select(M, value) %>% rename(CV_MSE = value)
model_pls_mse
## M CV_MSE
## 1 0 13409401
## 2 1 2550956
## 3 2 1823854
## 4 3 1502580
## 5 4 1443224
## 6 5 1373482
## 7 6 1269080
## 8 7 1236585
## 9 8 1229345
## 10 9 1229525
## 11 10 1232714
## 12 11 1232950
## 13 12 1234544
## 14 13 1238947
## 15 14 1240060
## 16 15 1239458
## 17 16 1239412
## 18 17 1239557
model_pls_mse %>%
mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
ggplot(aes(x = M, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_y_continuous(labels = scales::comma_format()) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(x = "M",
y = "Cross-Validation MSE",
title = "PLS - M Selection (Using 10-Fold Cross-Validation)")
pls_pred <- predict(model_pls, test, ncomp = 8)
(pls_mse <- mean((pls_pred - test$Apps)^2))
## [1] 1451594
ANSWER 9-g:
As shown below there is minimal difference among the test errors(MSE) obtained in each of the five approaches as shown below:
require(ISLR)
require(pls)
# R2 = 1 - (SS_res / SS_tot)^2
SS_tot <- sum((test$Apps - mean(test$Apps))^2)
data.frame(method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"),
test_MSE = c(ols_mse, ridge_mse, lasso_mse, pcr_mse, pls_mse),
test_R2 = c(1 - sum((test$Apps - ols_pred)^2) / SS_tot,
1 - sum((test$Apps - ridge_pred)^2) / SS_tot,
1 - sum((test$Apps - lasso_pred)^2) / SS_tot,
1 - sum((test$Apps - pcr_pred)^2) / SS_tot,
1 - sum((test$Apps - pls_pred)^2) / SS_tot)) %>% arrange(test_MSE)
## method test_MSE test_R2
## 1 OLS 1413287 0.9241061
## 2 Lasso 1444668 0.9224210
## 3 PLS 1451594 0.9220490
## 4 Ridge 1545921 0.9169836
## 5 PCR 1550794 0.9167220
ANSWER 11-a:
Used Best Subset Selection and found the Minimun CV MSE = 42.46049 as shown below:
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
## Loading required package: leaps
require(MASS)
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
#require(dplyr)
#require(pls)
#require(MASS)
custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
Rsquared = summary(lm(pred ~ obs, data))$r.squared,
MAE = mean(abs(data$obs-data$pred)),
MSE = mean((data$obs-data$pred)^2),
RSS = sum((data$obs-data$pred)^2))
}
ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, summaryFunction = custom_regression_metrics)
# best models per subset size:
model <- regsubsets(crim ~ .,
data = Boston,
nvmax = ncol(Boston) - 1,
method = "exhaustive")
# cross-validating to compare MSE:
CV_MSE <- c()
set.seed(10101)
for (i in 1:(ncol(Boston)-1)) {
Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
model_temp <- train(crim ~ .,
data = Boston_temp,
method = "lm",
trControl = ctrl)
CV_MSE[i] <- model_temp$results$MSE
}
data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>% mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
ggplot(aes(x = subset_size, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Best Subset Selection",
subtitle = "Selecting parameter subset size with cross-validation",
x = "Subset Size",
y = "CV MSE")
min(CV_MSE)
## [1] 42.46049
Used Lasso Regression and found the Minimun CV MSE = 42.91081 as shown below:
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
set.seed(159)
model_lasso <- train(crim ~ .,
data = Boston,
method = "glmnet",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(alpha = 1,
lambda = seq(0, 1, length = 100)))
model_lasso$results %>%
rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
ggplot(aes(x = lambda, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Lasso Regression",
subtitle = "Selecting shrinkage parameter with cross-validation",
y = "CV MSE")
min(model_lasso$results$MSE)
## [1] 42.91081
Used Ridge Regression and found the Minimun CV MSE = 43.04141 as shown below:
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
set.seed(159)
model_ridge <- train(crim ~ .,
data = Boston,
method = "glmnet",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(alpha = 0,
lambda = seq(0, 1, length = 100)))
model_ridge$results %>%
rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(lambda == model_ridge$bestTune$lambda)) %>%
ggplot(aes(x = lambda, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Ridge Regression",
subtitle = "Selecting shrinkage parameter with cross-validation",
y = "CV MSE")
min(model_ridge$results$MSE)
## [1] 43.04141
Used Principal Components Regression (PCR) and found the Minimun CV MSE = 42.97377 as shown below:
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
set.seed(339)
model_pcr <- train(crim ~ .,
data = Boston,
method = "pcr",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))
model_pcr$results %>%
rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(ncomp == model_pcr$bestTune$ncomp)) %>%
ggplot(aes(x = ncomp, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Principal Components Regression",
subtitle = "Selecting number of principal components with cross-validation",
x = "Principal Components",
y = "CV MSE")
min(model_pcr$results$MSE)
## [1] 42.97377
Used Partial Least Squares and found the Minimun CV MSE = 43.07288 as shown below:
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
require(pls)
set.seed(840)
model_pls <- train(crim ~ .,
data = Boston,
method = "pls",
preProcess = c("center", "scale"),
metric = "MSE",
maximize = F,
trControl = ctrl,
tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))
model_pls$results %>%
rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
ggplot(aes(x = ncomp, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Boston Dataset - Partial Least Squares",
subtitle = "Selecting number of principal components with cross-validation",
x = "Principal Components",
y = "CV MSE")
min(model_pls$results$MSE)
## [1] 43.07288
ANSWER 11-b:
It seems that the model that performed best MSE results in regards to cross-validation was the Best Subset Selection with MSE = 42.46049. In order to justify this answer, I will make a small change to the dataset and add quadratic and cubic terms for medv.
require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
Boston$medv_sq <- Boston$medv^2
glimpse(Boston)
## Rows: 506
## Columns: 15
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
## $ medv_sq <dbl> 576.00, 466.56, 1204.09, 1115.56, 1310.44, 823.69, 524.41, ...
# best models per subset size:
model <- regsubsets(crim ~ .,
data = Boston,
nvmax = ncol(Boston) - 1,
method = "exhaustive")
# cross-validating to compare MSE:
CV_MSE <- c()
set.seed(20202)
for (i in 1:(ncol(Boston)-1)) {
Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
model_temp <- train(crim ~ .,
data = Boston_temp,
method = "lm",
trControl = ctrl)
CV_MSE[i] <- model_temp$results$MSE
}
data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>% mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
ggplot(aes(x = subset_size, y = CV_MSE)) +
geom_line(col = "blue") +
geom_point(size = 2, aes(col = factor(min_CV_MSE))) +
scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
scale_color_manual(values = c("deepskyblue3", "orange")) +
theme(legend.position = "none") +
labs(title = "Modified Boston Dataset - Best Subset Selection",
subtitle = "Selecting parameter subset size with cross-validation",
x = "Subset Size",
y = "CV MSE")
As shown above, there is significant improvement when comparing our results to the other linear methods used previously to calculate for MSE. Our results show an MSE value around 37.5
ANSWER 11-c:
No, adding all the features causes the cross-validation MSE to increase.
require(ISLR)
coef(model, id = 7)
## (Intercept) nox dis rad tax
## 42.218638401 -16.438325440 -0.572670128 0.612925666 -0.006939313
## ptratio medv medv_sq
## -0.518520916 -1.468671062 0.020821049