Homework V

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

The lasso, relative to least squares, is:

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

ANSWER a: The correct answer is iii:Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Explanation: When the least squares estimates have excessively high variance, the lasso solution can yield a reduction in variance at the expense of a small increase in bias, and consequently can generate more accurate predictions.(pg.231)

Repeat (a) for ridge regression relative to least squares.

ANSWER b: The correct answer is iii: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease invariance.

Explanation: Ridge regression’s advantage over least squares is rooted in the bias-variance trade-off. As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.(pg.225)

Repeat (a) for non-linear methods relative to least squares.

ANSWER c: The correct answer is ii: More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Explanation: These models do have a higher variance, but they increase prediction accuracy.

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Split the data set into a training set and a test set.

ANSWER 9-a:

require(ISLR)

## Loading required package: ISLR

set.seed(3)
train_index <- sample(1:nrow(College), round(nrow(College) * 0.7))

train <- College[train_index, ]
nrow(train) / nrow(College)

## [1] 0.7001287

test <- College[-train_index, ]
nrow(test) / nrow(College)

## [1] 0.2998713

Fit a linear model using least squares on the training set, and report the test error obtained.

ANSWER 9-b:

The test error metric is 1413287 using MSE as shown:

require(ISLR)
require(caret)

## Loading required package: caret

## Loading required package: lattice

## Loading required package: ggplot2

model_ln <- lm(Apps ~ ., data = train)
summary(model_ln)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3371.9  -437.6   -63.3   350.2  6233.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -605.81244  493.25798  -1.228 0.219927    
## PrivateYes  -584.35163  164.18527  -3.559 0.000406 ***
## Accept         1.32861    0.06124  21.696  < 2e-16 ***
## Enroll        -0.24166    0.21699  -1.114 0.265908    
## Top10perc     52.92817    6.99182   7.570 1.68e-13 ***
## Top25perc    -17.86210    5.44140  -3.283 0.001097 ** 
## F.Undergrad    0.04199    0.03837   1.094 0.274342    
## P.Undergrad    0.04110    0.03435   1.197 0.231996    
## Outstate      -0.07531    0.02325  -3.239 0.001274 ** 
## Room.Board     0.17973    0.05972   3.010 0.002742 ** 
## Books         -0.09909    0.26773  -0.370 0.711436    
## Personal       0.02642    0.07653   0.345 0.730076    
## PhD           -9.93047    5.51335  -1.801 0.072249 .  
## Terminal      -5.68887    6.08867  -0.934 0.350559    
## S.F.Ratio     23.09251   15.20844   1.518 0.129514    
## perc.alumni   -6.52534    5.04819  -1.293 0.196713    
## Expend         0.12521    0.01797   6.966 9.77e-12 ***
## Grad.Rate     10.65431    3.52288   3.024 0.002614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1042 on 526 degrees of freedom
## Multiple R-squared:  0.9215, Adjusted R-squared:  0.9189 
## F-statistic:   363 on 17 and 526 DF,  p-value: < 2.2e-16

ols_pred <- predict(model_ln, test)
(ols_mse <- mean((ols_pred - test$Apps)^2))

## [1] 1413287

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

ANSWER 9-c:

The test error obtained is 1545921 as shown below:

require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)

## Loading required package: glmnet

## Loading required package: Matrix

## Loaded glmnet 4.1-2

require(tidyr)

## Loading required package: tidyr

## 
## Attaching package: 'tidyr'

## The following objects are masked from 'package:Matrix':
## 
##     expand, pack, unpack

require(dplyr)

## Loading required package: dplyr

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

train_mat <- dummyVars(Apps ~ ., data = train, fullRank = F) %>% predict(newdata = train) %>%as.matrix()
test_mat <- dummyVars(Apps ~ ., data = test, fullRank = F) %>% predict(newdata = test) %>% as.matrix()

set.seed(3)
model_ridge <- cv.glmnet(y = train$Apps, 
                         x = train_mat, 
                         alpha = 0, 
                         lambda = 10^seq(2,-2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5)

data.frame(lambda = model_ridge$lambda, 
           cv_mse = model_ridge$cvm) %>%
  ggplot(aes(x = lambda, y = cv_mse)) + 
  geom_point() + 
  geom_line(col="blue") + 
  geom_vline(xintercept = model_ridge$lambda.min, col = "red") +
  geom_hline(yintercept = min(model_ridge$cvm), col = "red") +
  scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "bottom") + 
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Ridge Regression - Lambda Selection (5-Fold Cross-Validation)")

model_ridge_best <- glmnet(y = train$Apps,
                           x = train_mat,
                           alpha = 0, 
                           lambda = 10^seq(2,-2, length = 100))

ridge_pred <- predict(model_ridge_best, s = model_ridge$lambda.min, newx = test_mat)
(ridge_mse <- mean((ridge_pred - test$Apps)^2))

## [1] 1545921

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

ANSWER 9-d:

The test error obtained is 1444668 and the number of non-zero coefficients are shown below:

require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)
require(tidyr)
require(dplyr)

set.seed(4)

model_lasso <- cv.glmnet(y = train$Apps, 
                         x = train_mat, 
                         alpha = 1, 
                         lambda = 10^seq(2, -2, length = 100), 
                         standardize = TRUE, 
                         nfolds = 5, 
                         thresh = 1e-12)

data.frame(lambda = model_lasso$lambda, 
           cv_mse = model_lasso$cvm, 
           nonzero_coeff = model_lasso$nzero) %>%
  ggplot(aes(x = lambda, y = cv_mse, col = nonzero_coeff)) + 
  geom_point() + 
  geom_line() + 
  geom_vline(xintercept = model_lasso$lambda.min, col = "red") +
  geom_hline(yintercept = min(model_lasso$cvm), col = "red") +
  scale_x_continuous(trans = 'log10', breaks = c(0.01, 0.1, 1, 10, 100), labels = c(0.01, 0.1, 1, 10, 100)) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  theme(legend.position = "bottom") + 
  scale_color_gradient(low = "red", high = "green") +
  labs(x = "Lambda", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "Lasso - Lambda Selection (Using 5-Fold Cross-Validation)")

model_lasso_best <- glmnet(y = train$Apps,
                           x = train_mat,
                           alpha = 1, 
                           lambda = 10^seq(2,-5, length = 100))

lasso_pred <- predict(model_lasso_best, s = model_lasso$lambda.min, newx = test_mat)
(lasso_mse <- mean((lasso_pred - test$Apps)^2))

## [1] 1444668

lasso_coef <- predict(model_lasso_best, type = "coefficients", s = model_lasso$lambda.min)
round(lasso_coef, 3)

## 19 x 1 sparse Matrix of class "dgCMatrix"
##                    s1
## (Intercept) -1302.682
## Private.No    568.744
## Private.Yes     .    
## Accept          1.281
## Enroll          .    
## Top10perc      46.089
## Top25perc     -12.664
## F.Undergrad     0.018
## P.Undergrad     0.035
## Outstate       -0.063
## Room.Board      0.164
## Books          -0.004
## Personal        0.010
## PhD            -8.092
## Terminal       -5.653
## S.F.Ratio      18.699
## perc.alumni    -7.071
## Expend          0.120
## Grad.Rate       9.274

#model_lasso$lambda

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

ANSWER 9-e:

The test error obtained is 1550794 at M=16 as shown below:

require(ISLR)
require(caret)
require(ggplot2)
require(glmnet)
require(tidyr)
require(dplyr)
require(pls)

## Loading required package: pls

## 
## Attaching package: 'pls'

## The following object is masked from 'package:caret':
## 
##     R2

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(5)

model_pcr <- pcr(Apps ~ .,
                 data = train, 
                 scale = T, 
                 validation = "CV")

model_pcr_mse <- MSEP(model_pcr, estimate = "CV")$val %>% reshape2::melt() %>% mutate(M = 0:(nrow(.)-1)) %>% select(M, value) %>% rename(CV_MSE = value)
model_pcr_mse

##     M   CV_MSE
## 1   0 13409401
## 2   1 13221496
## 3   2  3031993
## 4   3  3073801
## 5   4  2444471
## 6   5  2078195
## 7   6  1836587
## 8   7  1671864
## 9   8  1593721
## 10  9  1607070
## 11 10  1598481
## 12 11  1608062
## 13 12  1587359
## 14 13  1600032
## 15 14  1598151
## 16 15  1613936
## 17 16  1203941
## 18 17  1207079

model_pcr_mse %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = M, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(x = "M", 
       y = "Cross-Validation MSE", 
       col = "Non-Zero Coefficients:", 
       title = "PCR - M Selection (Using 10-Fold Cross-Validation)")

pcr_pred <- predict(model_pcr, test, ncomp = 16)
(pcr_mse <- mean((pcr_pred - test$Apps)^2))

## [1] 1550794

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

ANSWER 9-f:

The test error obtained is 1451594 at M=8 as shown below:

require(ISLR)
require(pls)

set.seed(6)

model_pls <- plsr(Apps ~ .,
                  data = train, 
                  scale = T, 
                  validation = "CV")

model_pls_mse <- MSEP(model_pls, estimate = "CV")$val %>% reshape2::melt() %>% mutate(M = 0:(nrow(.)-1)) %>% select(M, value) %>% rename(CV_MSE = value)
model_pls_mse

##     M   CV_MSE
## 1   0 13409401
## 2   1  2550956
## 3   2  1823854
## 4   3  1502580
## 5   4  1443224
## 6   5  1373482
## 7   6  1269080
## 8   7  1236585
## 9   8  1229345
## 10  9  1229525
## 11 10  1232714
## 12 11  1232950
## 13 12  1234544
## 14 13  1238947
## 15 14  1240060
## 16 15  1239458
## 17 16  1239412
## 18 17  1239557

model_pls_mse %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = M, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_y_continuous(labels = scales::comma_format()) + 
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(x = "M", 
       y = "Cross-Validation MSE", 
       title = "PLS - M Selection (Using 10-Fold Cross-Validation)")

pls_pred <- predict(model_pls, test, ncomp = 8)
(pls_mse <- mean((pls_pred - test$Apps)^2))

## [1] 1451594

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

ANSWER 9-g:

As shown below there is minimal difference among the test errors(MSE) obtained in each of the five approaches as shown below:

require(ISLR)
require(pls)

# R2 = 1 - (SS_res / SS_tot)^2

SS_tot <- sum((test$Apps - mean(test$Apps))^2)

data.frame(method = c("OLS", "Ridge", "Lasso", "PCR", "PLS"), 
           test_MSE = c(ols_mse, ridge_mse, lasso_mse, pcr_mse, pls_mse), 
           test_R2 = c(1 - sum((test$Apps - ols_pred)^2) / SS_tot,
                       1 - sum((test$Apps - ridge_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - lasso_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - pcr_pred)^2) / SS_tot, 
                       1 - sum((test$Apps - pls_pred)^2) / SS_tot)) %>% arrange(test_MSE)

##   method test_MSE   test_R2
## 1    OLS  1413287 0.9241061
## 2  Lasso  1444668 0.9224210
## 3    PLS  1451594 0.9220490
## 4  Ridge  1545921 0.9169836
## 5    PCR  1550794 0.9167220

We will now try to predict per capita crime rate in the Boston data set.

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

ANSWER 11-a:

Used Best Subset Selection and found the Minimun CV MSE = 42.46049 as shown below:

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)

## Loading required package: leaps

require(MASS)

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

#require(dplyr)
#require(pls)
#require(MASS)

custom_regression_metrics <- function (data, lev = NULL, model = NULL) {
  c(RMSE = sqrt(mean((data$obs-data$pred)^2)),
    Rsquared = summary(lm(pred ~ obs, data))$r.squared,
    MAE = mean(abs(data$obs-data$pred)), 
    MSE = mean((data$obs-data$pred)^2),
    RSS = sum((data$obs-data$pred)^2))
}

ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 10, summaryFunction = custom_regression_metrics)

# best models per subset size:
model <- regsubsets(crim ~ ., 
                    data = Boston, 
                    nvmax = ncol(Boston) - 1, 
                    method = "exhaustive")

# cross-validating to compare MSE:

CV_MSE <- c()

set.seed(10101)

for (i in 1:(ncol(Boston)-1)) {
  Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
  model_temp <- train(crim ~ ., 
                      data = Boston_temp, 
                      method = "lm", 
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>% mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = subset_size, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Best Subset Selection", 
       subtitle = "Selecting parameter subset size with cross-validation",
       x = "Subset Size", 
       y = "CV MSE")

min(CV_MSE)

## [1] 42.46049

Used Lasso Regression and found the Minimun CV MSE = 42.91081 as shown below:

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)

set.seed(159)

model_lasso <- train(crim ~ ., 
                     data = Boston, 
                     method = "glmnet", 
                     preProcess = c("center", "scale"), 
                     metric = "MSE",
                     maximize = F,
                     trControl = ctrl, 
                     tuneGrid = expand.grid(alpha = 1,
                                            lambda = seq(0, 1, length = 100)))



model_lasso$results %>%
  rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>%
  ggplot(aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Lasso Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

min(model_lasso$results$MSE)

## [1] 42.91081

Used Ridge Regression and found the Minimun CV MSE = 43.04141 as shown below:

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)

set.seed(159)

model_ridge <- train(crim ~ ., 
                     data = Boston, 
                     method = "glmnet", 
                     preProcess = c("center", "scale"), 
                     metric = "MSE",
                     maximize = F,
                     trControl = ctrl, 
                     tuneGrid = expand.grid(alpha = 0,
                                            lambda = seq(0, 1, length = 100)))


model_ridge$results %>%
  rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(lambda == model_ridge$bestTune$lambda)) %>%
  ggplot(aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Ridge Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

min(model_ridge$results$MSE)

## [1] 43.04141

Used Principal Components Regression (PCR) and found the Minimun CV MSE = 42.97377 as shown below:

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)

set.seed(339)

model_pcr <- train(crim ~ ., 
                   data = Boston, 
                   method = "pcr", 
                   preProcess = c("center", "scale"), 
                   metric = "MSE", 
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


model_pcr$results %>%
  rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(ncomp == model_pcr$bestTune$ncomp)) %>%
  ggplot(aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Principal Components Regression", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

min(model_pcr$results$MSE)

## [1] 42.97377

Used Partial Least Squares and found the Minimun CV MSE = 43.07288 as shown below:

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)
require(pls)
set.seed(840)

model_pls <- train(crim ~ ., 
                   data = Boston, 
                   method = "pls", 
                   preProcess = c("center", "scale"), 
                   metric = "MSE", 
                   maximize = F,
                   trControl = ctrl, 
                   tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


model_pls$results %>%
  rename(CV_MSE = MSE) %>% mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>%
  ggplot(aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Partial Least Squares", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

min(model_pls$results$MSE)

## [1] 43.07288

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

ANSWER 11-b:

It seems that the model that performed best MSE results in regards to cross-validation was the Best Subset Selection with MSE = 42.46049. In order to justify this answer, I will make a small change to the dataset and add quadratic and cubic terms for medv.

require(ISLR)
require(caret)
#require(mlbench)
require(ggplot2)
require(glmnet)
require(leaps)
require(MASS)

Boston$medv_sq <- Boston$medv^2
glimpse(Boston)

## Rows: 506
## Columns: 15
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, 0.088...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5, 12.5...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, 7.87,...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524, 0.5...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172, 5.6...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0, 85.9...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605, 5.9...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4,...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, 15.2,...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60, 396...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.93, 17...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9,...
## $ medv_sq <dbl> 576.00, 466.56, 1204.09, 1115.56, 1310.44, 823.69, 524.41, ...

# best models per subset size:
model <- regsubsets(crim ~ ., 
                    data = Boston, 
                    nvmax = ncol(Boston) - 1, 
                    method = "exhaustive")


# cross-validating to compare MSE:
CV_MSE <- c()

set.seed(20202)

for (i in 1:(ncol(Boston)-1)) {
  Boston_temp <- Boston[ ,c("crim", names(coef(model, id = i)[-1]))]
  model_temp <- train(crim ~ ., 
                      data = Boston_temp, 
                      method = "lm", 
                      trControl = ctrl)
  CV_MSE[i] <- model_temp$results$MSE
}

data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>% mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  ggplot(aes(x = subset_size, y = CV_MSE)) + 
  geom_line(col = "blue") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("deepskyblue3", "orange")) + 
  theme(legend.position = "none") + 
  labs(title = "Modified Boston Dataset - Best Subset Selection", 
       subtitle = "Selecting parameter subset size with cross-validation",
       x = "Subset Size", 
       y = "CV MSE")

As shown above, there is significant improvement when comparing our results to the other linear methods used previously to calculate for MSE. Our results show an MSE value around 37.5

Does your chosen model involve all of the features in the data set? Why or why not?

ANSWER 11-c:

No, adding all the features causes the cross-validation MSE to increase.

require(ISLR)

coef(model, id = 7)

##   (Intercept)           nox           dis           rad           tax 
##  42.218638401 -16.438325440  -0.572670128   0.612925666  -0.006939313 
##       ptratio          medv       medv_sq 
##  -0.518520916  -1.468671062   0.020821049

Homework V

Jose Fernandez

7/9/2021