Exercise 6.6.2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. The lasso, relative to least squares, is:
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The lasso method has a varying degree of flexibility in comparison to OLS; when lambda = 0 it is identical to OLS, but as lambda increases the penalty term begins to cause the flexibility to decrease. When the flexibility decreases the bias will increase, but the corresponding decrease in variance will make up for it. Thus, in order to have an increase in accuracy, the bias must increase less than the variance decreases.

  1. Repeat (a) for ridge regression relative to least squares.
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Ridge regression also starts at OLS when lambda is 0. As lambda increases, there is a decrease in flexibility which increases the bias, but it compensates by greatly reducing the variability of the model in prediction. This means that it gives better prediction accuracy when this increase in bias is counteracted by the resulting decrease in variance.

  1. Repeat (a) for non-linear methods relative to least squares.
  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

As non linear models are overall more flexible than linear models, this will mean there is a lower bias but higher variance. The variance increase must thus be balanced with the bias decrease in order to have a satisfactory performance.

Exercise 6.6.9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

head(College)
##                              Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University     Yes 1660   1232    721        23        52
## Adelphi University               Yes 2186   1924    512        16        29
## Adrian College                   Yes 1428   1097    336        22        50
## Agnes Scott College              Yes  417    349    137        60        89
## Alaska Pacific University        Yes  193    146     55        16        44
## Albertson College                Yes  587    479    158        38        62
##                              F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University        2885         537     7440       3300   450
## Adelphi University                  2683        1227    12280       6450   750
## Adrian College                      1036          99    11250       3750   400
## Agnes Scott College                  510          63    12960       5450   450
## Alaska Pacific University            249         869     7560       4120   800
## Albertson College                    678          41    13500       3335   500
##                              Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University     2200  70       78      18.1          12   7041
## Adelphi University               1500  29       30      12.2          16  10527
## Adrian College                   1165  53       66      12.9          30   8735
## Agnes Scott College               875  92       97       7.7          37  19016
## Alaska Pacific University        1500  76       72      11.9           2  10922
## Albertson College                 675  67       73       9.4          11   9727
##                              Grad.Rate
## Abilene Christian University        60
## Adelphi University                  56
## Adrian College                      54
## Agnes Scott College                 59
## Alaska Pacific University           15
## Albertson College                   55
  1. Split the data set into a training set and a test set.

We can create an 80-20 train test split using caret’s createDataPartition:

train_indices <- createDataPartition(College$Apps, times = 1, p = 0.80, list = F)
train <- College[train_indices,]
test <- College[-train_indices,]
  1. Fit a linear model using least squares on the training set, and report the test error obtained.
lm <- lm(Apps~., data = train)
lm_predict <- predict(lm, newdata = test)
lm_err <- mean((test$Apps - lm_predict)^2)
lm_err
## [1] 1224247

We get an MSE of 1224247 using least squares.

  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
x_train <- data.matrix(train[,-2]) # without apps
y_train <- train$Apps
x_test <- data.matrix(test[,-2]) # without apps
y_test <- test$Apps

cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)

best_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv_ridge$lambda.min)
ridge_predict <- predict(best_ridge, newx=x_test)
ridge_err <- mean((test$Apps - ridge_predict)^2)
ridge_err
## [1] 1016243

We get an MSE of 1016243 with Ridge Regression.

  1. Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)

best_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv_lasso$lambda.min)
lasso_predict <- predict(best_lasso, newx=x_test)
lasso_err <- mean((test$Apps - lasso_predict)^2)
lasso_err
## [1] 1209610

We get an MSE of 1209610 with Lasso.

  1. Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pcr <- pcr(Apps~., data=train, scale=T, validation='CV')
summary(pcr)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4091     4066     2163     2140     1961     1702     1681
## adjCV         4091     4069     2161     2141     1936     1696     1676
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1677     1631     1609      1611      1617      1619      1625
## adjCV     1672     1623     1604      1607      1613      1614      1620
##        14 comps  15 comps  16 comps  17 comps
## CV         1624      1585      1209      1152
## adjCV      1619      1568      1201      1144
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      30.935    57.01    63.91    69.64    75.28    80.13    83.85    87.23
## Apps    1.308    72.53    73.26    78.63    83.97    84.37    84.45    85.34
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.40     92.81     94.93     96.77     97.87     98.73     99.35
## Apps    85.78     85.87     85.87     85.95     85.95     86.03     90.16
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.19     93.78

We see that PCR error decreases until 9 comps, then increases until 15 comps. The true best error is 17 comps. However, we will use 9 principal components as our optimal model for PCR as using the same number of principal components as variables is likely unnecessary.

pcr_pred <- predict(pcr, test, ncomp = 9)
pcr_err <- mean((test$Apps - pcr_pred)^2)
pcr_err
## [1] 1126717

We get an MSE of 1126717 using PCR.

  1. Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
pls <- plsr(Apps~., data=train, scale=T, validation='CV')
summary(pls)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            4091     1971     1589     1511     1437     1274     1173
## adjCV         4091     1967     1581     1509     1420     1258     1164
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1156     1142     1126      1123      1119      1118      1119
## adjCV     1148     1135     1120      1117      1113      1112      1113
##        14 comps  15 comps  16 comps  17 comps
## CV         1120      1121      1121      1121
## adjCV      1114      1115      1115      1115
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       26.03    35.43    62.38    65.34    69.32    73.43    76.89    79.79
## Apps    77.86    86.50    87.77    91.12    92.92    93.46    93.54    93.61
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       81.96     85.91     88.56     90.42     92.68     95.09     96.79
## Apps    93.70     93.73     93.75     93.77     93.78     93.78     93.78
##       16 comps  17 comps
## X        98.90    100.00
## Apps     93.78     93.78

We see the best performance with 12 comps, so we will use our final model for PLS using 12 comps.

pls_pred <- predict(pls, test, ncomp = 9)
pls_err <- mean((test$Apps - pls_pred)^2)
pls_err
## [1] 1198885

We get an MSE of 1198885 using PLS.

  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
cat('MSE for linear regression: ', lm_err, '\n')
## MSE for linear regression:  1224247
cat('MSE for Ridge Regression: ', ridge_err, '\n')
## MSE for Ridge Regression:  1016243
cat('MSE for Lasso: ', lasso_err, '\n')
## MSE for Lasso:  1209610
cat('MSE for PCR: ', pcr_err, '\n')
## MSE for PCR:  1126717
cat('MSE for PLS: ', pls_err, '\n')
## MSE for PLS:  1198885

We get the best performance using Ridge regression. PCR performed the second best with an MSE slightly higher, but linear, lasso, and PLS all performed within a similar range as each other without much difference.

Exercise 6.6.11

We will now try to predict per capita crime rate in the Boston data set.

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
x.boston <- model.matrix(crim ~., data=Boston)[,-1]
y.boston <- Boston[,'crim']
# best subset selection
subsets <- regsubsets(crim ~ ., data = Boston, nvmax = 13) 
subset.summary <- summary(subsets)

print(subset.summary$outmat)
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
# plots
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp")
lines(1:length(subset.summary$cp), 1:length(subset.summary$cp) + 1, col = "red", lty = 2) # p + 1
plot(subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")

# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 8
cat("Best score for BIC:", which.min(subset.summary$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 9

Using regsubsets, we see mixed results. Cp shows optimal performance with 7 to 8 predictors, BIC shows 2 to 3 predictors, while adjR2 shows the best at 8 to 9 predictors. Given that Cp and adjR2 seem to agree around 8 predictors, we will say that regsubsets chooses the following 8 predictors:

zn + nox + dis + rad + ptratio + black + lstat + medv

# we will use stepAIC
null.model <- lm(crim~1, data=Boston)
full.model <- lm(crim~., data=Boston)

both <- ols_step_both_p(full.model, p_val = 0.05)

# both
print('Both:', quote=F)
## [1] Both:
summary(both$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.023  -1.713  -0.281   0.873  77.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.372585   1.641557  -0.227  0.82054    
## rad          0.488172   0.040422  12.077  < 2e-16 ***
## lstat        0.213596   0.047447   4.502 8.39e-06 ***
## black       -0.009472   0.003615  -2.620  0.00905 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared:  0.4286, Adjusted R-squared:  0.4252 
## F-statistic: 125.5 on 3 and 502 DF,  p-value: < 2.2e-16
plot(both)

If we use stepwise regression, combining both forward and backward selection, we arrive at three important variables. The resulting formula is: rad + lstat + black

These variables are present in the best subset selection, and appear to be the same as the variables that BIC selects for as well in subset selection.

# we do not expect the best performance here; n >> p in this case
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

# lasso
lasso.boston = glmnet(x.boston,(y.boston),alpha = 1)
plot(lasso.boston,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)

# do cv for best lambda
cv.lasso.boston = cv.glmnet(x.boston,(y.boston),alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.boston)

# displaying best lambda values
c(log(cv.lasso.boston$lambda.min), log(cv.lasso.boston$lambda.1se))
## [1] -3.046855  1.186180

Using Lasso, we see an optimal log(lambda) at -3.1 and a 1se at 1.2. Interestingly, we see a significant difference between these two models, as the minimum retains 11 predictors while the 1se retains only 1 predictor. We can analyze this further by inspecting the coefficients:

lasso.best <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.min)
coef(lasso.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 12.921616863
## zn           0.036630369
## indus       -0.071088384
## chas        -0.592709165
## nox         -7.203117383
## rm           0.243372202
## age          .          
## dis         -0.800965160
## rad          0.515774432
## tax          .          
## ptratio     -0.194514895
## black       -0.007544322
## lstat        0.125137815
## medv        -0.160384184
lasso.1se <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.1se)
coef(lasso.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 1.3076511
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2414676
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

We see that both agree that rad is an important feature, which regsubsets also chooses for only 1 predictor. The remaining difference of 10 features may suggest that the other features are of much less importance as they were eliminated using the L1 regularization term within 1 standard error.

# ridge regression
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

ridge.boston = glmnet(x.boston,(y.boston),alpha = 0, lambda = grid)

plot(ridge.boston,"lambda",ylab = "Coefficients", xlab = expression(paste("log", lambda)), main = "Predictor coefficients against log lambda")
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)

cv.ridge.boston = cv.glmnet(x.boston,(y.boston),alpha = 0, lambda = grid, nfolds = 10)
plot(cv.ridge.boston,xlab = expression(paste("log", lambda)))

# displaying best lambda values:
c(log(cv.ridge.boston$lambda.min), log(cv.ridge.boston$lambda.1se))
## [1] -1.581574  4.279552

Using ridge regression, we find a best log(lambda) of -1.26 and a 1 standard error of 4.11 We can see the following coefficients for each of these:

rr.best <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.min)
coef(rr.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## (Intercept) 12.9775905382
## zn           0.0387899913
## indus       -0.0804025387
## chas        -0.7323937516
## nox         -7.8754864737
## rm           0.3862135899
## age          0.0015110242
## dis         -0.8509646232
## rad          0.5007945695
## tax          0.0002970566
## ptratio     -0.2044293139
## black       -0.0080295010
## lstat        0.1371320594
## medv        -0.1683590233
rr.1se <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.1se)
coef(rr.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  1.359573930
## zn          -0.002953356
## indus        0.031060606
## chas        -0.184519032
## nox          2.001228568
## rm          -0.150147623
## age          0.006597720
## dis         -0.101274834
## rad          0.050252638
## tax          0.002268044
## ptratio      0.076314412
## black       -0.002837701
## lstat        0.038566984
## medv        -0.025307453

If we compare these side by side for the ones found in lasso, the coefficients are very near the corresponding coefficients for lasso. The significant difference, however, is the 1 standard error lambda value. With this lambda ridge regression only has one value that is not close to zero, much like lasso only having one nonzero value; however, ridge chooses nox while lasso chooses rad as the most important predictors. This means it also differs from regsubsets, which also chose rad as the most important predictor.

# PCR
pcr.boston = pcr(crim ~ ., data = Boston, scale = T, validation = "CV")
summary(pcr.boston)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.193    7.193    6.761    6.757    6.758    6.757
## adjCV         8.61    7.191    7.191    6.756    6.749    6.754    6.752
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.743    6.623    6.629     6.627     6.621     6.592     6.517
## adjCV    6.738    6.617    6.623     6.620     6.614     6.583     6.508
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.boston, val.type = 'MSEP')

Using PCR, we see that after around 3 principal components the error begins to plateau and we only see minor improvements as the number of principal components increases. We therefore will say PCR shows 3 principal components to be the best model.

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

The best performances with the simplest models were achieved using stepwise regression and principal components regression, as both of these showed low errors in the graphs and only used 3 components or predictors. Two other models that performed well were both lasso and ridge regression, although they appear to perform noticeably worse when using the 1 standard error lambda. We will thus use stepwise regression, PCR, lasso, and ridge regression to continue. We will do these each with 10 fold cross validation and compare the MSE of each to determine which model to put forth as our final model.

Fortunately, cross fold validation was automatically performed in the first part for all of the previously mentioned methods with the exception of stepwise regression, which we will do now:

set.seed(123)

folds <- createFolds(Boston$crim, k = 10, list = TRUE)
mse_values <- numeric(10)
selected_models <- vector("list", length = 10)

for (i in seq_along(folds)) {
  # training and validation sets
  train_data <- Boston[-folds[[i]], ]
  test_data <- Boston[folds[[i]], ]
  
  # fit
  full_model <- lm(crim ~ ., data = train_data)
  stepwise_model <- ols_step_both_p(full_model, pent = 0.05, prem = 0.05)
  
  # store best model of fold
  selected_models[[i]] <- formula(stepwise_model$model)
  
  # get mse
  predictions <- predict(stepwise_model$model, newdata = test_data)
  mse_values[i] <- mean((test_data$crim - predictions)^2)
}

# average mse
mean_mse <- mean(mse_values)

# Output results
# cat("MSE for each fold:", mse_values, "\n")
cat("Mean MSE across all folds:", mean_mse, "\n")
## Mean MSE across all folds: 43.83496
# Print the selected models for each fold
cat("\nSelected models for each fold:\n")
## 
## Selected models for each fold:
for (i in seq_along(selected_models)) {
  cat(sprintf("Fold %d: %s\n", i, deparse(selected_models[[i]])))
}
## Fold 1: crim ~ rad + lstat + black
## Fold 2: crim ~ rad + lstat + medv + ptratio
## Fold 3: crim ~ rad + lstat + black
## Fold 4: crim ~ rad + lstat + black
## Fold 5: crim ~ rad + lstat + black
## Fold 6: crim ~ rad + lstat + black + zn + dis + nox + medv
## Fold 7: crim ~ rad + lstat
## Fold 8: crim ~ rad + lstat + black + medv
## Fold 9: crim ~ rad + lstat + black
## Fold 10: crim ~ rad + lstat + black

We see that the after performing cross validated stepwise regression, we arrive at a mean MSE of 43.8, which is similar to the performance observed in part a) of this question. We also see that the selection overall appears to agree with the same 3 predictors as previously: rad, lstat, and black.

Now we can compare all of the results to see the optimal model:

## Ridge Regression:
## Minimum MSE: 43.16917
## MSE at 1-se Lambda: 57.52966
## Lasso Regression:
## Minimum MSE: 43.67005
## MSE at 1-se Lambda: 56.61822
## Principal Component Regression (PCR):
## MSE with 3 components: 45.70604
## MSE with 8 components: 43.86955
## Minumum MSE, with 13 components: 42.46525
## Stepwise Regression:
## [1] 43.83496

Looking at the performance we see the best MSE at around 43. We see the best performing were ridge regression, lasso regression, PCR with 13 components, and stepwise regression which all were within a single point.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

With the exception of the 1 standard error lambda values, all of the models in the previous performed at essentially the same level. In this case, we will decide to continue with lasso or ridge due to the decreased variance they will have due to the penalty terms. While this penalty term does shrink the coefficients, the decrease in variance may prove to be beneficial should we take the model to an outside dataset to apply it in a real world scenario. Between the two, the final model to propose to predict crime rates will be lasso regression. This is because of the two models, lasso regression is able to provide feature selection. If we return to the coefficients by the minimum lamdba we can see what features are shown to be the most important in the dataset, which ridge regression cannot do:

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 12.921616863
## zn           0.036630369
## indus       -0.071088384
## chas        -0.592709165
## nox         -7.203117383
## rm           0.243372202
## age          .          
## dis         -0.800965160
## rad          0.515774432
## tax          .          
## ptratio     -0.194514895
## black       -0.007544322
## lstat        0.125137815
## medv        -0.160384184

We can see that age and tax are taken out of the model and will not be used, due to the l1 norm penalty term. While this does not reduce the dataset as far as we would like, it does provide some feature selection without compromising performance.