Question 2

(a)

The lasso, relative to least squares is: less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(b)

Ridge regression, relative to least squares is: less flexible and will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

(c)

Non-linear methods, relative to least squares are: more flexible and will give improved prediction accuracy when their increase in variance are less than their decrease in bias.

Question 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

(a)

Split the data set into a training set and a test set.

dat = College
set.seed(123)

smp_sze = floor(.70*nrow(dat))

samp = sample(seq_len(nrow(dat)), size = smp_sze)

train = dat[samp, ]

test = dat[-samp, ]

(b)

Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fit = lm(Apps ~ ., train)
lm.pred = predict(lm.fit, test)
mse = mean((lm.pred - test$Apps)^2)
mse
## [1] 1734841

The test MSE 1.734841 ^6.

(c)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

set.seed(123) # set seed for repreoducibility

train.mat = model.matrix(Apps ~., train)
test.mat = model.matrix(Apps ~., test)

grid = 10^seq(10,-2, length=100) # set grid from null model to least squares fit

ridge.fit = cv.glmnet(train.mat, train$Apps, alpha=0, lambda = grid) # 10 fold cross validate to find best lambda

bestlam.ridge = ridge.fit$lambda.min

ridge.pred = predict(ridge.fit, s = bestlam.ridge, newx = test.mat)

mse = mean((ridge.pred - test$Apps)^2)

The test MSE increased from the least squares model and is now 1.939282 ^6

out = glmnet(train.mat, train$Apps, alpha = 0)
predict(out, type="coefficients", s = bestlam.ridge) # ridge regression does not perform variable selection
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -1.108384e+03
## (Intercept)  .           
## PrivateYes  -6.344812e+02
## Accept       7.799468e-01
## Enroll       7.025220e-01
## Top10perc    2.888869e+01
## Top25perc   -2.480330e+00
## F.Undergrad  1.000467e-01
## P.Undergrad  7.509154e-03
## Outstate    -1.062903e-02
## Room.Board   2.160271e-01
## Books        2.845690e-01
## Personal    -6.359640e-02
## PhD         -2.824524e+00
## Terminal    -5.297962e+00
## S.F.Ratio   -2.570072e+00
## perc.alumni -1.259409e+01
## Expend       7.644084e-02
## Grad.Rate    1.090898e+01

(d)

Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(123)
lasso.fit = glmnet(train.mat, train$Apps, alpha = 1, lambda = grid)
cv.lasso = cv.glmnet(train.mat, train$Apps, alpha = 1, lambda = grid)
bestlam.lasso = cv.lasso$lambda.min
bestlam.lasso
## [1] 6.135907
lasso.pred = predict(lasso.fit, s = bestlam.lasso, newx = test.mat)
mse = mean((lasso.pred-test$Apps)^2)

The test MSE decreased from the least squares and the ridge model. This model has the lowest MSE at 1.729545 ^6

predict(lasso.fit, type="coefficients", s = bestlam.lasso) 
## 19 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -437.25898480
## (Intercept)    .         
## PrivateYes  -660.98080564
## Accept         1.21931066
## Enroll         0.08323063
## Top10perc     45.77913227
## Top25perc    -13.13933555
## F.Undergrad    0.02274101
## P.Undergrad    0.02964089
## Outstate      -0.04845178
## Room.Board     0.17834817
## Books          0.16979458
## Personal      -0.01898038
## PhD           -5.30560631
## Terminal      -4.85092509
## S.F.Ratio      .         
## perc.alumni   -7.57316944
## Expend         0.07559669
## Grad.Rate      9.69692642

(e)

Fit a PCR model on the training set, with M chosen by cross- validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(123)

pcr.fit = pcr(Apps ~., data = train, scale = T, validation = "CV")

validationplot(pcr.fit, val.type = "MSEP")

pcr.pred = predict(pcr.fit, test, ncomp = 17)

mean((pcr.pred - test$Apps)^2)
## [1] 1734841

The best value of M selected through cross validation is 17. This number of components gives the lowest test MSE at 1.734841 ^6. This model performs similarly to the least squares model.

(f)

Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

pls.fit = plsr(Apps ~., data = train, scale = T, validation = "CV")

validationplot(pls.fit, val.type = "MSEP")

pls.pred = predict(pls.fit, test, ncomp = 17)

mean((pls.pred - test$Apps)^2)
## [1] 1734841

In this PLS model, using the same number of components as the PCR model the test MSE is lower than the least squares model.

(g)

Comment on the results obtained. How accurately can we pre- dict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

avg.actual = mean(test$Apps)

lm.r2 = 1-mean((lm.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # least squares model
ridge.r2 = 1-mean((ridge.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # ridge model
lasso.r2 = 1-mean((lasso.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # lasso model
pcr.r2 = 1-mean((pcr.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # pcr model
pls.r2 = 1-mean((pls.pred - test$Apps)^2) / mean((avg.actual - test$Apps)^2) # pls model

To compare each of the models we can calculate R^2 by dividing the residual sum of squares by the total sum of squares. The lasso model had the highest amount of variance explained by the model at around 92.43%. The ridge model had the lowest amount of variance explained by the model at 91%. The least squares, pcr, and pls models had the same amount of variance at 92.40% and were all within 15 dp of each other. We are able to pretty with high accuracy in all of the models without much difference in these approaches. The lasso approach seems to work best and the ridge model seemed to have the lowest performance out of these 5 models.

Question 11

We will now try to predict per capita crime rate in the Boston data set.

(a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

b = Boston

Practicing with caret to get a consistent object to work with across models

ctrl = trainControl(method = "repeatedcv", number = 10 , repeats = 10)

Best Subset Model

model = regsubsets(crim ~.,
                   data = b,
                   nvmax = ncol(b) - 1,
                   method = "exhaustive")

# cross-validation to compare MSE
CV_MSE = c()

set.seed(123)

for (i in 1:(ncol(b)-1)) {
  b_temp = b[ ,c("crim", names(coef(model, id=i)[-1]))]
  model_temp = train(crim ~.,
                     data = b_temp,
                     method = "lm",
                     trControl = ctrl)
  CV_MSE[i] = model_temp$results$RMSE
}

Plotting Best Subset

data.frame(CV_MSE = CV_MSE, subset_size = 1:(ncol(Boston)-1)) %>%
  mutate(min_CV_MSE = as.numeric(min(CV_MSE) == CV_MSE)) %>%
  
  ggplot(aes(x = subset_size, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Best Subset Selection", 
       subtitle = "Selecting parameter subset size with cross-validation",
       x = "Subset Size", 
       y = "CV MSE")

Minimum CV RMSE:

min(CV_MSE)
## [1] 5.777091

Lasso Model

set.seed(123)

model_lasso = train(crim ~.,
                    data = b,
                    method = "glmnet",
                    preProcess = c("center", "scale"),
                    metric = "RMSE",
                    maximize = F,
                    trControl = ctrl,
                    tuneGrid = expand.grid(alpha = 1,
                                           lambda = seq(0, 1, length = 100)))

Plotting Lasso Model

mlr = model_lasso$results %>% 
  rename(CV_MSE = RMSE) %>% 
  mutate(min_CV_MSE = as.numeric(lambda == model_lasso$bestTune$lambda)) %>% 
  as.data.frame()


ggplot(mlr, aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Lasso Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

Minimum CV RMSE:

min(mlr$CV_MSE)
## [1] 5.831888

Ridge Model

set.seed(123)

model_ridge = train(crim~.,
                    data = b,
                    method = "glmnet",
                    preProcess = c("center", "scale"),
                    metric = "RMSE",
                    maximize = F, 
                    trcontrol = ctrl,
                    tuneGrid = expand.grid(alpha = 0,
                                           lambda = seq(0, 1, length = 100)))

mrr = model_ridge$results %>% 
  rename(CV_MSE = RMSE) %>% 
  mutate(min_CV_MSE = as.numeric(lambda == model_ridge$bestTune$lambda)) %>% 
  as.data.frame()

Plotting Ridge Model

ggplot(mrr, aes(x = lambda, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Ridge Regression", 
       subtitle = "Selecting shrinkage parameter with cross-validation",
       y = "CV MSE")

Minimum CV RMSE:

min(mrr$CV_MSE)
## [1] 6.376497

Principal Components Regression Model

set.seed(123)
model_pcr = train(crim~.,
                  data = b, 
                  method = "pcr", 
                  preProcess = c("center", "scale"),
                  metric = "RMSE",
                  maximize = F,
                  trControl = ctrl,
                  tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


mpr = model_pcr$results %>% 
  rename(CV_MSE = RMSE) %>% 
  mutate(min_CV_MSE = as.numeric(ncomp == model_pcr$bestTune$ncomp)) %>% 
  as.data.frame()

Plotting PCR Model

 ggplot(mpr,aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Principal Components Regression", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

Minimum CV RMSE:

min(mpr$CV_MSE)
## [1] 5.869671

Partial Least Squares Model

set.seed(123)
model_pls = train(crim~.,
                  data = b, 
                  method = "pls", 
                  preProcess = c("center", "scale"),
                  metric = "RMSE",
                  maximize = F,
                  trControl = ctrl,
                  tuneGrid = expand.grid(ncomp = 1:(ncol(Boston)-1)))


mplr = model_pls$results %>% 
  rename(CV_MSE = RMSE) %>% 
  mutate(min_CV_MSE = as.numeric(ncomp == model_pls$bestTune$ncomp)) %>% 
  as.data.frame()

Plotting PLS Model

ggplot(mplr, aes(x = ncomp, y = CV_MSE)) + 
  geom_line(col = "grey55") + 
  geom_point(size = 2, aes(col = factor(min_CV_MSE))) + 
  scale_x_continuous(breaks = seq(1, ncol(Boston)-1), minor_breaks = NULL) +
  scale_color_manual(values = c("blue", "red")) + 
  theme(legend.position = "none") + 
  labs(title = "Boston Dataset - Partial Least Squares", 
       subtitle = "Selecting number of principal components with cross-validation",
       x = "Principal Components", 
       y = "CV MSE")

Minimum CV RMSE:

min(mplr$CV_MSE)
## [1] 5.869343

(b)

Q: Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

A:

Using RMSE as a standard, the best performing model uses best subset selection with RMSE = 5.777091 This comparative metric finds the model with the least amount of error so, best subset selection is the best model to be used out of these choices.

(c)

Q:Does your chosen model involve all of the features in the data set? Why or why not?

A:
Our final model does not include all of the features. It uses only 9 features as shown below:

coef(model, id=9)
##   (Intercept)            zn         indus           nox           dis 
##  19.124636156   0.042788127  -0.099385948 -10.466490364  -1.002597606 
##           rad       ptratio         black         lstat          medv 
##   0.539503547  -0.270835584  -0.008003761   0.117805932  -0.180593877

Including any additional features would increase the error rate from this optimal value.