2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is:

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Lasso improves prediction accuracy and reduces variance by utilizing the penalty term of λ that penalizes large coefficients and makes the model more simplistic. The penalty term λ controls how much we shrink the coefficients in the model. The closer λ is to 0, the more flexible it becomes and acts more like least squares regression. Getting an ideal value for λ decreases our variance and we gain bias.

  1. Repeat (a) for ridge regression relative to least squares.
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

We get the same answer as lasso, as both are similar. Ridge will use all of the predictors in the final model though but they will be close to 0. Lasso could change the coefficient of the predictor to 0. Both still use λ to shrink the model thus decreasing variance but increasing bias.

  1. Repeat (a) for non-linear methods relative to least squares.
  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Non-linear methods are more flexible because they generally have less assumptions or can violate assumptions of more linear / parametric models.They can also take on more complex shapes compared to a more simplistic model. These complex shapes increase our variance due to their flexibility in the bias-variance trade off.

9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

college = ISLR2::College
  1. Split the data set into a training set and a test set.
set.seed(1)
index = sample(1:nrow(college), 0.8*nrow(college))
college_train = college[index,]
college_test = college[-index,]

Splitting the data into an 80/20 train/test split.

  1. Fit a linear model using least squares on the training set, and report the test error obtained.
college_lm = lm(Apps ~ ., data = college_train)
summary(college_lm)
## 
## Call:
## lm(formula = Apps ~ ., data = college_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5555.2  -404.6    19.9   310.3  7577.7 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -630.58238  435.56266  -1.448 0.148209    
## PrivateYes  -388.97393  148.87623  -2.613 0.009206 ** 
## Accept         1.69123    0.04433  38.153  < 2e-16 ***
## Enroll        -1.21543    0.20873  -5.823 9.41e-09 ***
## Top10perc     50.45622    5.88174   8.578  < 2e-16 ***
## Top25perc    -13.62655    4.67321  -2.916 0.003679 ** 
## F.Undergrad    0.08271    0.03632   2.277 0.023111 *  
## P.Undergrad    0.06555    0.03367   1.947 0.052008 .  
## Outstate      -0.07562    0.01987  -3.805 0.000156 ***
## Room.Board     0.14161    0.05130   2.760 0.005947 ** 
## Books          0.21161    0.25184   0.840 0.401102    
## Personal       0.01873    0.06604   0.284 0.776803    
## PhD           -9.72551    4.91228  -1.980 0.048176 *  
## Terminal      -0.48690    5.43302  -0.090 0.928620    
## S.F.Ratio     18.26146   13.83984   1.319 0.187508    
## perc.alumni    1.39008    4.39572   0.316 0.751934    
## Expend         0.05764    0.01254   4.595 5.26e-06 ***
## Grad.Rate      5.89480    3.11185   1.894 0.058662 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 993.8 on 603 degrees of freedom
## Multiple R-squared:  0.9347, Adjusted R-squared:  0.9328 
## F-statistic: 507.5 on 17 and 603 DF,  p-value: < 2.2e-16

The linear model is significant at an alpha level of 0.05 (p-value = 2.2e-16) with a few terms not being significant.

lm_pred = predict(college_lm,college_test)
lm_mse = mean((lm_pred - college_test$Apps)^2)
lm_mse
## [1] 1567324

We get the mean square error of 1567324 with this linear model.

  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.3.3
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
## Loaded glmnet 4.1-8
set.seed(1)
x_train = model.matrix(Apps~.,college_train)[,-1]
y_train = college_train$Apps
x_test = model.matrix(Apps~.,college_test)[,-1]
y_test = college_test$Apps
grid <- 10^seq(10, -2, length = 100)
set.seed(1)
college_ridge_cv = cv.glmnet(x_train,y_train,alpha=0, lambda = grid,
                             standardize = T, nfolds = 10)
best_lambda = college_ridge_cv$lambda.min
print(best_lambda)
## [1] 0.01
ridge.mod = glmnet(x_train, y_train, alpha = 0, lambda = grid, thresh = 1e-12)
ridge.pred = predict(ridge.mod, s=best_lambda, newx=x_test)
ridge_mse = mean((ridge.pred-y_test)^2)
print(ridge_mse)
## [1] 1567278

We get an MSE of 1567278 which is slightly lower than the linear model.

  1. Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates
set.seed(1)
college_lasso_cv = cv.glmnet(x_train,y_train,alpha=1, lambda = grid,
                             standardize = T, nfolds = 10)
best_lambda = college_lasso_cv$lambda.min
print(best_lambda)
## [1] 0.01
plot(college_lasso_cv)

lasso.mod = glmnet(x_train, y_train, alpha = 1, lambda = grid)
lasso.pred = predict(lasso.mod, s=best_lambda, newx=x_test)
lasso_mse = mean((lasso.pred-y_test)^2)
print(lasso_mse)
## [1] 1565789

This is 2000 less than the ridge MSE.

lasso.coef = predict(lasso.mod, type = 'coefficients',s=best_lambda)[1:18,]
length(lasso.coef[lasso.coef != 0])
## [1] 18
lasso.coef[lasso.coef != 0]
##   (Intercept)    PrivateYes        Accept        Enroll     Top10perc 
## -632.95602448 -388.88540852    1.68973478   -1.20585828   50.38010502 
##     Top25perc   F.Undergrad   P.Undergrad      Outstate    Room.Board 
##  -13.57568416    0.08163106    0.06550365   -0.07553507    0.14188193 
##         Books      Personal           PhD      Terminal     S.F.Ratio 
##    0.21093272    0.01873856   -9.74128656   -0.46973434   18.25011259 
##   perc.alumni        Expend     Grad.Rate 
##    1.36217567    0.05765739    5.89241221

There are 18 non-zero variables.

  1. Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
paste('Linear Model Test MSE: ', round(lm_mse))
## [1] "Linear Model Test MSE:  1567324"
paste('Ridge Model Test MSE: ', round(ridge_mse))
## [1] "Ridge Model Test MSE:  1567278"
paste('Lasso Model Test MSE: ', round(lasso_mse))
## [1] "Lasso Model Test MSE:  1565789"

The test MSE’s are all rather close to each other but lasso had the lowest MSE which was ~2000 less than the ridge and linear models. We can calculate the R2 for each model to see how much variance is explained by each model.

SStotal = sum((college_test$Apps - mean(college_test$Apps))^2)
lm_r2 = 1 - sum((college_test$Apps - lm_pred)^2) / SStotal
ridge_r2 = 1 - sum((college_test$Apps - ridge.pred)^2) / SStotal
lasso_r2 = 1 - sum((college_test$Apps - lasso.pred)^2) / SStotal
paste('Linear Model R2: ', round(lm_r2,3))
## [1] "Linear Model R2:  0.901"
paste('Ridge Model R2: ', round(ridge_r2,3))
## [1] "Ridge Model R2:  0.901"
paste('Lasso Model R2: ', round(lasso_r2,3))
## [1] "Lasso Model R2:  0.901"

All three models explain 90% of the variance in the data and the lasso model had the lowest test MSE so they model should be slightly better at making predictions compared to the linear and ridge models.

11 (Optional)

We will now try to predict per capita crime rate in the Boston data set. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

boston = ISLR2::Boston
library(leaps)
## Warning: package 'leaps' was built under R version 4.3.2
boston_regfit.full = regsubsets(crim ~ ., data=boston)
summary(boston_regfit.full)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 ) " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 ) "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 ) "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 ) "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 ) "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 ) "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"

I just wanted to see the best subset selection.

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
  2. Does your chosen model involve all of the features in the data set? Why or why not?