Question 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

(a) The lasso, relative to least squares, is:

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

This statement is correct. Lasso is less flexible because it can shrink coefficients to exactly zero in order to perform variable selection. This reduction in flexibility does cause an increase in training bias but in turn decreases variances.

  1. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

(b) Repeat (a) for ridge regression relative to least squares.

  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Similar to Lasso, the above statement also correctly describes Ridge regression relative to least squares. Similar to Lasso Ridge also shrinks coefficients towards zero but the difference is that it does not set any coefficients exactly to zero. However the same effect occurs where this shrinkage causes an increase in training bias and a decrease in variance.

(c) Repeat (a) for non-linear methods relative to least squares.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The above statement is correct in describing non-linear methods relative to least squares. Non-linear methods such as decision trees, and random forests have the ability to capture more complex relationships than just linear relationships thus making the less likely to be bias.

Question 9

(a) Split the data set into a training set and a test set.

college <- read.csv('College.csv')
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(42)

train_index <- createDataPartition(college$Apps, p = 0.8, list = FALSE)
train_data <- college[train_index, ]
test_data <- college[-train_index, ]
train_data <- subset(train_data, select = -X)
test_data <- subset(test_data, select = -X)

(b) Fit a linear model using least squares on the training set, and report the test error obtained.

linear_model <- lm(Apps ~ ., data = train_data)
summary(linear_model)
## 
## Call:
## lm(formula = Apps ~ ., data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3881.0  -437.1   -19.5   316.7  6967.8 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -561.34089  461.02508  -1.218  0.22385    
## PrivateYes  -409.16975  158.55701  -2.581  0.01010 *  
## Accept         1.62834    0.04476  36.382  < 2e-16 ***
## Enroll        -0.85903    0.20563  -4.177 3.38e-05 ***
## Top10perc     51.31276    6.42525   7.986 7.03e-15 ***
## Top25perc    -14.67648    5.16650  -2.841  0.00465 ** 
## F.Undergrad    0.05127    0.03625   1.415  0.15769    
## P.Undergrad    0.04916    0.03535   1.391  0.16483    
## Outstate      -0.09202    0.02262  -4.067 5.38e-05 ***
## Room.Board     0.13611    0.05628   2.418  0.01588 *  
## Books          0.10960    0.27217   0.403  0.68732    
## Personal       0.01803    0.07008   0.257  0.79711    
## PhD           -8.96660    5.28199  -1.698  0.09010 .  
## Terminal      -4.40975    5.78320  -0.763  0.44605    
## S.F.Ratio     21.41479   14.51439   1.475  0.14062    
## perc.alumni    3.87434    4.69013   0.826  0.40909    
## Expend         0.08916    0.01484   6.009 3.22e-09 ***
## Grad.Rate      7.23133    3.32272   2.176  0.02992 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1075 on 606 degrees of freedom
## Multiple R-squared:  0.9268, Adjusted R-squared:  0.9247 
## F-statistic: 451.1 on 17 and 606 DF,  p-value: < 2.2e-16
predictions <- predict(linear_model, newdata = test_data)  # Get predictions
actuals <- test_data$Apps  # Actual values
rmse <- sqrt(mean((actuals - predictions)^2))  # Compute RMSE
print(rmse)
## [1] 928.8176

The RMSE from using linear regression is 928.81.

(c) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
x_train <- model.matrix(Apps ~ ., train_data)[, -1] 
y_train <- train_data$Apps
x_test <- model.matrix(Apps ~ ., test_data)[, -1] 
y_test <- test_data$Apps
set.seed(1)
#grid <- 10^seq(10, -2, length = 100)
cv.out <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam
## [1] 367.6145
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 0, lambda = bestlam) 
## 
##   Df  %Dev Lambda
## 1 17 90.66  367.6
ridge.pred <- predict(ridge_model, s = bestlam, newx = x_test)
sqrt(mean((ridge.pred - y_test)^2))
## [1] 879.621

The Ridge Model resulted in an RMSE of 879.621 after running predictions on the test set. This improved significantly from the regular Linear Regression Model.

(d) Fit a lasso model on the training set, with λ chosen by cross- validation. Report the test error obtained, along with the number of non-zero coefficient estimates.

set.seed(1)

cv.out <- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam
## [1] 13.84045
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam)
lasso_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 1, lambda = bestlam) 
## 
##   Df  %Dev Lambda
## 1 14 92.55  13.84
lasso.pred <- predict(lasso_model, s = bestlam, newx = x_test)
sqrt(mean((lasso.pred - y_test)^2))
## [1] 922.3743

The Lasso Model resulted in an RMSE of 922.3743 which is worse than the ridge model but still slightly better than the linear regression model.

(e) Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)
## 
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
## 
##     R2
## The following object is masked from 'package:stats':
## 
##     loadings
set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3922     3896     2109     2112     1992     1715     1692
## adjCV         3922     3895     2106     2110     1946     1704     1686
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1682     1651     1631      1633      1639      1635      1644
## adjCV     1682     1642     1625      1627      1633      1629      1638
##        14 comps  15 comps  16 comps  17 comps
## CV         1644      1494      1246      1206
## adjCV      1638      1471      1236      1197
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.199    57.51    64.56    70.11    75.57    80.57    84.18    87.63
## Apps    2.383    72.02    72.04    79.00    82.65    83.03    83.22    84.07
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.55     92.94     95.07     96.79     97.85     98.69     99.36
## Apps    84.56     84.82     84.83     84.95     84.97     84.99     90.89
##       16 comps  17 comps
## X        99.82    100.00
## Apps     92.31     92.68

Looking at the results of the PCR model run using cross-validation, it can be seen that the variance expliain in the predictors and in the response variable exceed 80% when 6 components are considered. We will go with this number because we want to have a good model but we also don’t want to include the whole thing as that would lead to overfitting.

validationplot(pcr.fit, val.type = "MSEP")

pcr_model <- pcr(Apps ~ ., data = train_data, scale = TRUE, ncomp = 6)
summary(pcr_model)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 6
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## X      32.199    57.51    64.56    70.11    75.57    80.57
## Apps    2.383    72.02    72.04    79.00    82.65    83.03
pcr.pred <- predict(pcr.fit, test_data, ncomp = 6) 
sqrt(mean((pcr.pred - test_data$Apps)^2))
## [1] 1161.317

After re-running the pcr model while specifing the numbner of components considered to 6 from the previous step, we get a caluculated RMSE of 1161.317 which is significantly worse than the previous models.

(f) Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

set.seed(1)
pls.fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 624 17 
##  Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3922     1941     1673     1543     1462     1283     1234
## adjCV         3922     1936     1672     1535     1441     1266     1224
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1218     1214     1211      1209      1209      1207      1207
## adjCV     1209     1205     1202      1201      1200      1199      1198
##        14 comps  15 comps  16 comps  17 comps
## CV         1206      1206      1206      1206
## adjCV      1198      1197      1197      1197
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.52    42.51    62.72    64.87    67.51    73.11    76.72    80.11
## Apps    76.94    83.71    87.00    90.75    92.28    92.44    92.52    92.57
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.40     85.13     88.99     90.91     92.68     95.49     97.00
## Apps    92.63     92.65     92.66     92.67     92.68     92.68     92.68
##       16 comps  17 comps
## X        98.87    100.00
## Apps     92.68     92.68

Similar to the PCR model, with PLS, we first ran a model using cross validation in order to see what the best number of compnents to consider would be. We see that at 8 componenets, at least 80% of variance is explained in both the predictors and the the response variable. We will choose this for ncomps and re-run the model.

validationplot(pls.fit, val.type = "MSEP")

pls.pred <- predict(pls.fit, test_data, ncomp = 8) 
sqrt(mean((pls.pred - test_data$Apps)^2))
## [1] 924.1843

After re-running the PLS model with ncomp set to 8 I made predictions on the test set and got an RMSE of 924.1843 which is much better than the PCR model.

(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

The RMSE’s of the all of the models are as follows:

Linear Model (Least Square): 928.8176

Ridge Regression: 879.621

Lasso: 922.3743

PCR: 1161.317

PLS: 924.1843

From these results it can seen that the Ridge Regression Model performed much better than the other models. Similarly it can also be seen that the PCR Model performed the worst by a significant margin.

Question 11

(a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(ISLR2)
library(leaps)
Boston = Boston
dim(Boston)
## [1] 506  13
set.seed(42)
train_index <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
boston_train <- Boston[train_index, ]
boston_test <- Boston[-train_index, ]

Regbest Subset

regfit.full <- regsubsets(crim ~ ., data = boston_train, nvmax = 12)
reg.summary = summary(regfit.full)
reg.summary
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = boston_train, nvmax = 12)
## 12 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*" 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 5  ( 1 )  "*" "*"   " "  " " " " " " "*" "*" " " " "     " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*" 
## 8  ( 1 )  "*" " "   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 9  ( 1 )  "*" " "   "*"  "*" "*" " " "*" "*" " " "*"     "*"   "*" 
## 10  ( 1 ) "*" " "   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*" 
## 11  ( 1 ) "*" " "   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"
reg.summary$adjr2
##  [1] 0.3834393 0.4034368 0.4048850 0.4099491 0.4119745 0.4140309 0.4147092
##  [8] 0.4156044 0.4152502 0.4149416 0.4137607 0.4122893
which.max(reg.summary$adjr2)
## [1] 8
plot(reg.summary$adjr2, xlab = "Number of Variables",
ylab = "Adjusted RSq", type = "l")
points(8, reg.summary$adjr2[8], col = "red", cex = 2, pch = 20)

coef(regfit.full, 8)
##  (Intercept)           zn          nox           rm          dis          rad 
##  10.64234133   0.03865907 -11.15120476   0.84593198  -0.78348686   0.55150069 
##      ptratio        lstat         medv 
##  -0.32765249   0.12170597  -0.19666350
plot(regfit.full, scale = "adjr2") 

I ran a regsubsets() function on the entire model and then took a look at the Adjusted R-Squared values. Because there are 12 predictors, there were also 12 values. The highest adjusted R-Squared value was at 8 variables being included in the model. The graph above as well as the coef() function directly above that show that the specific varibles to be inclulded when considering 8 variables are zn, nox, rm, dis, rad, ptratio, lstat, and medv.

linear_model <- lm(crim ~ zn  + nox + rm + dis + rad + ptratio + lstat + medv, data = boston_train)
summary(linear_model)
## 
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat + 
##     medv, data = boston_train)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.210 -1.909 -0.262  0.891 74.365 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.64234    8.18336   1.300  0.19430    
## zn            0.03866    0.02160   1.789  0.07442 .  
## nox         -11.15120    5.69954  -1.957  0.05121 .  
## rm            0.84593    0.68322   1.238  0.21649    
## dis          -0.78349    0.31377  -2.497  0.01299 *  
## rad           0.55150    0.05721   9.640  < 2e-16 ***
## ptratio      -0.32765    0.21984  -1.490  0.13703    
## lstat         0.12171    0.08466   1.438  0.15144    
## medv         -0.19666    0.06737  -2.919  0.00374 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.446 on 347 degrees of freedom
## Multiple R-squared:  0.4288, Adjusted R-squared:  0.4156 
## F-statistic: 32.56 on 8 and 347 DF,  p-value: < 2.2e-16
predictions <- predict(linear_model, newdata = boston_test)  # Get predictions
actuals <- boston_test$crim  # Actual values
rmse <- sqrt(mean((actuals - predictions)^2))  # Compute RMSE
print(rmse)
## [1] 6.493153

After refitting a linear regression model using the variables chosen from the regsubsets() function, I used the model to make predictions on a test set and got an RMSE of 6.493153.

Ridge Model

x_train <- model.matrix(crim ~ ., boston_train)[, -1] 
y_train <- boston_train$crim
x_test <- model.matrix(crim ~ ., boston_test)[, -1] 
y_test <- boston_test$crim
set.seed(1)
cv.out <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam
## [1] 0.5225794
ridge_model <- glmnet(x_train, y_train, alpha = 0, lambda = bestlam)
ridge_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 0, lambda = bestlam) 
## 
##   Df  %Dev Lambda
## 1 12 42.63 0.5226
ridge.pred <- predict(ridge_model, s = bestlam, newx = x_test)
sqrt(mean((ridge.pred - y_test)^2))
## [1] 6.553395

I fit a ridge model on the training split and then made predictions on the test split getting an RMSE of 6.553395.

Lasso Model

set.seed(1)

cv.out <- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.out)

bestlam <- cv.out$lambda.min
bestlam
## [1] 0.03132781
lasso_model <- glmnet(x_train, y_train, alpha = 1, lambda = bestlam)
lasso_model
## 
## Call:  glmnet(x = x_train, y = y_train, alpha = 1, lambda = bestlam) 
## 
##   Df  %Dev  Lambda
## 1 12 43.13 0.03133
lasso.pred <- predict(lasso_model, s = bestlam, newx = x_test)
sqrt(mean((lasso.pred - y_test)^2))
## [1] 6.512083

I fit a lasso model on the training split and then made predictions on the test split getting an RMSE of 6.512083.

PCR Model

set.seed(1)
pcr.fit <- pcr(crim ~ ., data = boston_train, scale = TRUE, validation = "CV")
summary(pcr.fit)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.220    7.230    6.800    6.807    6.765    6.753
## adjCV        8.444    7.218    7.227    6.792    6.806    6.761    6.748
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.632    6.665    6.653     6.667     6.641     6.548
## adjCV    6.625    6.658    6.645     6.659     6.631     6.537
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.52    63.44    72.52    79.80    86.34    89.86    92.69    94.83
## crim    27.33    27.36    36.31    36.32    37.18    37.63    40.07    40.21
##       9 comps  10 comps  11 comps  12 comps
## X       96.71     98.27     99.42    100.00
## crim    40.49     40.49     41.60     43.22
validationplot(pcr.fit, val.type = "MSEP")

Similar to problem 9, I began by running a pcr model with cross validation in order to see what the best number of predictors to consider would be. After looking at the summary, I saw that in the in the Cross-Validation errors, there was a first local minimum value at 3 components. This compelled me to choose this as the number of predictors to consider in the model.

pcr_model <- pcr(crim ~ ., data = boston_train, scale = TRUE, ncomp = 3)
summary(pcr_model)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 3
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps
## X       49.52    63.44    72.52
## crim    27.33    27.36    36.31
pcr.pred <- predict(pcr.fit, boston_test, ncomp = 3) 
sqrt(mean((pcr.pred - boston_test$crim)^2))
## [1] 6.976105

After re-running the pcr model while specifing the numbner of components considered to 3 from the previous step, we get a calculated RMSE of 6.976105 which is again significantly worse than the previous models.

PLS Model

set.seed(1)
pls.fit <- plsr(crim ~ ., data = boston_train, scale = TRUE, validation = "CV")
summary(pls.fit)
## Data:    X dimension: 356 12 
##  Y dimension: 356 1
## Fit method: kernelpls
## Number of components considered: 12
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.444    7.050    6.665    6.578    6.606    6.580    6.571
## adjCV        8.444    7.048    6.659    6.569    6.596    6.569    6.559
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps
## CV       6.551    6.546    6.546     6.547     6.548     6.548
## adjCV    6.540    6.535    6.535     6.537     6.537     6.537
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       49.04    58.22    62.61    73.39    79.60    82.91    85.46    89.73
## crim    31.09    39.69    41.87    42.31    42.74    43.10    43.18    43.21
##       9 comps  10 comps  11 comps  12 comps
## X       94.34     96.37     98.42    100.00
## crim    43.21     43.22     43.22     43.22
validationplot(pls.fit, val.type = "MSEP")

After looking at the summary of the pls model using cross-validation, I again saw a local minimum value in the cross-validtion errors at 3 components. This again made me choose 3 as the number of components to include in my final model.

pls.pred <- predict(pls.fit, boston_test, ncomp = 3) 
sqrt(mean((pls.pred - boston_test$crim)^2))
## [1] 6.600582

After re-runninc the pls model with only 3 predictors being considered, I ran predictions on the test set and got an RMSE of 6.600582.

(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross- validation, or some other reasonable alternative, as opposed to using training error.

The RMSE’s of the all of the models are as follows:

Linear Model (regsubsets()): 6.493153

Ridge Regression: 6.553395

Lasso: 6.512083

PCR: 6.976105

PLS: 6.600582

This time it was the linear regression model using varibles chosen through the regsubset() function that was ables to perform the best predictions on the data. Similar to question 9, the PCR model performed significatlly worse than the other models.

(c) Does your chosen model involve all of the features in the data set? Why or why not?

The linear regression model did not end up involving all of the features in the data set because before fit the model, i ran the data through the regsubsets() function. As I mentioned I used the adjusted R-squared values as the criteria and found that the best combination of predictors to maximize that value is zn, nox, rm, dis, rad, ptratio, lstat, and medv.