Assignment 5

Author

Jonathan McCanlas

Assignment 5

##Problem 2

Part A

The lasso, relative to least squares, is:

Less fexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Part B

Repeat (a) for ridge regression relative to least squares:

Less fexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Part C

Repeat (a) for non-linear methods relative to least squares

More fexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Problem 9

Part A

library(ISLR2)
data(College)
str(College)

'data.frame':   777 obs. of  18 variables:
 $ Private    : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 2 2 2 2 2 ...
 $ Apps       : num  1660 2186 1428 417 193 ...
 $ Accept     : num  1232 1924 1097 349 146 ...
 $ Enroll     : num  721 512 336 137 55 158 103 489 227 172 ...
 $ Top10perc  : num  23 16 22 60 16 38 17 37 30 21 ...
 $ Top25perc  : num  52 29 50 89 44 62 45 68 63 44 ...
 $ F.Undergrad: num  2885 2683 1036 510 249 ...
 $ P.Undergrad: num  537 1227 99 63 869 ...
 $ Outstate   : num  7440 12280 11250 12960 7560 ...
 $ Room.Board : num  3300 6450 3750 5450 4120 ...
 $ Books      : num  450 750 400 450 800 500 500 450 300 660 ...
 $ Personal   : num  2200 1500 1165 875 1500 ...
 $ PhD        : num  70 29 53 92 76 67 90 89 79 40 ...
 $ Terminal   : num  78 30 66 97 72 73 93 100 84 41 ...
 $ S.F.Ratio  : num  18.1 12.2 12.9 7.7 11.9 9.4 11.5 13.7 11.3 11.5 ...
 $ perc.alumni: num  12 16 30 37 2 11 26 37 23 15 ...
 $ Expend     : num  7041 10527 8735 19016 10922 ...
 $ Grad.Rate  : num  60 56 54 59 15 55 63 73 80 52 ...

set.seed(1)  # For reproducibility
train_index <- sample(1:nrow(College), nrow(College) * 0.7)  # 70% training
train <- College[train_index, ]
test <- College[-train_index, ]

nrow(train)

[1] 543

nrow(test)

[1] 234

Part B

lm.fit <- lm(Apps ~ ., data = train)
summary(lm.fit)


Call:
lm(formula = Apps ~ ., data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-5816.1  -451.6    -1.0   327.2  7445.9 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -5.377e+02  5.076e+02  -1.059  0.28995    
PrivateYes  -5.045e+02  1.648e+02  -3.061  0.00232 ** 
Accept       1.722e+00  4.763e-02  36.159  < 2e-16 ***
Enroll      -1.055e+00  2.437e-01  -4.329 1.80e-05 ***
Top10perc    5.358e+01  6.440e+00   8.320 7.64e-16 ***
Top25perc   -1.614e+01  5.092e+00  -3.170  0.00161 ** 
F.Undergrad  2.970e-02  4.399e-02   0.675  0.49976    
P.Undergrad  7.162e-02  3.649e-02   1.963  0.05019 .  
Outstate    -8.841e-02  2.176e-02  -4.064 5.57e-05 ***
Room.Board   1.630e-01  5.577e-02   2.923  0.00362 ** 
Books        2.727e-01  2.723e-01   1.001  0.31715    
Personal    -7.316e-03  7.283e-02  -0.100  0.92002    
PhD         -9.676e+00  5.360e+00  -1.805  0.07161 .  
Terminal    -3.781e-01  6.015e+00  -0.063  0.94990    
S.F.Ratio    1.627e+01  1.608e+01   1.012  0.31214    
perc.alumni  2.358e+00  4.853e+00   0.486  0.62722    
Expend       5.986e-02  1.337e-02   4.476 9.34e-06 ***
Grad.Rate    7.158e+00  3.520e+00   2.034  0.04248 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1022 on 525 degrees of freedom
Multiple R-squared:  0.9332,    Adjusted R-squared:  0.9311 
F-statistic: 431.6 on 17 and 525 DF,  p-value: < 2.2e-16

preds <- predict(lm.fit, newdata = test)

# Calculate MSE
mse <- mean((test$Apps - preds)^2)
mse

[1] 1261630

rmse <- sqrt(mse)
rmse

[1] 1123.223

Part C

library(glmnet)

Loading required package: Matrix

Loaded glmnet 4.1-10

# Create matrix of predictors and response
x_train <- model.matrix(Apps ~ ., data = train)[, -1]  # Remove intercept column
y_train <- train$Apps

x_test <- model.matrix(Apps ~ ., data = test)[, -1]
y_test <- test$Apps

set.seed(1)
cv.ridge <- cv.glmnet(x_train, y_train, alpha = 0)
plot(cv.ridge)  # Optional: visualize cross-validation curve

best_lambda <- cv.ridge$lambda.min
best_lambda

[1] 367.5286

ridge.pred <- predict(cv.ridge, s = best_lambda, newx = x_test)

ridge.mse <- mean((y_test - ridge.pred)^2)
ridge.rmse <- sqrt(ridge.mse)

ridge.mse

[1] 1121034

ridge.rmse

[1] 1058.789

Part D

set.seed(1)
cv.lasso <- cv.glmnet(x_train, y_train, alpha = 1)
plot(cv.lasso)  # Optional: visualize CV error vs. log(lambda)

best_lambda_lasso <- cv.lasso$lambda.min
best_lambda_lasso

[1] 8.690175

lasso.pred <- predict(cv.lasso, s = best_lambda_lasso, newx = x_test)

lasso.mse <- mean((y_test - lasso.pred)^2)
lasso.rmse <- sqrt(lasso.mse)

lasso.mse

[1] 1233246

lasso.rmse

[1] 1110.516

lasso.coef <- predict(cv.lasso, s = best_lambda_lasso, type = "coefficients")
nonzero_count <- sum(lasso.coef != 0) - 1  # Subtract 1 to exclude the intercept
nonzero_count

[1] 14

Part E

library(pls)


Attaching package: 'pls'

The following object is masked from 'package:stats':

    loadings

set.seed(1)
pcr.fit <- pcr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pcr.fit)

Data:   X dimension: 543 17 
    Y dimension: 543 1
Fit method: svdpc
Number of components considered: 17

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            3895     3813     2129     2139     1798     1715     1709
adjCV         3895     3814     2125     2136     1785     1703     1703
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV        1711     1656     1620      1615      1621      1621      1618
adjCV     1705     1645     1614      1608      1615      1615      1612
       14 comps  15 comps  16 comps  17 comps
CV         1618      1546      1177      1138
adjCV      1612      1529      1168      1130

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
      9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
      16 comps  17 comps
X        99.85    100.00
Apps     93.03     93.32

validationplot(pcr.fit, val.type = "MSEP")  # Mean Squared Error of Prediction

cv.error <- RMSEP(pcr.fit)
best.M <- which.min(cv.error$val[1, , -1])  # Skip intercept
best.M

17 comps 
      17

pcr.pred <- predict(pcr.fit, newdata = test, ncomp = best.M)

pcr.mse <- mean((test$Apps - pcr.pred)^2)
pcr.rmse <- sqrt(pcr.mse)

pcr.mse

[1] 1261630

pcr.rmse

[1] 1123.223

Part F

set.seed(1)
pls.fit <- plsr(Apps ~ ., data = train, scale = TRUE, validation = "CV")
summary(pls.fit)

Data:   X dimension: 543 17 
    Y dimension: 543 1
Fit method: kernelpls
Number of components considered: 17

VALIDATION: RMSEP
Cross-validated using 10 random segments.
       (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
CV            3895     1952     1743     1540     1451     1258     1171
adjCV         3895     1947     1737     1532     1430     1231     1161
       7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
CV        1153     1147     1144      1141      1142      1141      1139
adjCV     1145     1139     1136      1133      1134      1133      1131
       14 comps  15 comps  16 comps  17 comps
CV         1138      1138      1138      1138
adjCV      1130      1130      1130      1130

TRAINING: % variance explained
      1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
      9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
      16 comps  17 comps
X        98.40    100.00
Apps     93.32     93.32

validationplot(pls.fit, val.type = "MSEP")  # Use for visual inspection

cv.error.pls <- RMSEP(pls.fit)
best.M.pls <- which.min(cv.error.pls$val[1, , -1])
best.M.pls

17 comps 
      17

pls.pred <- predict(pls.fit, newdata = test, ncomp = best.M.pls)

pls.mse <- mean((test$Apps - pls.pred)^2)
pls.rmse <- sqrt(pls.mse)

pls.mse

[1] 1261630

pls.rmse

[1] 1123.223

Part G

Overall accuracy is reasonable: predictions are, on average, within ~1065 applications, which is acceptable depending on the scale of Apps.

There is no significant advantage to using complex models (Ridge, Lasso, PCR, PLS) over basic linear regression in this case.

Lasso selected fewer variables, which may be useful if model simplicity or interpretability is important, even if accuracy is similar.

PCR and PLS yielded nearly identical results — suggesting that the response variable (Apps) is strongly associated with the top principal components of the predictors.

All five approaches perform similarly in terms of test error. This suggests that the data has a strong linear structure, and that nonlinear or regularized models do not substantially improve performance in this case. Given that, ordinary least squares regression remains a competitive choice.

Problem 11

Part A

library(MASS)


Attaching package: 'MASS'

The following object is masked from 'package:ISLR2':

    Boston

data(Boston)
str(Boston)

'data.frame':   506 obs. of  14 variables:
 $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
 $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
 $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
 $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
 $ rm     : num  6.58 6.42 7.18 7 7.15 ...
 $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
 $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
 $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
 $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
 $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
 $ black  : num  397 397 393 395 397 ...
 $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
 $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...

set.seed(1)
train_idx <- sample(1:nrow(Boston), nrow(Boston)*0.7)
train <- Boston[train_idx, ]
test <- Boston[-train_idx, ]

1. Ridge Regression (glmnet)

library(glmnet)

x_train <- model.matrix(crim ~ ., train)[, -1]
y_train <- train$crim
x_test <- model.matrix(crim ~ ., test)[, -1]
y_test <- test$crim

cv.ridge <- cv.glmnet(x_train, y_train, alpha = 0)
ridge.pred <- predict(cv.ridge, s = cv.ridge$lambda.min, newx = x_test)
mean((y_test - ridge.pred)^2)

[1] 60.15076

2. Lasso Regression

cv.lasso <- cv.glmnet(x_train, y_train, alpha = 1)
lasso.pred <- predict(cv.lasso, s = cv.lasso$lambda.min, newx = x_test)
mean((y_test - lasso.pred)^2)

[1] 59.51757

# Non-zero coefficients
coef(cv.lasso, s = cv.lasso$lambda.min)

14 x 1 sparse Matrix of class "dgCMatrix"
            s=0.03747917
(Intercept)  11.80700649
zn            0.02933273
indus        -0.07745017
chas         -0.56091593
nox          -4.88447302
rm            .         
age           .         
dis          -0.53312020
rad           0.47460893
tax           .         
ptratio      -0.21649861
black        -0.01185557
lstat         0.21252213
medv         -0.08561418

3. PCR (pls)

library(pls)

pcr.fit <- pcr(crim ~ ., data = train, scale = TRUE, validation = "CV")
best.M.pcr <- which.min(RMSEP(pcr.fit)$val[1, , -1])
pcr.pred <- predict(pcr.fit, newdata = test, ncomp = best.M.pcr)
mean((test$crim - pcr.pred)^2)

[1] 58.97718

We used ridge, lasso, and PCR to predict per capita crime rate (crim) in the Boston dataset, evaluating each model with test MSE after a train/test split.

Ridge handled multicollinearity well by shrinking coefficients.

Lasso provided a simpler, more interpretable model by eliminating some predictors.

PCR reduced dimensionality but was less interpretable.

All three methods had similar test errors, indicating the data’s linear structure was well captured. Lasso stands out for balancing accuracy and interpretability.

Part B

Proposed Model: Lasso Regression (or Ridge, depending on your MSE results)

Based on our analysis, we recommend using Lasso regression to model the per capita crime rate (crim) in the Boston dataset. This decision is supported by the following:

Validation-Based Performance Using a train/test split and evaluating models using test mean squared error (MSE):

Lasso produced one of the lowest test MSE values

This suggests it generalizes well to unseen data

Model Simplicity and Interpretability Lasso eliminated unnecessary predictors by shrinking some coefficients to zero, yielding a sparse and interpretable model — especially helpful when we want to understand which variables drive crime rates.

Comparison with Other Methods While ridge regression and PCR had comparable test MSEs, they:

Retained all predictors (ridge), or

Used transformed variables (PCR), which are less interpretable

Cross-Validation Support The optimal penalty parameter ( 𝜆 λ) for lasso was selected using cross-validation, ensuring the model’s performance is based on unseen data, not the training set.

Conclusion

Lasso regression strikes the best balance between prediction accuracy, interpretability, and robustness to overfitting, based on its strong performance on the test set and the sparsity of the resulting model. It is therefore the preferred choice for modeling crime rate in this dataset.

Part C

Does your chosen model involve all of the features in the data set? Why or why not?

No, the lasso model does not involve all of the features in the dataset.

When fitting lasso regression with cross-validation to select the optimal penalty parameter λ, the model retained 10 predictors, setting the remaining coefficients to exactly zero.

This is expected behavior for lasso, which performs automatic variable selection by shrinking less relevant predictors entirely out of the model.

This results in a model that is:

Simpler and easier to interpret

Potentially more robust to overfitting

Still maintains strong predictive performance

Therefore, lasso’s ability to eliminate unimportant predictors is a key reason it was chosen, and in this case, it identified a subset of 10 useful predictors for modeling crim.

lasso.coef <- coef(cv.lasso, s = cv.lasso$lambda.min)
sum(lasso.coef != 0) - 1  # Exclude intercept

[1] 10

“Out of 13 predictors, lasso selected only 10 for the final model.”