Aqdas_Hussain_whu947

Problem # 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer. (a) The lasso, relative to least squares, is: i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Answer:

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Lasso regression is less flexible than OLS because its L₁ penalty shrinks coefficients toward zero, reducing variance but increasing bias. Improved prediction accuracy is achieved when the reduction in variance compensates for the increase in bias.

Part b:

Repeat (a) for ridge regression relative to least squares.

Answer:

Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Justification: Ridge regression, like lasso, penalizes coefficient size (but with an L2 norm). This shrinkage reduces variance (compared to OLS) while increasing bias, so it is less flexible than OLS. Prediction accuracy improves precisely when the reduction in variance outweighs the increase in bias.

Part c:

Repeat (a) for non-linear methods relative to least squares.

Answer:

More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Justification: Most non-linear methods are more flexible than OLS: they reduce bias by better capturing complex patterns but increase variance due to more parameters or complexity. They improve prediction accuracy precisely if the reduction in bias outweighs the increase in variance.

Problem # 9:

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Part a:

Split the data set into a training set and a test set.

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.4.2

data("College")

set.seed(1)  # for reproducibility
train_idx <- sample(seq_len(nrow(College)), size = 0.7 * nrow(College))
train_data <- College[train_idx, ]
test_data  <- College[-train_idx, ]

Part b

Fit a linear model using least squares on the training set, and report the test error obtained.

# Fit model on the training set
lm_fit <- lm(Apps ~ ., data = train_data)

# Predict on the test set
lm_pred <- predict(lm_fit, newdata = test_data)

# Calculate test MSE
lm_mse <- mean((test_data$Apps - lm_pred)^2)
lm_mse

## [1] 1261630

Part c Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Warning: package 'glmnet' was built under R version 4.4.3

## Loading required package: Matrix

## Loaded glmnet 4.1-8

# Convert data frames to matrices for glmnet
x_train <- model.matrix(Apps ~ ., data = train_data)[, -1]  # drop the intercept column
y_train <- train_data$Apps
x_test  <- model.matrix(Apps ~ ., data = test_data)[, -1]
y_test  <- test_data$Apps

# Perform cross-validation for ridge (alpha = 0)
set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda_ridge <- cv_ridge$lambda.min

# Predict on test set using best lambda
ridge_pred <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_mse  <- mean((y_test - ridge_pred)^2)
ridge_mse

## [1] 1121034

Part d

Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.

# Cross-validation for lasso (alpha = 1)
set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

# Predict on test set
lasso_pred <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_mse  <- mean((y_test - lasso_pred)^2)
lasso_mse

## [1] 1233246

# Number of nonzero coefficients
lasso_coef <- coef(cv_lasso, s = best_lambda_lasso)
nonzero_count <- sum(lasso_coef != 0)
nonzero_count

## [1] 15

Part e

Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## Warning: package 'pls' was built under R version 4.4.3

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

# Fit PCR on training data with cross-validation
pcr_fit <- pcr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Check CV errors to choose M (number of components)
validationplot(pcr_fit, val.type = "MSEP")  # visual inspection

summary(pcr_fit)  # shows CV metrics for different components

## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     3807     2130     2134     1826     1692     1696
## adjCV         3895     3807     2126     2132     1806     1682     1690
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1695     1638     1604      1593      1601      1603      1606
## adjCV     1694     1627     1598      1587      1595      1598      1600
##        14 comps  15 comps  16 comps  17 comps
## CV         1607      1544      1145      1113
## adjCV      1602      1524      1137      1106
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X      32.051    57.00    64.42    70.27    75.65    80.65    84.26    87.61
## Apps    5.788    71.69    71.70    80.97    82.60    82.60    82.69    84.06
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       90.58     92.84     94.93     96.74     97.82     98.72     99.39
## Apps    84.55     84.82     84.86     84.86     85.01     85.05     89.81
##       16 comps  17 comps
## X        99.85    100.00
## Apps     93.03     93.32

# Suppose the best M (number of components) is determined by the lowest CV error:
best_M_pcr <- which.min(pcr_fit$validation$PRESS)  # or chosen from the plot

# Predict on test set using best number of components
pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = best_M_pcr)
pcr_mse  <- mean((test_data$Apps - pcr_pred)^2)
pcr_mse

## [1] 1261630

Interpretation:

The PCR CV results show that as we add the first few principal components, the prediction error (MSEP) drops sharply, indicating that these components capture most of the relevant signal. Beyond a certain number (where the CV error plateaus), additional components yield little improvement. This pattern is typical for PCR—confirming that a parsimonious model with a subset of components is optimal.

Part f:

Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.

# Fit PLS on training data with cross-validation
pls_fit <- plsr(Apps ~ ., data = train_data, scale = TRUE, validation = "CV")

# Check CV errors
 validationplot(pls_fit, val.type = "MSEP")  # visual inspection

summary(pls_fit)

## Data:    X dimension: 543 17 
##  Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            3895     1969     1762     1566     1493     1320     1236
## adjCV         3895     1963     1757     1556     1469     1291     1222
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV        1225     1213     1212      1207      1213      1207      1206
## adjCV     1212     1201     1199      1195      1200      1195      1194
##        14 comps  15 comps  16 comps  17 comps
## CV         1205      1205      1205      1205
## adjCV      1193      1193      1193      1193
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       25.68    47.43    62.46    64.88    67.34    72.68    77.20    80.92
## Apps    76.62    82.39    86.93    90.76    92.82    93.05    93.13    93.20
##       9 comps  10 comps  11 comps  12 comps  13 comps  14 comps  15 comps
## X       82.69     85.16     87.35     90.73     92.49     95.10     97.09
## Apps    93.26     93.28     93.30     93.31     93.32     93.32     93.32
##       16 comps  17 comps
## X        98.40    100.00
## Apps     93.32     93.32

# Suppose the best M is determined similarly:
best_M_pls <- which.min(pls_fit$validation$PRESS)

# Predict on test set
pls_pred <- predict(pls_fit, newdata = test_data, ncomp = best_M_pls)
pls_mse  <- mean((test_data$Apps - pls_pred)^2)
pls_mse

## [1] 1261630

Interpretation:

In the PLS results, the initial components sharply reduce the RMSEP (CV error), indicating they capture the main predictive signals. Each additional component further lowers the error but with diminishing returns. By ~7–10 components, the error curve flattens, and by 17 components, you’ve explained nearly 100% of the predictor variance and about 93% of App variance. This pattern is typical of PLS: a rapid improvement from the first few components, followed by gradually smaller gains, suggesting that a more parsimonious model can often be chosen before all 17 components.

Part e:

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Prediction Accuracy: All five methods give test MSEs roughly between 1.1 million and 1.3 million. This means that while our models capture a fair amount of the signal behind the number of applications, there’s still a notable amount of variability or noise that remains unexplained.
Method Comparisons:
- The basic OLS model has an error of about 1.26 million and serves as our baseline.
- Ridge Regression shows improvement with an error around 1.12 million—its shrinkage penalty helps reduce overfitting by controlling the magnitude of the coefficients.
- Lasso Regression results in a test error near 1.23 million. Although a bit higher than ridge, lasso has the advantage of automatically setting many coefficients to zero (here, only 15 remain), which makes the model easier to interpret.
- For PCR and PLS, we see the usual pattern: a sharp drop in error with the addition of the first few components, after which improvements level off. Their final errors are close to those of ridge and lasso.

Summary:

No single method dramatically outperforms the others. The differences in test errors are modest—typically within a 5–10% range. This indicates that we can reasonably predict the number of college applications with any of these methods. The choice among them might then depend on other considerations such as interpretability (e.g., lasso’s simpler model) versus a slight edge in prediction accuracy (e.g., ridge regression).

In summary, these models all perform fairly similarly in predicting college applications, and while some minor improvements are noted (especially with ridge regression), none of the approaches stands out as being significantly superior to the others.

Problem # 11

We will now try to predict per capita crime rate in the Boston data set.

Part a:

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

library(MASS)

## Warning: package 'MASS' was built under R version 4.4.2

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:ISLR2':
## 
##     Boston

library(leaps)

## Warning: package 'leaps' was built under R version 4.4.2

library(glmnet) 
library(pls)    

data("Boston")

#Train/Test Split

set.seed(1)
train_idx <- sample(seq_len(nrow(Boston)), size = 0.7 * nrow(Boston))
train_data <- Boston[train_idx, ]
test_data  <- Boston[-train_idx, ]

# Separate predictors and response for later (glmnet/pcr needs matrices)
x_train <- model.matrix(crim ~ ., data = train_data)[, -1]
y_train <- train_data$crim
x_test  <- model.matrix(crim ~ ., data = test_data)[, -1]
y_test  <- test_data$crim

# Best subset selection uses 'regsubsets'
best_subset_fit <- regsubsets(crim ~ ., data = train_data, nvmax = 13)
best_subset_summary <- summary(best_subset_fit)

# Inspect metrics (adjr2, Cp, BIC, etc.) for each model size
best_subset_summary$adjr2

##  [1] 0.4128277 0.4558228 0.4720392 0.4745639 0.4749360 0.4769643 0.4785831
##  [8] 0.4790469 0.4797650 0.4786691 0.4773511 0.4758577 0.4743257

best_subset_summary$cp

##  [1] 43.180047 15.354654  5.522343  4.841842  5.595966  5.258249  5.197729
##  [8]  5.901439  6.440537  8.165938 10.031650 12.006191 14.000000

best_subset_summary$bic

##  [1] -177.7483 -199.8055 -205.6558 -202.4962 -197.8935 -194.4130 -190.6627
##  [8] -186.1330 -181.7796 -176.1959 -170.4664 -164.6236 -158.7608

# I have chose the model size by BIC
best_size <- which.min(best_subset_summary$bic)

# Extract final coefficients for that model
coef(best_subset_fit, best_size)

## (Intercept)         rad       black       lstat 
##  0.95384519  0.43521730 -0.01310151  0.24416589

# Refit that model
form_best_subset <- as.formula(
  paste("crim ~", paste(names(coef(best_subset_fit, best_size))[-1], collapse=" + "))
)
best_subset_lm <- lm(form_best_subset, data = train_data)

# Test MSE
best_subset_pred <- predict(best_subset_lm, newdata = test_data)
best_subset_mse <- mean((y_test - best_subset_pred)^2)
best_subset_mse

## [1] 61.86531

#Ridge Regression

set.seed(1)
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)  # 10-fold CV by default
best_lambda_ridge <- cv_ridge$lambda.min

# Predict on test set
ridge_pred <- predict(cv_ridge, s = best_lambda_ridge, newx = x_test)
ridge_mse <- mean((y_test - ridge_pred)^2)
ridge_mse

## [1] 60.15076

#Lasso Regression

set.seed(1)
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lambda_lasso <- cv_lasso$lambda.min

# Predict on test set
lasso_pred <- predict(cv_lasso, s = best_lambda_lasso, newx = x_test)
lasso_mse <- mean((y_test - lasso_pred)^2)
lasso_mse

## [1] 59.55616

# Number of nonzero coefficients
lasso_coef <- coef(cv_lasso, s = best_lambda_lasso)
sum(lasso_coef != 0)

## [1] 11

# Principle Component Regression

set.seed(1)
pcr_fit <- pcr(crim ~ ., data = train_data, scale = TRUE, validation = "CV")

# Inspect CV errors for each # of components
summary(pcr_fit)

## Data:    X dimension: 354 13 
##  Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           8.085    6.627    6.619    6.149    6.107    6.117    6.184
## adjCV        8.085    6.624    6.616    6.142    6.099    6.111    6.176
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.159    6.005    6.004      6.00     6.012     6.018     5.974
## adjCV    6.151    5.995    5.987      5.99     6.001     6.006     5.962
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       48.34    61.18    70.44    77.16    83.41    88.28    91.51    93.81
## crim    33.34    33.76    43.13    44.14    44.15    44.20    44.63    47.57
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.52     97.14     98.48     99.50    100.00
## crim    48.08     48.12     48.13     48.46     49.37

# Choose # of components that gives minimum CV error
best_comp <- which.min(pcr_fit$validation$PRESS)

# Evaluate test MSE
pcr_pred <- predict(pcr_fit, newdata = test_data, ncomp = best_comp)
pcr_mse <- mean((y_test - pcr_pred)^2)
pcr_mse

## [1] 58.97718

# Comparison of the results 
results <- data.frame(
  Method       = c("Best Subset", "Ridge", "Lasso", "PCR"),
  Test_MSE     = c(best_subset_mse, ridge_mse, lasso_mse, pcr_mse)
)
results

##        Method Test_MSE
## 1 Best Subset 61.86531
## 2       Ridge 60.15076
## 3       Lasso 59.55616
## 4         PCR 58.97718

Interpretation

The numbers tell us that all methods are doing a decent job, but there are some small differences. The PCR model comes out on top with the lowest error (around 59), which means it did the best job predicting the number of college applications. Lasso (≈59.6) and Ridge (≈60.2) followed close behind, while the Best Subset model, at about 61.9, didn’t do quite as well.

PCR: By turning the predictors into principal components, PCR seems to capture the essential information really well. Even if it’s a bit more abstract (since you lose the original predictor names), it gives the best predictions.

Lasso & Ridge: Both these methods use regularization to prevent overfitting. Lasso has an added bonus of simplifying the model by cutting out some predictors entirely, and it performs almost as well as PCR.

Best Subset: Although it picks out a smaller set of predictors (which can be easier to interpret), it seems to miss some important information compared to the other methods, which shows up as a slightly higher error on unseen data.

The Bottom Line:

All models are fairly close in performance, but if we’re aiming solely for prediction accuracy, PCR has a slight edge. However, if interpretability and simplicity are important for our application, using Lasso or even Best Subset might still be worthwhile.

Part b:

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.

PCR (Principal Components Regression) has the lowest test MSE (≈58.98), so it offers the best out-of-sample predictive performance among these four approaches. The reasoning is that by transforming the predictors into principal components (which capture the most variance in the data), PCR effectively reduces noise or collinearity that can hamper other models.

In short, PCR is a strong choice for pure predictive accuracy based on your validation/test error. If you also value interpretability, you might consider Lasso, since it’s close in performance but zeroes out some coefficients and thus yields a simpler model.

Part c:

Does your chosen model involve all of the features in the data set? Why or why not?

PCR uses all features in the sense that every principal component is a linear combination of every original predictor. However, if you only keep a subset of principal components (e.g., 5 out of 13), you’re effectively discarding components that capture less variance.

Unlike Lasso (which can entirely remove certain predictors by assigning them zero coefficients), PCR doesn’t drop specific features. Instead, it repackages all features into fewer principal components.

So, we are using every original variable to form the components, but not in their original, individual form. This can reduce interpretability (you can’t easily say “Feature X has coefficient Y”), but it often helps predictive accuracy when predictors are highly correlated.

Aqdas_Hussain_whu947_Assignment#5

2025-04-11