For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
Repeat (a) for ridge regression relative to least squares. Answer: 3
Repeat (a) for non-linear methods relative to least squares. Answer: 1
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Split the data set into a training set and a test set.
library(ISLR)
summary(College)
## Private Apps Accept Enroll Top10perc
## No :212 Min. : 81 Min. : 72 Min. : 35 Min. : 1.00
## Yes:565 1st Qu.: 776 1st Qu.: 604 1st Qu.: 242 1st Qu.:15.00
## Median : 1558 Median : 1110 Median : 434 Median :23.00
## Mean : 3002 Mean : 2019 Mean : 780 Mean :27.56
## 3rd Qu.: 3624 3rd Qu.: 2424 3rd Qu.: 902 3rd Qu.:35.00
## Max. :48094 Max. :26330 Max. :6392 Max. :96.00
## Top25perc F.Undergrad P.Undergrad Outstate
## Min. : 9.0 Min. : 139 Min. : 1.0 Min. : 2340
## 1st Qu.: 41.0 1st Qu.: 992 1st Qu.: 95.0 1st Qu.: 7320
## Median : 54.0 Median : 1707 Median : 353.0 Median : 9990
## Mean : 55.8 Mean : 3700 Mean : 855.3 Mean :10441
## 3rd Qu.: 69.0 3rd Qu.: 4005 3rd Qu.: 967.0 3rd Qu.:12925
## Max. :100.0 Max. :31643 Max. :21836.0 Max. :21700
## Room.Board Books Personal PhD
## Min. :1780 Min. : 96.0 Min. : 250 Min. : 8.00
## 1st Qu.:3597 1st Qu.: 470.0 1st Qu.: 850 1st Qu.: 62.00
## Median :4200 Median : 500.0 Median :1200 Median : 75.00
## Mean :4358 Mean : 549.4 Mean :1341 Mean : 72.66
## 3rd Qu.:5050 3rd Qu.: 600.0 3rd Qu.:1700 3rd Qu.: 85.00
## Max. :8124 Max. :2340.0 Max. :6800 Max. :103.00
## Terminal S.F.Ratio perc.alumni Expend
## Min. : 24.0 Min. : 2.50 Min. : 0.00 Min. : 3186
## 1st Qu.: 71.0 1st Qu.:11.50 1st Qu.:13.00 1st Qu.: 6751
## Median : 82.0 Median :13.60 Median :21.00 Median : 8377
## Mean : 79.7 Mean :14.09 Mean :22.74 Mean : 9660
## 3rd Qu.: 92.0 3rd Qu.:16.50 3rd Qu.:31.00 3rd Qu.:10830
## Max. :100.0 Max. :39.80 Max. :64.00 Max. :56233
## Grad.Rate
## Min. : 10.00
## 1st Qu.: 53.00
## Median : 65.00
## Mean : 65.46
## 3rd Qu.: 78.00
## Max. :118.00
set.seed(1)
NumSample <- nrow(College) # Num of Total Sample
Select_idx <- sample(1:NumSample, size = NumSample * 0.7) # Sampling 70% for Training
Training_DB <- College[Select_idx, ]
Test_DB <- College[-Select_idx, ]
lm.fits <- lm(Apps ~ . , data = Training_DB) # Fit model
lm.preds <- predict(lm.fits, newdata = Test_DB) # Prediction
summary(lm.preds)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -239.3 649.9 1868.5 3166.2 3780.0 20593.5
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-9
# For Training
x_train <- model.matrix(Apps ~ ., data = Training_DB)[, -1]
y_train <- Training_DB$Apps
# For Testing
x_test <- model.matrix(Apps ~ ., data = Test_DB)[, -1]
y_test <- Test_DB$Apps
set.seed(1)
CrossValidation_Ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_lambda <- CrossValidation_Ridge$lambda.min
# Prediction
Ridge.Pred <- predict(CrossValidation_Ridge, s = best_lambda, newx = x_test)
# Calc MSE
Ridge_MSE <- mean((y_test - Ridge.Pred)^2)
print(paste("Ridge Regression Test MSE:", Ridge_MSE))
## [1] "Ridge Regression Test MSE: 1121034.26769553"
Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
# For Training
X_Train <- model.matrix(Apps ~ ., data = Training_DB)[, -1]
Y_Train <- Training_DB$Apps
# For Testing
X_Test <- model.matrix(Apps ~ ., data = Test_DB)[, -1]
Y_Test <- Test_DB$Apps
# Cross-Validation for Lasso
set.seed(1)
CrossValidation_Lasso <- cv.glmnet(X_Train, Y_Train, alpha = 1)
Best_Lambda <- CrossValidation_Lasso$lambda.min
# Prediction
Lasso_Pred <- predict(CrossValidation_Lasso, s = Best_Lambda, newx = X_Test)
# Test MSE
Lasso_MSE <- mean((Y_Test - Lasso_Pred)^2)
print(paste("Lasso MOdel Test MSE:", Lasso_MSE))
## [1] "Lasso MOdel Test MSE: 1233245.71960883"
# Number of non-zero coefficients
Lasso_Coefficients <- predict(CrossValidation_Lasso, s = Best_Lambda, type = "coefficients")
Num_Nonzero <- sum(Lasso_Coefficients != 0) - 1
print(paste("Number of Non-Zero Coefficients:", Num_Nonzero))
## [1] "Number of Non-Zero Coefficients: 14"
Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
set.seed(1)
Pcr_Model <- pcr(Apps ~ ., data = Training_DB, scale = TRUE, validation = "CV")
# Select Best M
Best_M <- which.min(Pcr_Model$validation$PRESS) # PRESS가 최소인 주성분 개수
# Prediction
Pcr_Pred <- predict(Pcr_Model, newdata = Test_DB, ncomp = Best_M)
# Test
Pcr_MSE <- mean((Test_DB$Apps - Pcr_Pred)^2)
print(paste("PCR Test MSE:", Pcr_MSE))
## [1] "PCR Test MSE: 1261630.41216042"
print(paste("Number of M:", Best_M))
## [1] "Number of M: 17"
Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.
# PLS model
set.seed(1)
Pls_Model <- plsr(Apps ~ ., data = Training_DB, scale = TRUE, validation = "CV")
# Selecet Best M
Best_M <- which.min(Pls_Model$validation$PRESS)
# Prediction
Pls_Pred <- predict(Pls_Model, newdata = Test_DB, ncomp = Best_M)
# Calc MSE
Pls_MSE <- mean((Test_DB$Apps - Pls_Pred)^2)
print(paste("PLS Test MSE:", Pls_MSE))
## [1] "PLS Test MSE: 1261630.41216042"
print(paste("Number of M:", Best_M))
## [1] "Number of M: 17"
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
Answer: The performance of All approaches have similar performance. The LM has best performance among them. Lasso only uses 14 predictors. But, It has good performance.
We will now try to predict per capita crime rate in the Boston data set.
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Answer: I have tested best subset selection, the lasso, ridge regression and PCR, respectively. According to my observation, the LASSO has best performance among those approaches. LASSO is using 10 coefficients of toal 14 coefficients. Discarding 4 coefficient could be heplful to improve the accuracy.
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
set.seed(1)
Sample_Index <- sample(1:nrow(Boston), nrow(Boston) * 0.7) # Training 70%, Test 30%
Training_DB <- Boston[Sample_Index, ]
Test_DB <- Boston[-Sample_Index, ]
Best SubSet Selection
library(leaps)
Regfit_Full <- regsubsets(crim ~ ., data = Training_DB, nvmax = 13)
Summary_Fit <- summary(Regfit_Full)
Best_Vars <- which.min(Summary_Fit$bic)
# Extract Coeff
Coeffs <- coef(Regfit_Full, id = Best_Vars)
# Predict Function
Predict_Regsubsets <- function(object, newdata, id) {
Coeffs <- coef(object, id = id)
Vars <- names(Coeffs)
X <- model.matrix(crim ~ ., newdata)
X[, Vars] %*% Coeffs
}
Pred_Best <- Predict_Regsubsets(Regfit_Full, Test_DB, Best_Vars)
MSE_Best <- mean((Test_DB$crim - Pred_Best)^2)
Ridge Regression
library(glmnet)
X_Train <- model.matrix(crim ~ ., Training_DB)[, -1]
Y_Train <- Training_DB$crim
X_Test <- model.matrix(crim ~ ., Test_DB)[, -1]
Y_Test <- Test_DB$crim
# Cross-validation
Ridge_CV <- cv.glmnet(X_Train, Y_Train, alpha = 0)
Ridge_Pred <- predict(Ridge_CV, s = Ridge_CV$lambda.min, newx = X_Test)
MSE_Ridge <- mean((Y_Test - Ridge_Pred)^2)
LASSO
Lasso_CV <- cv.glmnet(X_Train, Y_Train, alpha = 1)
Lasso_Pred <- predict(Lasso_CV, s = Lasso_CV$lambda.min, newx = X_Test)
MSE_Lasso <- mean((Y_Test - Lasso_Pred)^2)
#Coeff
Lasso_Coeff <- predict(Lasso_CV, s = Lasso_CV$lambda.min, type = "coefficients")
Nonzero_Count <- sum(Lasso_Coeff != 0) - 1
print(paste("Non-Zero Coefficient:", Nonzero_Count))
## [1] "Non-Zero Coefficient: 10"
print(Lasso_Coeff)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 7.48573849
## zn 0.03186478
## indus -0.07526620
## chas -0.65050037
## nox -4.92307928
## rm 0.32095116
## age .
## dis -0.58591403
## rad 0.52152144
## tax .
## ptratio -0.29108701
## lstat 0.22588987
## medv -0.12777539
PCR
Pcr_Model <- pcr(crim ~ ., data = Training_DB, scale = TRUE, validation = "CV")
Best_M_PCR <- which.min(Pcr_Model$validation$PRESS)
PCR_Pred <- predict(Pcr_Model, newdata = Test_DB, ncomp = Best_M_PCR)
MSE_PCR <- mean((Test_DB$crim - PCR_Pred)^2)
Analysis
Test_MSEs <- c(MSE_Best, MSE_Ridge, MSE_Lasso, MSE_PCR)
names(Test_MSEs) <- c("Best Subset", "Ridge", "Lasso", "PCR")
barplot(Test_MSEs,
main = "Test MSE: Boston crim Prediction",
ylab = "Test MSE",
col = "skyblue",
ylim = c(0, max(Test_MSEs) * 1.1))
text(x = seq_along(Test_MSEs),
y = Test_MSEs,
labels = round(Test_MSEs, 2),
pos = 3, cex = 0.8)
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.
Answer: Among the models tested, LASSO regression performed best, showing the lowest test MSE based on cross-validation. It selected 10 predictors, offering both high accuracy and model simplicity. This balance of performance and interpretability makes LASSO the most suitable choice for predicting crime rate in the Boston data set.
Does your chosen model involve all of the features in the data set? Why or why not? Answer: NO, LASSO only uses 10 predictors. rm, age and tax were not used for prediction.