pnt130_Assignment5

Exercise 2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

Exercise 2.(a)

The lasso, relative to least squares, is: Answer: 3

More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

Repeat (a) for ridge regression relative to least squares. Answer: 3
Repeat (a) for non-linear methods relative to least squares. Answer: 1

Exercise 9

In this exercise, we will predict the number of applications received using the other variables in the College data set.

Exercise 9.(a)

Split the data set into a training set and a test set.

library(ISLR)

summary(College)

##  Private        Apps           Accept          Enroll       Top10perc    
##  No :212   Min.   :   81   Min.   :   72   Min.   :  35   Min.   : 1.00  
##  Yes:565   1st Qu.:  776   1st Qu.:  604   1st Qu.: 242   1st Qu.:15.00  
##            Median : 1558   Median : 1110   Median : 434   Median :23.00  
##            Mean   : 3002   Mean   : 2019   Mean   : 780   Mean   :27.56  
##            3rd Qu.: 3624   3rd Qu.: 2424   3rd Qu.: 902   3rd Qu.:35.00  
##            Max.   :48094   Max.   :26330   Max.   :6392   Max.   :96.00  
##    Top25perc      F.Undergrad     P.Undergrad         Outstate    
##  Min.   :  9.0   Min.   :  139   Min.   :    1.0   Min.   : 2340  
##  1st Qu.: 41.0   1st Qu.:  992   1st Qu.:   95.0   1st Qu.: 7320  
##  Median : 54.0   Median : 1707   Median :  353.0   Median : 9990  
##  Mean   : 55.8   Mean   : 3700   Mean   :  855.3   Mean   :10441  
##  3rd Qu.: 69.0   3rd Qu.: 4005   3rd Qu.:  967.0   3rd Qu.:12925  
##  Max.   :100.0   Max.   :31643   Max.   :21836.0   Max.   :21700  
##    Room.Board       Books           Personal         PhD        
##  Min.   :1780   Min.   :  96.0   Min.   : 250   Min.   :  8.00  
##  1st Qu.:3597   1st Qu.: 470.0   1st Qu.: 850   1st Qu.: 62.00  
##  Median :4200   Median : 500.0   Median :1200   Median : 75.00  
##  Mean   :4358   Mean   : 549.4   Mean   :1341   Mean   : 72.66  
##  3rd Qu.:5050   3rd Qu.: 600.0   3rd Qu.:1700   3rd Qu.: 85.00  
##  Max.   :8124   Max.   :2340.0   Max.   :6800   Max.   :103.00  
##     Terminal       S.F.Ratio      perc.alumni        Expend     
##  Min.   : 24.0   Min.   : 2.50   Min.   : 0.00   Min.   : 3186  
##  1st Qu.: 71.0   1st Qu.:11.50   1st Qu.:13.00   1st Qu.: 6751  
##  Median : 82.0   Median :13.60   Median :21.00   Median : 8377  
##  Mean   : 79.7   Mean   :14.09   Mean   :22.74   Mean   : 9660  
##  3rd Qu.: 92.0   3rd Qu.:16.50   3rd Qu.:31.00   3rd Qu.:10830  
##  Max.   :100.0   Max.   :39.80   Max.   :64.00   Max.   :56233  
##    Grad.Rate     
##  Min.   : 10.00  
##  1st Qu.: 53.00  
##  Median : 65.00  
##  Mean   : 65.46  
##  3rd Qu.: 78.00  
##  Max.   :118.00

set.seed(1)

NumSample <- nrow(College)  # Num of Total Sample

Select_idx <- sample(1:NumSample, size = NumSample * 0.7)  # Sampling 70% for Training

Training_DB <- College[Select_idx, ]
Test_DB <- College[-Select_idx, ]

Exercise 9.(b)

Fit a linear model using least squares on the training set, and report the test error obtained.

lm.fits <- lm(Apps ~ . , data = Training_DB)  # Fit model

lm.preds <- predict(lm.fits, newdata = Test_DB)  # Prediction

summary(lm.preds)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  -239.3   649.9  1868.5  3166.2  3780.0 20593.5

Exercise 9.(c)

Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 4.1-9

# For Training
x_train <- model.matrix(Apps ~ ., data = Training_DB)[, -1]
y_train <- Training_DB$Apps

# For Testing
x_test <- model.matrix(Apps ~ ., data = Test_DB)[, -1]
y_test <- Test_DB$Apps

set.seed(1)
CrossValidation_Ridge <- cv.glmnet(x_train, y_train, alpha = 0)

best_lambda <- CrossValidation_Ridge$lambda.min

# Prediction
Ridge.Pred <- predict(CrossValidation_Ridge, s = best_lambda, newx = x_test)

# Calc MSE
Ridge_MSE <- mean((y_test - Ridge.Pred)^2)

print(paste("Ridge Regression Test MSE:", Ridge_MSE))

## [1] "Ridge Regression Test MSE: 1121034.26769553"

Exercise 9.(d)

Fit a lasso model on the training set, with λ chosen by cross-validation. Report the test error obtained, along with the number of non-zero coeﬀicient estimates.

# For Training
X_Train <- model.matrix(Apps ~ ., data = Training_DB)[, -1]
Y_Train <- Training_DB$Apps

# For Testing
X_Test <- model.matrix(Apps ~ ., data = Test_DB)[, -1]
Y_Test <- Test_DB$Apps

# Cross-Validation for Lasso
set.seed(1)
CrossValidation_Lasso <- cv.glmnet(X_Train, Y_Train, alpha = 1)

Best_Lambda <- CrossValidation_Lasso$lambda.min

# Prediction
Lasso_Pred <- predict(CrossValidation_Lasso, s = Best_Lambda, newx = X_Test)

# Test MSE
Lasso_MSE <- mean((Y_Test - Lasso_Pred)^2)
print(paste("Lasso MOdel Test MSE:", Lasso_MSE))

## [1] "Lasso MOdel Test MSE: 1233245.71960883"

# Number of non-zero coefficients
Lasso_Coefficients <- predict(CrossValidation_Lasso, s = Best_Lambda, type = "coefficients")
Num_Nonzero <- sum(Lasso_Coefficients != 0) - 1 

print(paste("Number of Non-Zero Coefficients:", Num_Nonzero))

## [1] "Number of Non-Zero Coefficients: 14"

Exercise 9.(e)

Fit a PCR model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

library(pls)

## 
## Attaching package: 'pls'

## The following object is masked from 'package:stats':
## 
##     loadings

set.seed(1)

Pcr_Model <- pcr(Apps ~ ., data = Training_DB, scale = TRUE, validation = "CV")


# Select Best M
Best_M <- which.min(Pcr_Model$validation$PRESS)  # PRESS가 최소인 주성분 개수

# Prediction
Pcr_Pred <- predict(Pcr_Model, newdata = Test_DB, ncomp = Best_M)

# Test
Pcr_MSE <- mean((Test_DB$Apps - Pcr_Pred)^2)
print(paste("PCR Test MSE:", Pcr_MSE))

## [1] "PCR Test MSE: 1261630.41216042"

print(paste("Number of M:", Best_M))

## [1] "Number of M: 17"

Exercise 9.(f)

Fit a PLS model on the training set, with M chosen by cross-validation. Report the test error obtained, along with the value of M selected by cross-validation.

# PLS model
set.seed(1)
Pls_Model <- plsr(Apps ~ ., data = Training_DB, scale = TRUE, validation = "CV")

# Selecet Best M
Best_M <- which.min(Pls_Model$validation$PRESS)

# Prediction
Pls_Pred <- predict(Pls_Model, newdata = Test_DB, ncomp = Best_M)

# Calc MSE
Pls_MSE <- mean((Test_DB$Apps - Pls_Pred)^2)

print(paste("PLS Test MSE:", Pls_MSE))

## [1] "PLS Test MSE: 1261630.41216042"

print(paste("Number of M:", Best_M))

## [1] "Number of M: 17"

Exercise 9.(g)

Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?

Answer: The performance of All approaches have similar performance. The LM has best performance among them. Lasso only uses 14 predictors. But, It has good performance.

Exercise 11

We will now try to predict per capita crime rate in the Boston data set.

Exercise 11.(a)

Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.

Answer: I have tested best subset selection, the lasso, ridge regression and PCR, respectively. According to my observation, the LASSO has best performance among those approaches. LASSO is using 10 coefficients of toal 14 coefficients. Discarding 4 coefficient could be heplful to improve the accuracy.

library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

set.seed(1)

Sample_Index <- sample(1:nrow(Boston), nrow(Boston) * 0.7) # Training 70%, Test 30%

Training_DB <- Boston[Sample_Index, ]
Test_DB <- Boston[-Sample_Index, ]

Best SubSet Selection

library(leaps)

Regfit_Full <- regsubsets(crim ~ ., data = Training_DB, nvmax = 13)
Summary_Fit <- summary(Regfit_Full)


Best_Vars <- which.min(Summary_Fit$bic)

# Extract Coeff
Coeffs <- coef(Regfit_Full, id = Best_Vars)

# Predict Function
Predict_Regsubsets <- function(object, newdata, id) {
  Coeffs <- coef(object, id = id)
  Vars <- names(Coeffs)
  X <- model.matrix(crim ~ ., newdata)
  X[, Vars] %*% Coeffs
}

Pred_Best <- Predict_Regsubsets(Regfit_Full, Test_DB, Best_Vars)
MSE_Best <- mean((Test_DB$crim - Pred_Best)^2)

Ridge Regression

library(glmnet)
X_Train <- model.matrix(crim ~ ., Training_DB)[, -1]
Y_Train <- Training_DB$crim
X_Test  <- model.matrix(crim ~ ., Test_DB)[, -1]
Y_Test  <- Test_DB$crim

# Cross-validation
Ridge_CV <- cv.glmnet(X_Train, Y_Train, alpha = 0)
Ridge_Pred <- predict(Ridge_CV, s = Ridge_CV$lambda.min, newx = X_Test)
MSE_Ridge <- mean((Y_Test - Ridge_Pred)^2)

LASSO

Lasso_CV <- cv.glmnet(X_Train, Y_Train, alpha = 1)

Lasso_Pred <- predict(Lasso_CV, s = Lasso_CV$lambda.min, newx = X_Test)

MSE_Lasso <- mean((Y_Test - Lasso_Pred)^2)

#Coeff
Lasso_Coeff <- predict(Lasso_CV, s = Lasso_CV$lambda.min, type = "coefficients")
Nonzero_Count <- sum(Lasso_Coeff != 0) - 1
print(paste("Non-Zero Coefficient:", Nonzero_Count))

## [1] "Non-Zero Coefficient: 10"

print(Lasso_Coeff)

## 13 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept)  7.48573849
## zn           0.03186478
## indus       -0.07526620
## chas        -0.65050037
## nox         -4.92307928
## rm           0.32095116
## age          .         
## dis         -0.58591403
## rad          0.52152144
## tax          .         
## ptratio     -0.29108701
## lstat        0.22588987
## medv        -0.12777539

PCR

Pcr_Model <- pcr(crim ~ ., data = Training_DB, scale = TRUE, validation = "CV")

Best_M_PCR <- which.min(Pcr_Model$validation$PRESS)

PCR_Pred <- predict(Pcr_Model, newdata = Test_DB, ncomp = Best_M_PCR)

MSE_PCR <- mean((Test_DB$crim - PCR_Pred)^2)

Analysis

Test_MSEs <- c(MSE_Best, MSE_Ridge, MSE_Lasso, MSE_PCR)
names(Test_MSEs) <- c("Best Subset", "Ridge", "Lasso", "PCR")

barplot(Test_MSEs,
        main = "Test MSE: Boston crim Prediction",
        ylab = "Test MSE",
        col = "skyblue",
        ylim = c(0, max(Test_MSEs) * 1.1))
text(x = seq_along(Test_MSEs),
     y = Test_MSEs,
     labels = round(Test_MSEs, 2),
     pos = 3, cex = 0.8)

Exercise 11.(b)

Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross-validation, or some other reasonable alternative, as opposed to using training error.

Answer: Among the models tested, LASSO regression performed best, showing the lowest test MSE based on cross-validation. It selected 10 predictors, offering both high accuracy and model simplicity. This balance of performance and interpretability makes LASSO the most suitable choice for predicting crime rate in the Boston data set.

Exercise 11.(c)

Does your chosen model involve all of the features in the data set? Why or why not? Answer: NO, LASSO only uses 10 predictors. rm, age and tax were not used for prediction.