library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.4.2
library(caret)
## Warning: package 'caret' was built under R version 4.4.2
## Loading required package: ggplot2
## Loading required package: lattice
library(glmnet)
## Warning: package 'glmnet' was built under R version 4.4.2
## Loading required package: Matrix
## Loaded glmnet 4.1-8
library(pls)
## Warning: package 'pls' was built under R version 4.4.3
##
## Attaching package: 'pls'
## The following object is masked from 'package:caret':
##
## R2
## The following object is masked from 'package:stats':
##
## loadings
library(leaps)
## Warning: package 'leaps' was built under R version 4.4.3
attach(College)
attach(Boston)
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Answer: iii
Lasso is less flexible than least squares because it shrinks some coefficients to exactly zero. When lasso works better than least squares, it’s because the decrease in variance is bigger than the increase in bias.
Answer: iii
Ridge regression is also less flexible than least squares because it shrinks all coefficients toward zero. Like lasso, ridge regression performs better when its decrease in variance outweighs its increase in bias.
Answer: ii
Non-linear methods are more flexible than least squares because they can model curved relationships. When non-linear methods work better, it’s because their decrease in bias exceeds their increase in variance.
set.seed(20)
train_index <- createDataPartition(College$Apps, p = 0.7, list = FALSE)
col_train <- College[train_index, ]
col_test <- College[-train_index, ]
col_fit <- lm(Apps ~ ., data = col_train)
summary(col_fit)
##
## Call:
## lm(formula = Apps ~ ., data = col_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3621.7 -451.5 1.4 351.2 7033.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -332.20811 495.74137 -0.670 0.503072
## PrivateYes -344.96641 164.64642 -2.095 0.036630 *
## Accept 1.63836 0.04609 35.548 < 2e-16 ***
## Enroll -0.85727 0.24060 -3.563 0.000400 ***
## Top10perc 57.69812 6.82745 8.451 2.84e-16 ***
## Top25perc -22.00444 5.58659 -3.939 9.29e-05 ***
## F.Undergrad 0.06140 0.04132 1.486 0.137862
## P.Undergrad 0.05219 0.03639 1.434 0.152084
## Outstate -0.08962 0.02374 -3.776 0.000178 ***
## Room.Board 0.14151 0.05789 2.445 0.014831 *
## Books -0.17329 0.27292 -0.635 0.525732
## Personal 0.02192 0.07426 0.295 0.767957
## PhD -7.56687 5.55053 -1.363 0.173380
## Terminal -3.38850 6.10177 -0.555 0.578904
## S.F.Ratio 15.49485 15.90679 0.974 0.330453
## perc.alumni 2.19067 5.04900 0.434 0.664552
## Expend 0.07599 0.01433 5.304 1.67e-07 ***
## Grad.Rate 8.33272 3.64965 2.283 0.022819 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1067 on 527 degrees of freedom
## Multiple R-squared: 0.9307, Adjusted R-squared: 0.9284
## F-statistic: 416.1 on 17 and 527 DF, p-value: < 2.2e-16
col_preds <- predict(col_fit, newdata = col_test)
col_er <- mean((col_preds - col_test$Apps)^2)
col_er
## [1] 1043245
After fitting a linear model using least squares on the training set, the test error was 1043245.
x_train <- model.matrix(Apps ~ ., data = col_train)[, -1]
y_train <- col_train$Apps
x_test <- model.matrix(Apps ~ ., data = col_test)[, -1]
grid=10^seq(10,-2,length=100)
col_grid <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 0)
grid_lambda <- col_grid$lambda.min
col_grid
##
## Call: cv.glmnet(x = x_train, y = y_train, lambda = grid, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01 100 1361534 356253 17
## 1se 231.01 64 1680177 628505 17
col_grid_pred <- predict(col_grid, s = grid_lambda, newx = x_test)
col_grid_er <- mean((col_grid_pred - col_test$Apps)^2)
col_grid_er
## [1] 1043219
After fitting a ridge regression model on the training set with λ chosen by cross-validation, the test error was 1043219.
col_lasso <- cv.glmnet(x_train, y_train, lambda = grid, alpha = 1)
lasso_lambda <- col_lasso$lambda.min
col_lasso
##
## Call: cv.glmnet(x = x_train, y = y_train, lambda = grid, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.01 100 1382267 318738 17
## 1se 305.39 63 1610045 401474 3
col_lasso_pred <- predict(col_lasso, s = lasso_lambda, newx = x_test)
col_lasso_er <- mean((col_lasso_pred - col_test$Apps)^2)
col_lasso_er
## [1] 1043249
After fitting a lasso regression model on the training set with λ chosen by cross-validation, the test error was 1043249.
col_lasso_coef <- predict(col_lasso, type = "coefficients", s = lasso_lambda)
col_lasso_coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) -334.98659413
## PrivateYes -345.34671278
## Accept 1.63683324
## Enroll -0.84628954
## Top10perc 57.60008684
## Top25perc -21.92675371
## F.Undergrad 0.06009396
## P.Undergrad 0.05220000
## Outstate -0.08944182
## Room.Board 0.14177374
## Books -0.17319275
## Personal 0.02180997
## PhD -7.55776847
## Terminal -3.40610008
## S.F.Ratio 15.50406762
## perc.alumni 2.15679557
## Expend 0.07597679
## Grad.Rate 8.32502809
col_pcr <- pcr(Apps ~., data = col_train, scale = TRUE, validation = 'CV')
summary(col_pcr)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3992 3839 2155 2152 1986 1746 1737
## adjCV 3992 3839 2153 2151 2006 1726 1733
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1725 1650 1612 1621 1633 1629 1636
## adjCV 1750 1635 1607 1617 1629 1624 1631
## 14 comps 15 comps 16 comps 17 comps
## CV 1636 1401 1210 1172
## adjCV 1632 1381 1202 1165
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.924 57.01 64.07 69.75 75.14 80.15 83.75 87.29
## Apps 7.847 71.42 71.64 75.66 82.28 82.52 82.62 84.75
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.34 92.70 94.94 96.71 97.79 98.66 99.38
## Apps 85.07 85.09 85.09 85.19 85.33 85.36 91.46
## 16 comps 17 comps
## X 99.84 100.00
## Apps 92.72 93.07
validationplot(col_pcr, val.type = 'MSEP')
col_pcr_m <- which.min(col_pcr$validation$PRESS)
col_pcr_pred <-predict(col_pcr, newdata = col_test, ncomp = col_pcr_m)
col_pcr_er <- mean((col_pcr_pred - col_test$Apps)^2)
list(col_pcr_m, col_pcr_er)
## [[1]]
## [1] 17
##
## [[2]]
## [1] 1043245
After fitting a PCR model on the training set, with M chosen by cross-validation, the test error was 1043245 and the optimal number of components selected was 17.
col_pls <- plsr(Apps ~., data = col_train, scale = TRUE, validation = 'CV')
summary(col_pls)
## Data: X dimension: 545 17
## Y dimension: 545 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3992 2004 1773 1537 1417 1274 1200
## adjCV 3992 2000 1771 1530 1395 1257 1192
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1187 1177 1176 1176 1175 1173 1172
## adjCV 1179 1170 1169 1169 1168 1166 1165
## 14 comps 15 comps 16 comps 17 comps
## CV 1172 1172 1172 1172
## adjCV 1165 1164 1164 1164
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.06 48.38 62.18 64.92 68.60 73.63 77.16 80.29
## Apps 76.27 82.28 87.49 90.90 92.47 92.80 92.88 92.96
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.47 85.19 88.71 90.23 91.79 93.65 96.86
## Apps 93.01 93.03 93.04 93.06 93.07 93.07 93.07
## 16 comps 17 comps
## X 97.90 100.00
## Apps 93.07 93.07
validationplot(col_pls, val.type = 'MSEP')
col_pls_m <- which.min(col_pls$validation$PRESS)
col_pls_pred <- predict(col_pls, newdata = col_test, ncomp = col_pls_m)
col_pls_er <- mean((col_pls_pred - col_test$Apps)^2)
list(col_pls_m, col_pls_er)
## [[1]]
## [1] 16
##
## [[2]]
## [1] 1043226
After fitting a PLS model on the training set, with M chosen by cross-validation, the test error was 1043226 and the optimal number of components selected was 16.
We tried five different models to predict the number of college applications. All of them had fairly large test errors, meaning they aren’t perfect at predicting the number of applications. Lasso, Ridge, LM, and PLS all had very similar results. PCR however, had the same test error as Linear Regression, but was slightly worse than Ridge and PLS. Overall, Ridge and PLS performed the best, but the differences were minimal.
set.seed(28)
Bostontrain_index <- createDataPartition(Boston$crim, p = 0.7, list = FALSE)
Boston_train <- Boston[Bostontrain_index, ]
Boston_test <- Boston[-Bostontrain_index, ]
Best Subset Selection
bos_regfit <- regsubsets(crim ~ ., data = Boston_train, nvmax=13)
summary(bos_regfit)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston_train, nvmax = 13)
## 12 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 12
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " "*" " " "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) " " " " " " " " "*" " " " " "*" " " "*" "*" "*"
## 6 ( 1 ) "*" " " " " " " "*" " " "*" "*" " " " " "*" "*"
## 7 ( 1 ) "*" "*" " " " " "*" " " "*" "*" " " " " "*" "*"
## 8 ( 1 ) "*" " " " " "*" "*" " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" " " " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
bos_summary = summary(bos_regfit)
par(mfrow = c(2, 2))
plot(bos_summary$rss, type = "l", xlab = "Number of Variables", ylab = "RSS")
points(which.max(bos_summary$rss), max(bos_summary$rss), cex = 2, pch = 20)
plot(bos_summary$adjr2, type = "l", xlab = "Number of Variables", ylab = "Adjusted R²")
points(which.max(bos_summary$adjr2), max(bos_summary$adjr2), cex = 2, pch = 20)
plot(bos_summary$cp, type = "l", xlab = "Number of Variables", ylab = "Cp")
points(which.min(bos_summary$cp), min(bos_summary$cp), cex = 2, pch = 20)
plot(bos_summary$bic, type = "l", xlab = "Number of Variables", ylab = "BIC")
points(which.min(bos_summary$bic), min(bos_summary$bic), cex = 2, pch = 20)
The best model according to BIC uses 2 variables, suggesting a simpler model is preferred. The Adjusted R² peaks around 10 variables, indicating that adding predictors improves performance up to that point. Meanwhile, Cp is minimized at about 6 variables, and RSS steadily decreases with more variables, showing reduced error.
# Get Coefficients of Best Model by BIC
best_model_bic <- which.min(bos_summary$bic)
coef(bos_regfit, id = best_model_bic)
## (Intercept) rad lstat
## -4.3605627 0.4966580 0.2497854
BIC-optimal model includes rad and lstat as
predictors.
#Get Coefficients of the best 8
coef(bos_regfit, id = 8)
## (Intercept) zn nox rm dis rad
## 8.89846618 0.04237313 -11.43416265 1.56209304 -0.79124903 0.52990145
## ptratio lstat medv
## -0.41637837 0.17496592 -0.26327815
#lm Model 1
Bos_lm1 <- lm(crim ~ rad + lstat, data = Boston_train)
summary(Bos_lm1)
##
## Call:
## lm(formula = crim ~ rad + lstat, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.513 -1.908 -0.264 0.945 77.118
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.36056 0.74545 -5.850 1.12e-08 ***
## rad 0.49666 0.04898 10.140 < 2e-16 ***
## lstat 0.24979 0.05901 4.233 2.94e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.931 on 353 degrees of freedom
## Multiple R-squared: 0.389, Adjusted R-squared: 0.3856
## F-statistic: 112.4 on 2 and 353 DF, p-value: < 2.2e-16
#lm model 2
Bos_lm2 <- lm(crim ~ zn + nox + rm + dis + rad + ptratio + lstat + medv, data = Boston_train)
summary(Bos_lm2)
##
## Call:
## lm(formula = crim ~ zn + nox + rm + dis + rad + ptratio + lstat +
## medv, data = Boston_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.889 -2.230 -0.303 1.018 73.406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.89847 8.96450 0.993 0.321580
## zn 0.04237 0.02337 1.813 0.070652 .
## nox -11.43416 5.91213 -1.934 0.053924 .
## rm 1.56209 0.79141 1.974 0.049195 *
## dis -0.79125 0.32705 -2.419 0.016062 *
## rad 0.52990 0.06304 8.406 1.11e-15 ***
## ptratio -0.41638 0.23500 -1.772 0.077300 .
## lstat 0.17497 0.09231 1.895 0.058876 .
## medv -0.26328 0.07413 -3.552 0.000436 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.814 on 347 degrees of freedom
## Multiple R-squared: 0.4196, Adjusted R-squared: 0.4062
## F-statistic: 31.36 on 8 and 347 DF, p-value: < 2.2e-16
Boston_pred1 <- predict(Bos_lm1, newdata = Boston_test)
Boston_pred2 <- predict(Bos_lm2, newdata = Boston_test)
mean((Boston_test$crim - Boston_pred1)^2)
## [1] 31.3732
mean((Boston_test$crim - Boston_pred2)^2)
## [1] 31.39418
The Mean Squared Errors for both models are nearly identical (31.37
vs 31.39), meaning the simpler model with just rad and
lstat performs just as well as the more complex one. This
suggests that adding more variables didn’t meaningfully improve
predictive accuracy.
Ridge regression
x_train1 <- model.matrix(crim ~ ., data = Boston_train)[, -1]
y_train1 <- Boston_train$crim
x_test1 <- model.matrix(crim ~ ., data = Boston_test)[, -1]
grid1 <- 10^seq(10, -2, length = 100)
boston_ridge <- cv.glmnet(x_train1, y_train1, alpha = 0, lambda = grid1)
grid1_lambda <- boston_ridge$lambda.min
boston_ridge
##
## Call: cv.glmnet(x = x_train1, y = y_train1, lambda = grid1, alpha = 0)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.1 91 48.36 26.38 12
## 1se 403.7 62 74.10 34.88 12
boston_ridge_pred <- predict(boston_ridge, s = grid1_lambda, newx = x_test1)
boston_ridge_er <- mean((boston_ridge_pred - Boston_test$crim)^2)
boston_ridge_er
## [1] 31.35174
Lasso
Boston_lasso <- cv.glmnet(x_train1, y_train1, lambda = grid1, alpha = 1)
Boston_lasso_lambda <- Boston_lasso$lambda.min
Boston_lasso
##
## Call: cv.glmnet(x = x_train1, y = y_train1, lambda = grid1, alpha = 1)
##
## Measure: Mean-Squared Error
##
## Lambda Index Measure SE Nonzero
## min 0.017 98 47.71 19.15 12
## 1se 3.511 79 63.71 25.63 1
Boston_lasso_pred <- predict(Boston_lasso, s = Boston_lasso_lambda, newx = x_test1)
Boston_lasso_er <- mean((Boston_lasso_pred - Boston_test$crim)^2)
Boston_lasso_er
## [1] 31.26229
Boston_lasso_coef <- predict(Boston_lasso, type = "coefficients", s = Boston_lasso_lambda)
Boston_lasso_coef
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 7.898131720
## zn 0.041751269
## indus -0.054305063
## chas -0.323709225
## nox -7.494961620
## rm 1.445043223
## age -0.006782595
## dis -0.825629960
## rad 0.572939199
## tax -0.003426776
## ptratio -0.338110920
## lstat 0.186432250
## medv -0.255356964
PCR
Boston_pcr <- pcr(crim ~ ., data = Boston_train, scale = TRUE, validation = 'CV')
summary(Boston_pcr)
## Data: X dimension: 356 12
## Y dimension: 356 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.855 7.592 7.598 7.176 7.183 7.126 7.096
## adjCV 8.855 7.591 7.596 7.171 7.182 7.123 7.093
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.968 7.004 6.969 6.913 6.945 6.880
## adjCV 6.964 7.001 6.964 6.905 6.936 6.871
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 50.56 64.38 73.73 80.93 87.3 90.59 93.09 95.18
## crim 26.52 26.54 34.79 34.79 35.9 36.49 38.55 38.55
## 9 comps 10 comps 11 comps 12 comps
## X 96.88 98.32 99.43 100.00
## crim 39.24 40.39 40.72 42.25
validationplot(Boston_pcr, val.type = "MSEP")
Boston_pcr_m <- which.min(Boston_pcr$validation$PRESS)
Boston_pcr_pred <-predict(Boston_pcr, newdata = Boston_test, ncomp = Boston_pcr_m)
Boston_pcr_er <- mean((Boston_pcr_pred - Boston_test$crim)^2)
list(Boston_pcr_m, Boston_pcr_er)
## [[1]]
## [1] 12
##
## [[2]]
## [1] 31.42049
Lasso regression performs the best on this data set with the lowest validation error of 31.21781. This means it makes the most accurate predictions on new, unseen data.Unlike regular linear models, Lasso also removes less important features. This helps prevent overfitting and keeps the model simpler. Other models like Ridge and PCR had higher errors, so they didn’t perform as well. That’s why Lasso is the best choice here.
No, the chosen model does not involve all the features in the dataset.
Based on the BIC , the best model only uses two variables
rad and lstat.
This is because BIC prefers simpler models that still explain the data well—it penalizes models with too many variables to avoid overfitting.