Statement 1: More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Statement 2: More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. Statement 3: Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. Statement 4: Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The lasso is a less flexible method compared to least squares, as it includes an L1 penalty. This reduces the likelihood of overfitting and results in lower variance. However, due to the bias-variance tradeoff, there is a relative increase in bias.
Similar to the lasso, ridge regression is also less flexible than least squares, as it includes an L2 penalty. This reduces the likelihood of overfitting and results in lower variance. However, due to the bias-variance tradeoff, there is a relative increase in bias.
Non-linear methods are more flexible than least squares, so they can better fit complex patterns in data regardless of noise, contributing to their lower bias. However, due to the bias-variance tradeoff, there is a relative increase in variance as slight changes in training data can cause significant changes in the model.
set.seed(1234)
X <- rnorm(100)
epsilon <- rnorm(100)
beta0 <- 1
beta1 <- -2
beta2 <- 3
beta3 <- -4
Y <- beta0 + beta1*X + beta2*X^2 + beta3*X^3 + epsilon
The simulated data is y = 1 - 2x + 3x^2 - 4x^3.
# Load the leaps package
library(leaps)
# Create a data frame with predictors X, X^2, ..., X^10 and response Y
data_df <- data.frame(y = Y, x = X, x2 = X^2, x3 = X^3, x4 = X^4, x5 = X^5,
x6 = X^6, x7 = X^7, x8 = X^8, x9 = X^9, x10 = X^10)
# Perform best subset selection
subsets <- regsubsets(y ~ ., data = data_df, nvmax = 10)
# Get the summary of the results
subsets_summary <- summary(subsets)
# Find the best model based on Cp, BIC, and adjusted R-squared
best_cp <- which.min(subsets_summary$cp)
best_bic <- which.min(subsets_summary$bic)
best_adjr2 <- which.max(subsets_summary$adjr2)
# Print the best models
cat("Best model based on Cp:", best_cp)
## Best model based on Cp: 3
cat("Best model based on BIC:", best_bic)
## Best model based on BIC: 3
cat("Best model based on adjusted R-squared:", best_adjr2)
## Best model based on adjusted R-squared: 3
# Plot Cp, BIC, and adjusted R-squared
par(mfrow = c(2, 2))
plot(subsets_summary$cp, xlab = "Number of Variables", ylab = "Cp", type = "l")
points(best_cp, subsets_summary$cp[best_cp], col = "red", pch = 20)
plot(subsets_summary$bic, xlab = "Number of Variables", ylab = "BIC", type = "l")
points(best_bic, subsets_summary$bic[best_bic], col = "red", pch = 20)
plot(subsets_summary$adjr2, xlab = "Number of Variables", ylab = "Adjusted R-squared", type = "l")
points(best_adjr2, subsets_summary$adjr2[best_adjr2], col = "red", pch = 20)
# Get the coefficients of the best model
coef(subsets, best_cp)
## (Intercept) x x2 x3
## 1.132470 -2.087414 2.893627 -3.967695
coef(subsets, best_bic)
## (Intercept) x x2 x3
## 1.132470 -2.087414 2.893627 -3.967695
coef(subsets, best_adjr2)
## (Intercept) x x2 x3
## 1.132470 -2.087414 2.893627 -3.967695
The best model according to all 3 criterion is y = 1.13 - 2.09x + 2.89x^2 - 3.97x^3. This is very similar to the original.
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
df.lm <- lm(y ~ ., data = data_df)
df.f <- ols_step_forward_p(df.lm, penter = 0.05)
df.b <- ols_step_backward_p(df.lm, prem = 0.1)
summary(df.f$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.95953 -0.66151 0.03895 0.53538 2.91598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.13247 0.13610 8.321 5.92e-13 ***
## x3 -3.96769 0.05661 -70.088 < 2e-16 ***
## x2 2.89363 0.08222 35.193 < 2e-16 ***
## x -2.08741 0.18789 -11.110 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.038 on 96 degrees of freedom
## Multiple R-squared: 0.9952, Adjusted R-squared: 0.995
## F-statistic: 6588 on 3 and 96 DF, p-value: < 2.2e-16
summary(df.b$model)
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.84330 -0.51507 -0.03746 0.46811 2.80231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.937870 0.203768 4.603 1.32e-05 ***
## x -2.029522 0.202791 -10.008 < 2e-16 ***
## x2 3.920355 0.694328 5.646 1.77e-07 ***
## x3 -3.971703 0.065209 -60.908 < 2e-16 ***
## x4 -0.765037 0.518945 -1.474 0.144
## x6 0.145173 0.105581 1.375 0.172
## x10 -0.001227 0.001058 -1.159 0.249
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.042 on 93 degrees of freedom
## Multiple R-squared: 0.9953, Adjusted R-squared: 0.995
## F-statistic: 3271 on 6 and 93 DF, p-value: < 2.2e-16
In both subset selection methods and sequential forward model selection, the methods select the right predictors and coefficients. However, backward selection chose 5 predictors with incorrect coeff of ~4 instead of 3 for x2. At the very least, x4, x6, and x10 are not considered significant according to the model results.
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 4.1-8
# Create the predictor matrix and response vector
x_matrix <- as.matrix(data_df[, -1]) # Exclude the response variable (y)
y_vector <- data_df$y
# Perform cross-validation to find the optimal lambda
cv_lasso <- cv.glmnet(x_matrix, y_vector, alpha = 1) # alpha = 1 for lasso, default nfolds=10
# Plot the cross-validation results
plot(cv_lasso)
# 2 possible cutoffs of lambda
c(log(cv_lasso$lambda.min), log(cv_lasso$lambda.1se))
## [1] -2.845343 -1.635904
# Get the optimal lambda value, we opt for 1se
best_lambda <- cv_lasso$lambda.1se
# Fit the lasso model with the optimal lambda
lasso_model <- glmnet(x_matrix, y_vector, alpha = 1, lambda = best_lambda)
# Report the coefficient estimates
coef(lasso_model)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.348018
## x -2.018755
## x2 2.685827
## x3 -3.902978
## x4 .
## x5 .
## x6 .
## x7 .
## x8 .
## x9 .
## x10 .
As in class, we choose the lambda that gives us the most regularized model such that error is within one standard error of the lambda.min. The resulting model is y = 3.36 + 6.75x^7 + 0.0003x^9. The coefficient for x^9 is small, nonzero number. Overall, this lasso model is the poorest fit to the original so far.
beta7 <- 7
Y <- beta0 + beta7*X^7 + epsilon
data_df <- data.frame(y = Y, x = X, x2 = X^2, x3 = X^3, x4 = X^4, x5 = X^5,
x6 = X^6, x7 = X^7, x8 = X^8, x9 = X^9, x10 = X^10)
subsets <- regsubsets(y ~ ., data = data_df, nvmax = 10)
subsets_summary <- summary(subsets)
best_cp <- which.min(subsets_summary$cp)
best_bic <- which.min(subsets_summary$bic)
best_adjr2 <- which.max(subsets_summary$adjr2)
cat("Best model based on Cp:", best_cp)
## Best model based on Cp: 1
cat("Best model based on BIC:", best_bic)
## Best model based on BIC: 1
cat("Best model based on adjusted R-squared:", best_adjr2)
## Best model based on adjusted R-squared: 2
coef(subsets, best_cp)
## (Intercept) x7
## 1.042105 6.999908
coef(subsets, best_bic)
## (Intercept) x7
## 1.042105 6.999908
coef(subsets, best_adjr2)
## (Intercept) x2 x7
## 1.1471879 -0.1074417 7.0004274
df.lm <- lm(y ~ ., data = data_df)
df.f <- ols_step_forward_p(df.lm, penter = 0.05)
df.b <- ols_step_backward_p(df.lm, prem = 0.1)
summary(df.f$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.99595 -0.65283 0.01794 0.56276 2.95288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.147188 0.132249 8.674 9.7e-14 ***
## x7 7.000427 0.001113 6289.428 < 2e-16 ***
## x2 -0.107442 0.083717 -1.283 0.202
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.034 on 97 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 2.279e+07 on 2 and 97 DF, p-value: < 2.2e-16
summary(df.b$model)
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.85434 -0.50383 -0.04033 0.46323 2.81306
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.942629 0.202373 4.658 1.05e-05 ***
## x2 0.907952 0.677624 1.340 0.184
## x4 -0.758738 0.514381 -1.475 0.144
## x6 0.145042 0.106775 1.358 0.178
## x7 7.000710 0.001549 4518.276 < 2e-16 ***
## x10 -0.001249 0.001114 -1.121 0.265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.037 on 94 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 9.059e+06 on 5 and 94 DF, p-value: < 2.2e-16
x_matrix <- as.matrix(data_df[, -1])
y_vector <- data_df$y
cv_lasso <- cv.glmnet(x_matrix, y_vector, alpha = 1)
plot(cv_lasso)
c(log(cv_lasso$lambda.min), log(cv_lasso$lambda.1se))
## [1] 3.013022 3.199090
best_lambda <- cv_lasso$lambda.1se
lasso_model <- glmnet(x_matrix, y_vector, alpha = 1, lambda = best_lambda)
coef(lasso_model)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.3608113059
## x .
## x2 .
## x3 .
## x4 .
## x5 .
## x6 .
## x7 6.7502552058
## x8 .
## x9 0.0003334034
## x10 .
Overall, Cp and BIC were the best criterion selection methods giving the correct predictors and coefficients. Adjr2 included X2 as a predictor, which makes sense given that it often overestimates number of predictors. Stepwise selection methods like forward and backward performed poorly due to forward including x2 and backward including x2,x4,x6,x10. However, in both models, only the x7 term was considered significant and had the correct coefficient. Lastly, lasso gave 3.36 + 6.75x^7 + 0.0003x^9, which is the same model given in part e) but a much better fit for this data. This is due to the L1 penalty forcing some of the coefficient estimates to be exactly equal to zero, slightly increasing bias over OLS, but decreasing variance for improved performance.
library(ISLR2)
data(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio lstat
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 1.73
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.: 6.95
## Median : 5.000 Median :330.0 Median :19.05 Median :11.36
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :12.65
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:16.95
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :37.97
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.53
## 3rd Qu.:25.00
## Max. :50.00
mod.sel <- regsubsets(Boston$crim ~ ., data = Boston, nvmax = 12)
reg.summary <- summary(mod.sel)
names(reg.summary)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
reg.summary$outmat
## zn indus chas nox rm age dis rad tax ptratio lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*"
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " "*"
## 5 ( 1 ) "*" "*" " " " " " " " " "*" "*" " " " " " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*"
## 10 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
cat(which.max(reg.summary$rsq), reg.summary$rsq[which.max(reg.summary$rsq)]) #12
## 12 0.4493378
cat(which.min(reg.summary$rss), reg.summary$rss[which.min(reg.summary$rss)]) #12
## 12 20574.51
cat(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)]) #9
## 9 0.4382783
cat(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)]) #7
## 7 6.748016
cat(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)]) #2
## 2 -257.6477
lm.Boston <- lm(Boston$crim ~ ., data = Boston)
summary(lm.Boston) # zn, dis, rad, medv
##
## Call:
## lm(formula = Boston$crim ~ ., data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.534 -2.248 -0.348 1.087 73.923
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.7783938 7.0818258 1.946 0.052271 .
## zn 0.0457100 0.0187903 2.433 0.015344 *
## indus -0.0583501 0.0836351 -0.698 0.485709
## chas -0.8253776 1.1833963 -0.697 0.485841
## nox -9.9575865 5.2898242 -1.882 0.060370 .
## rm 0.6289107 0.6070924 1.036 0.300738
## age -0.0008483 0.0179482 -0.047 0.962323
## dis -1.0122467 0.2824676 -3.584 0.000373 ***
## rad 0.6124653 0.0875358 6.997 8.59e-12 ***
## tax -0.0037756 0.0051723 -0.730 0.465757
## ptratio -0.3040728 0.1863598 -1.632 0.103393
## lstat 0.1388006 0.0757213 1.833 0.067398 .
## medv -0.2200564 0.0598240 -3.678 0.000261 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.46 on 493 degrees of freedom
## Multiple R-squared: 0.4493, Adjusted R-squared: 0.4359
## F-statistic: 33.52 on 12 and 493 DF, p-value: < 2.2e-16
Boston.f <- ols_step_forward_p(lm.Boston, penter = 0.05) #9
Boston.b <- ols_step_backward_p(lm.Boston, prem = 0.1) #9
Boston.s <- ols_step_both_p(lm.Boston, pent = 0.05, prem = 0.1) #4
summary(Boston.f$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.547 -2.294 -0.263 1.108 73.970
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.18247 6.96989 1.891 0.059160 .
## rad 0.56097 0.04890 11.471 < 2e-16 ***
## lstat 0.14041 0.07168 1.959 0.050696 .
## medv -0.22013 0.05820 -3.782 0.000175 ***
## ptratio -0.30424 0.18514 -1.643 0.100949
## rm 0.63511 0.59541 1.067 0.286641
## nox -10.46823 5.07652 -2.062 0.039720 *
## dis -1.00637 0.27056 -3.720 0.000222 ***
## zn 0.04306 0.01807 2.383 0.017543 *
## indus -0.08821 0.07489 -1.178 0.239465
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.447 on 496 degrees of freedom
## Multiple R-squared: 0.4483, Adjusted R-squared: 0.4383
## F-statistic: 44.78 on 9 and 496 DF, p-value: < 2.2e-16
summary(Boston.b$model)
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.625 -2.201 -0.349 1.128 73.893
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.285117 7.031364 2.032 0.042724 *
## zn 0.047681 0.018491 2.579 0.010206 *
## nox -11.431518 4.870169 -2.347 0.019305 *
## rm 0.675191 0.592967 1.139 0.255392
## dis -0.970703 0.265259 -3.659 0.000280 ***
## rad 0.628695 0.083038 7.571 1.82e-13 ***
## tax -0.005150 0.004632 -1.112 0.266789
## ptratio -0.318088 0.183795 -1.731 0.084134 .
## lstat 0.132263 0.071654 1.846 0.065508 .
## medv -0.227946 0.058967 -3.866 0.000126 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.448 on 496 degrees of freedom
## Multiple R-squared: 0.4481, Adjusted R-squared: 0.4381
## F-statistic: 44.75 on 9 and 496 DF, p-value: < 2.2e-16
summary(Boston.s$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.942 -1.896 -0.346 0.974 75.878
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.93612 3.88687 1.270 0.2047
## rad 0.54758 0.04104 13.343 <2e-16 ***
## lstat 0.14350 0.06442 2.228 0.0263 *
## medv -0.11983 0.05066 -2.365 0.0184 *
## ptratio -0.30708 0.16706 -1.838 0.0666 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.529 on 501 degrees of freedom
## Multiple R-squared: 0.4283, Adjusted R-squared: 0.4238
## F-statistic: 93.85 on 4 and 501 DF, p-value: < 2.2e-16
x.Boston <- model.matrix(crim ~ ., Boston)[,-1]
y.Boston <- Boston[,"crim"]
cv_lasso.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 1)
best_lambda.Boston <- cv_lasso.Boston$lambda.1se
lasso_model.Boston <- glmnet(x.Boston, y.Boston, alpha=1, lambda=best_lambda.Boston)
coef(lasso_model.Boston) #1 (lambda.1se) vs 10 (lambda.min)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.4186414
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2298449
## tax .
## ptratio .
## lstat .
## medv .
cv_ridge.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 0)
best_lambda.Boston <- cv_ridge.Boston$lambda.1se
ridge_model.Boston <- glmnet(x.Boston, y.Boston, alpha=0, lambda=best_lambda.Boston)
coef(ridge_model.Boston) #12
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.830241387
## zn -0.002323439
## indus 0.018788484
## chas -0.087676364
## nox 1.175797658
## rm -0.094724473
## age 0.003960747
## dis -0.058498956
## rad 0.025819211
## tax 0.001207665
## ptratio 0.043787293
## lstat 0.021432608
## medv -0.014151535
cv_enet.Boston <- cv.glmnet(x.Boston,y.Boston,alpha = 0.5)
best_lambda.Boston <- cv_enet.Boston$lambda.1se
enet_model.Boston <- glmnet(x.Boston, y.Boston, alpha=0.5, lambda=best_lambda.Boston)
coef(enet_model.Boston) #2 (lambda.1se) vs 10 (lambda.min)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 0.872722755
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.154480237
## tax 0.003100174
## ptratio .
## lstat .
## medv .
library(pls)
## Warning: package 'pls' was built under R version 4.4.2
##
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
##
## loadings
pcr.Boston <- pcr(crim ~ . , data = Boston, scale=T, validation = "CV")
summary(pcr.Boston)
## Data: X dimension: 506 12
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.295 7.289 6.899 6.878 6.823 6.812
## adjCV 8.61 7.291 7.285 6.893 6.873 6.817 6.806
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.682 6.703 6.683 6.677 6.638 6.56
## adjCV 6.671 6.696 6.675 6.669 6.628 6.55
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.93 63.64 72.94 80.21 86.83 90.26 92.79 94.99
## crim 29.39 29.55 37.39 37.85 38.85 39.23 41.73 41.82
## 9 comps 10 comps 11 comps 12 comps
## X 96.78 98.33 99.48 100.00
## crim 42.12 42.43 43.58 44.93
validationplot(pcr.Boston, val.type = "MSEP")
press.Boston <- sqrt(pcr.Boston$validation[["PRESS"]])
which.min(press.Boston)
## [1] 12
press.Boston.eff <- press.Boston - press.Boston[which.min(press.Boston)]
print(press.Boston.eff) # smallest CV error occurs when M=12 components used
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## crim 16.53362 16.40192 7.633271 7.166196 5.908498 5.676738 2.757023 3.223763
## 9 comps 10 comps 11 comps 12 comps
## crim 2.771947 2.627039 1.750864 0
pls.Boston <- plsr(crim ~ . , data = Boston, scale=T, validation = "CV")
summary(pls.Boston)
## Data: X dimension: 506 12
## Y dimension: 506 1
## Fit method: kernelpls
## Number of components considered: 12
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.070 6.665 6.592 6.582 6.571 6.554
## adjCV 8.61 7.068 6.661 6.586 6.574 6.562 6.544
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps
## CV 6.547 6.537 6.536 6.537 6.537 6.537
## adjCV 6.538 6.528 6.528 6.528 6.528 6.528
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 49.49 58.85 64.82 74.54 80.88 83.57 87.80 92.13
## crim 33.00 41.20 43.32 44.04 44.40 44.78 44.87 44.91
## 9 comps 10 comps 11 comps 12 comps
## X 94.56 96.22 98.11 100.00
## crim 44.93 44.93 44.93 44.93
validationplot(pls.Boston, val.type = "MSEP")
which.min(pls.Boston$validation[["PRESS"]]) # smallest when n.comp=9 components used
## [1] 9
I tried criterion selection methods and sequential selection methods for OLS, the lasso, ridge regression, elastic net, and PCR. Rsq and Rss selected all 12 predictors while AdjR2 selected 9, Cp 7, BIC 2, forward/backward 9, stepwise regression 4. The downside with these models is that we cannot conclude that all variables included are important or all excluded variables are unimportant. The selected models for the subset selection method may all equally good, and it is difficult to conclude which is optimal. Moreover, they are more flexible methods, such that they may overfit. In this chapter, we were introduced shrinkage methods (lasso regression, ridge regression, elastic net) and dimensionality reduction methods (PCA regression, PLS regression). Ridge uses all predictors; however, lasso interestingly chooses 10 predictors at lambda.min and only 1 predictor at lambda.1se. Enet is similar to lasso in that lambbda.min chooses 10 predictors and lambda.1se chooses only 2. The PCA model performs best with all 12 principal components while the PLS does best with 9 components.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:pls':
##
## R2
set.seed(1234)
# Splitting the dataset
tr.ind <- sample(seq_len(nrow(x.Boston)), floor(0.7 * nrow(x.Boston)), replace = FALSE)
x.tr <- x.Boston[tr.ind, ]
x.te <- x.Boston[-tr.ind, ]
y.tr <- y.Boston[tr.ind]
y.te <- y.Boston[-tr.ind]
# Full OLS model
lm.Boston <- lm(crim ~ ., data = Boston[tr.ind,])
pred.lm <- predict(lm.Boston, newdata = Boston[-tr.ind,])
mse.lm <- mean((y.te - pred.lm)^2)
# Subset selection models
mod.sel <- regsubsets(crim ~ ., data = Boston[tr.ind,], nvmax = 12)
reg.summary <- summary(mod.sel)
# Adjusted R^2
best.adjr2 <- which.max(reg.summary$adjr2)
selected.vars <- names(coef(mod.sel, best.adjr2))[-1] # Exclude intercept
formula.adjr2 <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.adjr2 <- lm(formula.adjr2, data = Boston[tr.ind,])
pred.adjr2 <- predict(lm.adjr2, newdata = Boston[-tr.ind,])
mse.adjr2 <- mean((y.te - pred.adjr2)^2)
# Cp
best.cp <- which.max(reg.summary$cp)
selected.vars <- names(coef(mod.sel, best.cp))[-1]
formula.cp <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.cp <- lm(formula.cp, data = Boston[tr.ind,])
pred.cp <- predict(lm.cp, newdata = Boston[-tr.ind,])
mse.cp <- mean((y.te - pred.cp)^2)
# BIC
best.bic <- which.max(reg.summary$bic)
selected.vars <- names(coef(mod.sel, best.bic))[-1]
formula.bic <- as.formula(paste("crim ~", paste(selected.vars, collapse = " + ")))
lm.bic <- lm(formula.bic, data = Boston[tr.ind,])
pred.bic <- predict(lm.bic, newdata = Boston[-tr.ind,])
mse.bic <- mean((y.te - pred.bic)^2)
# Forward Selection
Boston.f <- ols_step_forward_p(lm.Boston, penter = 0.05)
pred.forward <- predict(Boston.f$model, newdata = Boston[-tr.ind,])
mse.forward <- mean((y.te - pred.forward)^2)
# Backward Selection
Boston.b <- ols_step_backward_p(lm.Boston, prem = 0.1)
pred.backward <- predict(Boston.b$model, newdata = Boston[-tr.ind,])
mse.backward <- mean((y.te - pred.backward)^2)
# Stepwise (Both Directions)
Boston.s <- ols_step_both_p(lm.Boston, pent = 0.05, prem = 0.1)
pred.stepwise <- predict(Boston.s$model, newdata = Boston[-tr.ind,])
mse.stepwise <- mean((y.te - pred.stepwise)^2)
# Lasso regression
lasso.cv <- cv.glmnet(x.tr, y.tr, alpha = 1)
lasso.fit <- glmnet(x.tr, y.tr, alpha = 1)
coef(lasso.fit, s = lasso.cv$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.5160139
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2467164
## tax .
## ptratio .
## lstat .
## medv .
lasso.predict <- predict(lasso.fit, s = lasso.cv$lambda.1se, newx = x.te)
mse.lasso <- mean((y.te - lasso.predict)^2)
# Ridge regression
ridge.cv <- cv.glmnet(x.tr, y.tr, alpha = 0)
ridge.fit <- glmnet(x.tr, y.tr, alpha = 0)
coef(ridge.fit, s = ridge.cv$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.415322414
## zn -0.002894736
## indus 0.024681808
## chas -0.142582896
## nox 1.589732481
## rm -0.133972959
## age 0.005513600
## dis -0.081479027
## rad 0.037730343
## tax 0.001699419
## ptratio 0.061789713
## lstat 0.031522083
## medv -0.021769896
ridge.predict <- predict(ridge.fit, s = ridge.cv$lambda.1se, newx = x.te)
mse.ridge <- mean((y.te - ridge.predict)^2)
# Elastic Net regression
enet.cv <- cv.glmnet(x.tr, y.tr, alpha = 0.5)
enet.fit <- glmnet(x.tr, y.tr, alpha = 0.5)
coef(enet.fit, s = enet.cv$lambda.1se)
## 13 x 1 sparse Matrix of class "dgCMatrix"
## s1
## (Intercept) 1.002704193
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.169922764
## tax 0.002874765
## ptratio .
## lstat 0.005833654
## medv .
enet.predict <- predict(enet.fit, s = enet.cv$lambda.1se, newx = x.te)
mse.enet <- mean((y.te - enet.predict)^2)
# PCA regression
pcr.Boston <- pcr(crim ~ . , data = Boston[tr.ind,], scale=T, validation = "CV")
press.Boston <- sqrt(pcr.Boston$validation[["PRESS"]])
best_ncomp_pcr <- which.min(press.Boston)
pcr.pred <- predict(pcr.Boston, within(Boston,rm("crim"))[-tr.ind,], ncomp = best_ncomp_pcr)
mse.pcr <- mean((y.te - pcr.pred)^2)
# PLS regression
pls.Boston <- plsr(crim ~ . , data = Boston[tr.ind,], scale=T, validation = "CV")
best_ncomp_pls <- which.min(pls.Boston$validation[["PRESS"]])
pls.pred <- predict(pls.Boston, within(Boston,rm("crim"))[-tr.ind,], ncomp = best_ncomp_pls)
mse.pls <- mean((y.te - pls.pred)^2)
# Combine results into a data frame
mse_comparison <- data.frame(
Model = c("Full OLS", "Best Subset (Adj R^2)", "Forward Selection", "Backward Elimination", "Stepwise Regression", "Lasso", "Ridge", "Elastic Net", "PCA Regression", "PLS Regression"),
MSE = c(mse.lm, mse.adjr2, mse.forward, mse.backward, mse.stepwise, mse.lasso, mse.ridge, mse.enet, mse.pcr, mse.pls)
)
print(mse_comparison)
## Model MSE
## 1 Full OLS 24.80042
## 2 Best Subset (Adj R^2) 24.85268
## 3 Forward Selection 24.85268
## 4 Backward Elimination 24.85268
## 5 Stepwise Regression 24.67800
## 6 Lasso 29.52554
## 7 Ridge 33.75263
## 8 Elastic Net 30.38215
## 9 PCA Regression 24.80042
## 10 PLS Regression 24.73771
I chose a 70/30 train test split. I use OLS with the full set of predictors as my baseline MSE (24.8) to compare my other models with. PCA regression performs the same while PLS regression improves upon that with (24.73). The best model is actually stepwise regression (24.68). It seems as though Lasso (29.53), Ridge (33.75), and Elastic Net (30.38) performed the worst. This can be due to the increase in bias being greater than the decrease in variance for this Boston dataset.
For this seed(1234), the best model of stepwise regression did not include all of the predictors. Subset selection methods in general seem to perform best. The next best performing models are the dimensionality reduction models. Even though they do not necessarily use all the components, they still involve all of the features in the data set since each PC is a combination of all 12 features. Finally, the poorest performing models are the shrinkage models, with Ridge, the only one that uses all the predictors, performing the worst.
The best explanation is that there are irrelevant or redundant predictors in the Boston dataset. While PCR and PLS did not remove such predictors, they project the original features into a new set of uncorrelated, orthogonal components, which helps deal with multicollinearity. Finally, Ridge does not eliminate predictors, only shrinking their coefficients, and thus being unable to deal with multicollinearity.