For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The lasso method has a varying degree of flexibility in comparison to OLS; when lambda = 0 it is identical to OLS, but as lambda increases the penalty term begins to cause the flexibility to decrease. When the flexibility decreases the bias will increase, but the corresponding decrease in variance will make up for it. Thus, in order to have an increase in accuracy, the bias must increase less than the variance decreases.
Ridge regression also starts at OLS when lambda is 0. As lambda increases, there is a decrease in flexibility which increases the bias, but it compensates by greatly reducing the variability of the model in prediction. This means that it gives better prediction accuracy when this increase in bias is counteracted by the resulting decrease in variance.
As non linear models are overall more flexible than linear models, this will mean there is a lower bias but higher variance. The variance increase must thus be balanced with the bias decrease in order to have a satisfactory performance.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
head(College)
## Private Apps Accept Enroll Top10perc Top25perc
## Abilene Christian University Yes 1660 1232 721 23 52
## Adelphi University Yes 2186 1924 512 16 29
## Adrian College Yes 1428 1097 336 22 50
## Agnes Scott College Yes 417 349 137 60 89
## Alaska Pacific University Yes 193 146 55 16 44
## Albertson College Yes 587 479 158 38 62
## F.Undergrad P.Undergrad Outstate Room.Board Books
## Abilene Christian University 2885 537 7440 3300 450
## Adelphi University 2683 1227 12280 6450 750
## Adrian College 1036 99 11250 3750 400
## Agnes Scott College 510 63 12960 5450 450
## Alaska Pacific University 249 869 7560 4120 800
## Albertson College 678 41 13500 3335 500
## Personal PhD Terminal S.F.Ratio perc.alumni Expend
## Abilene Christian University 2200 70 78 18.1 12 7041
## Adelphi University 1500 29 30 12.2 16 10527
## Adrian College 1165 53 66 12.9 30 8735
## Agnes Scott College 875 92 97 7.7 37 19016
## Alaska Pacific University 1500 76 72 11.9 2 10922
## Albertson College 675 67 73 9.4 11 9727
## Grad.Rate
## Abilene Christian University 60
## Adelphi University 56
## Adrian College 54
## Agnes Scott College 59
## Alaska Pacific University 15
## Albertson College 55
We can create an 80-20 train test split using caret’s createDataPartition:
train_indices <- createDataPartition(College$Apps, times = 1, p = 0.80, list = F)
train <- College[train_indices,]
test <- College[-train_indices,]
lm <- lm(Apps~., data = train)
lm_predict <- predict(lm, newdata = test)
lm_err <- mean((test$Apps - lm_predict)^2)
lm_err
## [1] 1224247
We get an MSE of 1224247 using least squares.
x_train <- data.matrix(train[,-2]) # without apps
y_train <- train$Apps
x_test <- data.matrix(test[,-2]) # without apps
y_test <- test$Apps
cv_ridge <- cv.glmnet(x_train, y_train, alpha = 0)
best_ridge <- glmnet(x_train, y_train, alpha = 0, lambda = cv_ridge$lambda.min)
ridge_predict <- predict(best_ridge, newx=x_test)
ridge_err <- mean((test$Apps - ridge_predict)^2)
ridge_err
## [1] 1016243
We get an MSE of 1016243 with Ridge Regression.
cv_lasso <- cv.glmnet(x_train, y_train, alpha = 1)
best_lasso <- glmnet(x_train, y_train, alpha = 1, lambda = cv_lasso$lambda.min)
lasso_predict <- predict(best_lasso, newx=x_test)
lasso_err <- mean((test$Apps - lasso_predict)^2)
lasso_err
## [1] 1209610
We get an MSE of 1209610 with Lasso.
pcr <- pcr(Apps~., data=train, scale=T, validation='CV')
summary(pcr)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4091 4066 2163 2140 1961 1702 1681
## adjCV 4091 4069 2161 2141 1936 1696 1676
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1677 1631 1609 1611 1617 1619 1625
## adjCV 1672 1623 1604 1607 1613 1614 1620
## 14 comps 15 comps 16 comps 17 comps
## CV 1624 1585 1209 1152
## adjCV 1619 1568 1201 1144
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 30.935 57.01 63.91 69.64 75.28 80.13 83.85 87.23
## Apps 1.308 72.53 73.26 78.63 83.97 84.37 84.45 85.34
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 90.40 92.81 94.93 96.77 97.87 98.73 99.35
## Apps 85.78 85.87 85.87 85.95 85.95 86.03 90.16
## 16 comps 17 comps
## X 99.85 100.00
## Apps 93.19 93.78
We see that PCR error decreases until 9 comps, then increases until 15 comps. The true best error is 17 comps. However, we will use 9 principal components as our optimal model for PCR as using the same number of principal components as variables is likely unnecessary.
pcr_pred <- predict(pcr, test, ncomp = 9)
pcr_err <- mean((test$Apps - pcr_pred)^2)
pcr_err
## [1] 1126717
We get an MSE of 1126717 using PCR.
pls <- plsr(Apps~., data=train, scale=T, validation='CV')
summary(pls)
## Data: X dimension: 624 17
## Y dimension: 624 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 4091 1971 1589 1511 1437 1274 1173
## adjCV 4091 1967 1581 1509 1420 1258 1164
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1156 1142 1126 1123 1119 1118 1119
## adjCV 1148 1135 1120 1117 1113 1112 1113
## 14 comps 15 comps 16 comps 17 comps
## CV 1120 1121 1121 1121
## adjCV 1114 1115 1115 1115
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.03 35.43 62.38 65.34 69.32 73.43 76.89 79.79
## Apps 77.86 86.50 87.77 91.12 92.92 93.46 93.54 93.61
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 81.96 85.91 88.56 90.42 92.68 95.09 96.79
## Apps 93.70 93.73 93.75 93.77 93.78 93.78 93.78
## 16 comps 17 comps
## X 98.90 100.00
## Apps 93.78 93.78
We see the best performance with 12 comps, so we will use our final model for PLS using 12 comps.
pls_pred <- predict(pls, test, ncomp = 9)
pls_err <- mean((test$Apps - pls_pred)^2)
pls_err
## [1] 1198885
We get an MSE of 1198885 using PLS.
cat('MSE for linear regression: ', lm_err, '\n')
## MSE for linear regression: 1224247
cat('MSE for Ridge Regression: ', ridge_err, '\n')
## MSE for Ridge Regression: 1016243
cat('MSE for Lasso: ', lasso_err, '\n')
## MSE for Lasso: 1209610
cat('MSE for PCR: ', pcr_err, '\n')
## MSE for PCR: 1126717
cat('MSE for PLS: ', pls_err, '\n')
## MSE for PLS: 1198885
We get the best performance using Ridge regression. PCR performed the second best with an MSE slightly higher, but linear, lasso, and PLS all performed within a similar range as each other without much difference.
We will now try to predict per capita crime rate in the Boston data set.
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
x.boston <- model.matrix(crim ~., data=Boston)[,-1]
y.boston <- Boston[,'crim']
# best subset selection
subsets <- regsubsets(crim ~ ., data = Boston, nvmax = 13)
subset.summary <- summary(subsets)
print(subset.summary$outmat)
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# plots
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp")
lines(1:length(subset.summary$cp), 1:length(subset.summary$cp) + 1, col = "red", lty = 2) # p + 1
plot(subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")
# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 8
cat("Best score for BIC:", which.min(subset.summary$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 9
Using regsubsets, we see mixed results. Cp shows optimal performance with 7 to 8 predictors, BIC shows 2 to 3 predictors, while adjR2 shows the best at 8 to 9 predictors. Given that Cp and adjR2 seem to agree around 8 predictors, we will say that regsubsets chooses the following 8 predictors:
zn + nox + dis + rad + ptratio + black + lstat + medv
# we will use stepAIC
null.model <- lm(crim~1, data=Boston)
full.model <- lm(crim~., data=Boston)
both <- ols_step_both_p(full.model, p_val = 0.05)
# both
print('Both:', quote=F)
## [1] Both:
summary(both$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.023 -1.713 -0.281 0.873 77.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.372585 1.641557 -0.227 0.82054
## rad 0.488172 0.040422 12.077 < 2e-16 ***
## lstat 0.213596 0.047447 4.502 8.39e-06 ***
## black -0.009472 0.003615 -2.620 0.00905 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared: 0.4286, Adjusted R-squared: 0.4252
## F-statistic: 125.5 on 3 and 502 DF, p-value: < 2.2e-16
plot(both)
If we use stepwise regression, combining both forward and backward selection, we arrive at three important variables. The resulting formula is: rad + lstat + black
These variables are present in the best subset selection, and appear to be the same as the variables that BIC selects for as well in subset selection.
# we do not expect the best performance here; n >> p in this case
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
# lasso
lasso.boston = glmnet(x.boston,(y.boston),alpha = 1)
plot(lasso.boston,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)
# do cv for best lambda
cv.lasso.boston = cv.glmnet(x.boston,(y.boston),alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.boston)
# displaying best lambda values
c(log(cv.lasso.boston$lambda.min), log(cv.lasso.boston$lambda.1se))
## [1] -3.046855 1.186180
Using Lasso, we see an optimal log(lambda) at -3.1 and a 1se at 1.2. Interestingly, we see a significant difference between these two models, as the minimum retains 11 predictors while the 1se retains only 1 predictor. We can analyze this further by inspecting the coefficients:
lasso.best <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.min)
coef(lasso.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.921616863
## zn 0.036630369
## indus -0.071088384
## chas -0.592709165
## nox -7.203117383
## rm 0.243372202
## age .
## dis -0.800965160
## rad 0.515774432
## tax .
## ptratio -0.194514895
## black -0.007544322
## lstat 0.125137815
## medv -0.160384184
lasso.1se <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.1se)
coef(lasso.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.3076511
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2414676
## tax .
## ptratio .
## black .
## lstat .
## medv .
We see that both agree that rad is an important feature, which regsubsets also chooses for only 1 predictor. The remaining difference of 10 features may suggest that the other features are of much less importance as they were eliminated using the L1 regularization term within 1 standard error.
# ridge regression
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
ridge.boston = glmnet(x.boston,(y.boston),alpha = 0, lambda = grid)
plot(ridge.boston,"lambda",ylab = "Coefficients", xlab = expression(paste("log", lambda)), main = "Predictor coefficients against log lambda")
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)
cv.ridge.boston = cv.glmnet(x.boston,(y.boston),alpha = 0, lambda = grid, nfolds = 10)
plot(cv.ridge.boston,xlab = expression(paste("log", lambda)))
# displaying best lambda values:
c(log(cv.ridge.boston$lambda.min), log(cv.ridge.boston$lambda.1se))
## [1] -1.581574 4.279552
Using ridge regression, we find a best log(lambda) of -1.26 and a 1 standard error of 4.11 We can see the following coefficients for each of these:
rr.best <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.min)
coef(rr.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.9775905382
## zn 0.0387899913
## indus -0.0804025387
## chas -0.7323937516
## nox -7.8754864737
## rm 0.3862135899
## age 0.0015110242
## dis -0.8509646232
## rad 0.5007945695
## tax 0.0002970566
## ptratio -0.2044293139
## black -0.0080295010
## lstat 0.1371320594
## medv -0.1683590233
rr.1se <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.1se)
coef(rr.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.359573930
## zn -0.002953356
## indus 0.031060606
## chas -0.184519032
## nox 2.001228568
## rm -0.150147623
## age 0.006597720
## dis -0.101274834
## rad 0.050252638
## tax 0.002268044
## ptratio 0.076314412
## black -0.002837701
## lstat 0.038566984
## medv -0.025307453
If we compare these side by side for the ones found in lasso, the coefficients are very near the corresponding coefficients for lasso. The significant difference, however, is the 1 standard error lambda value. With this lambda ridge regression only has one value that is not close to zero, much like lasso only having one nonzero value; however, ridge chooses nox while lasso chooses rad as the most important predictors. This means it also differs from regsubsets, which also chose rad as the most important predictor.
# PCR
pcr.boston = pcr(crim ~ ., data = Boston, scale = T, validation = "CV")
summary(pcr.boston)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.193 7.193 6.761 6.757 6.758 6.757
## adjCV 8.61 7.191 7.191 6.756 6.749 6.754 6.752
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.743 6.623 6.629 6.627 6.621 6.592 6.517
## adjCV 6.738 6.617 6.623 6.620 6.614 6.583 6.508
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.boston, val.type = 'MSEP')
Using PCR, we see that after around 3 principal components the error begins to plateau and we only see minor improvements as the number of principal components increases. We therefore will say PCR shows 3 principal components to be the best model.
The best performances with the simplest models were achieved using stepwise regression and principal components regression, as both of these showed low errors in the graphs and only used 3 components or predictors. Two other models that performed well were both lasso and ridge regression, although they appear to perform noticeably worse when using the 1 standard error lambda. We will thus use stepwise regression, PCR, lasso, and ridge regression to continue. We will do these each with 10 fold cross validation and compare the MSE of each to determine which model to put forth as our final model.
Fortunately, cross fold validation was automatically performed in the first part for all of the previously mentioned methods with the exception of stepwise regression, which we will do now:
set.seed(123)
folds <- createFolds(Boston$crim, k = 10, list = TRUE)
mse_values <- numeric(10)
selected_models <- vector("list", length = 10)
for (i in seq_along(folds)) {
# training and validation sets
train_data <- Boston[-folds[[i]], ]
test_data <- Boston[folds[[i]], ]
# fit
full_model <- lm(crim ~ ., data = train_data)
stepwise_model <- ols_step_both_p(full_model, pent = 0.05, prem = 0.05)
# store best model of fold
selected_models[[i]] <- formula(stepwise_model$model)
# get mse
predictions <- predict(stepwise_model$model, newdata = test_data)
mse_values[i] <- mean((test_data$crim - predictions)^2)
}
# average mse
mean_mse <- mean(mse_values)
# Output results
# cat("MSE for each fold:", mse_values, "\n")
cat("Mean MSE across all folds:", mean_mse, "\n")
## Mean MSE across all folds: 43.83496
# Print the selected models for each fold
cat("\nSelected models for each fold:\n")
##
## Selected models for each fold:
for (i in seq_along(selected_models)) {
cat(sprintf("Fold %d: %s\n", i, deparse(selected_models[[i]])))
}
## Fold 1: crim ~ rad + lstat + black
## Fold 2: crim ~ rad + lstat + medv + ptratio
## Fold 3: crim ~ rad + lstat + black
## Fold 4: crim ~ rad + lstat + black
## Fold 5: crim ~ rad + lstat + black
## Fold 6: crim ~ rad + lstat + black + zn + dis + nox + medv
## Fold 7: crim ~ rad + lstat
## Fold 8: crim ~ rad + lstat + black + medv
## Fold 9: crim ~ rad + lstat + black
## Fold 10: crim ~ rad + lstat + black
We see that the after performing cross validated stepwise regression, we arrive at a mean MSE of 43.8, which is similar to the performance observed in part a) of this question. We also see that the selection overall appears to agree with the same 3 predictors as previously: rad, lstat, and black.
Now we can compare all of the results to see the optimal model:
## Ridge Regression:
## Minimum MSE: 43.16917
## MSE at 1-se Lambda: 57.52966
## Lasso Regression:
## Minimum MSE: 43.67005
## MSE at 1-se Lambda: 56.61822
## Principal Component Regression (PCR):
## MSE with 3 components: 45.70604
## MSE with 8 components: 43.86955
## Minumum MSE, with 13 components: 42.46525
## Stepwise Regression:
## [1] 43.83496
Looking at the performance we see the best MSE at around 43. We see the best performing were ridge regression, lasso regression, PCR with 13 components, and stepwise regression which all were within a single point.
With the exception of the 1 standard error lambda values, all of the models in the previous performed at essentially the same level. In this case, we will decide to continue with lasso or ridge due to the decreased variance they will have due to the penalty terms. While this penalty term does shrink the coefficients, the decrease in variance may prove to be beneficial should we take the model to an outside dataset to apply it in a real world scenario. Between the two, the final model to propose to predict crime rates will be lasso regression. This is because of the two models, lasso regression is able to provide feature selection. If we return to the coefficients by the minimum lamdba we can see what features are shown to be the most important in the dataset, which ridge regression cannot do:
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.921616863
## zn 0.036630369
## indus -0.071088384
## chas -0.592709165
## nox -7.203117383
## rm 0.243372202
## age .
## dis -0.800965160
## rad 0.515774432
## tax .
## ptratio -0.194514895
## black -0.007544322
## lstat 0.125137815
## medv -0.160384184
We can see that age and tax are taken out of the model and will not be used, due to the l1 norm penalty term. While this does not reduce the dataset as far as we would like, it does provide some feature selection without compromising performance.