For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
The lasso method has a varying degree of flexibility in comparison to OLS; when lambda = 0 it is identical to OLS, but as lambda increases the penalty term begins to cause the flexibility to decrease. When the flexibility decreases the bias will increase, but the corresponding decrease in variance will make up for it. Thus, in order to have an increase in accuracy, the bias must increase less than the variance decreases.
Ridge regression also starts at OLS when lambda is 0. As lambda increases, there is a decrease in flexibility which increases the bias, but it compensates by greatly reducing the variability of the model in prediction. This means that it gives better prediction accuracy when this increase in bias is counteracted by the resulting decrease in variance.
As non linear models are overall more flexible than linear models, this will mean there is a lower bias but higher variance. The variance increase must thus be balanced with the bias decrease in order to have a satisfactory performance.
In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.
set.seed(123)
x <- c()
for (i in 1:10){
x[[i]] <- rnorm(100)
}
x <- do.call(cbind, x)
x <- as.data.frame(x)
colnames(x) <- paste("x", 1:10, sep="")
eps <- rnorm(100)
y <- numeric(100)
b0 <- 10
b1 <- 5
b2 <- -30
b3 <- 25
for (i in seq_along(y)){
y[i] <- b0 + b1 * x$x1[i] + b2 * x$x2[i] + b3 * x$x3[i] + eps[i]
}
xy <- data.frame(x,y)
x.subsets <- regsubsets(x=xy[,-11], y=xy[,11])
best.x <- summary(x.subsets)
best.x$outmat
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
## 1 ( 1 ) " " "*" " " " " " " " " " " " " " " " "
## 2 ( 1 ) " " "*" "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" "*" " " " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*"
## 6 ( 1 ) "*" "*" "*" "*" " " " " "*" " " " " "*"
## 7 ( 1 ) "*" "*" "*" "*" " " " " "*" "*" " " "*"
## 8 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*"
# plot
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(best.x$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp", ylim=c(0,25))
# line for p + 1, as Cp should be less than p + 1
lines(1:length(best.x$cp), 1:length(best.x$cp) + 1, col = "red", lty = 2)
plot(best.x$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(best.x$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")
# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(best.x$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 4
cat("Best score for BIC:", which.min(best.x$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(best.x$adjr2), "\n")
## Best score for Adjusted R2: 6
Using the graph of BIC, Cp, and adjusted R2 from regsubsets, we can see the best model is around 3 to 4 predictors. Cp chooses for 4 predictors, BIC for 3 predictors, and adjusted R2 does not have any significant increase after 3 predictors. As these results are all similar for 3 to 4 predictors, we will go with 3 predictors as it is the simpler model.
For the 3 predictor model, we see that it uses the predictors x1, x2, and x3. This lines up with what we originally used to define the response y, so this is to be expected. We can use linear regression to see what the coefficients of each are:
predictor3.model <- lm(y ~ x1 + x2 + x3, data=xy)
summary(predictor3.model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77639 -0.75392 0.03634 0.56478 2.72857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1326 0.1065 95.11 <2e-16 ***
## x1 5.0631 0.1160 43.65 <2e-16 ***
## x2 -30.0176 0.1086 -276.31 <2e-16 ***
## x3 24.8321 0.1114 222.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 4.131e+04 on 3 and 96 DF, p-value: < 2.2e-16
We originally used b0 as 25, b1 as 5, b2 as -30, and b3 as 25. We can see the the linear model gets very close to the true model, as all of the coefficients and the intercept are all within 0.2 of the true values. Given how we created the data, this is to be expected; this fortunately shows that these tools do indeed work.
# we will use stepAIC
null.model <- lm(y~1, data=xy)
full.model <- lm(y~., data=xy)
# forward <- stepAIC(null.model, direction = 'forward')
# backward <- stepAIC(full.model, direction = 'backward')
forward <- ols_step_forward_p(full.model, p_val =0.05)
backward <- ols_step_backward_p(full.model, p_val =0.05)
# forward
print('Forward selection:', quote=F)
## [1] Forward selection:
summary(forward$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77639 -0.75392 0.03634 0.56478 2.72857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1326 0.1065 95.11 <2e-16 ***
## x2 -30.0176 0.1086 -276.31 <2e-16 ***
## x3 24.8321 0.1114 222.93 <2e-16 ***
## x1 5.0631 0.1160 43.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 4.131e+04 on 3 and 96 DF, p-value: < 2.2e-16
plot(forward)
# backward
print('Backward selection:', quote=F)
## [1] Backward selection:
summary(backward$model)
##
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.77639 -0.75392 0.03634 0.56478 2.72857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.1326 0.1065 95.11 <2e-16 ***
## x1 5.0631 0.1160 43.65 <2e-16 ***
## x2 -30.0176 0.1086 -276.31 <2e-16 ***
## x3 24.8321 0.1114 222.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 4.131e+04 on 3 and 96 DF, p-value: < 2.2e-16
plot(backward)
Looking at the plots, we can see a steady and clear increase in performance when picking the significant predictors in forward selection, while also seeing that there is very little difference to any of the variables removed in backwards selection. This provides evidence that the significant predictors (x1, x2, and x3) are all important while all other predictors are uninmportant.
For forward and backward we see that both have selected x1, x2, and x3 as their predictors. This agrees with regsubsets that the best model has those three predictors. Given that we know how our data was constructed this is to be expected.
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
# lasso
lasso.x = glmnet(xy[,-11],xy[,11],alpha = 1, lambda=grid)
plot(lasso.x,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("topright", legend = colnames(xy[,-11]), col = 1:ncol(xy[,-11]), lty = 1, cex = 0.6)
# do cv for best lambda
xy.matrix <- model.matrix(y ~ ., data = xy)[, -1]
cv.lasso.x = cv.glmnet(xy.matrix,xy$y,alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.x)
# displaying minimum and 1 standard error lambda values
c(log(cv.lasso.x$lambda.min), log(cv.lasso.x$lambda.1se))
## [1] -2.395619 -1.418765
We see the best log(lambda) value to be -2.39, with the 1 standard error log(lambda) to be -1.42. We can see that the error for a high log(lambda) is significantly higher than for a low log(lambda), which the error appears to be close to zero. We can take a closer look at the minimum:
plot(cv.lasso.x, ylim = c(0,3))
From this we see a small error near 1, with little difference between the 1se lambda and the true minimum, as we would expect. Lining this value of log(lambda) with the previous plot for the coefficients, we see that Lasso was able to correctly identify the significant variables and an accurate approximation of the coefficients, much like stepwise regression and regular subsets. In addition, it correctly shows that for high values of lambda that cause the shrinkage and disappearance of these predictors the performance suffers.
lasso.best <- glmnet(xy[,-11],xy[,11],alpha = 1, lambda = cv.lasso.x$lambda.min)
coef(lasso.best)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 10.16694033
## x1 4.95052754
## x2 -29.92012807
## x3 24.72671989
## x4 0.06951759
## x5 .
## x6 .
## x7 .
## x8 0.02725292
## x9 .
## x10 0.07209471
lasso.1se <- glmnet(xy[,-11],xy[,11],alpha = 1, lambda = cv.lasso.x$lambda.1se)
coef(lasso.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 10.221850
## x1 4.769227
## x2 -29.770786
## x3 24.531854
## x4 .
## x5 .
## x6 .
## x7 .
## x8 .
## x9 .
## x10 .
Looking at the 1 standard error lambda, we can see that it properly selects the correct features. There is some slight shrinkage compared to the true values, but that is to be expected for lasso.
# response variable collection
y2 <- numeric(100)
b0 <- 25
b7 <- -10
for (i in seq_along(y)){
y2[i] <- b0 + b7 * xy$x7[i] + eps[i]
}
xy2 <- data.frame(x,y2)
# best subset selection
x2.subsets <- regsubsets(y2 ~ ., data = xy2, nvmax = 10)
x2.subset.summary <- summary(x2.subsets)
print(x2.subset.summary$outmat)
## x1 x2 x3 x4 x5 x6 x7 x8 x9 x10
## 1 ( 1 ) " " " " " " " " " " " " "*" " " " " " "
## 2 ( 1 ) " " " " " " "*" " " " " "*" " " " " " "
## 3 ( 1 ) " " " " "*" "*" " " " " "*" " " " " " "
## 4 ( 1 ) " " " " "*" "*" " " " " "*" " " " " "*"
## 5 ( 1 ) " " " " "*" "*" " " " " "*" "*" " " "*"
## 6 ( 1 ) " " " " "*" "*" " " " " "*" "*" "*" "*"
## 7 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# plot
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(x2.subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp", ylim=c(0,25))
# line for p + 1, as Cp should be less than p + 1
lines(1:length(x2.subset.summary$cp), 1:length(x2.subset.summary$cp) + 1, col = "red", lty = 2)
plot(x2.subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(x2.subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")
# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(x2.subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 3
cat("Best score for BIC:", which.min(x2.subset.summary$bic), "\n")
## Best score for BIC: 1
cat("Best score for Adjusted R2:", which.max(x2.subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 5
Using regsubsets for this new response Y, we get mixed results. Cp shows the best performance between 2 and 3, BIC shows the best at 1, and adjR2 shows the best at 5. Given that adjR2 shows only minimal differences of 0.005 between each, we can safely assume this statistic shows no real difference between the models. For Cp we can go with 2 predictors as that is similar to the minimum of 3, but the performance is only mildly better than with 1 predictor. It would be reasonable to follow BIC in this instance and use a one predictor model as all of these results show similar performance, and it is preferred to use the simplest model in these cases for interpretability. It is also important to note that BIC was able to identify the true model in this case, which is to be expected for BIC.
# lasso
legend.labels <- colnames(xy2[,-11])
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
lasso.x2 = glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = grid)
plot(lasso.x2,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("topright", legend = legend.labels, col = 1:ncol(xy2[,-11]), lty = 1, cex = 0.6)
# do cv for best lambda
# xy2 <- model.matrix(y2 ~., data=xy2)[,-1]
xy2.matrix <- model.matrix(y2 ~ ., data = xy2)[, -1]
cv.lasso.x2 = cv.glmnet(xy2.matrix,xy2$y2,alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.x2)
# displaying best lambda values
c(log(cv.lasso.x2$lambda.min), log(cv.lasso.x2$lambda.1se))
## [1] -2.0700007 -0.7675284
Note: the legend appears to be incorrect as it shows x9 being the selected variable, but as you can see in the coefficients below, the correct variable was selected; I am not sure why the legend is incorrect, and I was not able to get the correct color.
Using Lasso we find the best log(lambda) to be -2.07 with the 1 standard error to be -0.76. Similarly to the previous question, can see that the error for a high log(lambda) is significantly higher than for a low log(lambda), which the error appears to be close to zero. We can again take a closer look at the minimum:
plot(cv.lasso.x2, ylim = c(0,3))
From this we see again see a small error near 1, with little difference between the 1se lambda and the true minimum, as we would expect. Lining this value of log(lambda) with the previous plot for the coefficients, we see that Lasso was once again able to correctly identify the significant variables and an accurate approximation of the coefficients, much like regular subsets. In addition, it correctly shows that for high values of lambda that cause the shrinkage and disappearance of these predictors the performance suffers.
lasso.best <- glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = cv.lasso.x2$lambda.min)
coef(lasso.best)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 25.15973358
## x1 .
## x2 .
## x3 -0.03967533
## x4 0.07656780
## x5 .
## x6 .
## x7 -9.78454826
## x8 0.00661949
## x9 .
## x10 0.03645885
lasso.1se <- glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = cv.lasso.x2$lambda.1se)
coef(lasso.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 25.200720
## x1 .
## x2 .
## x3 .
## x4 .
## x5 .
## x6 .
## x7 -9.460191
## x8 .
## x9 .
## x10 .
Looking at the coefficients, we see that the 1 standard error managed to correctly point out the useful predictor as well as get the expected coefficient, with some slight shrinkage.
Overall, we see a very similar results to the previous section of this problem. This is expected due to the similar nature of the data, and we can see that these methods are able to appropriately adapt to these small changes.
We will now try to predict per capita crime rate in the Boston data set.
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
## 2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
## 3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
## 4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
## 5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33
## 6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21
## medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
# do we do any preprocessing for this? forget
x.boston <- model.matrix(crim ~., data=Boston)[,-1]
y.boston <- Boston[,'crim']
# best subset selection
subsets <- regsubsets(crim ~ ., data = Boston, nvmax = 13)
subset.summary <- summary(subsets)
print(subset.summary$outmat)
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# plots
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp")
lines(1:length(subset.summary$cp), 1:length(subset.summary$cp) + 1, col = "red", lty = 2) # p + 1
plot(subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")
# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 8
cat("Best score for BIC:", which.min(subset.summary$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 9
Using regsubsets, we see mixed results. Cp shows optimal performance with 7 to 8 predictors, BIC shows 2 to 3 predictors, while adjR2 shows the best at 8 to 9 predictors. Given that Cp and adjR2 seem to agree around 8 predictors, we will say that regsubsets chooses the following 8 predictors:
zn + nox + dis + rad + ptratio + black + lstat + medv
# we will use stepAIC
null.model <- lm(crim~1, data=Boston)
full.model <- lm(crim~., data=Boston)
# forward <- stepAIC(null.model, direction = 'forward')
# backward <- stepAIC(full.model, direction = 'backward')
# forward <- ols_step_forward_p(full.model, p_val =0.05)
# backward <- ols_step_backward_p(full.model, p_val =0.05)
both <- ols_step_both_p(full.model, p_val = 0.05)
# # forward
# print('Forward selection:', quote=F)
# summary(forward$model)
# plot(forward)
#
# # backward
# print('Backward selection:', quote=F)
# summary(backward$model)
# plot(backward)
# both
print('Both:', quote=F)
## [1] Both:
summary(both$model)
##
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")),
## data = l)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.023 -1.713 -0.281 0.873 77.716
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.372585 1.641557 -0.227 0.82054
## rad 0.488172 0.040422 12.077 < 2e-16 ***
## lstat 0.213596 0.047447 4.502 8.39e-06 ***
## black -0.009472 0.003615 -2.620 0.00905 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared: 0.4286, Adjusted R-squared: 0.4252
## F-statistic: 125.5 on 3 and 502 DF, p-value: < 2.2e-16
plot(both)
If we use stepwise regression, combining both forward and backward selection, we arrive at three important variables. The resulting formula is: rad + lstat + black
These variables are present in the best subset selection, and appear to be the same as the variables that BIC selects for as well in subset selection.
# we do not expect the best performance here; n >> p in this case
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
# lasso
lasso.boston = glmnet(x.boston,(y.boston),alpha = 1)
plot(lasso.boston,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)
# do cv for best lambda
cv.lasso.boston = cv.glmnet(x.boston,(y.boston),alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.boston)
# displaying best lambda values
c(log(cv.lasso.boston$lambda.min), log(cv.lasso.boston$lambda.1se))
## [1] -3.046855 1.186180
Using Lasso, we see an optimal log(lambda) at -3.1 and a 1se at 1.2. Interestingly, we see a significant difference between these two models, as the minimum retains 11 predictors while the 1se retains only 1 predictor. We can analyze this further by inspecting the coefficients:
lasso.best <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.min)
coef(lasso.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.921616863
## zn 0.036630369
## indus -0.071088384
## chas -0.592709165
## nox -7.203117383
## rm 0.243372202
## age .
## dis -0.800965160
## rad 0.515774432
## tax .
## ptratio -0.194514895
## black -0.007544322
## lstat 0.125137815
## medv -0.160384184
lasso.1se <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.1se)
coef(lasso.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.3076511
## zn .
## indus .
## chas .
## nox .
## rm .
## age .
## dis .
## rad 0.2414676
## tax .
## ptratio .
## black .
## lstat .
## medv .
We see that both agree that rad is an important feature, which regsubsets also chooses for only 1 predictor. The remaining difference of 10 features may suggest that the other features are of much less importance as they were eliminated using the L1 regularization term within 1 standard error.
# ridge regression
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5
ridge.boston = glmnet(x.boston,(y.boston),alpha = 0, lambda = grid)
plot(ridge.boston,"lambda",ylab = "Coefficients", xlab = expression(paste("log", lambda)), main = "Predictor coefficients against log lambda")
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)
cv.ridge.boston = cv.glmnet(x.boston,(y.boston),alpha = 0, lambda = grid, nfolds = 10)
plot(cv.ridge.boston,xlab = expression(paste("log", lambda)))
# displaying best lambda values:
c(log(cv.ridge.boston$lambda.min), log(cv.ridge.boston$lambda.1se))
## [1] -1.255956 4.116743
Using ridge regression, we find a best log(lambda) of-1.26 and a 1 standard error of 4.11 We can see the following coefficients for each of these:
rr.best <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.min)
coef(rr.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 11.836387893
## zn 0.037105939
## indus -0.082352998
## chas -0.732319890
## nox -7.162927585
## rm 0.372009282
## age 0.001612269
## dis -0.809063722
## rad 0.477612534
## tax 0.001283535
## ptratio -0.184989001
## black -0.008164558
## lstat 0.139250075
## medv -0.159866428
rr.1se <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.1se)
coef(rr.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 1.136778807
## zn -0.002881647
## indus 0.033300535
## chas -0.210522818
## nox 2.165731519
## rm -0.158704737
## age 0.007094660
## dis -0.110224494
## rad 0.056383381
## tax 0.002520427
## ptratio 0.082968051
## black -0.003162664
## lstat 0.042415472
## medv -0.027788872
If we compare these side by side for the ones found in lasso, the coefficients are very near the corresponding coefficients for lasso. The significant difference, however, is the 1 standard error lambda value. With this lambda ridge regression only has one value that is not close to zero, much like lasso only having one nonzero value; however, ridge chooses nox while lasso chooses rad as the most important predictors. This means it also differs from regsubsets, which also chose rad as the most important predictor.
# PCR
pcr.boston = pcr(crim ~ ., data = Boston, scale = T, validation = "CV")
summary(pcr.boston)
## Data: X dimension: 506 13
## Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 8.61 7.212 7.210 6.775 6.779 6.782 6.796
## adjCV 8.61 7.209 7.206 6.769 6.772 6.777 6.789
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 6.782 6.655 6.672 6.662 6.656 6.623 6.571
## adjCV 6.775 6.648 6.663 6.653 6.648 6.612 6.560
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 47.70 60.36 69.67 76.45 82.99 88.00 91.14 93.45
## crim 30.69 30.87 39.27 39.61 39.61 39.86 40.14 42.47
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 95.40 97.04 98.46 99.52 100.0
## crim 42.55 42.78 43.04 44.13 45.4
validationplot(pcr.boston, val.type = 'MSEP')
Using PCR, we see that after around 3 principal components the error begins to plateau and we only see minor improvements as the number of principal components increases. We therefore will say PCR shows 3 principal components to be the best model.
The best performances with the simplest models were achieved using stepwise regression and principal components regression, as both of these showed low errors in the graphs and only used 3 components or predictors. Two other models that performed well were both lasso and ridge regression, although they appear to perform noticeably worse when using the 1 standard error lambda. We will thus use stepwise regression, PCR, lasso, and ridge regression to continue. We will do these each with 10 fold cross validation and compare the MSE of each to determine which model to put forth as our final model.
Fortunately, cross fold validation was automatically performed in the first part for all of the previously mentioned methods with the exception of stepwise regression, which we will do now:
set.seed(123)
folds <- createFolds(Boston$crim, k = 10, list = TRUE)
mse_values <- numeric(10)
selected_models <- vector("list", length = 10)
for (i in seq_along(folds)) {
# training and validation sets
train_data <- Boston[-folds[[i]], ]
test_data <- Boston[folds[[i]], ]
# fit
full_model <- lm(crim ~ ., data = train_data)
stepwise_model <- ols_step_both_p(full_model, pent = 0.05, prem = 0.05)
# store best model of fold
selected_models[[i]] <- formula(stepwise_model$model)
# get mse
predictions <- predict(stepwise_model$model, newdata = test_data)
mse_values[i] <- mean((test_data$crim - predictions)^2)
}
# average mse
mean_mse <- mean(mse_values)
# Output results
# cat("MSE for each fold:", mse_values, "\n")
cat("Mean MSE across all folds:", mean_mse, "\n")
## Mean MSE across all folds: 43.83496
# Print the selected models for each fold
cat("\nSelected models for each fold:\n")
##
## Selected models for each fold:
for (i in seq_along(selected_models)) {
cat(sprintf("Fold %d: %s\n", i, deparse(selected_models[[i]])))
}
## Fold 1: crim ~ rad + lstat + black
## Fold 2: crim ~ rad + lstat + medv + ptratio
## Fold 3: crim ~ rad + lstat + black
## Fold 4: crim ~ rad + lstat + black
## Fold 5: crim ~ rad + lstat + black
## Fold 6: crim ~ rad + lstat + black + zn + dis + nox + medv
## Fold 7: crim ~ rad + lstat
## Fold 8: crim ~ rad + lstat + black + medv
## Fold 9: crim ~ rad + lstat + black
## Fold 10: crim ~ rad + lstat + black
We see that the after performing cross validated stepwise regression, we arrive at a mean MSE of 43.8, which is similar to the performance observed in part a) of this question. We also see that the selection overall appears to agree with the same 3 predictors as previously: rad, lstat, and black.
Now we can compare all of the results to see the optimal model:
## Ridge Regression:
## Minimum MSE: 43.43256
## MSE at 1-se Lambda: 56.33671
## Lasso Regression:
## Minimum MSE: 43.62434
## MSE at 1-se Lambda: 56.43248
## Principal Component Regression (PCR):
## MSE with 3 components: 45.90038
## MSE with 8 components: 44.28734
## Minumum MSE, with 13 components: 43.18376
## Stepwise Regression:
## [1] 43.83496
Looking at the performance we see the best MSE at around 43. We see the best performing were ridge regression, lasso regression, PCR with 13 components, and stepwise regression which all were within a single point.
With the exception of the 1 standard error lambda values, all of the models in the previous performed at essentially the same level. In this case, we will decide to continue with lasso or ridge due to the decreased variance they will have due to the penalty terms. While this penalty term does shrink the coefficients, the decrease in variance may prove to be beneficial should we take the model to an outside dataset to apply it in a real world scenario. Between the two, the final model to propose to predict crime rates will be lasso regression. This is because of the two models, lasso regression is able to provide feature selection. If we return to the coefficients by the minimum lamdba we can see what features are shown to be the most important in the dataset, which ridge regression cannot do:
## 14 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 12.921616863
## zn 0.036630369
## indus -0.071088384
## chas -0.592709165
## nox -7.203117383
## rm 0.243372202
## age .
## dis -0.800965160
## rad 0.515774432
## tax .
## ptratio -0.194514895
## black -0.007544322
## lstat 0.125137815
## medv -0.160384184
We can see that age and tax are taken out of the model and will not be used, due to the l1 norm penalty term. While this does not reduce the dataset as far as we would like, it does provide some feature selection without compromising performance.