Exercise 6.6.2

For parts (a) through (c), indicate which of i. through iv. is correct. Justify your answer.

  1. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  2. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  3. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

  4. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

  1. The lasso, relative to least squares, is:
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

The lasso method has a varying degree of flexibility in comparison to OLS; when lambda = 0 it is identical to OLS, but as lambda increases the penalty term begins to cause the flexibility to decrease. When the flexibility decreases the bias will increase, but the corresponding decrease in variance will make up for it. Thus, in order to have an increase in accuracy, the bias must increase less than the variance decreases.

  1. Repeat (a) for ridge regression relative to least squares.
  1. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.

Ridge regression also starts at OLS when lambda is 0. As lambda increases, there is a decrease in flexibility which increases the bias, but it compensates by greatly reducing the variability of the model in prediction. This means that it gives better prediction accuracy when this increase in bias is counteracted by the resulting decrease in variance.

  1. Repeat (a) for non-linear methods relative to least squares.
  1. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.

As non linear models are overall more flexible than linear models, this will mean there is a lower bias but higher variance. The variance increase must thus be balanced with the bias decrease in order to have a satisfactory performance.

Exercise 6.6.8

In this exercise, we will generate simulated data, and will then use this data to perform best subset selection.

  1. Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector epsilon of length n = 100.
set.seed(123)

x <- c()

for (i in 1:10){
  x[[i]] <- rnorm(100)
}

x <- do.call(cbind, x)

x <- as.data.frame(x)

colnames(x) <- paste("x", 1:10, sep="")

eps <- rnorm(100)
  1. Generate a response vector Y of length n = 100 according to the model Y = B0 + B1X + B2X2 + B3X3 + epsilon, where B0, B1, B2, and B3 are constants of your choice.
y <- numeric(100)

b0 <- 10
b1 <- 5
b2 <- -30
b3 <- 25

for (i in seq_along(y)){
  y[i] <- b0 + b1 * x$x1[i] + b2 * x$x2[i] + b3 * x$x3[i] + eps[i]
}
  1. Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X, X2, . . . , X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained. Note you will need to use the data.frame() function to create a single data set containing both X and Y .
xy <- data.frame(x,y)
x.subsets <- regsubsets(x=xy[,-11], y=xy[,11])
best.x <- summary(x.subsets)
best.x$outmat
##          x1  x2  x3  x4  x5  x6  x7  x8  x9  x10
## 1  ( 1 ) " " "*" " " " " " " " " " " " " " " " "
## 2  ( 1 ) " " "*" "*" " " " " " " " " " " " " " "
## 3  ( 1 ) "*" "*" "*" " " " " " " " " " " " " " "
## 4  ( 1 ) "*" "*" "*" "*" " " " " " " " " " " " "
## 5  ( 1 ) "*" "*" "*" "*" " " " " " " " " " " "*"
## 6  ( 1 ) "*" "*" "*" "*" " " " " "*" " " " " "*"
## 7  ( 1 ) "*" "*" "*" "*" " " " " "*" "*" " " "*"
## 8  ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" " " "*"
# plot
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(best.x$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp", ylim=c(0,25))
# line for p + 1, as Cp should be less than p + 1
lines(1:length(best.x$cp), 1:length(best.x$cp) + 1, col = "red", lty = 2) 
plot(best.x$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(best.x$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")

# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(best.x$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 4
cat("Best score for BIC:", which.min(best.x$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(best.x$adjr2), "\n")
## Best score for Adjusted R2: 6

Using the graph of BIC, Cp, and adjusted R2 from regsubsets, we can see the best model is around 3 to 4 predictors. Cp chooses for 4 predictors, BIC for 3 predictors, and adjusted R2 does not have any significant increase after 3 predictors. As these results are all similar for 3 to 4 predictors, we will go with 3 predictors as it is the simpler model.

For the 3 predictor model, we see that it uses the predictors x1, x2, and x3. This lines up with what we originally used to define the response y, so this is to be expected. We can use linear regression to see what the coefficients of each are:

predictor3.model <- lm(y ~ x1 + x2 + x3, data=xy)
summary(predictor3.model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = xy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77639 -0.75392  0.03634  0.56478  2.72857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1326     0.1065   95.11   <2e-16 ***
## x1            5.0631     0.1160   43.65   <2e-16 ***
## x2          -30.0176     0.1086 -276.31   <2e-16 ***
## x3           24.8321     0.1114  222.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 4.131e+04 on 3 and 96 DF,  p-value: < 2.2e-16

We originally used b0 as 25, b1 as 5, b2 as -30, and b3 as 25. We can see the the linear model gets very close to the true model, as all of the coefficients and the intercept are all within 0.2 of the true values. Given how we created the data, this is to be expected; this fortunately shows that these tools do indeed work.

  1. Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
# we will use stepAIC
null.model <- lm(y~1, data=xy)
full.model <- lm(y~., data=xy)
# forward <- stepAIC(null.model, direction = 'forward')
# backward <- stepAIC(full.model, direction = 'backward')

forward <- ols_step_forward_p(full.model, p_val =0.05)
backward <- ols_step_backward_p(full.model, p_val =0.05)
# forward
print('Forward selection:', quote=F)
## [1] Forward selection:
summary(forward$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77639 -0.75392  0.03634  0.56478  2.72857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1326     0.1065   95.11   <2e-16 ***
## x2          -30.0176     0.1086 -276.31   <2e-16 ***
## x3           24.8321     0.1114  222.93   <2e-16 ***
## x1            5.0631     0.1160   43.65   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 4.131e+04 on 3 and 96 DF,  p-value: < 2.2e-16
plot(forward)

# backward
print('Backward selection:', quote=F)
## [1] Backward selection:
summary(backward$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(c(include, cterms), collapse = " + ")), 
##     data = l)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.77639 -0.75392  0.03634  0.56478  2.72857 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  10.1326     0.1065   95.11   <2e-16 ***
## x1            5.0631     0.1160   43.65   <2e-16 ***
## x2          -30.0176     0.1086 -276.31   <2e-16 ***
## x3           24.8321     0.1114  222.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.044 on 96 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 4.131e+04 on 3 and 96 DF,  p-value: < 2.2e-16
plot(backward)

Looking at the plots, we can see a steady and clear increase in performance when picking the significant predictors in forward selection, while also seeing that there is very little difference to any of the variables removed in backwards selection. This provides evidence that the significant predictors (x1, x2, and x3) are all important while all other predictors are uninmportant.

For forward and backward we see that both have selected x1, x2, and x3 as their predictors. This agrees with regsubsets that the best model has those three predictors. Given that we know how our data was constructed this is to be expected.

  1. Now fit a lasso model to the simulated data, again using X, X2,. . . , X10 as predictors. Use cross-validation to select the optimal value of lambda. Create plots of the cross-validation error as a function of lambda. Report the resulting coefficient estimates, and discuss the results obtained.
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

# lasso
lasso.x = glmnet(xy[,-11],xy[,11],alpha = 1, lambda=grid)
plot(lasso.x,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("topright", legend = colnames(xy[,-11]), col = 1:ncol(xy[,-11]), lty = 1, cex = 0.6)

# do cv for best lambda
xy.matrix <- model.matrix(y ~ ., data = xy)[, -1]
cv.lasso.x = cv.glmnet(xy.matrix,xy$y,alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.x)

# displaying minimum and 1 standard error lambda values
c(log(cv.lasso.x$lambda.min), log(cv.lasso.x$lambda.1se))
## [1] -2.395619 -1.418765

We see the best log(lambda) value to be -2.39, with the 1 standard error log(lambda) to be -1.42. We can see that the error for a high log(lambda) is significantly higher than for a low log(lambda), which the error appears to be close to zero. We can take a closer look at the minimum:

plot(cv.lasso.x, ylim = c(0,3))

From this we see a small error near 1, with little difference between the 1se lambda and the true minimum, as we would expect. Lining this value of log(lambda) with the previous plot for the coefficients, we see that Lasso was able to correctly identify the significant variables and an accurate approximation of the coefficients, much like stepwise regression and regular subsets. In addition, it correctly shows that for high values of lambda that cause the shrinkage and disappearance of these predictors the performance suffers.

lasso.best <- glmnet(xy[,-11],xy[,11],alpha = 1, lambda = cv.lasso.x$lambda.min)
coef(lasso.best)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  10.16694033
## x1            4.95052754
## x2          -29.92012807
## x3           24.72671989
## x4            0.06951759
## x5            .         
## x6            .         
## x7            .         
## x8            0.02725292
## x9            .         
## x10           0.07209471
lasso.1se <- glmnet(xy[,-11],xy[,11],alpha = 1, lambda = cv.lasso.x$lambda.1se)
coef(lasso.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                     s0
## (Intercept)  10.221850
## x1            4.769227
## x2          -29.770786
## x3           24.531854
## x4            .       
## x5            .       
## x6            .       
## x7            .       
## x8            .       
## x9            .       
## x10           .

Looking at the 1 standard error lambda, we can see that it properly selects the correct features. There is some slight shrinkage compared to the true values, but that is to be expected for lasso.

  1. Now generate a response vector Y according to the model Y = B0 + B7X7 + epsilon, and perform best subset selection and the lasso. Discuss the results obtained.
# response variable collection
y2 <- numeric(100)

b0 <- 25
b7 <- -10


for (i in seq_along(y)){
  y2[i] <- b0 +  b7 * xy$x7[i] + eps[i]
}


xy2 <- data.frame(x,y2)


# best subset selection
x2.subsets <- regsubsets(y2 ~ ., data = xy2, nvmax = 10) 
x2.subset.summary <- summary(x2.subsets)

print(x2.subset.summary$outmat)
##           x1  x2  x3  x4  x5  x6  x7  x8  x9  x10
## 1  ( 1 )  " " " " " " " " " " " " "*" " " " " " "
## 2  ( 1 )  " " " " " " "*" " " " " "*" " " " " " "
## 3  ( 1 )  " " " " "*" "*" " " " " "*" " " " " " "
## 4  ( 1 )  " " " " "*" "*" " " " " "*" " " " " "*"
## 5  ( 1 )  " " " " "*" "*" " " " " "*" "*" " " "*"
## 6  ( 1 )  " " " " "*" "*" " " " " "*" "*" "*" "*"
## 7  ( 1 )  "*" " " "*" "*" "*" " " "*" "*" " " "*"
## 8  ( 1 )  "*" " " "*" "*" "*" " " "*" "*" "*" "*"
## 9  ( 1 )  "*" " " "*" "*" "*" "*" "*" "*" "*" "*"
## 10  ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# plot
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(x2.subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp", ylim=c(0,25))
# line for p + 1, as Cp should be less than p + 1
lines(1:length(x2.subset.summary$cp), 1:length(x2.subset.summary$cp) + 1, col = "red", lty = 2) 
plot(x2.subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(x2.subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")

# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(x2.subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 3
cat("Best score for BIC:", which.min(x2.subset.summary$bic), "\n")
## Best score for BIC: 1
cat("Best score for Adjusted R2:", which.max(x2.subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 5

Using regsubsets for this new response Y, we get mixed results. Cp shows the best performance between 2 and 3, BIC shows the best at 1, and adjR2 shows the best at 5. Given that adjR2 shows only minimal differences of 0.005 between each, we can safely assume this statistic shows no real difference between the models. For Cp we can go with 2 predictors as that is similar to the minimum of 3, but the performance is only mildly better than with 1 predictor. It would be reasonable to follow BIC in this instance and use a one predictor model as all of these results show similar performance, and it is preferred to use the simplest model in these cases for interpretability. It is also important to note that BIC was able to identify the true model in this case, which is to be expected for BIC.

# lasso
legend.labels <- colnames(xy2[,-11])
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

lasso.x2 = glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = grid)
plot(lasso.x2,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("topright", legend = legend.labels, col = 1:ncol(xy2[,-11]), lty = 1, cex = 0.6)

# do cv for best lambda
# xy2 <- model.matrix(y2 ~., data=xy2)[,-1]
xy2.matrix <- model.matrix(y2 ~ ., data = xy2)[, -1]
cv.lasso.x2 = cv.glmnet(xy2.matrix,xy2$y2,alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.x2)

# displaying best lambda values
c(log(cv.lasso.x2$lambda.min), log(cv.lasso.x2$lambda.1se))
## [1] -2.0700007 -0.7675284

Note: the legend appears to be incorrect as it shows x9 being the selected variable, but as you can see in the coefficients below, the correct variable was selected; I am not sure why the legend is incorrect, and I was not able to get the correct color.

Using Lasso we find the best log(lambda) to be -2.07 with the 1 standard error to be -0.76. Similarly to the previous question, can see that the error for a high log(lambda) is significantly higher than for a low log(lambda), which the error appears to be close to zero. We can again take a closer look at the minimum:

plot(cv.lasso.x2, ylim = c(0,3))

From this we see again see a small error near 1, with little difference between the 1se lambda and the true minimum, as we would expect. Lining this value of log(lambda) with the previous plot for the coefficients, we see that Lasso was once again able to correctly identify the significant variables and an accurate approximation of the coefficients, much like regular subsets. In addition, it correctly shows that for high values of lambda that cause the shrinkage and disappearance of these predictors the performance suffers.

lasso.best <- glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = cv.lasso.x2$lambda.min)
coef(lasso.best)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                      s0
## (Intercept) 25.15973358
## x1           .         
## x2           .         
## x3          -0.03967533
## x4           0.07656780
## x5           .         
## x6           .         
## x7          -9.78454826
## x8           0.00661949
## x9           .         
## x10          0.03645885
lasso.1se <- glmnet(xy2[,-11],xy2[,11],alpha = 1, lambda = cv.lasso.x2$lambda.1se)
coef(lasso.1se)
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 25.200720
## x1           .       
## x2           .       
## x3           .       
## x4           .       
## x5           .       
## x6           .       
## x7          -9.460191
## x8           .       
## x9           .       
## x10          .

Looking at the coefficients, we see that the 1 standard error managed to correctly point out the useful predictor as well as get the expected coefficient, with some slight shrinkage.

Overall, we see a very similar results to the previous section of this problem. This is expected due to the similar nature of the data, and we can see that these methods are able to appropriately adapt to these small changes.

Exercise 6.6.11

We will now try to predict per capita crime rate in the Boston data set.

##      crim zn indus chas   nox    rm  age    dis rad tax ptratio  black lstat
## 1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98
## 2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14
## 3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03
## 4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94
## 5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33
## 6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21
##   medv
## 1 24.0
## 2 21.6
## 3 34.7
## 4 33.4
## 5 36.2
## 6 28.7
  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
# do we do any preprocessing for this? forget
x.boston <- model.matrix(crim ~., data=Boston)[,-1]
y.boston <- Boston[,'crim']
# best subset selection
subsets <- regsubsets(crim ~ ., data = Boston, nvmax = 13) 
subset.summary <- summary(subsets)

print(subset.summary$outmat)
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
# plots
par(mfrow = c(1,3))
par(mar = c(5, 4, 2, 1)) # adjusting margins for better look
plot(subset.summary$cp, xlab = "Predictors", ylab = "Cp", type = "l", main = "Cp")
lines(1:length(subset.summary$cp), 1:length(subset.summary$cp) + 1, col = "red", lty = 2) # p + 1
plot(subset.summary$bic, xlab = "Predictors", ylab = "BIC", type = "l", main = "BIC")
plot(subset.summary$adjr2, xlab = "Predictors", ylab = "adjR2", type = "l", main = "adjR2")

# optimal models
cat("Best score for Cp (where Cp is less than p + 1):", which.min(subset.summary$cp), "\n")
## Best score for Cp (where Cp is less than p + 1): 8
cat("Best score for BIC:", which.min(subset.summary$bic), "\n")
## Best score for BIC: 3
cat("Best score for Adjusted R2:", which.max(subset.summary$adjr2), "\n")
## Best score for Adjusted R2: 9

Using regsubsets, we see mixed results. Cp shows optimal performance with 7 to 8 predictors, BIC shows 2 to 3 predictors, while adjR2 shows the best at 8 to 9 predictors. Given that Cp and adjR2 seem to agree around 8 predictors, we will say that regsubsets chooses the following 8 predictors:

zn + nox + dis + rad + ptratio + black + lstat + medv

# we will use stepAIC
null.model <- lm(crim~1, data=Boston)
full.model <- lm(crim~., data=Boston)
# forward <- stepAIC(null.model, direction = 'forward')
# backward <- stepAIC(full.model, direction = 'backward')

# forward <- ols_step_forward_p(full.model, p_val =0.05)
# backward <- ols_step_backward_p(full.model, p_val =0.05)
both <- ols_step_both_p(full.model, p_val = 0.05)

# # forward
# print('Forward selection:', quote=F)
# summary(forward$model)
# plot(forward)
# 
# # backward
# print('Backward selection:', quote=F)
# summary(backward$model)
# plot(backward)

# both
print('Both:', quote=F)
## [1] Both:
summary(both$model)
## 
## Call:
## lm(formula = paste(response, "~", paste(preds, collapse = " + ")), 
##     data = l)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.023  -1.713  -0.281   0.873  77.716 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.372585   1.641557  -0.227  0.82054    
## rad          0.488172   0.040422  12.077  < 2e-16 ***
## lstat        0.213596   0.047447   4.502 8.39e-06 ***
## black       -0.009472   0.003615  -2.620  0.00905 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.521 on 502 degrees of freedom
## Multiple R-squared:  0.4286, Adjusted R-squared:  0.4252 
## F-statistic: 125.5 on 3 and 502 DF,  p-value: < 2.2e-16
plot(both)

If we use stepwise regression, combining both forward and backward selection, we arrive at three important variables. The resulting formula is: rad + lstat + black

These variables are present in the best subset selection, and appear to be the same as the variables that BIC selects for as well in subset selection.

# we do not expect the best performance here; n >> p in this case
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

# lasso
lasso.boston = glmnet(x.boston,(y.boston),alpha = 1)
plot(lasso.boston,"lambda",xlab = expression(paste("log(",lambda,")")))
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)

# do cv for best lambda
cv.lasso.boston = cv.glmnet(x.boston,(y.boston),alpha = 1, lambda = grid, nfolds = 10)
plot(cv.lasso.boston)

# displaying best lambda values
c(log(cv.lasso.boston$lambda.min), log(cv.lasso.boston$lambda.1se))
## [1] -3.046855  1.186180

Using Lasso, we see an optimal log(lambda) at -3.1 and a 1se at 1.2. Interestingly, we see a significant difference between these two models, as the minimum retains 11 predictors while the 1se retains only 1 predictor. We can analyze this further by inspecting the coefficients:

lasso.best <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.min)
coef(lasso.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 12.921616863
## zn           0.036630369
## indus       -0.071088384
## chas        -0.592709165
## nox         -7.203117383
## rm           0.243372202
## age          .          
## dis         -0.800965160
## rad          0.515774432
## tax          .          
## ptratio     -0.194514895
## black       -0.007544322
## lstat        0.125137815
## medv        -0.160384184
lasso.1se <- glmnet(x.boston,(y.boston),alpha = 1, lambda = cv.lasso.boston$lambda.1se)
coef(lasso.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                    s0
## (Intercept) 1.3076511
## zn          .        
## indus       .        
## chas        .        
## nox         .        
## rm          .        
## age         .        
## dis         .        
## rad         0.2414676
## tax         .        
## ptratio     .        
## black       .        
## lstat       .        
## medv        .

We see that both agree that rad is an important feature, which regsubsets also chooses for only 1 predictor. The remaining difference of 10 features may suggest that the other features are of much less importance as they were eliminated using the L1 regularization term within 1 standard error.

# ridge regression
grid = 10^(seq(2,-5,length = 100)) # defining range of lambda, from 10^2 to 10^-5

ridge.boston = glmnet(x.boston,(y.boston),alpha = 0, lambda = grid)

plot(ridge.boston,"lambda",ylab = "Coefficients", xlab = expression(paste("log", lambda)), main = "Predictor coefficients against log lambda")
legend("bottomright", legend = colnames(x.boston), col = 1:ncol(x.boston), lty = 1, cex = 0.6)

cv.ridge.boston = cv.glmnet(x.boston,(y.boston),alpha = 0, lambda = grid, nfolds = 10)
plot(cv.ridge.boston,xlab = expression(paste("log", lambda)))

# displaying best lambda values:
c(log(cv.ridge.boston$lambda.min), log(cv.ridge.boston$lambda.1se))
## [1] -1.255956  4.116743

Using ridge regression, we find a best log(lambda) of-1.26 and a 1 standard error of 4.11 We can see the following coefficients for each of these:

rr.best <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.min)
coef(rr.best)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 11.836387893
## zn           0.037105939
## indus       -0.082352998
## chas        -0.732319890
## nox         -7.162927585
## rm           0.372009282
## age          0.001612269
## dis         -0.809063722
## rad          0.477612534
## tax          0.001283535
## ptratio     -0.184989001
## black       -0.008164558
## lstat        0.139250075
## medv        -0.159866428
rr.1se <- glmnet(x.boston,(y.boston),alpha = 0, lambda = cv.ridge.boston$lambda.1se)
coef(rr.1se)
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept)  1.136778807
## zn          -0.002881647
## indus        0.033300535
## chas        -0.210522818
## nox          2.165731519
## rm          -0.158704737
## age          0.007094660
## dis         -0.110224494
## rad          0.056383381
## tax          0.002520427
## ptratio      0.082968051
## black       -0.003162664
## lstat        0.042415472
## medv        -0.027788872

If we compare these side by side for the ones found in lasso, the coefficients are very near the corresponding coefficients for lasso. The significant difference, however, is the 1 standard error lambda value. With this lambda ridge regression only has one value that is not close to zero, much like lasso only having one nonzero value; however, ridge chooses nox while lasso chooses rad as the most important predictors. This means it also differs from regsubsets, which also chose rad as the most important predictor.

# PCR
pcr.boston = pcr(crim ~ ., data = Boston, scale = T, validation = "CV")
summary(pcr.boston)
## Data:    X dimension: 506 13 
##  Y dimension: 506 1
## Fit method: svdpc
## Number of components considered: 13
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV            8.61    7.212    7.210    6.775    6.779    6.782    6.796
## adjCV         8.61    7.209    7.206    6.769    6.772    6.777    6.789
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV       6.782    6.655    6.672     6.662     6.656     6.623     6.571
## adjCV    6.775    6.648    6.663     6.653     6.648     6.612     6.560
## 
## TRAINING: % variance explained
##       1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps  8 comps
## X       47.70    60.36    69.67    76.45    82.99    88.00    91.14    93.45
## crim    30.69    30.87    39.27    39.61    39.61    39.86    40.14    42.47
##       9 comps  10 comps  11 comps  12 comps  13 comps
## X       95.40     97.04     98.46     99.52     100.0
## crim    42.55     42.78     43.04     44.13      45.4
validationplot(pcr.boston, val.type = 'MSEP')

Using PCR, we see that after around 3 principal components the error begins to plateau and we only see minor improvements as the number of principal components increases. We therefore will say PCR shows 3 principal components to be the best model.

  1. Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, cross validation, or some other reasonable alternative, as opposed to using training error.

The best performances with the simplest models were achieved using stepwise regression and principal components regression, as both of these showed low errors in the graphs and only used 3 components or predictors. Two other models that performed well were both lasso and ridge regression, although they appear to perform noticeably worse when using the 1 standard error lambda. We will thus use stepwise regression, PCR, lasso, and ridge regression to continue. We will do these each with 10 fold cross validation and compare the MSE of each to determine which model to put forth as our final model.

Fortunately, cross fold validation was automatically performed in the first part for all of the previously mentioned methods with the exception of stepwise regression, which we will do now:

set.seed(123)

folds <- createFolds(Boston$crim, k = 10, list = TRUE)
mse_values <- numeric(10)
selected_models <- vector("list", length = 10)

for (i in seq_along(folds)) {
  # training and validation sets
  train_data <- Boston[-folds[[i]], ]
  test_data <- Boston[folds[[i]], ]
  
  # fit
  full_model <- lm(crim ~ ., data = train_data)
  stepwise_model <- ols_step_both_p(full_model, pent = 0.05, prem = 0.05)
  
  # store best model of fold
  selected_models[[i]] <- formula(stepwise_model$model)
  
  # get mse
  predictions <- predict(stepwise_model$model, newdata = test_data)
  mse_values[i] <- mean((test_data$crim - predictions)^2)
}

# average mse
mean_mse <- mean(mse_values)

# Output results
# cat("MSE for each fold:", mse_values, "\n")
cat("Mean MSE across all folds:", mean_mse, "\n")
## Mean MSE across all folds: 43.83496
# Print the selected models for each fold
cat("\nSelected models for each fold:\n")
## 
## Selected models for each fold:
for (i in seq_along(selected_models)) {
  cat(sprintf("Fold %d: %s\n", i, deparse(selected_models[[i]])))
}
## Fold 1: crim ~ rad + lstat + black
## Fold 2: crim ~ rad + lstat + medv + ptratio
## Fold 3: crim ~ rad + lstat + black
## Fold 4: crim ~ rad + lstat + black
## Fold 5: crim ~ rad + lstat + black
## Fold 6: crim ~ rad + lstat + black + zn + dis + nox + medv
## Fold 7: crim ~ rad + lstat
## Fold 8: crim ~ rad + lstat + black + medv
## Fold 9: crim ~ rad + lstat + black
## Fold 10: crim ~ rad + lstat + black

We see that the after performing cross validated stepwise regression, we arrive at a mean MSE of 43.8, which is similar to the performance observed in part a) of this question. We also see that the selection overall appears to agree with the same 3 predictors as previously: rad, lstat, and black.

Now we can compare all of the results to see the optimal model:

## Ridge Regression:
## Minimum MSE: 43.43256
## MSE at 1-se Lambda: 56.33671
## Lasso Regression:
## Minimum MSE: 43.62434
## MSE at 1-se Lambda: 56.43248
## Principal Component Regression (PCR):
## MSE with 3 components: 45.90038
## MSE with 8 components: 44.28734
## Minumum MSE, with 13 components: 43.18376
## Stepwise Regression:
## [1] 43.83496

Looking at the performance we see the best MSE at around 43. We see the best performing were ridge regression, lasso regression, PCR with 13 components, and stepwise regression which all were within a single point.

  1. Does your chosen model involve all of the features in the data set? Why or why not?

With the exception of the 1 standard error lambda values, all of the models in the previous performed at essentially the same level. In this case, we will decide to continue with lasso or ridge due to the decreased variance they will have due to the penalty terms. While this penalty term does shrink the coefficients, the decrease in variance may prove to be beneficial should we take the model to an outside dataset to apply it in a real world scenario. Between the two, the final model to propose to predict crime rates will be lasso regression. This is because of the two models, lasso regression is able to provide feature selection. If we return to the coefficients by the minimum lamdba we can see what features are shown to be the most important in the dataset, which ridge regression cannot do:

## 14 x 1 sparse Matrix of class "dgCMatrix"
##                       s0
## (Intercept) 12.921616863
## zn           0.036630369
## indus       -0.071088384
## chas        -0.592709165
## nox         -7.203117383
## rm           0.243372202
## age          .          
## dis         -0.800965160
## rad          0.515774432
## tax          .          
## ptratio     -0.194514895
## black       -0.007544322
## lstat        0.125137815
## medv        -0.160384184

We can see that age and tax are taken out of the model and will not be used, due to the l1 norm penalty term. While this does not reduce the dataset as far as we would like, it does provide some feature selection without compromising performance.