Problem 1

  1. In this data set, what is n and what is p? Write out the model used to generate the data in equation form.
  set.seed(1)
x = rnorm (100)
y = x-2* x^2+ rnorm (100)

The number of sample size of y(n) is

n <- 100

The number of predictors in the model(p) is

p <- 2

Since rnorm(100) in y meets the condition to be an error with mean = 0 and sd = 1, there are two predictors: X, x^2.

  1. Create a scatterplot of X against Y . Comment on what you find.
plot(x,y, pch = 19)

There is a non-linear relationship between X and Y.

  1. Set a random seed, and then compute the LOOCV errors that result from fitting the following three models using least squares: Center predictors for second, third and fourth order predictors.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
set.seed(100)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat = data.frame(X.cent,Y)

# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.c = rep(NA,3)
for (i in 1:3) {
  glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
  cv.err.c[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.c
## [1] 0.6511909 0.6665339 0.6671261
  1. Repeat (c) using another random seed, and report your results. Are your results the same as what you got in (c)? Why?
set.seed(50)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat <- data.frame(X.cent,Y)

# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.d = rep(NA,3)
for (i in 1:3) {
  glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
  cv.err.d[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.d
## [1] 0.990204 1.000547 1.013875
  1. Which of the models in (c) had the smallest LOOCV error? Is this what you expected? Explain your answer.
min.err.fold <- which.min(cv.err.c)
min.err <- cv.err.c[min.err.fold]
min.err.fold     # The model number with the smallest LOOCV
## [1] 1
round(min.err,4) # MSE corresponding the model with the smallest LOOCV
## [1] 0.6512

It concludes that the second order model has the smallest LOOCV error. It was what I expected since the true y is quatratic form to begin with.

  1. Comment on the statistical significance of the coefficient estimates that results from fitting each of the models in (c) using least squares. Do these results agree with the conclusions drawn based on the cross-validation results?
glm.fit.2 <- glm(Y~poly(X.cent,2),data = dat)
summary(glm.fit.2)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 2), data = dat)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.55901  -0.67948   0.02914   0.59356   2.65331  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09814  -22.33   <2e-16 ***
## poly(X.cent, 2)1  15.43428    0.98140   15.73   <2e-16 ***
## poly(X.cent, 2)2 -27.28047    0.98140  -27.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9631416)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   93.425  on 97  degrees of freedom
## AIC: 284.99
## 
## Number of Fisher Scoring iterations: 2
glm.fit.3 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.3)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6894  -0.6778   0.1201   0.5571   2.7050  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09866 -22.212   <2e-16 ***
## poly(X.cent, 4)1  15.43428    0.98656  15.645   <2e-16 ***
## poly(X.cent, 4)2 -27.28047    0.98656 -27.652   <2e-16 ***
## poly(X.cent, 4)3   0.56089    0.98656   0.569    0.571    
## poly(X.cent, 4)4  -0.80433    0.98656  -0.815    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9732968)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   92.463  on 95  degrees of freedom
## AIC: 287.95
## 
## Number of Fisher Scoring iterations: 2
glm.fit.4 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.4)
## 
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.6894  -0.6778   0.1201   0.5571   2.7050  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -2.19133    0.09866 -22.212   <2e-16 ***
## poly(X.cent, 4)1  15.43428    0.98656  15.645   <2e-16 ***
## poly(X.cent, 4)2 -27.28047    0.98656 -27.652   <2e-16 ***
## poly(X.cent, 4)3   0.56089    0.98656   0.569    0.571    
## poly(X.cent, 4)4  -0.80433    0.98656  -0.815    0.417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9732968)
## 
##     Null deviance: 1075.866  on 99  degrees of freedom
## Residual deviance:   92.463  on 95  degrees of freedom
## AIC: 287.95
## 
## Number of Fisher Scoring iterations: 2

Since the true y is of quadratic form to begin with, It is obvious to have such a great significant level for first and second order predictors of all three model while the other two variable do not have.

Problem 2

  1. Split the data set into a training set and a test set. (Use 5-fold cross-validation)
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.2
## Loaded glmnet 3.0-2
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.6.2
require(faraway)
## Loading required package: faraway
## 
## Attaching package: 'faraway'
## The following objects are masked from 'package:boot':
## 
##     logit, melanoma
require(ISLR)
## Loading required package: ISLR
## Warning: package 'ISLR' was built under R version 3.6.2
require(caret)
## Loading required package: caret
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
## 
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
## 
##     melanoma
## The following object is masked from 'package:boot':
## 
##     melanoma
## Loading required package: ggplot2
require(leaps)
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.6.2
data(College)
  1. Fit a linear model using least squares on the training set, and report the test error obtained. (Use 5-fold cross-validation)
k=5  # set number of folds
set.seed(123)
train.control <- trainControl(method = "cv", number = k)
# Train the model
model <- train(Apps ~., data = College, method = "lm",
               trControl = train.control)
# Summarize the results
summary(model)
## 
## Call:
## lm(formula = .outcome ~ ., data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4908.8  -430.2   -29.5   322.3  7852.5 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -445.08413  408.32855  -1.090 0.276053    
## PrivateYes  -494.14897  137.81191  -3.586 0.000358 ***
## Accept         1.58581    0.04074  38.924  < 2e-16 ***
## Enroll        -0.88069    0.18596  -4.736 2.60e-06 ***
## Top10perc     49.92628    5.57824   8.950  < 2e-16 ***
## Top25perc    -14.23448    4.47914  -3.178 0.001543 ** 
## F.Undergrad    0.05739    0.03271   1.754 0.079785 .  
## P.Undergrad    0.04445    0.03214   1.383 0.167114    
## Outstate      -0.08587    0.01906  -4.506 7.64e-06 ***
## Room.Board     0.15103    0.04829   3.127 0.001832 ** 
## Books          0.02090    0.23841   0.088 0.930175    
## Personal       0.03110    0.06308   0.493 0.622060    
## PhD           -8.67850    4.63814  -1.871 0.061714 .  
## Terminal      -3.33066    5.09494  -0.654 0.513492    
## S.F.Ratio     15.38961   13.00622   1.183 0.237081    
## perc.alumni    0.17867    4.10230   0.044 0.965273    
## Expend         0.07790    0.01235   6.308 4.79e-10 ***
## Grad.Rate      8.66763    2.94893   2.939 0.003390 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1041 on 759 degrees of freedom
## Multiple R-squared:  0.9292, Adjusted R-squared:  0.9276 
## F-statistic: 585.9 on 17 and 759 DF,  p-value: < 2.2e-16

To summarize, the residual standard error is about 1041 with 759 degrees of freedom and R-squared is 0.9292 with decent F-statistic value.

  1. Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the test error obtained.
require(ISLR)
require(caret)
require(tidyverse)
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble  2.1.3     v dplyr   0.8.4
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## v purrr   0.3.3
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## -- Conflicts ------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x purrr::lift()   masks caret::lift()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()
set.seed(123)
data(College)
#  Standardize data before ridge regression
College[,-1] = apply(College[,-1], 2, scale)
# Split test and training into 20% and 80% of dataset
val = createDataPartition(College$Apps, p = 0.8, list = FALSE)
test.college = data.matrix(College[val,])
training.college = data.matrix(College[-val,])
# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,3,0.01) 
ridge.mod = glmnet(training.college[,-2],training.college[,2],alpha=0,lambda=grid)
# plot coefficent values as we change lambda
plot(ridge.mod,xlab="L2 Norm")

Optimize lambda using cross-validation

set.seed(123)
cv.ridge=cv.glmnet(training.college[,-2],training.college[,2],alpha=0,lambda=grid,nfolds = 5)
plot(cv.ridge)

Find the optima of lambda with lowest mse

bestlam.r=cv.ridge$lambda.min
mse.r=min(cv.ridge$cvm)
bestlam.r
## [1] 0.07585776
mse.r
## [1] 0.07107457

R-squared and test mse for LASSO

fit.ridge=predict(ridge.mod,s=bestlam.r,test.college[,-2])
mse.ridge = mean((fit.ridge-test.college[,2])^2) 
mse.ridge
## [1] 0.1586797

Test error for ridge regression is 0.1586797.

  1. Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained, along with the number of non-zero coefficient estimates.
set.seed(123)
data(College)
#  Standardize data before LASSO
College[,-1] = apply(College[,-1], 2, scale)

# Split test and training into 20% and 80% of dataset
val = createDataPartition(College$Apps, p = 0.8, list = FALSE)
test.college = data.matrix(College[val,])
training.college = data.matrix(College[-val,])

# Create a grid for lambda and plot lasso
grid=10^seq(-5,3,0.01) 
lasso.mod=glmnet(training.college[,-2],training.college[,2],alpha=1,lambda=grid)  
# check coefficent values for each value of lambda
plot(lasso.mod)  # x-axis is in terms of sum(beta^2)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

Optimize lambda using cross-validation

cv.lasso=cv.glmnet(training.college[,-2],training.college[,2],alpha=1,lambda=grid, nfolds = 5)
plot(cv.lasso)

bestlam.l=cv.lasso$lambda.min
mse.l=min(cv.lasso$cvm)
bestlam.l
## [1] 0.002290868
mse.l
## [1] 0.06601061

Get coefficents for best model and compare to OLS

lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam.l)
lasso.coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
##                        1
## (Intercept)  0.222981468
## Private     -0.137717276
## Accept       0.486882310
## Enroll       0.245539805
## Top10perc    0.097261759
## Top25perc    .          
## F.Undergrad  0.163163543
## P.Undergrad -0.120246056
## Outstate    -0.035947883
## Room.Board   0.113723972
## Books       -0.027292368
## Personal     0.009369827
## PhD         -0.034290713
## Terminal    -0.037057513
## S.F.Ratio    0.072757989
## perc.alumni  0.012932329
## Expend       0.163967013
## Grad.Rate    0.033178752

R-squared and test mse for LASSO

fit.lasso=predict(lasso.mod,s=bestlam.l,test.college[,-2])
mse.ridge = mean((fit.lasso-test.college[,2])^2) 
mse.ridge
## [1] 0.1402846

Test error for ridge regression is 0.1402846.

Problem 3

  1. Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, and ridge regression. Present and discuss results for the approaches that you consider.
library(MASS)
library(dplyr)
library(ISLR)
library(glmnet)
data("Boston")
  1. Subset selection
best.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="exhaustive")
best.sum=summary(best.mods)
best.sum
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13, method = "exhaustive")
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"

Criterion-based methods to select best model

names(best.sum)
## [1] "which"  "rsq"    "rss"    "adjr2"  "cp"     "bic"    "outmat" "obj"
p=1:13
aic=best.sum$bic+2*p-log(dim(Boston)[1])*p
which.min(aic)
## [1] 8
which.min(best.sum$bic)
## [1] 3
which.max(best.sum$adjr2)
## [1] 9

Plot criterion to get visual confirmation

plot(p,aic,pch=19,type="b",main="AIC")
x.AIC <- which.min(aic)
y.AIC <- aic[which.min(aic)]
points(x.AIC,y.AIC,cex=1.5,col="red",lwd=2)

plot(p,best.sum$bic,pch=19,type="b",main="BIC")
x.BIC <- which.max(best.sum$bic)
y.BIC <- best.sum$adjr2[which.max(best.sum$bic)]
points(x.BIC,y.BIC, cex=1.5,col="red",lwd=2)

plot(p,best.sum$adjr2,pch=19,type="b",main="adj-R2")
x.adjr2 <- which.max(best.sum$adjr2)
y.adjr2 <- best.sum$adjr2[which.max(best.sum$adjr2)]
points(x.adjr2, y.adjr2, cex = 1.5, col = "red", lwd = 2)

conclude best p=8 since AIC is more reasonable when it comes to its R-squared. The list of 8 predictors are followed: zn, nox, dis,rad, ptratio, black, lstat, and medv

xvarm=names(coef(best.mods,id=8))[-1]
Boston.best=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best)
par(mfrow=c(2,2))
plot(lmod)

summary(lmod)
## 
## Call:
## lm(formula = crim ~ ., data = Boston.best)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.860 -2.102 -0.363  0.895 75.702 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.683128   6.086010   3.234 0.001301 ** 
## zn            0.043293   0.017977   2.408 0.016394 *  
## nox         -12.753708   4.760157  -2.679 0.007623 ** 
## dis          -0.918318   0.261932  -3.506 0.000496 ***
## rad           0.532617   0.049727  10.711  < 2e-16 ***
## ptratio      -0.310541   0.182941  -1.697 0.090229 .  
## black        -0.007922   0.003615  -2.191 0.028897 *  
## lstat         0.110173   0.069219   1.592 0.112097    
## medv         -0.174207   0.053988  -3.227 0.001334 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared:  0.4505, Adjusted R-squared:  0.4416 
## F-statistic: 50.92 on 8 and 497 DF,  p-value: < 2.2e-16
vif(lmod)
##       zn      nox      dis      rad  ptratio    black    lstat     medv 
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693

Since all VIF’s are acceptable, the model with 8 predictors are reasonable.

Start with intercept, add predictor that produces lowest RSS and take that one predictor model and add the predictor from the remaining predictors that produces the lowest RSS continue until all 13 predictors have been added

fwd.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="forward")
fwd.sum=summary(fwd.mods)

Criterion to select best model

p = 1:13
aic=fwd.sum$bic+2*p-log(dim(Boston)[1])*p
which.max(fwd.sum$adjr2)
## [1] 9
which.min(fwd.sum$bic)
## [1] 3
which.min(aic)
## [1] 8

Plot criteria to get visual confirmation

plot(p,aic,pch=19,type="b",main="AIC")
points(which.min(aic),aic[which.min(aic)],cex=1.5,col="red",lwd=2)

plot(p,fwd.sum$bic,pch=19,type="b",main="BIC")
points(which.min(fwd.sum$bic),fwd.sum$bic[which.min(fwd.sum$bic)],cex=1.5,col="red",lwd=2)

plot(p,fwd.sum$adjr2,pch=19,type="b",main="adj-R2")
points(which.max(fwd.sum$adjr2),fwd.sum$adjr2[which.max(fwd.sum$adjr2)],cex=1.5,col="red",lwd=2)

xvarm=names(coef(fwd.mods,id=8))[-1]
Boston.best.fwd=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best.fwd)
par(mfrow=c(2,2))
plot(lmod)

summary(lmod)
## 
## Call:
## lm(formula = crim ~ ., data = Boston.best.fwd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.860 -2.102 -0.363  0.895 75.702 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.683128   6.086010   3.234 0.001301 ** 
## zn            0.043293   0.017977   2.408 0.016394 *  
## nox         -12.753708   4.760157  -2.679 0.007623 ** 
## dis          -0.918318   0.261932  -3.506 0.000496 ***
## rad           0.532617   0.049727  10.711  < 2e-16 ***
## ptratio      -0.310541   0.182941  -1.697 0.090229 .  
## black        -0.007922   0.003615  -2.191 0.028897 *  
## lstat         0.110173   0.069219   1.592 0.112097    
## medv         -0.174207   0.053988  -3.227 0.001334 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared:  0.4505, Adjusted R-squared:  0.4416 
## F-statistic: 50.92 on 8 and 497 DF,  p-value: < 2.2e-16
vif(lmod)
##       zn      nox      dis      rad  ptratio    black    lstat     medv 
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693

By forward stepwise method, the best subset model has 8 predictors with the lowest AIC.

Start with full model, remove predictor that produces lowest RSS and take that eight predictor model and remove the predictor from the model that produces the lowest RSS continue until all but one of the predictors remain

bkwd.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="backward")
bkwd.sum=summary(bkwd.mods)
p = 1:13
aic=bkwd.sum$bic+2*p-log(dim(Boston)[1])*p
which.max(bkwd.sum$adjr2)
## [1] 9
which.min(bkwd.sum$bic)
## [1] 4
which.min(aic)
## [1] 8

Plot criteria to get visual confirmation

plot(p,aic,pch=19,type="b",main="AIC")
points(which.min(aic),aic[which.min(aic)],cex=1.5,col="red",lwd=2)

plot(p,bkwd.sum$bic,pch=19,type="b",main="BIC")
points(which.min(bkwd.sum$bic),bkwd.sum$bic[which.min(bkwd.sum$bic)],
       cex=1.5,col="red",lwd=2)

plot(p,bkwd.sum$adjr2,pch=19,type="b",main="adj-R2")
points(which.max(bkwd.sum$adjr2),bkwd.sum$adjr2[which.max(bkwd.sum$adjr2)],
       cex=1.5,col="red",lwd=2)

The number of predictors in the best subset choosen by backward stepwise method is 8 with zn, nox, dis, rad, ptratio, black, lstat, and medv

Fit best model with this subset.

xvarm=names(coef(bkwd.mods,id=8))[-1]
Boston.best.bkwd=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best.bkwd)
par(mfrow=c(2,2))
plot(lmod)

summary(lmod)
## 
## Call:
## lm(formula = crim ~ ., data = Boston.best.bkwd)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.860 -2.102 -0.363  0.895 75.702 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  19.683128   6.086010   3.234 0.001301 ** 
## zn            0.043293   0.017977   2.408 0.016394 *  
## nox         -12.753708   4.760157  -2.679 0.007623 ** 
## dis          -0.918318   0.261932  -3.506 0.000496 ***
## rad           0.532617   0.049727  10.711  < 2e-16 ***
## ptratio      -0.310541   0.182941  -1.697 0.090229 .  
## black        -0.007922   0.003615  -2.191 0.028897 *  
## lstat         0.110173   0.069219   1.592 0.112097    
## medv         -0.174207   0.053988  -3.227 0.001334 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared:  0.4505, Adjusted R-squared:  0.4416 
## F-statistic: 50.92 on 8 and 497 DF,  p-value: < 2.2e-16
vif(lmod)
##       zn      nox      dis      rad  ptratio    black    lstat     medv 
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693

Compare three different methods.

table <- rbind(names(coef(best.mods,id=8)[-1]),
               names(coef(fwd.mods,id=8)[-1]),
               names(coef(bkwd.mods,id=8)[-1]))
rownames(table) <- c("Exhaustive", "Forward", "backward")
table
##            [,1] [,2]  [,3]  [,4]  [,5]      [,6]    [,7]    [,8]  
## Exhaustive "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
## Forward    "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
## backward   "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
Boston = data.frame(Boston)
best.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="exhaustive")
summary(best.mods)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13, method = "exhaustive")
## 13 Variables  (and intercept)
##         Forced in Forced out
## zn          FALSE      FALSE
## indus       FALSE      FALSE
## chas        FALSE      FALSE
## nox         FALSE      FALSE
## rm          FALSE      FALSE
## age         FALSE      FALSE
## dis         FALSE      FALSE
## rad         FALSE      FALSE
## tax         FALSE      FALSE
## ptratio     FALSE      FALSE
## black       FALSE      FALSE
## lstat       FALSE      FALSE
## medv        FALSE      FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
##           zn  indus chas nox rm  age dis rad tax ptratio black lstat medv
## 1  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   " "   " " 
## 2  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     " "   "*"   " " 
## 3  ( 1 )  " " " "   " "  " " " " " " " " "*" " " " "     "*"   "*"   " " 
## 4  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     " "   " "   "*" 
## 5  ( 1 )  "*" " "   " "  " " " " " " "*" "*" " " " "     "*"   " "   "*" 
## 6  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " " "     "*"   " "   "*" 
## 7  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   " "   "*" 
## 8  ( 1 )  "*" " "   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 9  ( 1 )  "*" "*"   " "  "*" " " " " "*" "*" " " "*"     "*"   "*"   "*" 
## 10  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" " " "*"     "*"   "*"   "*" 
## 11  ( 1 ) "*" "*"   " "  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 12  ( 1 ) "*" "*"   "*"  "*" "*" " " "*" "*" "*" "*"     "*"   "*"   "*" 
## 13  ( 1 ) "*" "*"   "*"  "*" "*" "*" "*" "*" "*" "*"     "*"   "*"   "*"
# use 5-fold CV to determine which best model for each number of predictors
# has the lowest estimated test error

# create function to predict from reg subsets object
pred.sbs=function(obj,new,id){
  form=as.formula(obj$call[[2]])
  mat=model.matrix(form,new)
  coefi=coef(obj,id=id)
  xvars=names(coefi)
  return(mat[,xvars]%*%coefi)
}

# set up for cross validation
k=5  # set number of folds
set.seed(123)
# create an index with id 1-5 to assign observations to folds
folds=sample(1:k,nrow(Boston),replace=T) 
# create dummy matrix to store CV error estimates
cv.err=matrix(NA,k,13,dimnames=list(NULL,paste(1:13)))

# perform CV
for (j in 1:k){
  # pick models with lowest RSS with 1-9 predictors fit without kth fold
  best.mods=regsubsets(crim~.,data=Boston[folds!=j,],
                       nvmax=13,method="exhaustive")
  # estimate test error for all nine models by predicting kth fold 
  for (i in 1:13){
    pred=pred.sbs(best.mods,Boston[folds==j,],id=i)
    cv.err[j,i]=mean((Boston$crim[folds==j]-pred)^2)  # save error est
  }
}

mse.cv=apply(cv.err,2,mean) # compute mean MSE for each number of predictors
min=which.min(mse.cv)  # find minimum mean MSE

# plot and put a red circle around lowest MSE
par(mfrow=c(1,1))
plot(1:13,mse.cv,type="b",xlab="no. of predictors)",ylab="est. test MSE")
points(min,mse.cv[min],cex=2,col="red",lwd=2)

The best model with 12 predictors includes zn, indus, chas, nox, rm, dis, rad, tax, ptratio, black, lstat, and medvv.

Compute mean MSE for each number of predictors

mse.cv=apply(cv.err,2,mean)
mse.cv
##        1        2        3        4        5        6        7        8 
## 47.05419 45.09358 45.12318 44.71803 44.65900 44.55829 43.93026 44.08478 
##        9       10       11       12       13 
## 43.88882 43.86616 43.62821 43.54070 43.59277

Find which subset has the minimum mean MSE

min=which.min(mse.cv)  
min
## 12 
## 12

Refit the model with the best subset predictors.

lmod=lm(crim~zn + indus+ chas +nox +rm +dis +rad +tax +ptratio +black +lstat +medv,data=Boston)
summary(lmod)
## 
## Call:
## lm(formula = crim ~ zn + indus + chas + nox + rm + dis + rad + 
##     tax + ptratio + black + lstat + medv, data = Boston)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.923 -2.116 -0.354  1.031 75.042 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  16.985714   7.203806   2.358 0.018769 *  
## zn            0.044673   0.018580   2.404 0.016569 *  
## indus        -0.063848   0.083323  -0.766 0.443879    
## chas         -0.744368   1.177490  -0.632 0.527572    
## nox         -10.202169   5.088019  -2.005 0.045495 *  
## rm            0.439588   0.600994   0.731 0.464861    
## dis          -0.993557   0.270306  -3.676 0.000263 ***
## rad           0.587660   0.087700   6.701 5.68e-11 ***
## tax          -0.003768   0.005148  -0.732 0.464618    
## ptratio      -0.269949   0.185739  -1.453 0.146754    
## black        -0.007519   0.003662  -2.053 0.040602 *  
## lstat         0.128120   0.071890   1.782 0.075337 .  
## medv         -0.198878   0.060455  -3.290 0.001075 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.433 on 493 degrees of freedom
## Multiple R-squared:  0.454,  Adjusted R-squared:  0.4407 
## F-statistic: 34.16 on 12 and 493 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod)

vif(lmod)
##       zn    indus     chas      nox       rm      dis      rad      tax 
## 2.291646 3.987749 1.091605 4.242295 2.176112 3.953771 7.116489 9.187292 
##  ptratio    black    lstat     medv 
## 1.973342 1.364385 3.216388 3.772843

Compare 5-fold cv test estimates and interpretable models

mse.cv=data.frame(matrix(NA,nrow=4,ncol=2)) #creat data frame for MSE estimate
colnames(mse.cv)=c("model","MSE")
R2 = data.frame(matrix(NA,nrow=4,ncol=2))
colnames(R2)=c("model","MSE")
mse.cv[,1]=c("Exhaustive", "Forward", "backward","CV.Subset")
R2[,1]=c("Exhaustive", "Forward", "backward","CV.Subset")

# Fit model for Exhaustive subsets using p=8
lmod.cv=train(crim~.,data=Boston.best,method="lm", 
              trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
##       RMSE
## 1 6.353976
mse.cv[1,2]=lmod.cv$results[2]^2
R2[1,2] =lmod.cv$results[3]
# Fit model from forward subsets using p=8
lmod.cv=train(crim~.,data=Boston.best.fwd,method="lm", 
              trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
##       RMSE
## 1 6.104062
mse.cv[2,2]=lmod.cv$results[2]^2
R2[2,2] =lmod.cv$results[3]

# Fit model from backward subsets using p=8
lmod.cv=train(crim~.,data=Boston.best.bkwd,method="lm", 
              trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
##       RMSE
## 1 6.416298
mse.cv[3,2]=lmod.cv$results[2]^2
R2[3,2] =lmod.cv$results[3]

# Fit model from subset selection with 5 folds cross Validation using p=12
lmod.cv=train(crim~zn + indus+ chas +nox +rm +dis +rad +tax +ptratio +black +lstat +medv,data=Boston,method="lm", 
              trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
##       RMSE
## 1 6.125756
mse.cv[4,2]=lmod.cv$results[2]^2
R2[4,2] =lmod.cv$results[3]
# Barplot
par(mfrow=c(1,1))
mse.cv
##        model      MSE
## 1 Exhaustive 40.37301
## 2    Forward 37.25957
## 3   backward 41.16888
## 4  CV.Subset 37.52489
barplot(MSE~model,data=mse.cv,col="blue",ylab="test MSE estimate")

# Barplot
par(mfrow=c(1,1))
R2
##        model       MSE
## 1 Exhaustive 0.4752034
## 2    Forward 0.4940464
## 3   backward 0.4663372
## 4  CV.Subset 0.4954353
barplot(MSE~model,data=R2,col="light blue",ylab="R-squared")

  1. LASSO
#  Standardize dataset
Boston = apply(Boston, 2, scale)

# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,1,0.01)  
lasso.mod=glmnet(Boston[,-1],Boston[,1],alpha=1,lambda=grid)  

# check coefficent values for each value of lambda
plot(lasso.mod)  
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values

Optimize lambda using cross-validation

set.seed(123)
cv.lasso=cv.glmnet(Boston[,-1],Boston[,1],alpha=1,lambda=grid,nfolds = 5)
plot(cv.lasso)

bestlam.l=cv.lasso$lambda.min
mse.l=min(cv.lasso$cvm)
bestlam.l
## [1] 0.005754399
mse.l
## [1] 0.5760587

Get coefficents for best model and compare to OLS

lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam.l)
lasso.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -6.701677e-16
## zn           9.863206e-02
## indus       -5.686633e-02
## chas        -1.741376e-02
## nox         -9.478202e-02
## rm           1.910888e-02
## age          .           
## dis         -1.940549e-01
## rad          5.212591e-01
## tax          .           
## ptratio     -4.788684e-02
## black       -8.016399e-02
## lstat        1.040778e-01
## medv        -1.697260e-01

Compare R2 for each fit

# Predict a response with test dataset
fit.lasso=predict(lasso.mod,s=bestlam.l,Boston[,-1])
R2.lasso=cor(fit.lasso,Boston[,1])^2
R2.lasso
##        [,1]
## 1 0.4522557
  1. Ridge regression
#  Standardize dataset
Boston = apply(Boston, 2, scale)

# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,2,0.01) # lambda ranges from 0.1 to 0.00001 
ridge.mod=glmnet(Boston[,-1],Boston[,1],alpha=0,lambda=grid)  

# plot coefficent values as we change lambda
plot(ridge.mod,xlab="L2 Norm")  

Optimize lambda using cross-validation

set.seed(123)
cv.ridge=cv.glmnet(Boston[,-1],Boston[,1],alpha=0,lambda=grid)
plot(cv.ridge)

bestlam.r=cv.ridge$lambda.min
mse.r=min(cv.ridge$cvm)
bestlam.r
## [1] 0.01862087
mse.r
## [1] 0.5758632

Get coefficents for best model and compare to OLS

ridge.coef=predict(ridge.mod,type="coefficients",s=bestlam.r)
ridge.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
##                         1
## (Intercept) -5.915110e-16
## zn           1.077566e-01
## indus       -6.269857e-02
## chas        -2.158857e-02
## nox         -1.120676e-01
## rm           3.264238e-02
## age          4.400132e-03
## dis         -2.145556e-01
## rad          5.209229e-01
## tax         -6.132351e-03
## ptratio     -5.448785e-02
## black       -8.432539e-02
## lstat        1.129551e-01
## medv        -1.858328e-01

Compare R2 for each fit

fit.ridge=predict(ridge.mod,s=bestlam.r,Boston[,-1])
R2.ridge=cor(fit.ridge,Boston[,1])^2
R2.ridge
##        [,1]
## 1 0.4531314
  1. Discuss trade-offs related to prediction accuracy vs. interpretability and choose to evaluate various models and your reasoning for choosing the best model (or set of models if you propose more than one).
##   model       MSE
## 1 Ridge 0.5758632
## 2 LASSO 0.5760587
##   model        R2
## 1 Ridge 0.4531314
## 2 LASSO 0.4522557
##        model      MSE        R2
## 1 Exhaustive 40.37301 0.4752034
## 2    Forward 37.25957 0.4940464
## 3   backward 41.16888 0.4663372
## 4  CV.Subset 37.52489 0.4954353
##        model        MSE        R2
## 1 Exhaustive 40.3730085 0.4752034
## 2    Forward 37.2595679 0.4940464
## 3   backward 41.1688755 0.4663372
## 4  CV.Subset 37.5248877 0.4954353
## 5      Ridge  0.5758632 0.4531314
## 6      LASSO  0.5760587 0.4522557

I would choose LASSO model with 11 predictors listed: zn, indus, chas, nox, rm, dis, rad, ptratio, black, and lstate due to two reasons

  1. Accuracy LASSO model has the smallest MSE value among the other methods in spite of relatively low R-squared.

  2. Reduced variance and correlations Subset selections do not count the correlation among predictors which is significantly difficult to figure out each variables’ effect on the model. LASSO tends to eliminate a set of correlated predictors giving more accurate outputs. Also, LASSO can do variable selection as subset selection method. Comparing to ridge regression, LASSO is still the best model even though ridge and LASSO have a same feature except for variable selection.

To conclude, it is reasonable to choose LASSO as the best due to better accurate measurement of MSE and reduced variance and correlations with variable selection.