Problem 1
set.seed(1)
x = rnorm (100)
y = x-2* x^2+ rnorm (100)
The number of sample size of y(n) is
n <- 100
The number of predictors in the model(p) is
p <- 2
Since rnorm(100) in y meets the condition to be an error with mean = 0 and sd = 1, there are two predictors: X, x^2.
plot(x,y, pch = 19)
There is a non-linear relationship between X and Y.
library(boot)
## Warning: package 'boot' was built under R version 3.6.2
set.seed(100)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat = data.frame(X.cent,Y)
# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.c = rep(NA,3)
for (i in 1:3) {
glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
cv.err.c[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.c
## [1] 0.6511909 0.6665339 0.6671261
set.seed(50)
X <- rnorm(100)
X.cent <- scale(X, scale=F)
Y = X-2* X^2+ rnorm (100)
dat <- data.frame(X.cent,Y)
# Calculate MSE for all models with centered predictors for the second, third, and fourth order
cv.err.d = rep(NA,3)
for (i in 1:3) {
glm.fit <- glm(Y~poly(X.cent,i+1),data = dat)
cv.err.d[i] <- cv.glm(dat, glm.fit)$delta[1]
}
cv.err.d
## [1] 0.990204 1.000547 1.013875
min.err.fold <- which.min(cv.err.c)
min.err <- cv.err.c[min.err.fold]
min.err.fold # The model number with the smallest LOOCV
## [1] 1
round(min.err,4) # MSE corresponding the model with the smallest LOOCV
## [1] 0.6512
It concludes that the second order model has the smallest LOOCV error. It was what I expected since the true y is quatratic form to begin with.
glm.fit.2 <- glm(Y~poly(X.cent,2),data = dat)
summary(glm.fit.2)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 2), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.55901 -0.67948 0.02914 0.59356 2.65331
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09814 -22.33 <2e-16 ***
## poly(X.cent, 2)1 15.43428 0.98140 15.73 <2e-16 ***
## poly(X.cent, 2)2 -27.28047 0.98140 -27.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9631416)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 93.425 on 97 degrees of freedom
## AIC: 284.99
##
## Number of Fisher Scoring iterations: 2
glm.fit.3 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.3)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.6778 0.1201 0.5571 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09866 -22.212 <2e-16 ***
## poly(X.cent, 4)1 15.43428 0.98656 15.645 <2e-16 ***
## poly(X.cent, 4)2 -27.28047 0.98656 -27.652 <2e-16 ***
## poly(X.cent, 4)3 0.56089 0.98656 0.569 0.571
## poly(X.cent, 4)4 -0.80433 0.98656 -0.815 0.417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9732968)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 92.463 on 95 degrees of freedom
## AIC: 287.95
##
## Number of Fisher Scoring iterations: 2
glm.fit.4 <- glm(Y~poly(X.cent,4),data = dat)
summary(glm.fit.4)
##
## Call:
## glm(formula = Y ~ poly(X.cent, 4), data = dat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6894 -0.6778 0.1201 0.5571 2.7050
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.19133 0.09866 -22.212 <2e-16 ***
## poly(X.cent, 4)1 15.43428 0.98656 15.645 <2e-16 ***
## poly(X.cent, 4)2 -27.28047 0.98656 -27.652 <2e-16 ***
## poly(X.cent, 4)3 0.56089 0.98656 0.569 0.571
## poly(X.cent, 4)4 -0.80433 0.98656 -0.815 0.417
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.9732968)
##
## Null deviance: 1075.866 on 99 degrees of freedom
## Residual deviance: 92.463 on 95 degrees of freedom
## AIC: 287.95
##
## Number of Fisher Scoring iterations: 2
Since the true y is of quadratic form to begin with, It is obvious to have such a great significant level for first and second order predictors of all three model while the other two variable do not have.
Problem 2
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.2
## Loading required package: Matrix
## Warning: package 'Matrix' was built under R version 3.6.2
## Loaded glmnet 3.0-2
require(MASS)
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.6.2
require(faraway)
## Loading required package: faraway
##
## Attaching package: 'faraway'
## The following objects are masked from 'package:boot':
##
## logit, melanoma
require(ISLR)
## Loading required package: ISLR
## Warning: package 'ISLR' was built under R version 3.6.2
require(caret)
## Loading required package: caret
## Warning: package 'caret' was built under R version 3.6.2
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:faraway':
##
## melanoma
## The following object is masked from 'package:boot':
##
## melanoma
## Loading required package: ggplot2
require(leaps)
## Loading required package: leaps
## Warning: package 'leaps' was built under R version 3.6.2
data(College)
k=5 # set number of folds
set.seed(123)
train.control <- trainControl(method = "cv", number = k)
# Train the model
model <- train(Apps ~., data = College, method = "lm",
trControl = train.control)
# Summarize the results
summary(model)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4908.8 -430.2 -29.5 322.3 7852.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -445.08413 408.32855 -1.090 0.276053
## PrivateYes -494.14897 137.81191 -3.586 0.000358 ***
## Accept 1.58581 0.04074 38.924 < 2e-16 ***
## Enroll -0.88069 0.18596 -4.736 2.60e-06 ***
## Top10perc 49.92628 5.57824 8.950 < 2e-16 ***
## Top25perc -14.23448 4.47914 -3.178 0.001543 **
## F.Undergrad 0.05739 0.03271 1.754 0.079785 .
## P.Undergrad 0.04445 0.03214 1.383 0.167114
## Outstate -0.08587 0.01906 -4.506 7.64e-06 ***
## Room.Board 0.15103 0.04829 3.127 0.001832 **
## Books 0.02090 0.23841 0.088 0.930175
## Personal 0.03110 0.06308 0.493 0.622060
## PhD -8.67850 4.63814 -1.871 0.061714 .
## Terminal -3.33066 5.09494 -0.654 0.513492
## S.F.Ratio 15.38961 13.00622 1.183 0.237081
## perc.alumni 0.17867 4.10230 0.044 0.965273
## Expend 0.07790 0.01235 6.308 4.79e-10 ***
## Grad.Rate 8.66763 2.94893 2.939 0.003390 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1041 on 759 degrees of freedom
## Multiple R-squared: 0.9292, Adjusted R-squared: 0.9276
## F-statistic: 585.9 on 17 and 759 DF, p-value: < 2.2e-16
To summarize, the residual standard error is about 1041 with 759 degrees of freedom and R-squared is 0.9292 with decent F-statistic value.
require(ISLR)
require(caret)
require(tidyverse)
## Loading required package: tidyverse
## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages --------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v tibble 2.1.3 v dplyr 0.8.4
## v tidyr 1.0.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## v purrr 0.3.3
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## -- Conflicts ------------------------------------------------------------------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x purrr::lift() masks caret::lift()
## x tidyr::pack() masks Matrix::pack()
## x dplyr::select() masks MASS::select()
## x tidyr::unpack() masks Matrix::unpack()
set.seed(123)
data(College)
# Standardize data before ridge regression
College[,-1] = apply(College[,-1], 2, scale)
# Split test and training into 20% and 80% of dataset
val = createDataPartition(College$Apps, p = 0.8, list = FALSE)
test.college = data.matrix(College[val,])
training.college = data.matrix(College[-val,])
# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,3,0.01)
ridge.mod = glmnet(training.college[,-2],training.college[,2],alpha=0,lambda=grid)
# plot coefficent values as we change lambda
plot(ridge.mod,xlab="L2 Norm")
Optimize lambda using cross-validation
set.seed(123)
cv.ridge=cv.glmnet(training.college[,-2],training.college[,2],alpha=0,lambda=grid,nfolds = 5)
plot(cv.ridge)
Find the optima of lambda with lowest mse
bestlam.r=cv.ridge$lambda.min
mse.r=min(cv.ridge$cvm)
bestlam.r
## [1] 0.07585776
mse.r
## [1] 0.07107457
R-squared and test mse for LASSO
fit.ridge=predict(ridge.mod,s=bestlam.r,test.college[,-2])
mse.ridge = mean((fit.ridge-test.college[,2])^2)
mse.ridge
## [1] 0.1586797
Test error for ridge regression is 0.1586797.
set.seed(123)
data(College)
# Standardize data before LASSO
College[,-1] = apply(College[,-1], 2, scale)
# Split test and training into 20% and 80% of dataset
val = createDataPartition(College$Apps, p = 0.8, list = FALSE)
test.college = data.matrix(College[val,])
training.college = data.matrix(College[-val,])
# Create a grid for lambda and plot lasso
grid=10^seq(-5,3,0.01)
lasso.mod=glmnet(training.college[,-2],training.college[,2],alpha=1,lambda=grid)
# check coefficent values for each value of lambda
plot(lasso.mod) # x-axis is in terms of sum(beta^2)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
Optimize lambda using cross-validation
cv.lasso=cv.glmnet(training.college[,-2],training.college[,2],alpha=1,lambda=grid, nfolds = 5)
plot(cv.lasso)
bestlam.l=cv.lasso$lambda.min
mse.l=min(cv.lasso$cvm)
bestlam.l
## [1] 0.002290868
mse.l
## [1] 0.06601061
Get coefficents for best model and compare to OLS
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam.l)
lasso.coef
## 18 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.222981468
## Private -0.137717276
## Accept 0.486882310
## Enroll 0.245539805
## Top10perc 0.097261759
## Top25perc .
## F.Undergrad 0.163163543
## P.Undergrad -0.120246056
## Outstate -0.035947883
## Room.Board 0.113723972
## Books -0.027292368
## Personal 0.009369827
## PhD -0.034290713
## Terminal -0.037057513
## S.F.Ratio 0.072757989
## perc.alumni 0.012932329
## Expend 0.163967013
## Grad.Rate 0.033178752
R-squared and test mse for LASSO
fit.lasso=predict(lasso.mod,s=bestlam.l,test.college[,-2])
mse.ridge = mean((fit.lasso-test.college[,2])^2)
mse.ridge
## [1] 0.1402846
Test error for ridge regression is 0.1402846.
Problem 3
library(MASS)
library(dplyr)
library(ISLR)
library(glmnet)
data("Boston")
best.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="exhaustive")
best.sum=summary(best.mods)
best.sum
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13, method = "exhaustive")
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
Criterion-based methods to select best model
names(best.sum)
## [1] "which" "rsq" "rss" "adjr2" "cp" "bic" "outmat" "obj"
p=1:13
aic=best.sum$bic+2*p-log(dim(Boston)[1])*p
which.min(aic)
## [1] 8
which.min(best.sum$bic)
## [1] 3
which.max(best.sum$adjr2)
## [1] 9
Plot criterion to get visual confirmation
plot(p,aic,pch=19,type="b",main="AIC")
x.AIC <- which.min(aic)
y.AIC <- aic[which.min(aic)]
points(x.AIC,y.AIC,cex=1.5,col="red",lwd=2)
plot(p,best.sum$bic,pch=19,type="b",main="BIC")
x.BIC <- which.max(best.sum$bic)
y.BIC <- best.sum$adjr2[which.max(best.sum$bic)]
points(x.BIC,y.BIC, cex=1.5,col="red",lwd=2)
plot(p,best.sum$adjr2,pch=19,type="b",main="adj-R2")
x.adjr2 <- which.max(best.sum$adjr2)
y.adjr2 <- best.sum$adjr2[which.max(best.sum$adjr2)]
points(x.adjr2, y.adjr2, cex = 1.5, col = "red", lwd = 2)
conclude best p=8 since AIC is more reasonable when it comes to its R-squared. The list of 8 predictors are followed: zn, nox, dis,rad, ptratio, black, lstat, and medv
xvarm=names(coef(best.mods,id=8))[-1]
Boston.best=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best)
par(mfrow=c(2,2))
plot(lmod)
summary(lmod)
##
## Call:
## lm(formula = crim ~ ., data = Boston.best)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.860 -2.102 -0.363 0.895 75.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.683128 6.086010 3.234 0.001301 **
## zn 0.043293 0.017977 2.408 0.016394 *
## nox -12.753708 4.760157 -2.679 0.007623 **
## dis -0.918318 0.261932 -3.506 0.000496 ***
## rad 0.532617 0.049727 10.711 < 2e-16 ***
## ptratio -0.310541 0.182941 -1.697 0.090229 .
## black -0.007922 0.003615 -2.191 0.028897 *
## lstat 0.110173 0.069219 1.592 0.112097
## medv -0.174207 0.053988 -3.227 0.001334 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared: 0.4505, Adjusted R-squared: 0.4416
## F-statistic: 50.92 on 8 and 497 DF, p-value: < 2.2e-16
vif(lmod)
## zn nox dis rad ptratio black lstat medv
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693
Since all VIF’s are acceptable, the model with 8 predictors are reasonable.
Start with intercept, add predictor that produces lowest RSS and take that one predictor model and add the predictor from the remaining predictors that produces the lowest RSS continue until all 13 predictors have been added
fwd.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="forward")
fwd.sum=summary(fwd.mods)
Criterion to select best model
p = 1:13
aic=fwd.sum$bic+2*p-log(dim(Boston)[1])*p
which.max(fwd.sum$adjr2)
## [1] 9
which.min(fwd.sum$bic)
## [1] 3
which.min(aic)
## [1] 8
Plot criteria to get visual confirmation
plot(p,aic,pch=19,type="b",main="AIC")
points(which.min(aic),aic[which.min(aic)],cex=1.5,col="red",lwd=2)
plot(p,fwd.sum$bic,pch=19,type="b",main="BIC")
points(which.min(fwd.sum$bic),fwd.sum$bic[which.min(fwd.sum$bic)],cex=1.5,col="red",lwd=2)
plot(p,fwd.sum$adjr2,pch=19,type="b",main="adj-R2")
points(which.max(fwd.sum$adjr2),fwd.sum$adjr2[which.max(fwd.sum$adjr2)],cex=1.5,col="red",lwd=2)
xvarm=names(coef(fwd.mods,id=8))[-1]
Boston.best.fwd=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best.fwd)
par(mfrow=c(2,2))
plot(lmod)
summary(lmod)
##
## Call:
## lm(formula = crim ~ ., data = Boston.best.fwd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.860 -2.102 -0.363 0.895 75.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.683128 6.086010 3.234 0.001301 **
## zn 0.043293 0.017977 2.408 0.016394 *
## nox -12.753708 4.760157 -2.679 0.007623 **
## dis -0.918318 0.261932 -3.506 0.000496 ***
## rad 0.532617 0.049727 10.711 < 2e-16 ***
## ptratio -0.310541 0.182941 -1.697 0.090229 .
## black -0.007922 0.003615 -2.191 0.028897 *
## lstat 0.110173 0.069219 1.592 0.112097
## medv -0.174207 0.053988 -3.227 0.001334 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared: 0.4505, Adjusted R-squared: 0.4416
## F-statistic: 50.92 on 8 and 497 DF, p-value: < 2.2e-16
vif(lmod)
## zn nox dis rad ptratio black lstat medv
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693
By forward stepwise method, the best subset model has 8 predictors with the lowest AIC.
Start with full model, remove predictor that produces lowest RSS and take that eight predictor model and remove the predictor from the model that produces the lowest RSS continue until all but one of the predictors remain
bkwd.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="backward")
bkwd.sum=summary(bkwd.mods)
p = 1:13
aic=bkwd.sum$bic+2*p-log(dim(Boston)[1])*p
which.max(bkwd.sum$adjr2)
## [1] 9
which.min(bkwd.sum$bic)
## [1] 4
which.min(aic)
## [1] 8
Plot criteria to get visual confirmation
plot(p,aic,pch=19,type="b",main="AIC")
points(which.min(aic),aic[which.min(aic)],cex=1.5,col="red",lwd=2)
plot(p,bkwd.sum$bic,pch=19,type="b",main="BIC")
points(which.min(bkwd.sum$bic),bkwd.sum$bic[which.min(bkwd.sum$bic)],
cex=1.5,col="red",lwd=2)
plot(p,bkwd.sum$adjr2,pch=19,type="b",main="adj-R2")
points(which.max(bkwd.sum$adjr2),bkwd.sum$adjr2[which.max(bkwd.sum$adjr2)],
cex=1.5,col="red",lwd=2)
The number of predictors in the best subset choosen by backward stepwise method is 8 with zn, nox, dis, rad, ptratio, black, lstat, and medv
Fit best model with this subset.
xvarm=names(coef(bkwd.mods,id=8))[-1]
Boston.best.bkwd=Boston[,c("crim",xvarm)]
lmod=lm(crim~.,data=Boston.best.bkwd)
par(mfrow=c(2,2))
plot(lmod)
summary(lmod)
##
## Call:
## lm(formula = crim ~ ., data = Boston.best.bkwd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.860 -2.102 -0.363 0.895 75.702
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.683128 6.086010 3.234 0.001301 **
## zn 0.043293 0.017977 2.408 0.016394 *
## nox -12.753708 4.760157 -2.679 0.007623 **
## dis -0.918318 0.261932 -3.506 0.000496 ***
## rad 0.532617 0.049727 10.711 < 2e-16 ***
## ptratio -0.310541 0.182941 -1.697 0.090229 .
## black -0.007922 0.003615 -2.191 0.028897 *
## lstat 0.110173 0.069219 1.592 0.112097
## medv -0.174207 0.053988 -3.227 0.001334 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.428 on 497 degrees of freedom
## Multiple R-squared: 0.4505, Adjusted R-squared: 0.4416
## F-statistic: 50.92 on 8 and 497 DF, p-value: < 2.2e-16
vif(lmod)
## zn nox dis rad ptratio black lstat medv
## 2.148871 3.719176 3.718604 2.291669 1.917428 1.331764 2.986626 3.013693
Compare three different methods.
table <- rbind(names(coef(best.mods,id=8)[-1]),
names(coef(fwd.mods,id=8)[-1]),
names(coef(bkwd.mods,id=8)[-1]))
rownames(table) <- c("Exhaustive", "Forward", "backward")
table
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
## Exhaustive "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
## Forward "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
## backward "zn" "nox" "dis" "rad" "ptratio" "black" "lstat" "medv"
Boston = data.frame(Boston)
best.mods=regsubsets(crim~.,data=Boston,nvmax=13,method="exhaustive")
summary(best.mods)
## Subset selection object
## Call: regsubsets.formula(crim ~ ., data = Boston, nvmax = 13, method = "exhaustive")
## 13 Variables (and intercept)
## Forced in Forced out
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## medv FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " " " " "
## 2 ( 1 ) " " " " " " " " " " " " " " "*" " " " " " " "*" " "
## 3 ( 1 ) " " " " " " " " " " " " " " "*" " " " " "*" "*" " "
## 4 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " " " " " "*"
## 5 ( 1 ) "*" " " " " " " " " " " "*" "*" " " " " "*" " " "*"
## 6 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " " " "*" " " "*"
## 7 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" " " "*"
## 8 ( 1 ) "*" " " " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 9 ( 1 ) "*" "*" " " "*" " " " " "*" "*" " " "*" "*" "*" "*"
## 10 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" " " "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" " " "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
# use 5-fold CV to determine which best model for each number of predictors
# has the lowest estimated test error
# create function to predict from reg subsets object
pred.sbs=function(obj,new,id){
form=as.formula(obj$call[[2]])
mat=model.matrix(form,new)
coefi=coef(obj,id=id)
xvars=names(coefi)
return(mat[,xvars]%*%coefi)
}
# set up for cross validation
k=5 # set number of folds
set.seed(123)
# create an index with id 1-5 to assign observations to folds
folds=sample(1:k,nrow(Boston),replace=T)
# create dummy matrix to store CV error estimates
cv.err=matrix(NA,k,13,dimnames=list(NULL,paste(1:13)))
# perform CV
for (j in 1:k){
# pick models with lowest RSS with 1-9 predictors fit without kth fold
best.mods=regsubsets(crim~.,data=Boston[folds!=j,],
nvmax=13,method="exhaustive")
# estimate test error for all nine models by predicting kth fold
for (i in 1:13){
pred=pred.sbs(best.mods,Boston[folds==j,],id=i)
cv.err[j,i]=mean((Boston$crim[folds==j]-pred)^2) # save error est
}
}
mse.cv=apply(cv.err,2,mean) # compute mean MSE for each number of predictors
min=which.min(mse.cv) # find minimum mean MSE
# plot and put a red circle around lowest MSE
par(mfrow=c(1,1))
plot(1:13,mse.cv,type="b",xlab="no. of predictors)",ylab="est. test MSE")
points(min,mse.cv[min],cex=2,col="red",lwd=2)
The best model with 12 predictors includes zn, indus, chas, nox, rm, dis, rad, tax, ptratio, black, lstat, and medvv.
Compute mean MSE for each number of predictors
mse.cv=apply(cv.err,2,mean)
mse.cv
## 1 2 3 4 5 6 7 8
## 47.05419 45.09358 45.12318 44.71803 44.65900 44.55829 43.93026 44.08478
## 9 10 11 12 13
## 43.88882 43.86616 43.62821 43.54070 43.59277
Find which subset has the minimum mean MSE
min=which.min(mse.cv)
min
## 12
## 12
Refit the model with the best subset predictors.
lmod=lm(crim~zn + indus+ chas +nox +rm +dis +rad +tax +ptratio +black +lstat +medv,data=Boston)
summary(lmod)
##
## Call:
## lm(formula = crim ~ zn + indus + chas + nox + rm + dis + rad +
## tax + ptratio + black + lstat + medv, data = Boston)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.923 -2.116 -0.354 1.031 75.042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.985714 7.203806 2.358 0.018769 *
## zn 0.044673 0.018580 2.404 0.016569 *
## indus -0.063848 0.083323 -0.766 0.443879
## chas -0.744368 1.177490 -0.632 0.527572
## nox -10.202169 5.088019 -2.005 0.045495 *
## rm 0.439588 0.600994 0.731 0.464861
## dis -0.993557 0.270306 -3.676 0.000263 ***
## rad 0.587660 0.087700 6.701 5.68e-11 ***
## tax -0.003768 0.005148 -0.732 0.464618
## ptratio -0.269949 0.185739 -1.453 0.146754
## black -0.007519 0.003662 -2.053 0.040602 *
## lstat 0.128120 0.071890 1.782 0.075337 .
## medv -0.198878 0.060455 -3.290 0.001075 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.433 on 493 degrees of freedom
## Multiple R-squared: 0.454, Adjusted R-squared: 0.4407
## F-statistic: 34.16 on 12 and 493 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lmod)
vif(lmod)
## zn indus chas nox rm dis rad tax
## 2.291646 3.987749 1.091605 4.242295 2.176112 3.953771 7.116489 9.187292
## ptratio black lstat medv
## 1.973342 1.364385 3.216388 3.772843
Compare 5-fold cv test estimates and interpretable models
mse.cv=data.frame(matrix(NA,nrow=4,ncol=2)) #creat data frame for MSE estimate
colnames(mse.cv)=c("model","MSE")
R2 = data.frame(matrix(NA,nrow=4,ncol=2))
colnames(R2)=c("model","MSE")
mse.cv[,1]=c("Exhaustive", "Forward", "backward","CV.Subset")
R2[,1]=c("Exhaustive", "Forward", "backward","CV.Subset")
# Fit model for Exhaustive subsets using p=8
lmod.cv=train(crim~.,data=Boston.best,method="lm",
trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
## RMSE
## 1 6.353976
mse.cv[1,2]=lmod.cv$results[2]^2
R2[1,2] =lmod.cv$results[3]
# Fit model from forward subsets using p=8
lmod.cv=train(crim~.,data=Boston.best.fwd,method="lm",
trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
## RMSE
## 1 6.104062
mse.cv[2,2]=lmod.cv$results[2]^2
R2[2,2] =lmod.cv$results[3]
# Fit model from backward subsets using p=8
lmod.cv=train(crim~.,data=Boston.best.bkwd,method="lm",
trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
## RMSE
## 1 6.416298
mse.cv[3,2]=lmod.cv$results[2]^2
R2[3,2] =lmod.cv$results[3]
# Fit model from subset selection with 5 folds cross Validation using p=12
lmod.cv=train(crim~zn + indus+ chas +nox +rm +dis +rad +tax +ptratio +black +lstat +medv,data=Boston,method="lm",
trControl=trainControl(method="cv",number=5))
lmod.cv$results[2]
## RMSE
## 1 6.125756
mse.cv[4,2]=lmod.cv$results[2]^2
R2[4,2] =lmod.cv$results[3]
# Barplot
par(mfrow=c(1,1))
mse.cv
## model MSE
## 1 Exhaustive 40.37301
## 2 Forward 37.25957
## 3 backward 41.16888
## 4 CV.Subset 37.52489
barplot(MSE~model,data=mse.cv,col="blue",ylab="test MSE estimate")
# Barplot
par(mfrow=c(1,1))
R2
## model MSE
## 1 Exhaustive 0.4752034
## 2 Forward 0.4940464
## 3 backward 0.4663372
## 4 CV.Subset 0.4954353
barplot(MSE~model,data=R2,col="light blue",ylab="R-squared")
# Standardize dataset
Boston = apply(Boston, 2, scale)
# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,1,0.01)
lasso.mod=glmnet(Boston[,-1],Boston[,1],alpha=1,lambda=grid)
# check coefficent values for each value of lambda
plot(lasso.mod)
## Warning in regularize.values(x, y, ties, missing(ties)): collapsing to unique
## 'x' values
Optimize lambda using cross-validation
set.seed(123)
cv.lasso=cv.glmnet(Boston[,-1],Boston[,1],alpha=1,lambda=grid,nfolds = 5)
plot(cv.lasso)
bestlam.l=cv.lasso$lambda.min
mse.l=min(cv.lasso$cvm)
bestlam.l
## [1] 0.005754399
mse.l
## [1] 0.5760587
Get coefficents for best model and compare to OLS
lasso.coef=predict(lasso.mod,type="coefficients",s=bestlam.l)
lasso.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -6.701677e-16
## zn 9.863206e-02
## indus -5.686633e-02
## chas -1.741376e-02
## nox -9.478202e-02
## rm 1.910888e-02
## age .
## dis -1.940549e-01
## rad 5.212591e-01
## tax .
## ptratio -4.788684e-02
## black -8.016399e-02
## lstat 1.040778e-01
## medv -1.697260e-01
Compare R2 for each fit
# Predict a response with test dataset
fit.lasso=predict(lasso.mod,s=bestlam.l,Boston[,-1])
R2.lasso=cor(fit.lasso,Boston[,1])^2
R2.lasso
## [,1]
## 1 0.4522557
# Standardize dataset
Boston = apply(Boston, 2, scale)
# create grid for lambda, fit model using all lambdas
grid=10^seq(-5,2,0.01) # lambda ranges from 0.1 to 0.00001
ridge.mod=glmnet(Boston[,-1],Boston[,1],alpha=0,lambda=grid)
# plot coefficent values as we change lambda
plot(ridge.mod,xlab="L2 Norm")
Optimize lambda using cross-validation
set.seed(123)
cv.ridge=cv.glmnet(Boston[,-1],Boston[,1],alpha=0,lambda=grid)
plot(cv.ridge)
bestlam.r=cv.ridge$lambda.min
mse.r=min(cv.ridge$cvm)
bestlam.r
## [1] 0.01862087
mse.r
## [1] 0.5758632
Get coefficents for best model and compare to OLS
ridge.coef=predict(ridge.mod,type="coefficients",s=bestlam.r)
ridge.coef
## 14 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -5.915110e-16
## zn 1.077566e-01
## indus -6.269857e-02
## chas -2.158857e-02
## nox -1.120676e-01
## rm 3.264238e-02
## age 4.400132e-03
## dis -2.145556e-01
## rad 5.209229e-01
## tax -6.132351e-03
## ptratio -5.448785e-02
## black -8.432539e-02
## lstat 1.129551e-01
## medv -1.858328e-01
Compare R2 for each fit
fit.ridge=predict(ridge.mod,s=bestlam.r,Boston[,-1])
R2.ridge=cor(fit.ridge,Boston[,1])^2
R2.ridge
## [,1]
## 1 0.4531314
## model MSE
## 1 Ridge 0.5758632
## 2 LASSO 0.5760587
## model R2
## 1 Ridge 0.4531314
## 2 LASSO 0.4522557
## model MSE R2
## 1 Exhaustive 40.37301 0.4752034
## 2 Forward 37.25957 0.4940464
## 3 backward 41.16888 0.4663372
## 4 CV.Subset 37.52489 0.4954353
## model MSE R2
## 1 Exhaustive 40.3730085 0.4752034
## 2 Forward 37.2595679 0.4940464
## 3 backward 41.1688755 0.4663372
## 4 CV.Subset 37.5248877 0.4954353
## 5 Ridge 0.5758632 0.4531314
## 6 LASSO 0.5760587 0.4522557
I would choose LASSO model with 11 predictors listed: zn, indus, chas, nox, rm, dis, rad, ptratio, black, and lstate due to two reasons
Accuracy LASSO model has the smallest MSE value among the other methods in spite of relatively low R-squared.
Reduced variance and correlations Subset selections do not count the correlation among predictors which is significantly difficult to figure out each variables’ effect on the model. LASSO tends to eliminate a set of correlated predictors giving more accurate outputs. Also, LASSO can do variable selection as subset selection method. Comparing to ridge regression, LASSO is still the best model even though ridge and LASSO have a same feature except for variable selection.
To conclude, it is reasonable to choose LASSO as the best due to better accurate measurement of MSE and reduced variance and correlations with variable selection.