Conceptual (1-7) & Applied (8-11)
We perform best subset, forward stepwise, and backward stepwise selection on a single data set. For each approach, we obtain p + 1 models, containing 0, 1, 2, . . . , p predictors. Explain your answers:
Which of the three models with k predictors has the smallest training RSS?
The smallest training RSS will be for the model with best subset approach. This is because the model will be chosen after considering all the possible models with k parameters for best subset. This is not true for either backward stepwise or forward stepwise.
Which of the three models with k predictors has the smallest test RSS?
Can’t Say: The model with best subset approach will select a best model with k predictors from all possible combinations for k predictors based on training RSS. Similar will be the case for forward stepwise and backward stepwise approach. But this might not be nessassrily true for test RSS.
True or False:
i
The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by forward stepwise selection.
TRUE
ii
The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by backward stepwise selection.
TRUE
iii
The predictors in the k-variable model identified by backward stepwise are a subset of the predictors in the (k + 1)- variable model identified by forward stepwise selection.
FALSE
iv
The predictors in the k-variable model identified by forward stepwise are a subset of the predictors in the (k+1)-variable model identified by backward stepwise selection.
FALSE
v
The predictors in the k-variable model identified by best subset are a subset of the predictors in the (k + 1)-variable model identified by best subset selection.
FALSE
For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer.
The lasso, relative to least squares, is:
More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
iii
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Repeat (a) for ridge regression relative to least squares.
iii Same as Lasso
Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance.
Repeat (a) for non-linear methods relative to least squares.
ii
More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias
Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2\] subject to
\[ \sum_{j=1}^{p} |\beta_j| \le s\]
for a particular value of s. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
As we increase s from 0, the training RSS will:
Increase initially, and then eventually start decreasing in an inverted U shape.
Decrease initially, and then eventually start increasing in a U shape.
Steadily increase.
Steadily decrease.
Remain constant.
Steadily decrease
With increase in s we are making the model more and more flexible as the restriction on \(\beta\) is reducing. This will lead to decreased RSS.
Repeat (a) for test RSS
Decrease initially, and then eventually start increasing in a U shape
As the model is becoming more and more flexible the test RSS will reduce first and then start increasing when overfitting will start
Repeat (a) for variance.
Steadily increase
Variance steadily increase with the increase in model flexibility
Repeat (a) for (squared) bias.
Steadily decrease
Bias decreases with the increase in the model flexibility
Repeat (a) for the irreducible error.
Remain constant
Irreducible error is independent of model parameters and thus independent of s
Suppose we estimate the regression coefficients in a linear regression model by minimizing \[\sum_{i=1}^{n} (y_i - \beta_0 - \sum_{j=1}^{p} \beta_j x_{i,j})^2 + \lambda \sum_{j=1}^{p} \beta_j^2\] for a particular value of ??. For parts (a) through (e), indicate which of i. through v. is correct. Justify your answer.
As we increase ?? from 0, the training RSS will:
Increase initially, and then eventually start decreasing in an inverted U shape.
Decrease initially, and then eventually start increasing in a U shape.
Steadily increase.
Steadily decrease.
Remain constant.
Steadily increase
With increase in \(\lambda\) we are making the model less and less flexible as the restriction on \(\beta\) is increasing. This will lead to increased RSS.
Repeat (a) for test RSS
Decrease initially, and then eventually start increasing in a U shape
As the model is becoming less and less flexible the test RSS will decrease first and then start increasing when overfitting will start
Repeat (a) for variance.
Steadily decrease
Variance steadily decreases with the decrease in model flexibility
Repeat (a) for (squared) bias.
Steadily increase
Bias increases with the decrease in the model flexibility
Repeat (a) for the irreducible error.
Remain constant
Irreducible error is independent of model parameters and thus independent of \(\lambda\)
It is well-known that ridge regression tends to give similar coefficient values to correlated variables, whereas the lasso may give quite different coefficient values to correlated variables. We will now explore this property in a very simple setting. Suppose that n = 2, p = 2, x11 = x12, x21 = x22. Furthermore, suppose that y1+y2 = 0 and x11+x21 = 0 and x12+x22 = 0, so that the estimate for the intercept in a least squares, ridge regression, or lasso model is zero: ^ ??0 = 0.
Write out the ridge regression optimization problem in this setting.
Given the setting we have \(x_{1,1}\) = \(x_{1,2}\) = \(x_1\) and similarly \(x_2\) Thus the ridge regression problem reduces to minimizing \((y_1 - \beta_1 x_1 - \beta_2 x_1)^2 + (y_2 - \beta_1 x_2 - \beta_2 x_2)^2 + \lambda (\beta_1^2 + \beta_2^2)\)
Argue that in this setting, the ridge coefficient estimates satisfy \(\beta_1 = \beta_2\)
On Differentiating the expression for ridge regression wrt \(\beta_1\) and \(\beta_2\) & equating it to zero we get expression \(\beta_1 = \beta_2\)
Write out the lasso optimization problem in this setting
\((y_1 - \beta_1 x_1 - \beta_2 x_1)^2 + (y_2 - \beta_1 x_2 - \beta_2 x_2)^2 + \lambda (|\beta_1| + |\beta_2|)\)
We will now explore (6.12) and (6.13) further.
Consider (6.12) \[\sum_{j=1}^{p} (y_i - \beta_j)^2 + \lambda \sum_{j=1}^{p} \beta_j^2\] with p = 1. For some choice of y1 and \(\lambda\) > 0, plot (6.12) as a function of \(\beta_1\).
Your plot should confirm that (6.12) is solved by (6.14)\[\beta_{j}^R=y_i / (1+\lambda)\]
y <- 5
lambda <- 5
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * beta^2, pch = 20, xlab = "beta", ylab = "Ridge optimization")
beta.est <- y / (1 + lambda)
points(beta.est, (y - beta.est)^2 + lambda * beta.est^2, col = "red", pch = 4, lwd = 5)
Consider (6.13) \[\sum_{j=1}^{p} (y_i - \beta_j)^2 + \lambda \sum_{j=1}^{p} |\beta_j|\] with p = 1. For some choice of y1 and \(\lambda\) > 0, plot (6.13) as a function of \(\beta_1\). Your plot should confirm that (6.13) is solved by (6.15).
y <- 5
lambda <- 5
beta <- seq(-10, 10, 0.1)
plot(beta, (y - beta)^2 + lambda * abs(beta), pch = 20, xlab = "beta", ylab = "Lasso optimization")
beta.est <- y - lambda / 2
points(beta.est, (y - beta.est)^2 + lambda * abs(beta.est), col = "red", pch = 4, lwd = 5)
In this exercise, we will generate simulated data, and will then use this data to perform best subset selection
Use the rnorm() function to generate a predictor X of length n = 100, as well as a noise vector error of length n = 100.
set.seed(1)
x<-rnorm(100)
error<-rnorm(100)
Generate a response vector Y of length n = 100 according to the model \(Y = \beta_0 + \beta_1X + \beta_2X_2 + \beta_3X_3 + error\), where \(\beta_0, \beta_1, \beta_2, and \beta_3\) are constants of your choice.
b0<-2
b1<-3
b2<-(-4)
b3<-0.5
y<-b0+b1*x+b2*x^2+b3*x^3+error
Use the regsubsets() function to perform best subset selection in order to choose the best model containing the predictors X,X2, . . .,X10. What is the best model obtained according to Cp, BIC, and adjusted R2? Show some plots to provide evidence for your answer, and report the coefficients of the best model obtained.Note you will need to use the data.frame() function to create a single data set containing both X and Y .
#install.packages("leaps")
library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.full<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T)
reg.summary<-summary(regfit.full)
par(mfrow=c(1,3))
#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary$cp),reg.summary$cp[which.min(reg.summary$cp)],col="blue",cex=2,pch=20)
#plot model BIC value
plot(reg.summary$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary$bic),reg.summary$bic[which.min(reg.summary$bic)],col="blue",cex=2,pch=20)
#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary$adjr2),reg.summary$adjr2[which.max(reg.summary$adjr2)],col="blue",cex=2,pch=20)
#Model
coef(regfit.full,which.max(reg.summary$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -4.15676236 0.09022577
The plots shows that the best subset method gives the model with 3 predictor variables when C_p, BIC, and adjusted R square are used as criterias to select model
The model selects X, \(x^2\) and \(x^5\) as parameters.
The original relationship of y is explained by x,\(x^2\), and \(x^3\)
Repeat (c), using forward stepwise selection and also using backwards stepwise selection. How does your answer compare to the results in (c)?
Forward Selection Approach
library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.fwd<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T,method = "forward")
reg.summary.fwd<-summary(regfit.fwd)
par(mfrow=c(1,3))
#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary.fwd$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary.fwd$cp),reg.summary.fwd$cp[which.min(reg.summary.fwd$cp)],col="blue",cex=2,pch=20)
#plot model BIC value
plot(reg.summary.fwd$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary.fwd$bic),reg.summary.fwd$bic[which.min(reg.summary.fwd$bic)],col="blue",cex=2,pch=20)
#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary.fwd$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary.fwd$adjr2),reg.summary.fwd$adjr2[which.max(reg.summary.fwd$adjr2)],col="blue",cex=2,pch=20)
#Model
coef(regfit.fwd,which.max(reg.summary.fwd$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -4.15676236 0.09022577
With forward selection approach also 3 predictor variable model is picked.
Here also the model selects X, \(x_2\) and \(x_5\) as predictors
Backward Selection Approach
library(leaps)
data.full<-data.frame(y=y,x=x)
regfit.bwd<-regsubsets(y~x+I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10,really.big = T,method = "backward")
reg.summary.bwd<-summary(regfit.bwd)
par(mfrow=c(1,3))
#plot model c_p value for different number of variables.Least value of c_p gives best model
plot(reg.summary.bwd$cp,xlab="Number of Variables",ylab="C_p",type="l")
points(which.min(reg.summary.bwd$cp),reg.summary.bwd$cp[which.min(reg.summary.bwd$cp)],col="blue",cex=2,pch=20)
#plot model BIC value
plot(reg.summary.bwd$bic,xlab="Number of Variables",ylab="BIC",type="l")
points(which.min(reg.summary.bwd$bic),reg.summary.bwd$bic[which.min(reg.summary.bwd$bic)],col="blue",cex=2,pch=20)
#plot model adj R square. HIgher adj r sqaure gives best model
plot(reg.summary.bwd$adjr2,xlab="Number of Variables",ylab="Adj R^2",type="l")
points(which.max(reg.summary.bwd$adjr2),reg.summary.bwd$adjr2[which.max(reg.summary.bwd$adjr2)],col="blue",cex=2,pch=20)
#Model
coef(regfit.bwd,which.max(reg.summary.bwd$adjr2))
## (Intercept) x I(x^2) I(x^5)
## 2.07219472 3.44514720 -4.15676236 0.09022577
Now fit a lasso model to the simulated data, again using X,X2, . . . , X10 as predictors. Use cross-validation to select the optimal value of \(\lambda\). Create plots of the cross-validation error as a function of \(\lambda\). Report the resulting coefficient estimates, and discuss the results obtained.
library(glmnet)
set.seed(1)
xmat<-model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[,-1]
set.seed(1)
cv.lasso<-cv.glmnet(xmat,y,alpha=1)
plot(cv.lasso)
bestlam<-cv.lasso$lambda.min
bestlam
## [1] 0.02684502
fit.lasso<-glmnet(xmat,y,alpha=1)
predict(fit.lasso,s=bestlam,type="coefficients")[1:11,]
## (Intercept) x I(x^2) I(x^3) I(x^4)
## 2.050652e+00 3.301160e+00 -4.121223e+00 1.298972e-01 0.000000e+00
## I(x^5) I(x^6) I(x^7) I(x^8) I(x^9)
## 6.621380e-02 0.000000e+00 7.160792e-05 0.000000e+00 0.000000e+00
## I(x^10)
## 2.919318e-06
After performing cross validation we get \(\lambda\) as 0.026845 for minimum MSE.
We use this value of lambda to fit lasso regression
We obtain the coefficient estimates as above. Rest coefficients are reduced to 0
Now generate a response vector Y according to the model \[Y = \beta_0 + \beta_7 X_7 + error\] and perform best subset selection and the lasso. Discuss the results obtained.
Best Subset Method
b7<-7
y <- b0 + b7 * x^7 + error
data.full <- data.frame(y = y, x = x)
regfit.full <- regsubsets(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full, nvmax = 10)
reg.summary <- summary(regfit.full)
par(mfrow = c(1, 3))
plot(reg.summary$cp, xlab = "Number of variables", ylab = "C_p", type = "l")
points(which.min(reg.summary$cp), reg.summary$cp[which.min(reg.summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg.summary$bic, xlab = "Number of variables", ylab = "BIC", type = "l")
points(which.min(reg.summary$bic), reg.summary$bic[which.min(reg.summary$bic)], col = "red", cex = 2, pch = 20)
plot(reg.summary$adjr2, xlab = "Number of variables", ylab = "Adjusted R^2", type = "l")
points(which.max(reg.summary$adjr2), reg.summary$adjr2[which.max(reg.summary$adjr2)], col = "red", cex = 2, pch = 20)
coef(regfit.full,2)
## (Intercept) I(x^2) I(x^7)
## 2.0704904 -0.1417084 7.0015552
coef(regfit.full,1)
## (Intercept) I(x^7)
## 1.95894 7.00077
coef(regfit.full,4)
## (Intercept) x I(x^2) I(x^3) I(x^7)
## 2.0762524 0.2914016 -0.1617671 -0.2526527 7.0091338
We see \(c_p\) chooses 2 variable model, BIC chooses 1 variable model while adj r sqaure chooses 4 variable model
Lasso Model
xmat <- model.matrix(y ~ x + I(x^2) + I(x^3) + I(x^4) + I(x^5) + I(x^6) + I(x^7) + I(x^8) + I(x^9) + I(x^10), data = data.full)[, -1]
set.seed(1)
cv.lasso <- cv.glmnet(xmat, y, alpha = 1)
bestlam <- cv.lasso$lambda.min
bestlam
## [1] 13.57478
fit.lasso <- glmnet(xmat, y, alpha = 1)
predict(fit.lasso, s = bestlam, type = "coefficients")[1:11, ]
## (Intercept) x I(x^2) I(x^3) I(x^4) I(x^5)
## 2.904188 0.000000 0.000000 0.000000 0.000000 0.000000
## I(x^6) I(x^7) I(x^8) I(x^9) I(x^10)
## 0.000000 6.776797 0.000000 0.000000 0.000000
Here the lasso also chooses the best model with one variable
Thus lasso performs better as compared to best subset approach as it gives model close to simulated value of y
In this exercise, we will predict the number of applications received using the other variables in the College data set.
Split the data set into a training set and a test set.
library(ISLR)
attach(College)
set.seed(11)
#Randomly splitting data into trainig and test set in 7:3 ratio
subset<-sample(nrow(College),nrow(College)*0.7)
train<-College[subset,]
test<-College[-subset,]
Fit a linear model using least squares on the training set, and report the test error obtained.
ls.full<-lm(Apps~.,data=train)
summary(ls.full)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5023.2 -408.6 -46.6 336.7 7242.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -140.27742 518.76903 -0.270 0.786955
## PrivateYes -511.17144 176.18600 -2.901 0.003872 **
## Accept 1.63646 0.05066 32.302 < 2e-16 ***
## Enroll -1.08353 0.24583 -4.408 1.27e-05 ***
## Top10perc 58.88924 7.33783 8.025 6.66e-15 ***
## Top25perc -21.37240 6.10623 -3.500 0.000505 ***
## F.Undergrad 0.07107 0.04330 1.641 0.101333
## P.Undergrad 0.05855 0.03783 1.548 0.122318
## Outstate -0.08871 0.02447 -3.625 0.000317 ***
## Room.Board 0.14479 0.06418 2.256 0.024493 *
## Books -0.19750 0.29231 -0.676 0.499547
## Personal 0.03265 0.07920 0.412 0.680391
## PhD -8.78197 6.16418 -1.425 0.154844
## Terminal -4.79463 6.69874 -0.716 0.474463
## S.F.Ratio 17.93368 16.68176 1.075 0.282848
## perc.alumni 2.29991 5.51102 0.417 0.676608
## Expend 0.07830 0.01574 4.974 8.92e-07 ***
## Grad.Rate 9.14788 3.83929 2.383 0.017541 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1113 on 525 degrees of freedom
## Multiple R-squared: 0.9308, Adjusted R-squared: 0.9286
## F-statistic: 415.4 on 17 and 525 DF, p-value: < 2.2e-16
predicted.apps<-predict(ls.full,test)
testerror<-mean((test$Apps-predicted.apps)^2)
testerror
## [1] 769127.1
The Mean Square error for the test data set is found to be 7.69127110^{5}
Fit a ridge regression model on the training set, with ?? chosen by cross-validation. Report the test error obtained.
#create matrix for training set and test set
train.mat<-model.matrix(Apps~.,data=train)
test.mat<-model.matrix(Apps~.,data=test)
#defining grid to covering all the range of lambda.This will e used to find best value of lambda
grid<-10^seq(4,-2,length=100)
#fitting the ridge regression model
library(glmnet)
ridge<-glmnet(train.mat,train$Apps,alpha=0,lambda=grid,thresh = 1e-12)
#doing cross validation on model
cv.ridge<-cv.glmnet(train.mat,train$Apps,alpha=0,lambda=grid,thresh=1e-12)
#finding the lambda for which cv error is minimum on training data
bestlam.ridge<-cv.ridge$lambda.min
bestlam.ridge
## [1] 0.01
#using the lambda value obtained from cross validation for the ridge model directly on test data set to get the predicted values
pred.newridge<-predict(ridge,s=bestlam.ridge,newx =test.mat)
#Mean Square Error calculation
mean((test$Apps-pred.newridge)^2)
## [1] 769103.1
The Mean Square error for the test data set is found to be 7.691030610^{5} for Ridge Regression Model
This is slightly less than the one for Least Square Model
Fit a lasso model on the training set, with ?? chosen by crossvalidation.Report the test error obtained, along with the number of non-zero coefficient estimates.
lasso<-glmnet(train.mat,train$Apps,alpha=1,lambda=grid,thresh = 1e-12)
#doing cross validation on model
cv.lasso<-cv.glmnet(train.mat,train$Apps,alpha=1,lambda=grid,thresh=1e-12)
#finding the lambda for which cv error is minimum on training data
bestlam.lasso<-cv.lasso$lambda.min
bestlam.lasso
## [1] 0.01
#using the lambda value obtained from cross validation for the lasso model directly on test data set to get the predicted values
pred.newlasso<-predict(lasso,s=bestlam.lasso,newx =test.mat)
#Mean Square Error calculation
mean((test$Apps-pred.newlasso)^2)
## [1] 769058.3
#Non zero Coefficienct estimates
predict(lasso,s=bestlam,type="coefficients")
## 19 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -326.56814587
## (Intercept) .
## PrivateYes -500.65803166
## Accept 1.54767937
## Enroll -0.46674168
## Top10perc 47.34473048
## Top25perc -12.45525428
## F.Undergrad .
## P.Undergrad 0.04640836
## Outstate -0.06891103
## Room.Board 0.12707018
## Books -0.05928521
## Personal 0.01197316
## PhD -7.03010228
## Terminal -4.56073825
## S.F.Ratio 12.44794269
## perc.alumni .
## Expend 0.07348693
## Grad.Rate 6.60642007
The Mean Square error for the test data set is found to be 7.690583110^{5} for Lasso Regression Model
The mean square error has further reduced compared to Ridge Regression. Lasso model has 13 non zero coefficients.
Fit a PCR model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
library(pls)
pcrmodel<-pcr(Apps~.,data=train,scale=TRUE,validation="CV")
#plotting Mean square error (MSEP) for different number of components
validationplot(pcrmodel,val.type="MSEP")
predict.pcr<-predict(pcrmodel,test,ncomp=17)
mean((test$Apps-predict.pcr)^2)
## [1] 769127.1
We see that the cross validation error is minimum for M=17.If we use 17 components it gives MSE on test set as 7.69127110^{5}.
This is similar to the one obtained for least squares method.
If we take a lower value of M, lets say 5 or 7 we get higher MSE for test set
Fit a PLS model on the training set, with M chosen by crossvalidation. Report the test error obtained, along with the value of M selected by cross-validation.
plsrmodel<-plsr(Apps~.,data=train,scale=TRUE,validation="CV")
#plotting Mean square error (MSEP) for different number of components
validationplot(plsrmodel,val.type="MSEP")
predict.plsr<-predict(plsrmodel,test,ncomp=10)
mean((test$Apps-predict.plsr)^2)
## [1] 775233.6
We see that the cross validation error is minimum for M=10.If we use 10 components it gives MSE on test set as 7.752336110^{5}.
This is higher compared to the one obtained for least squares method.
Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
We use test R Sqaure statistic to see how accurately we predicted the number of college applications received
Higher value of test R-Square shows higher accuracy
#Least Square model
test.avg <- mean(test$Apps)
lm.r2 <- 1 - mean((predicted.apps - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#Ridge model
ridge.r2 <- 1 - mean((pred.newridge - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#Lasso model
lasso.r2 <- 1 - mean((pred.newlasso - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#PCR model
pcr.r2 <- 1 - mean((predict.pcr - test$Apps)^2) / mean((test.avg - test$Apps)^2)
#PLS model
pls.r2 <- 1 - mean((predict.plsr - test$Apps)^2) / mean((test.avg - test$Apps)^2)
Least Square Test R-Square: 0.918807
Ridge Model Test R-Square: 0.9188096
Lasso Model Test R-Square: 0.9188143
PCR Model Test R-Square: 0.918807
PLS Model Test R-Square: 0.9181624
It can be seen that the Lasso model predicts the highest R-Square. Though all the models have similar performance
This was expected as the minimum MSE was found for Lasso model across all the models
We have seen that as the number of features used in a model increases, the training error will necessarily decrease, but the test error may not. We will now explore this in a simulated data set.
Generate a data set with p = 20 features, n = 1,000 observations, and an associated quantitative response vector generated according to the model \(Y = X\beta + error\), where \(\beta\) has some elements that are exactly equal to zero
set.seed(1)
#Create a matrix of 1000*20
x<-matrix(rnorm(1000*20),1000,20)
#Create a vector of beta and randomly assign 0 to few betas
b<-rnorm(20)
b[1]<-0
b[3]<-0
b[5]<-0
b[7]<-0
b[9]<-0
#generating error term
error<-rnorm(1000)
#multiplying x and beta matrix to get y
y<-x%*%b+error
Split your data set into a training set containing 100 observations and a test set containing 900 observations.
#Create a vector train of length 100 and assign them random index number chosen from 1 to 1000
train<-sample(seq(1000),100,replace=FALSE)
#create a vector test which contains rest of the indices
test<- (-train)
#subset x matrix for the given sequence in train and test
x.train<-x[train,]
x.test<-x[test,]
y.train<-y[train]
y.test<-y[test]
Perform best subset selection on the training set, and plot the training set MSE associated with the best model of each size.
library(leaps)
data.train <- data.frame(y = y.train, x = x.train)
regfit.full <- regsubsets(y ~ ., data = data.train, nvmax = 20)
train.mat <- model.matrix(y ~ ., data = data.train, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
pred <- train.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - y.train)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Training MSE", pch = 19, type = "b",col="orange")
Plot the test set MSE associated with the best model of each size.
data.test <- data.frame(y = y.test, x = x.test)
test.mat <- model.matrix(y ~ ., data = data.test, nvmax = 20)
val.errors <- rep(NA, 20)
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
pred <- test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - y.test)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")
For which model size does the test set MSE take on its minimum value? Comment on your results. If it takes on its minimum value for a model containing only an intercept or a model containing all of the features, then play around with the way that you are generating the data in (a) until you come up with a scenario in which the test set MSE is minimized for an intermediate model size.
which.min(val.errors)
## [1] 14
For the model having 14 variables the test set MSE is minimum
How does the model at which the test set MSE is minimized compare to the true model used to generate the data? Comment on the coefficient values.
coef(regfit.full,which.min(val.errors))
## (Intercept) x.2 x.4 x.8 x.10 x.11
## 0.1686140 0.3310800 -1.9043844 0.8537035 0.7885128 0.7023013
## x.12 x.13 x.14 x.15 x.16 x.17
## 0.5486439 -0.3692904 -0.8258862 -0.5603339 -0.2197001 0.3084519
## x.18 x.19 x.20
## 1.5824914 0.9099814 -0.7816566
We zeroed certain parameters in true model
The best subset model, for which the test MSE is minimum, built is able to identify those variables and removed them from the model
Create a plot displaying \[(\sqrt(\sum_{j=1}^{p} (\beta_{j} - \beta_{j}^{r})^2))\] for a range of values of r, where \(\beta_{j}^{r}\) j is the jth coefficient estimate for the best model containing r coefficients. Comment on what you observe. How does this compare to the test MSE plot from (d)?
val.errors <- rep(NA, 20)
x_cols = colnames(x, do.NULL = FALSE, prefix = "x.")
for (i in 1:20) {
coefi <- coef(regfit.full, id = i)
val.errors[i] <- sqrt(sum((b[x_cols %in% names(coefi)] - coefi[names(coefi) %in% x_cols])^2) + sum(b[!(x_cols %in% names(coefi))])^2)
}
plot(val.errors, xlab = "Number of Predictors", ylab = "Mean Square Error for estimated and true coefficients", pch = 19, type = "b",col="orange")
It can be seen that the error is minimized for 3 variables. It further decreases for variables from 14-20.
The test MSE is minimum for 14 variable model.Thus it can be said that the model which gives parameter estimates closest to true paramter estimate need not give the least test MSE i.e. it need not be the best model to predict
We will now try to predict per capita crime rate in the Boston data set.
Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
Exploring Data
library(MASS)
attach(Boston)
dim(Boston)
## [1] 506 14
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
#Splitting Data into training and test data
set.seed(10743959)
subset<-sample(nrow(Boston),nrow(Boston)*0.70)
boston.train<-Boston[subset,]
boston.test<-Boston[-subset,]
Following is the list of different approaches tried for variable selection and model Building
Full Model
Best Subset
Forward Selection
Backward Selection
Stepwise Selection
Ridge Regression
Lasso Regression
Principal Component Regression (PCR)
Partial Least Square (PLS)
Full Model
#Simple Least Square FUll model
slr.full<-lm(medv~.,data=boston.train)
slr.predict<-predict(slr.full,boston.test)
slr.MSE<-mean((slr.predict-boston.test$medv)^2)
slr.MSE
## [1] 17.96323
Best Subset
#Best Subset Model
library(leaps)
bsm<-regsubsets(medv~.,data=boston.train,nbest=1,nvmax=13)
summary(bsm)
## Subset selection object
## Call: regsubsets.formula(medv ~ ., data = boston.train, nbest = 1,
## nvmax = 13)
## 13 Variables (and intercept)
## Forced in Forced out
## crim FALSE FALSE
## zn FALSE FALSE
## indus FALSE FALSE
## chas FALSE FALSE
## nox FALSE FALSE
## rm FALSE FALSE
## age FALSE FALSE
## dis FALSE FALSE
## rad FALSE FALSE
## tax FALSE FALSE
## ptratio FALSE FALSE
## black FALSE FALSE
## lstat FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## crim zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 ( 1 ) " " " " " " " " " " " " " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*" " " " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " " " "*" " " " " " " " " "*" " " "*"
## 4 ( 1 ) " " " " " " " " " " "*" " " "*" " " " " "*" " " "*"
## 5 ( 1 ) " " " " " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 6 ( 1 ) " " "*" " " " " "*" "*" " " "*" " " " " "*" " " "*"
## 7 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" " " "*"
## 8 ( 1 ) " " "*" " " "*" "*" "*" " " "*" " " " " "*" "*" "*"
## 9 ( 1 ) " " "*" " " " " "*" "*" " " "*" "*" "*" "*" "*" "*"
## 10 ( 1 ) " " "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 11 ( 1 ) "*" "*" " " "*" "*" "*" " " "*" "*" "*" "*" "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*"
#Out of 13 models best model is chosen for minimum cv error on test set
boston.test.mat <- model.matrix(medv~ ., data = boston.test, nvmax = 13)
val.errors <- rep(NA, 13)
for (i in 1:13) {
coefi <- coef(bsm, id = i)
pred <- boston.test.mat[, names(coefi)] %*% coefi
val.errors[i] <- mean((pred - boston.test$medv)^2)
}
plot(val.errors, xlab = "Number of predictors", ylab = "Test MSE", pch = 19, type = "b",col="orange")
which.min(val.errors)
## [1] 11
coef(bsm,which.min(val.errors))
## (Intercept) crim zn chas nox
## 42.504262409 -0.051977242 0.058240750 2.579296005 -16.716321472
## rm dis rad tax ptratio
## 3.017526315 -1.598507331 0.330049548 -0.013725911 -0.904781765
## black lstat
## 0.008099312 -0.611990159
bsm.MSE<-val.errors[11]
bsm.MSE
## [1] 17.81448
Best subset method selects 1 model for each number of predictor i.e. best model having 1 predictor, best model having 2 predictor and so on based on the minimum training set MSE
To select best model from these 13 models test set MSE is found for each model. The model giving least test set MSE is chosen as final model
We select final model with 11 predictor variables
Forward Selection
#forward selection
nullmodel<-lm(medv ~ 1, data = boston.train)
fullmodel<-lm(medv ~ ., data = boston.train)
fwd.model<-step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel),
direction = "forward")
#Forward model coefficients
summary(fwd.model)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + chas +
## black + rad + tax, data = boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5511 -2.7981 -0.5087 1.6696 24.1049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.553676 6.098471 6.814 4.28e-11 ***
## lstat -0.620575 0.058679 -10.576 < 2e-16 ***
## rm 3.068242 0.484744 6.330 7.69e-10 ***
## ptratio -0.899612 0.160989 -5.588 4.69e-08 ***
## dis -1.568736 0.229077 -6.848 3.47e-11 ***
## nox -16.280488 4.309207 -3.778 0.000186 ***
## zn 0.056446 0.016975 3.325 0.000979 ***
## chas 2.628518 1.055356 2.491 0.013224 *
## black 0.008998 0.003350 2.686 0.007590 **
## rad 0.305467 0.083179 3.672 0.000279 ***
## tax -0.013603 0.004534 -3.000 0.002898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.7138
## F-statistic: 89.05 on 10 and 343 DF, p-value: < 2.2e-16
#Forward test MSE
fwd.predict<-predict(fwd.model,boston.test)
fwd.mse<-mean((fwd.predict-boston.test$medv)^2)
fwd.mse
## [1] 18.84049
Backward Selection
#backward selection
bwd.model<-step(fullmodel,direction = "backward")
summary(bwd.model)
##
## Call:
## lm(formula = medv ~ zn + chas + nox + rm + dis + rad + tax +
## ptratio + black + lstat, data = boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5511 -2.7981 -0.5087 1.6696 24.1049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.553676 6.098471 6.814 4.28e-11 ***
## zn 0.056446 0.016975 3.325 0.000979 ***
## chas 2.628518 1.055356 2.491 0.013224 *
## nox -16.280488 4.309207 -3.778 0.000186 ***
## rm 3.068242 0.484744 6.330 7.69e-10 ***
## dis -1.568736 0.229077 -6.848 3.47e-11 ***
## rad 0.305467 0.083179 3.672 0.000279 ***
## tax -0.013603 0.004534 -3.000 0.002898 **
## ptratio -0.899612 0.160989 -5.588 4.69e-08 ***
## black 0.008998 0.003350 2.686 0.007590 **
## lstat -0.620575 0.058679 -10.576 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.7138
## F-statistic: 89.05 on 10 and 343 DF, p-value: < 2.2e-16
#Backward test MSE
bwd.predict<-predict(bwd.model,boston.test)
bwd.mse<-mean((bwd.predict-boston.test$medv)^2)
bwd.mse
## [1] 18.84049
Stepwise Selection
#stepwise selection
step.model<-step(nullmodel, scope = list(lower = nullmodel, upper = fullmodel),
direction = "both")
summary(step.model)
##
## Call:
## lm(formula = medv ~ lstat + rm + ptratio + dis + nox + zn + chas +
## black + rad + tax, data = boston.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.5511 -2.7981 -0.5087 1.6696 24.1049
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.553676 6.098471 6.814 4.28e-11 ***
## lstat -0.620575 0.058679 -10.576 < 2e-16 ***
## rm 3.068242 0.484744 6.330 7.69e-10 ***
## ptratio -0.899612 0.160989 -5.588 4.69e-08 ***
## dis -1.568736 0.229077 -6.848 3.47e-11 ***
## nox -16.280488 4.309207 -3.778 0.000186 ***
## zn 0.056446 0.016975 3.325 0.000979 ***
## chas 2.628518 1.055356 2.491 0.013224 *
## black 0.008998 0.003350 2.686 0.007590 **
## rad 0.305467 0.083179 3.672 0.000279 ***
## tax -0.013603 0.004534 -3.000 0.002898 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.021 on 343 degrees of freedom
## Multiple R-squared: 0.7219, Adjusted R-squared: 0.7138
## F-statistic: 89.05 on 10 and 343 DF, p-value: < 2.2e-16
#Stepwise test MSE
step.predict<-predict(step.model,boston.test)
step.mse<-mean((step.predict-boston.test$medv)^2)
step.mse
## [1] 18.84049
Ridge Regression
#Ridge Regression
#creating matrix for Ridge Regression
bostontrain.mat<-model.matrix(medv~.,data=boston.train)
bostontest.mat<-model.matrix(medv~.,data=boston.test)
#Defining grid that covers the wide range of lambda
grid<-10^seq(4,-2,length=100)
#Fitting the ridge regression model
library(glmnet)
ridge.model<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)
#Finding lambda that gives minimum MSE by cross validation on training set
boston.cv.ridge<-cv.glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)
#Visualizing MSE vs Log(lambda)
plot(boston.cv.ridge)
#finding the lambda for which MSE is minimum on training data
boston.bestlam.ridge<-boston.cv.ridge$lambda.min
boston.bestlam.ridge
## [1] 0.1417474
#fitting the Ridge model with with the value of lambda obtained and finding the coefficients
ridge.model.1<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=boston.bestlam.ridge)
coef(ridge.model.1)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 39.726176503
## (Intercept) .
## crim -0.043885158
## zn 0.052687856
## indus -0.005063095
## chas 2.692011518
## nox -15.182853995
## rm 3.177687894
## age -0.008481379
## dis -1.529508653
## rad 0.269963547
## tax -0.010966192
## ptratio -0.881138929
## black 0.008225794
## lstat -0.588333809
#using the lambda value obtained from cross validation for the ridge model directly on test data set to get the predicted values
boston.pred.newridge<-predict(ridge.model,s=boston.bestlam.ridge,newx=bostontest.mat)
#Mean Square Error calculation
ridge.MSE<-mean((boston.test$medv-boston.pred.newridge)^2)
ridge.MSE
## [1] 17.74425
We perform ridge regression for a range of values of lambda and find MSE for training set. The value of lambda corresponding to minimum MSE is used to build final model and for prediction
The value of lambda is 0.1417474 for which we got minimum train set MSE
Lasso Regression
#Lasso Regression
lasso.model<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)
#finding lambda by cross validation and then finding MSE on test set using that lambda
boston.cv.lasso<-cv.glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=grid)
#Visualizing MSE vs Log(lambda)
plot(boston.cv.lasso)
#finding the lambda for which cv error is minimum on training data
boston.bestlam.lasso<-boston.cv.lasso$lambda.min
boston.bestlam.lasso
## [1] 0.1232847
#fitting the Lasso model with with the value of lambda obtained and finding the coefficients
lasso.model.1<-glmnet(bostontrain.mat,boston.train$medv,alpha=0,lambda=boston.bestlam.lasso)
coef(lasso.model.1)
## 15 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 40.033394715
## (Intercept) .
## crim -0.044466644
## zn 0.053252258
## indus -0.002505365
## chas 2.680304197
## nox -15.340063745
## rm 3.166768913
## age -0.008441933
## dis -1.541098565
## rad 0.276257251
## tax -0.011281348
## ptratio -0.883796393
## black 0.008225935
## lstat -0.590324265
#using the lambda value obtained from cross validation for the lasso model directly on test data set to get the predicted values
boston.pred.newlasso<-predict(lasso.model,s=boston.bestlam.lasso,newx=bostontest.mat)
#Mean Square Error calculation
lasso.MSE<-mean((boston.test$medv-boston.pred.newlasso)^2)
lasso.MSE
## [1] 17.76477
We perform lasso regression for a range of values of lambda and find MSE for training set. The value of lambda corresponding to minimum MSE is used to build final model and for prediction
The value of lambda is 0.1232847 for which we got minimum train set MSE
Principal Component Regression (PCR)
#Principal Component Regression
library(pls)
pcr.model<-pcr(medv~.,data=boston.train,scale=TRUE,validation="CV")
#13 component summary
summary(pcr.model)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.398 7.599 7.437 5.937 5.957 5.576 5.598
## adjCV 9.398 7.597 7.434 5.932 5.952 5.567 5.591
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 5.575 5.477 5.447 5.428 5.452 5.253 5.186
## adjCV 5.565 5.467 5.444 5.418 5.445 5.240 5.173
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps
## X 47.68 59.12 68.52 75.68 82.00 86.70 90.14
## medv 34.95 39.63 60.77 60.94 65.86 65.96 66.42
## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## X 93.15 95.05 96.84 98.28 99.58 100.00
## medv 67.91 68.10 69.03 69.15 71.49 72.29
#plotting Mean square error (MSEP) for different number of components
validationplot(pcr.model,val.type="MSEP")
#using 5 components to build the final model
pcr.model.5comp<-pcr(medv~.,data=boston.train,scale=TRUE,ncomp=5)
summary(pcr.model.5comp)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: svdpc
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 47.68 59.12 68.52 75.68 82.00
## medv 34.95 39.63 60.77 60.94 65.86
#Obtaining the scores of each principal component
pcr.model.5comp$coefficients
## , , 1 comps
##
## medv
## crim -0.6063915501
## zn 0.5653092042
## indus -0.7708923694
## chas -0.0006292165
## nox -0.7499003656
## rm 0.4304083600
## age -0.6915127075
## dis 0.7132887312
## rad -0.7111004067
## tax -0.7475540656
## ptratio -0.4139434089
## black 0.4877422872
## lstat -0.6904890541
##
## , , 2 comps
##
## medv
## crim -1.16076039
## zn 0.04235492
## indus -0.56193240
## chas 0.63284060
## nox -0.34381707
## rm 0.58457968
## age -0.17407242
## dis 0.12326631
## rad -1.17923658
## tax -1.15681336
## ptratio -0.97561493
## black 0.97449535
## lstat -0.73255090
##
## , , 3 comps
##
## medv
## crim -0.40885303
## zn 1.14506849
## indus -0.57361682
## chas 2.11629628
## nox 0.07154186
## rm 2.92971918
## age -0.12664889
## dis -0.12661619
## rad -0.06747743
## tax -0.27611708
## ptratio -2.13668603
## black 0.02081913
## lstat -1.84931819
##
## , , 4 comps
##
## medv
## crim -0.446493901
## zn 1.082229500
## indus -0.560066308
## chas 2.339542181
## nox 0.002820896
## rm 2.929573203
## age -0.173545861
## dis -0.095552865
## rad 0.012642142
## tax -0.218882977
## ptratio -1.892652735
## black 0.159432982
## lstat -1.922300636
##
## , , 5 comps
##
## medv
## crim -0.79069410
## zn 0.31377021
## indus -0.53227588
## chas 0.97980565
## nox -0.07258062
## rm 4.11921228
## age 0.08861939
## dis -0.53232412
## rad 0.24973101
## tax -0.07788551
## ptratio -1.33947559
## black 0.49875496
## lstat -2.65801809
#Using PCR model with 5 components (explaining 80 % of variation) to predict on test data
boston.predict.pcr<-predict(pcr.model,boston.test,ncomp=5)
pcr.MSE<-mean((boston.test$medv-boston.predict.pcr)^2)
pcr.MSE
## [1] 15.96318
We perform Principal Component Regression and see summary for 13 components
Plot of MSE vs Number of Components suggests that minimum MSE is obtained for 13 components.This means that PCR is not helping in dimension reduction from p+1 to m+1
We select 5 as number of components in final model as 5 components explain about 80 % of variation
Scores of the components are also summarized. Scores are further used to fit the Principal Component Model
Partial Least Square (PLS)
Similar steps are followed as above for Partial Least square method
library(pls)
#Partial Least Square
plsr.model<-plsr(medv~.,data=boston.train,scale=TRUE,validation="CV")
#plotting Mean square error (MSEP) for different number of components
validationplot(plsr.model,val.type="MSEP")
#using 5 components to build the final model
plsr.model.5comp<-plsr(medv~.,data=boston.train,scale=TRUE,ncomp=5)
summary(plsr.model.5comp)
## Data: X dimension: 354 13
## Y dimension: 354 1
## Fit method: kernelpls
## Number of components considered: 5
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps
## X 46.42 57.12 64.14 70.68 76.34
## medv 46.90 67.90 70.21 71.38 71.87
#Obtaining the scores of each
plsr.model.5comp$coefficients
## , , 1 comps
##
## medv
## crim -0.6629002
## zn 0.6346088
## indus -0.8125719
## chas 0.3234698
## nox -0.7289083
## rm 1.1522209
## age -0.6400400
## dis 0.4186334
## rad -0.5963464
## tax -0.7481862
## ptratio -0.8083894
## black 0.5970368
## lstat -1.2711419
##
## , , 2 comps
##
## medv
## crim -0.42482542
## zn 0.52607570
## indus -0.47112526
## chas 1.40382834
## nox -0.24208308
## rm 3.48196710
## age -0.04543861
## dis -1.18593277
## rad 0.35129618
## tax -0.25239724
## ptratio -1.93617630
## black 0.69452051
## lstat -3.03566596
##
## , , 3 comps
##
## medv
## crim 0.0162841
## zn 0.5614550
## indus -0.4975755
## chas 0.5868014
## nox -0.9170569
## rm 3.3403655
## age -0.3678324
## dis -1.9690256
## rad 0.9934994
## tax -0.1655670
## ptratio -1.6703181
## black 0.8684079
## lstat -4.0099664
##
## , , 4 comps
##
## medv
## crim -0.0137188
## zn 0.5497794
## indus -0.3653050
## chas 0.5499696
## nox -1.3785074
## rm 2.5871245
## age -0.3460459
## dis -3.0540460
## rad 1.1149044
## tax -0.6409048
## ptratio -1.9283887
## black 1.1417590
## lstat -4.5137243
##
## , , 5 comps
##
## medv
## crim 0.04611646
## zn 1.10369104
## indus -0.31643319
## chas 0.79412630
## nox -1.76929120
## rm 2.19271829
## age -0.40901868
## dis -3.49536648
## rad 1.46882099
## tax -0.87414555
## ptratio -2.08221751
## black 0.81717250
## lstat -4.63089016
boston.predict.plsr<-predict(plsr.model,boston.test,ncomp=5)
plsr.MSE<-mean((boston.test$medv-boston.predict.plsr)^2)
plsr.MSE
## [1] 19.42082
Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
Validation Set MSE is used for evaluating model performance
Below is the list of different approaches with the MSE obtained for Validation Data set.
Full Model: 17.96
Best Subset: 17.81
Forward Selection: 18.84
Backward Selection: 18.84
Stepwise Selection: 18.84
Ridge Regression: 17.74
Lasso Regression: 17.76
Principal Component Regression (PCR): 15.96
Partial Least Square (PLS): 19.42
The minimum validation is obtained for Principal Component Regression Model with 5 Principal Components
It is interesting to see that validation MSE for forward, backward and subset selection models is same
The second least MSE for Validation set is obtained for Ridge regression
Does your chosen model involve all of the features in the data set? Why or why not?
The PCR model contains 5 components which comprises of all the 13 predictor variables. The weightage of each predictor variable for every component is given by scores
The Ridge regression model also contains all the predictor variables
This suggest that all the predictor variables are contributing in predicting the response variable