For parts (a) through (c), indicate which of i. through iv. is correct.Justify your answer. (a) The lasso, relative to least squares, is:
i. More flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. ii. More flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias. iii. Less flexible and hence will give improved prediction accuracy when its increase in bias is less than its decrease in variance. iv. Less flexible and hence will give improved prediction accuracy when its increase in variance is less than its decrease in bias.
A) (iii)
Justification::
Relative to least Squares, Lasso is an alternate fitting procedure using a Shrinkage method, where coefficients of certain predictors are exactly 0(zero) with the goal of reducing the variance and improve prediction accuracy at a slight increase in bias.Lasso as compared to least squares introduces a penalty with tuning parameter \(\lambda\), which has a coefficient shrinking(make them exactly 0) effect.As we will be using only predictors which influence the response this is a less flexible approach with less variance.
(b) Repeat (a) for ridge regression relative to least squares. A) (iii)
Justification::
Relative to least Squares, Ridge Regression is an alternate fitting procedure using a Shrinkage method,where coefficients of certain predictors are shrinked towards 0(zero) with the goal of reducing the variance and improve prediction accuracy at a slight increase in bias.Ridge regression as compared to least squares introduces a penalty with tuning parameter \(\lambda\), which has a coefficient shrinking(make them towards 0) effect.As we will be using only predictors which influence the response this is a less flexible approach with less variance.
(c) Repeat (a) for non-linear methods relative to least squares. A) (ii)
Justification::
Non-linear methods compared to least square makes less assumptions by not assuming linearity in the functional form for the response and predictors reducing the bias. Thus making it a more flexible method.By fitting non-linear model will improve prediction accuracy with decrease in bias with relatively less increase in variance,bias-variance trade off can yield higher prediction accuracy.
In this exercise, we will predict the number of applications received using the other variables in the College data set.
(a) Split the data set into a training set and a test set.
mycollege <- College
#Convert the data to a matrix and a vector for the response
set.seed(12)
x = model.matrix(Apps~.,mycollege)[,-1]
y = mycollege$Apps
#split the data into training and validation sets(70- 30 split)
coltrain = sample(1:nrow(x), nrow(x)/1.43) #Row indexes
coltest = (-coltrain)#Row indexes for test set
ytest = y[coltest]
Training and test/validation data set are split in 70:30 ratio.
(b) Fit a linear model using least squares on the training set, and report the test error obtained.
col_lm <- lm(Apps ~ ., data = mycollege[coltrain,])
summary(col_lm)
##
## Call:
## lm(formula = Apps ~ ., data = mycollege[coltrain, ])
##
## Residuals:
## Min 1Q Median 3Q Max
## -2859.5 -427.9 -33.5 339.5 6074.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 86.95605 460.53700 0.189 0.85031
## PrivateYes -670.11120 156.80692 -4.273 2.29e-05 ***
## Accept 1.43861 0.05969 24.103 < 2e-16 ***
## Enroll -0.43665 0.21050 -2.074 0.03854 *
## Top10perc 45.82557 6.17884 7.417 4.86e-13 ***
## Top25perc -12.03688 4.83039 -2.492 0.01301 *
## F.Undergrad 0.04186 0.03689 1.135 0.25698
## P.Undergrad 0.04527 0.03373 1.342 0.18011
## Outstate -0.06886 0.02151 -3.201 0.00145 **
## Room.Board 0.09760 0.05198 1.878 0.06099 .
## Books -0.01949 0.28463 -0.068 0.94544
## Personal -0.08695 0.06873 -1.265 0.20641
## PhD -9.30044 5.21205 -1.784 0.07493 .
## Terminal -4.76292 5.56306 -0.856 0.39230
## S.F.Ratio 6.62384 14.38187 0.461 0.64530
## perc.alumni -6.66355 4.58755 -1.453 0.14695
## Expend 0.10279 0.01791 5.740 1.60e-08 ***
## Grad.Rate 9.64039 3.20526 3.008 0.00276 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 946.1 on 525 degrees of freedom
## Multiple R-squared: 0.929, Adjusted R-squared: 0.9267
## F-statistic: 404.3 on 17 and 525 DF, p-value: < 2.2e-16
#prediction error(MSE)
col_lm_test <- predict(col_lm,mycollege[coltest,])
lm_MSE <- mean((col_lm_test-ytest)^2)
lm_MSE
## [1] 1637411
Linear model using least squares Test error \(MSE = 1637411\)
(c) Fit a ridge regression model on the training set, with \(\lambda\) chosen by cross-validation. Report the test error obtained.
#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)
ridge_cv <- cv.glmnet(x[coltrain,],y[coltrain],alpha=0,lambda = grid )
plot(ridge_cv)
bestlam=ridge_cv$lambda.min
#predicting the MSE using the best lambda
ridge_pred=predict(ridge_cv,s=bestlam,newx=x[coltest,])
ridge_MSE <- mean((ridge_pred -ytest)^2)
ridge_MSE
## [1] 1737203
#checking if the ridge and the least squares are same at lambda=0
ridge_pred=predict(ridge_cv,s=0,newx=x[coltest,])
mean((ridge_pred -ytest)^2)
## [1] 1637852
A) Ridge regression with lambda selected using cross validation has an higher MSE compared to linear model using least squares. Test error \(MSE = 1737203\).
(d) Fit a lasso model on the training set, with \(\lambda\) chosen by cross validation. Report the test error obtained, along with the number of non-zero coefficient estimates.
#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)
lasso_cv <- cv.glmnet(x[coltrain,],y[coltrain],alpha=1,lambda = grid,standized = TRUE,tresh=1e-12 )
plot(lasso_cv)
lasso_best=lasso_cv$lambda.min
#predicting the MSE using the best lambda
lasso_pred=predict(lasso_cv,s=lasso_best,newx=x[coltest,])
lasso_MSE <- mean((lasso_pred -ytest)^2)
lasso_MSE
## [1] 1716777
#getting the coefficients which are non zero
lasso_best=glmnet(x,y,alpha=1,lambda=lasso_cv$lambda.min)
lasso_coef=predict(lasso_best,type="coefficients",s=lasso_cv$lambda.min)[1:18,]
length(lasso_coef[lasso_coef!=0])
## [1] 16
# length(lasso_coef[lasso_coef==0])
lasso_coef
## (Intercept) PrivateYes Accept Enroll Top10perc
## -553.21870806 -458.45853912 1.49894031 -0.34176381 39.26867154
## Top25perc F.Undergrad P.Undergrad Outstate Room.Board
## -6.32553495 0.00000000 0.03387459 -0.06843036 0.13491214
## Books Personal PhD Terminal S.F.Ratio
## 0.00000000 0.01149799 -6.73104428 -3.10748806 8.57911039
## perc.alumni Expend Grad.Rate
## -0.72236835 0.07249811 6.31855075
A) Lasso regression with lambda selected using cross validation has an higher MSE compared to linear model using least squares but less than Ridge. Test error \(MSE = 1716777\).
Number of non-zero coefficients are \(\beta_j\neq0\) are \(16\) out of 18.
(e) Fit a PCR model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(12)
pcr_fit=pcr(Apps~., data=mycollege, subset=coltrain, scale=TRUE, validation ="CV")
validationplot(pcr_fit, val.type="MSEP")
From the plot we can see that the MSE is hits the low point around 5 and around 16. As I don’t see much reduction after 5 but lowest seem to be $M = 16 $.
pcr_pred=predict(pcr_fit,x[coltest,],ncomp =16)
pcr_MSE <- mean((pcr_pred - ytest)^2)
pcr_MSE
## [1] 1762233
PCR \(MSE = 1762233\), which is still more than the linear fit using least squares and also more than Lasso and Ridge MSEs’.
pcr_fit_f=pcr(Apps~., data=mycollege, scale=TRUE,ncomps=16)
summary(pcr_fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: svdpc
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3499 3400 1719 1683 1336 1307 1257
## adjCV 3499 3399 1717 1679 1325 1301 1254
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 1226 1224 1172 1173 1178 1170 1171
## adjCV 1220 1220 1169 1172 1176 1169 1169
## 14 comps 15 comps 16 comps 17 comps
## CV 1172 1165 980.9 970.8
## adjCV 1170 1164 979.0 968.7
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 31.939 58.46 64.99 70.84 76.26 81.3 84.84 88.15
## Apps 5.969 76.38 77.63 85.99 86.65 87.6 88.41 88.44
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 91.06 93.47 95.49 97.14 98.16 98.99 99.47
## Apps 89.16 89.23 89.23 89.40 89.45 89.45 89.63
## 16 comps 17 comps
## X 99.84 100.0
## Apps 92.67 92.9
(f) Fit a PLS model on the training set, with M chosen by cross validation. Report the test error obtained, along with the value of M selected by cross-validation.
set.seed(12)
pls_fit=plsr(Apps~.,data=mycollege,subset=coltrain,scale=TRUE,validation ="CV")
summary(pls_fit)
## Data: X dimension: 543 17
## Y dimension: 543 1
## Fit method: kernelpls
## Number of components considered: 17
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 3499 1542 1288 1143 1097 1052 992.9
## adjCV 3499 1540 1289 1142 1094 1047 989.7
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 977 975.3 974.6 975.2 973.6 972.3 971.6
## adjCV 975 973.1 972.4 972.9 971.3 970.0 969.4
## 14 comps 15 comps 16 comps 17 comps
## CV 971.2 970.8 970.8 970.8
## adjCV 969.0 968.7 968.6 968.7
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 26.96 44.86 64.10 67.53 70.89 74.27 78.00 81.31
## Apps 81.01 86.93 89.73 90.84 91.97 92.68 92.81 92.85
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 84.10 86.75 90.09 91.62 93.39 95.17 97.15
## Apps 92.87 92.88 92.89 92.90 92.90 92.90 92.90
## 16 comps 17 comps
## X 99.06 100.0
## Apps 92.90 92.9
validationplot(pls_fit,val.type="MSEP")
pls.pred=predict(pls_fit,x[coltest,],ncomp =13)
pls_MSE <- mean((pls.pred -ytest)^2)
\(M=13\) is selected as it has the lowest MSE and also highest variance is explained.
pls_fit_f=plsr(Apps~.,data=mycollege,scale=TRUE,ncomps=13)
summary(pls_fit_f)
## Data: X dimension: 777 17
## Y dimension: 777 1
## Fit method: kernelpls
## Number of components considered: 17
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 25.76 40.33 62.59 64.97 66.87 71.33 75.39 79.37
## Apps 78.01 85.14 87.67 90.73 92.63 92.72 92.77 92.82
## 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps 15 comps
## X 82.36 85.04 87.92 90.65 92.69 95.50 96.87
## Apps 92.87 92.89 92.90 92.91 92.92 92.92 92.92
## 16 comps 17 comps
## X 98.65 100.00
## Apps 92.92 92.92
PLS \(MSE = 1635841\), which is the best so far with all of the above models.
(g) Comment on the results obtained. How accurately can we predict the number of college applications received? Is there much difference among the test errors resulting from these five approaches?
# MSE analysis
df_MSE <- data.frame(model=c("Lm","Ridge","Lasso","PCR","PLS"),
MSE = c(lm_MSE,ridge_MSE,lasso_MSE,pcr_MSE,pls_MSE))
hc <- df_MSE %>%
hchart('column', hcaes(x = model, y = MSE))
hc %>%
hc_add_theme(hc_theme_economist()) %>%
hc_title(text = "test MSE across various models")
# R^2 analysis
avg_test = mean(ytest)
sum_of_sq_tot= mean((avg_test - ytest)^2)
lm_r2 = round(1 - (lm_MSE/sum_of_sq_tot),4)*100
ridge_r2 = round(1 - (ridge_MSE/sum_of_sq_tot),4)*100
lasso_r2 = round(1 - (lasso_MSE/sum_of_sq_tot),4)*100
pcr_r2 = round(1 - (pcr_MSE/sum_of_sq_tot),4)*100
pls_r2 = round(1 - (pls_MSE/sum_of_sq_tot),4)*100
df_r2 <- data.frame(model=c("Lm(92.33%)","Ridge(91.86%)","Lasso(91.96%)","PCR(91.75%)","PLS(92.34%)"),
r2 = c(lm_r2,ridge_r2,lasso_r2,pcr_r2,pls_r2))
hc <- df_r2 %>%
hchart('column', hcaes(x = model, y = r2))
hc %>%
hc_add_theme(hc_theme_economist()) %>%
hc_title(text = "% Variation explained across various models")
A) Analysis:Out of all the models the test \(MSE\) for the PLS is the lowest and also each of these models explain/predict the number of college applications received by 90%.linear model using least squares and partial least squares method have very good prediction and also less test error rate.
We will now try to predict per capita crime rate in the Boston dataset. (a) Try out some of the regression methods explored in this chapter, such as best subset selection, the lasso, ridge regression, and PCR. Present and discuss results for the approaches that you consider.
myboston <- Boston
set.seed(452)
#Predict function for the subset regularization
predict.regsubsets <- function(object, newdata, id, ...) {
form = as.formula(object$call[[2]])
mat = model.matrix(form, newdata)
coefi = coef(object, id = id)
xvars = names(coefi)
mat[, xvars] %*% coefi
}
#Cross validation by making 10 folds and
#we create a vector that allocates each observation to one of k = 10 folds, and we create a matrix in which we will store the results
k=10
folds=sample (1:k,nrow(myboston),replace=TRUE)
cv.errors=matrix(NA,k,13, dimnames=list(NULL,paste(1:13)))
# run the subsets model for 10 folds of data and collect the MSEs' for these folds
for (j in 1:k) {
best.fit = regsubsets(crim ~ ., data = myboston[folds != j, ], nvmax = 13)
for (i in 1:13) {
pred = predict(best.fit, myboston[folds == j, ], id = i)
cv.errors[j, i] = mean((myboston$crim[folds == j] - pred)^2)
}
}
mean_errors = apply(cv.errors, 2, mean)
mean_errors
## 1 2 3 4 5 6 7 8
## 45.74325 44.85431 44.63879 44.83412 44.55958 44.50076 44.27417 43.77496
## 9 10 11 12 13
## 43.46779 43.46229 43.34575 43.44321 43.44798
plot(mean_errors, type = "b", xlab = "Number of variables", ylab = "mean CV error")
Cross validation error is lowest for model with 11 variables. mean CV Error = 43.3457459
set.seed(12)
x = model.matrix(crim~.,myboston)[,-1]
y = myboston$crim
#split the data into training and validation sets(70- 30 split)
bostrain = sample(1:nrow(x), nrow(x)/1.43) #Row indexes
bostest = (-bostrain)#Row indexes for test set
ytest = y[bostest]
#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)
lasso_cv <- cv.glmnet(x[bostrain,],y[bostrain],alpha=1,lambda = grid,standized = TRUE,tresh=1e-12 )
plot(lasso_cv)
lasso_best=lasso_cv$lambda.min
#predicting the MSE using the best lambda
lasso_pred=predict(lasso_cv,s=lasso_best,newx=x[bostest,])
lasso_MSE <- mean((lasso_pred -ytest)^2)
lasso_MSE
## [1] 12.90178
#getting the coefficients which are non zero
lasso_best=glmnet(x,y,alpha=1,lambda=lasso_cv$lambda.min)
lasso_coef=predict(lasso_best,type="coefficients",s=lasso_cv$lambda.min)[1:13,]
length(lasso_coef[lasso_coef!=0])
## [1] 11
# length(lasso_coef[lasso_coef==0])
lasso_coef
## (Intercept) zn indus chas nox rm
## 11.368044740 0.034360583 -0.062583150 -0.555543117 -5.684442646 0.148783449
## age dis rad tax ptratio black
## 0.000000000 -0.714300559 0.506525182 0.000000000 -0.156258741 -0.007558842
## lstat
## 0.121609089
Lasso Regression has shrinked predictors tax and age making it a 10 predictor model, \(Lasso\ MSE=12.90178\). \(\lambda=0.07\) indicating that it is close to lease squares model.
#selecting lambda using 10-fold cross validation
set.seed(12)
grid=10^seq(10,-2,length=100)
ridge_cv <- cv.glmnet(x[bostrain,],y[bostrain],alpha=0,lambda = grid )
plot(ridge_cv)
bestlam=ridge_cv$lambda.min
#predicting the MSE using the best lambda
ridge_pred=predict(ridge_cv,s=bestlam,newx=x[bostest,])
ridge_MSE <- mean((ridge_pred -ytest)^2)
ridge_MSE
## [1] 13.42227
Ridge regression test \(Ridge\ Regression's\ MSE = 13.42227\), Also We can see that the \(\lambda = 0.163\) so we can see that the ridge MSE will be close to lease Squares MSE.
set.seed(12)
pcr_fit=pcr(crim~., data=myboston, subset=bostrain, scale=TRUE, validation ="CV")
validationplot(pcr_fit, val.type="MSEP")
pcr_pred=predict(pcr_fit,x[bostest,],ncomp =8)
pcr_MSE <- mean((pcr_pred - ytest)^2)
pcr_MSE
## [1] 13.4287
\(PCR's\ MSE = 13.4287\), with dimensions reduced to 8 gives us the best MSE and it falls short of the Lasso regressions performance.
pls_fit=plsr(crim~.,data=myboston,subset=bostrain,scale=TRUE,validation ="CV")
summary(pls_fit)
## Data: X dimension: 353 13
## Y dimension: 353 1
## Fit method: kernelpls
## Number of components considered: 13
##
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
## CV 9.653 7.962 7.672 7.618 7.628 7.591 7.587
## adjCV 9.653 7.959 7.663 7.602 7.613 7.578 7.570
## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps
## CV 7.576 7.542 7.551 7.545 7.549 7.549 7.549
## adjCV 7.557 7.525 7.533 7.529 7.532 7.532 7.532
##
## TRAINING: % variance explained
## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps 8 comps
## X 48.89 57.99 61.63 68.66 76.33 81.06 84.96 87.03
## crim 32.96 39.66 42.06 42.68 43.06 43.41 43.58 43.68
## 9 comps 10 comps 11 comps 12 comps 13 comps
## X 89.64 94.13 96.60 98.21 100.00
## crim 43.70 43.70 43.71 43.71 43.71
validationplot(pls_fit,val.type="MSEP")
pls.pred=predict(pls_fit,x[bostest,],ncomp =8)
pls_MSE <- mean((pls.pred -ytest)^2)
pls_MSE
## [1] 14.56908
\(PLS's\ MSE = 14.56908\), with dimensions reduced to 8 gives us the best MSE and it falls short of the Lasso regressions performance.
(b) Propose a model (or set of models) that seem to perform well on this data set, and justify your answer. Make sure that you are evaluating model performance using validation set error, crossvalidation, or some other reasonable alternative, as opposed to using training error.
A)After evaluating using cross validation and test MSEs’ for Ridge,Lasso,PCR and PLS models, the best model which can predict the crime is Lasso with 10 predictors.
(c) Does your chosen model involve all of the features in the data set? Why or why not? A) Not all features are included as we have applied shrinkage methods to improve the prediction and based on the final model selected is Lasso where the coefficients for \(\beta_{tax}\) and \(\beta_{age}\) are reduced to zero.