Q1. We will predict the number of applications received using the other variables in the College data set (available at Webcourse@UCF or http://www-bcf.usc.edu/~gareth/ISL/College.csv). (a) Split the data set into a training set and a test set. (b) Fit a linear regression model using least squares on the training set, and report the test error obtained. (c) Fit a ridge regression model on the training set with lambda chosen by cross-validation, and report the test error obtained. (d) Fit a lasso regression model on the training set with lambda chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (e) Fit a linear regression model using the SCAD method on the training set with lambda chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (f) Fit a linear regression model using the Elastic method on the training set with lambda 1 and lambda 2 (or lambda and alpha) chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (h) Discuss and compare results for the approaches in (b)-(f).
Setting working directory and importing data file.
#Importing data file
library(ISLR)
college <- College
attach(college)
head(college,10)
# Seeting seed
set.seed(123)
#Sample Indexes
indexes=sample(1:nrow(college),size=0.3*nrow(college))
# Split data, 70% training & 30% test
#Setting up training set
train=college[-indexes,]
#Setting up test set
test=college[indexes,]
#Removing unused data
rm(college)
#Including library necessary for predict function
library(caret)
#Creating training model using logistic regression
model=lm(Apps~.,data=train)
#Printing out the logistic model
summary(model)
##
## Call:
## lm(formula = Apps ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2818.7 -498.6 -44.0 340.5 6464.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -233.97009 491.79329 -0.476 0.634451
## PrivateYes -859.99870 169.01676 -5.088 5.04e-07 ***
## Accept 1.22978 0.06047 20.336 < 2e-16 ***
## Enroll -0.06919 0.25206 -0.275 0.783806
## Top10perc 55.57605 6.35642 8.743 < 2e-16 ***
## Top25perc -17.86627 5.04564 -3.541 0.000434 ***
## F.Undergrad 0.03436 0.04453 0.772 0.440698
## P.Undergrad 0.03569 0.03563 1.002 0.316872
## Outstate -0.06382 0.02238 -2.852 0.004522 **
## Room.Board 0.21330 0.05598 3.810 0.000155 ***
## Books 0.16288 0.32559 0.500 0.617107
## Personal -0.05011 0.07316 -0.685 0.493741
## PhD -6.14646 5.25532 -1.170 0.242704
## Terminal -6.10063 5.76071 -1.059 0.290083
## S.F.Ratio 5.63435 16.10093 0.350 0.726524
## perc.alumni -6.52568 4.70008 -1.388 0.165597
## Expend 0.06970 0.01410 4.944 1.03e-06 ***
## Grad.Rate 11.85922 3.42604 3.461 0.000581 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1003 on 526 degrees of freedom
## Multiple R-squared: 0.9179, Adjusted R-squared: 0.9152
## F-statistic: 345.8 on 17 and 526 DF, p-value: < 2.2e-16
Fitting model to the test set and checking accuracy. Below is Mean Squared Error.
#Fitting trainning model on test set
pred=predict(model,newdata=test)
#Calculating Accuracy
MSE=mean((test$Apps-pred)^2)
#Printing MSE
print(MSE)
## [1] 1707004
Aside from linear regression, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the shrinkage penalty, is small when ??1, . . . , ??p are close to zero, and so it has the effect of shrinking the estimates of ??j towards zero. The tuning parameter ?? serves to control the relative impact of these two terms on the regression coefficient estimates. When ?? = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as ????????, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates for each value of ??. Selecting a good value for ?? is critical.
#Including library necessary for ridge regression
library(glmnet)
#For ridge regression, we must pass in an x matrix as well as a y vector for train and test sets
xtrain=model.matrix (Apps~.,train)[,-1]
ytrain=train$Apps
xtest=model.matrix (Apps~.,test)[,-1]
ytest=test$Apps
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =0)
plot(cv.out)
bestlam=cv.out$lambda.min
#Creating training model using ridge regression
model =glmnet(xtrain,ytrain,alpha=0,lambda=bestlam)
#Printing out the logistic model
model$beta
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## PrivateYes -7.465860e+02
## Accept 7.407759e-01
## Enroll 7.417674e-01
## Top10perc 3.075749e+01
## Top25perc -3.648377e+00
## F.Undergrad 1.052121e-01
## P.Undergrad 9.585709e-03
## Outstate -1.094181e-02
## Room.Board 2.316965e-01
## Books 2.904316e-01
## Personal -6.439403e-02
## PhD -2.290546e+00
## Terminal -5.903794e+00
## S.F.Ratio 8.204850e+00
## perc.alumni -1.143736e+01
## Expend 6.936717e-02
## Grad.Rate 1.306691e+01
Fitting model to the test set and checking accuracy. Below is Mean Squared Error.
#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)
## [1] 3046565
Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless ?? = ???). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large. LASSO is a penalized regression method that improves OLS and Ridge regression. LASSO does shrinkage and variable selection simultaneously for better prediction and model interpretation. The limitations of the Lasso are: 1) If p > n, the lasso selects at most n variables. The number of n selected genes is bounded by the number of samples. 2) Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =1)
plot(cv.out)
bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =glmnet(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta
## 17 x 1 sparse Matrix of class "dgCMatrix"
## s0
## PrivateYes -836.00348100
## Accept 1.21672144
## Enroll .
## Top10perc 50.04152032
## Top25perc -13.62068335
## F.Undergrad 0.02777029
## P.Undergrad 0.02800411
## Outstate -0.05380832
## Room.Board 0.19835446
## Books 0.08101676
## Personal -0.02103283
## PhD -4.95468364
## Terminal -5.82867418
## S.F.Ratio 1.79449875
## perc.alumni -6.02698585
## Expend 0.06548702
## Grad.Rate 10.46983604
Fitting model to the test set and checking accuracy. Below is Mean Squared Error.
#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)
## [1] 1713200
Lasso Non Zero Coefficients
#Retrieving the lasso coefficients
lasscoef=predict(model,type="coefficients",s=bestlam)[1:length(model$beta),]
#Printing non zero coefficients
lasscoef[lasscoef!=0]
## (Intercept) PrivateYes Accept Top10perc Top25perc
## -301.11192443 -836.00348100 1.21672144 50.04152032 -13.62068335
## F.Undergrad P.Undergrad Outstate Room.Board Books
## 0.02777029 0.02800411 -0.05380832 0.19835446 0.08101676
## Personal PhD Terminal S.F.Ratio perc.alumni
## -0.02103283 -4.95468364 -5.82867418 1.79449875 -6.02698585
## Expend
## 0.06548702
We consider the problem of simultaneous variable selection and estimation in partially linear models with a divergent number of covariates in the linear part, under the assumption that the vector of regression coefficients is sparse. We apply the SCAD penalty to achieve sparsity in the linear part and use polynomial splines to estimate the nonparametric component. Under reasonable conditions, it is shown that consistency in terms of variable selection and estimation can be achieved simultaneously for the linear and nonparametric components. Furthermore, the SCAD-penalized estimators of the nonzero coefficients are shown to have the asymptotic oracle property, in the sense that it is asymptotically normal with the same means and covariances that they would have if the zero coefficients were known in advance. The finite sample behavior of the SCAD-penalized estimators is evaluated with simulation and illustrated with a data set.
#Including library necessary for SCAD regression
library(ncvreg)
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.ncvreg (xtrain,ytrain,alpha =1,penalty=c("SCAD"))
plot(cv.out)
bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =ncvreg(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta
## 36.7079
## (Intercept) -101.53023308
## PrivateYes -906.07936234
## Accept 1.26703142
## Enroll 0.00000000
## Top10perc 54.51308719
## Top25perc -17.52148817
## F.Undergrad 0.00000000
## P.Undergrad 0.01817963
## Outstate -0.07313304
## Room.Board 0.23108034
## Books 0.00000000
## Personal 0.00000000
## PhD 0.00000000
## Terminal -10.79448917
## S.F.Ratio 0.00000000
## perc.alumni -3.10840996
## Expend 0.06698921
## Grad.Rate 10.65111066
Fitting model to the test set and checking accuracy. Below is Mean Squared Error.
#Fitting trainning model on test set
pred=predict(model,xtest,s=bestlam)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)
## [1] 1661451
The main idea of Elastic net regression is to combine the Ridge and Lasso. It combines variable selection, with the capacity of selecting groups of correlated variables. Elastic Net does not perform satisfactorily because there are two shrinkage procedures (Ridge and LASSO) in it. Double shrinkage introduces unnecessary bias.
for (i in seq(0, 1, .1))
{
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =i)
plot(cv.out)
bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =glmnet(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta
}
Fitting model to the test set and checking accuracy. Below is Mean Squared Error.
#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)
## [1] 1713200
Using a 70/30 split, and randomly selecting the training and testing data, the best results were reported by the SCAD method.
https://arxiv.org/abs/0903.5474 ISLR Textbook *http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf