CAP 5703 Assignment#3 Danilo Martinez

Q1. We will predict the number of applications received using the other variables in the College data set (available at Webcourse@UCF or http://www-bcf.usc.edu/~gareth/ISL/College.csv). (a) Split the data set into a training set and a test set. (b) Fit a linear regression model using least squares on the training set, and report the test error obtained. (c) Fit a ridge regression model on the training set with lambda chosen by cross-validation, and report the test error obtained. (d) Fit a lasso regression model on the training set with lambda chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (e) Fit a linear regression model using the SCAD method on the training set with lambda chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (f) Fit a linear regression model using the Elastic method on the training set with lambda 1 and lambda 2 (or lambda and alpha) chosen by cross-validation, and report the test error obtained, along with the number of non-zero coefficient estimates. (h) Discuss and compare results for the approaches in (b)-(f).

Setting working directory and importing data file.

#Importing data file
library(ISLR)
college <- College
attach(college)
head(college,10)

Dividing data into 70% training and 30% test set.

# Seeting seed
set.seed(123)
#Sample Indexes
indexes=sample(1:nrow(college),size=0.3*nrow(college))
# Split data, 70% training & 30% test
#Setting up training set
train=college[-indexes,]
#Setting up test set
test=college[indexes,]
#Removing unused data
rm(college)

Creating model for linear regression using training set

#Including library necessary for predict function
library(caret)
#Creating training model using logistic regression
model=lm(Apps~.,data=train)
#Printing out the logistic model
summary(model)

## 
## Call:
## lm(formula = Apps ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2818.7  -498.6   -44.0   340.5  6464.0 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -233.97009  491.79329  -0.476 0.634451    
## PrivateYes  -859.99870  169.01676  -5.088 5.04e-07 ***
## Accept         1.22978    0.06047  20.336  < 2e-16 ***
## Enroll        -0.06919    0.25206  -0.275 0.783806    
## Top10perc     55.57605    6.35642   8.743  < 2e-16 ***
## Top25perc    -17.86627    5.04564  -3.541 0.000434 ***
## F.Undergrad    0.03436    0.04453   0.772 0.440698    
## P.Undergrad    0.03569    0.03563   1.002 0.316872    
## Outstate      -0.06382    0.02238  -2.852 0.004522 ** 
## Room.Board     0.21330    0.05598   3.810 0.000155 ***
## Books          0.16288    0.32559   0.500 0.617107    
## Personal      -0.05011    0.07316  -0.685 0.493741    
## PhD           -6.14646    5.25532  -1.170 0.242704    
## Terminal      -6.10063    5.76071  -1.059 0.290083    
## S.F.Ratio      5.63435   16.10093   0.350 0.726524    
## perc.alumni   -6.52568    4.70008  -1.388 0.165597    
## Expend         0.06970    0.01410   4.944 1.03e-06 ***
## Grad.Rate     11.85922    3.42604   3.461 0.000581 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1003 on 526 degrees of freedom
## Multiple R-squared:  0.9179, Adjusted R-squared:  0.9152 
## F-statistic: 345.8 on 17 and 526 DF,  p-value: < 2.2e-16

Fitting model to the test set and checking accuracy. Below is Mean Squared Error.

#Fitting trainning model on test set
pred=predict(model,newdata=test)
#Calculating Accuracy
MSE=mean((test$Apps-pred)^2)
#Printing MSE
print(MSE)

## [1] 1707004

Aside from linear regression, we can fit a model containing all p predictors using a technique that constrains or regularizes the coefficient estimates, or equivalently, that shrinks the coefficient estimates towards zero. It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance. The two best-known techniques for shrinking the regression coefficients towards zero are ridge regression and the lasso. As with least squares, ridge regression seeks coefficient estimates that fit the data well, by making the RSS small. However, the shrinkage penalty, is small when ??1, . . . , ??p are close to zero, and so it has the effect of shrinking the estimates of ??j towards zero. The tuning parameter ?? serves to control the relative impact of these two terms on the regression coefficient estimates. When ?? = 0, the penalty term has no effect, and ridge regression will produce the least squares estimates. However, as ????????, the impact of the shrinkage penalty grows, and the ridge regression coefficient estimates will approach zero. Unlike least squares, which generates only one set of coefficient estimates, ridge regression will produce a different set of coefficient estimates for each value of ??. Selecting a good value for ?? is critical.

Creating model for ridge regression using training set with gamma chosen by cross-validation.

#Including library necessary for ridge regression
library(glmnet)
#For ridge regression, we must pass in an x matrix as well as a y vector for train and test sets
xtrain=model.matrix (Apps~.,train)[,-1]
ytrain=train$Apps
xtest=model.matrix (Apps~.,test)[,-1]
ytest=test$Apps
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =0)
plot(cv.out)

bestlam=cv.out$lambda.min
#Creating training model using ridge regression
model =glmnet(xtrain,ytrain,alpha=0,lambda=bestlam)
#Printing out the logistic model
model$beta

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## PrivateYes  -7.465860e+02
## Accept       7.407759e-01
## Enroll       7.417674e-01
## Top10perc    3.075749e+01
## Top25perc   -3.648377e+00
## F.Undergrad  1.052121e-01
## P.Undergrad  9.585709e-03
## Outstate    -1.094181e-02
## Room.Board   2.316965e-01
## Books        2.904316e-01
## Personal    -6.439403e-02
## PhD         -2.290546e+00
## Terminal    -5.903794e+00
## S.F.Ratio    8.204850e+00
## perc.alumni -1.143736e+01
## Expend       6.936717e-02
## Grad.Rate    1.306691e+01

Fitting model to the test set and checking accuracy. Below is Mean Squared Error.

#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)

## [1] 3046565

Ridge regression does have one obvious disadvantage. Unlike best subset, forward stepwise, and backward stepwise selection, which will generally select models that involve just a subset of the variables, ridge regression will include all p predictors in the final model. The penalty will shrink all of the coefficients towards zero, but it will not set any of them exactly to zero (unless ?? = ???). This may not be a problem for prediction accuracy, but it can create a challenge in model interpretation in settings in which the number of variables p is quite large. LASSO is a penalized regression method that improves OLS and Ridge regression. LASSO does shrinkage and variable selection simultaneously for better prediction and model interpretation. The limitations of the Lasso are: 1) If p > n, the lasso selects at most n variables. The number of n selected genes is bounded by the number of samples. 2) Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.

Creating model for lasso regression using training set with gamma chosen by cross-validation.

#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =1)
plot(cv.out)

bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =glmnet(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta

## 17 x 1 sparse Matrix of class "dgCMatrix"
##                        s0
## PrivateYes  -836.00348100
## Accept         1.21672144
## Enroll         .         
## Top10perc     50.04152032
## Top25perc    -13.62068335
## F.Undergrad    0.02777029
## P.Undergrad    0.02800411
## Outstate      -0.05380832
## Room.Board     0.19835446
## Books          0.08101676
## Personal      -0.02103283
## PhD           -4.95468364
## Terminal      -5.82867418
## S.F.Ratio      1.79449875
## perc.alumni   -6.02698585
## Expend         0.06548702
## Grad.Rate     10.46983604

Fitting model to the test set and checking accuracy. Below is Mean Squared Error.

#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)

## [1] 1713200

Lasso Non Zero Coefficients

#Retrieving the lasso coefficients
lasscoef=predict(model,type="coefficients",s=bestlam)[1:length(model$beta),]
#Printing non zero coefficients
lasscoef[lasscoef!=0]

##   (Intercept)    PrivateYes        Accept     Top10perc     Top25perc 
## -301.11192443 -836.00348100    1.21672144   50.04152032  -13.62068335 
##   F.Undergrad   P.Undergrad      Outstate    Room.Board         Books 
##    0.02777029    0.02800411   -0.05380832    0.19835446    0.08101676 
##      Personal           PhD      Terminal     S.F.Ratio   perc.alumni 
##   -0.02103283   -4.95468364   -5.82867418    1.79449875   -6.02698585 
##        Expend 
##    0.06548702

We consider the problem of simultaneous variable selection and estimation in partially linear models with a divergent number of covariates in the linear part, under the assumption that the vector of regression coefficients is sparse. We apply the SCAD penalty to achieve sparsity in the linear part and use polynomial splines to estimate the nonparametric component. Under reasonable conditions, it is shown that consistency in terms of variable selection and estimation can be achieved simultaneously for the linear and nonparametric components. Furthermore, the SCAD-penalized estimators of the nonzero coefficients are shown to have the asymptotic oracle property, in the sense that it is asymptotically normal with the same means and covariances that they would have if the zero coefficients were known in advance. The finite sample behavior of the SCAD-penalized estimators is evaluated with simulation and illustrated with a data set.

Creating model for SCAD regression using training set with gamma chosen by cross-validation.

#Including library necessary for SCAD regression
library(ncvreg)
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.ncvreg (xtrain,ytrain,alpha =1,penalty=c("SCAD"))
plot(cv.out)

bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =ncvreg(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta

##                   36.7079
## (Intercept) -101.53023308
## PrivateYes  -906.07936234
## Accept         1.26703142
## Enroll         0.00000000
## Top10perc     54.51308719
## Top25perc    -17.52148817
## F.Undergrad    0.00000000
## P.Undergrad    0.01817963
## Outstate      -0.07313304
## Room.Board     0.23108034
## Books          0.00000000
## Personal       0.00000000
## PhD            0.00000000
## Terminal     -10.79448917
## S.F.Ratio      0.00000000
## perc.alumni   -3.10840996
## Expend         0.06698921
## Grad.Rate     10.65111066

Fitting model to the test set and checking accuracy. Below is Mean Squared Error.

#Fitting trainning model on test set
pred=predict(model,xtest,s=bestlam)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)

## [1] 1661451

The main idea of Elastic net regression is to combine the Ridge and Lasso. It combines variable selection, with the capacity of selecting groups of correlated variables. Elastic Net does not perform satisfactorily because there are two shrinkage procedures (Ridge and LASSO) in it. Double shrinkage introduces unnecessary bias.

Creating model for Elastic Net regression using training set with gamma chosen by cross-validation.

for (i in seq(0, 1, .1))
{
#Using cross-validation to choose the tuning parameter ??
set.seed (123)
cv.out=cv.glmnet (xtrain,ytrain,alpha =i)
plot(cv.out)
bestlam=cv.out$lambda.min
#Creating training model using lasso regression
model =glmnet(xtrain,ytrain,alpha=1,lambda=bestlam)
#Printing out the logistic model
model$beta
}

Fitting model to the test set and checking accuracy. Below is Mean Squared Error.

#Fitting trainning model on test set
pred=predict(model,s=bestlam ,newx=xtest)
#Calculating Accuracy
MSE=mean((pred-ytest)^2)
#Printing MSE
print(MSE)

## [1] 1713200

Using a 70/30 split, and randomly selecting the training and testing data, the best results were reported by the SCAD method.

https://arxiv.org/abs/0903.5474 ISLR Textbook *http://web.stanford.edu/~hastie/TALKS/enet_talk.pdf