Why variable selection?

Variable selection is intended to select the “best” subset of predictors

Explain the data in the simplest way. The smallest model that fit the data is the best
Unnecessary predictors will add noise to the estimation of other quantities of interest. Degrees of freedom will be wasted.
Collinearity is caused by having too many variables trying to do the same job.
Cost: if the model is to be used for prediction, we can save time and/or money by not measuring redundant predictors.

Models for Wide Data

Wide data introduce a level of complexity that most algorithms for variables selection cannot handle due to computational cumbersomeness or space/dimension limitations.

Wide data refers to datasets with large number of features (p>>n).
Since p>>n then you cannot fit generalized linear models without constraints
Well known examples of wide data
- Genomics: 40K genes for a single individual
- Document classification

Algorithms and models for selection

Subset selection: fit separate least squares for each possible combination and chooses best model.
- Best-subset
- Stepwise
Shrinkage or regularization: coefficients shrunken towards zero which reduce variance and could lead to variable selection. Regularization refers to a process of introducing additional information in order to prevent over-fitting.
- Ridge
- Lasso
Dimension reduction: projecting p predictors into a subspace through linear combinations or “projection variables”
- Principal Components Analysis
- Partial Least Squares

Regularization/Shrinkage

Refers to a process of introducing additional information in order to solve an ill-posed problem ( (a) A solution exists (b) The solution is unique (c) The solution’s behavior changes continuously with the initial conditions) or to prevent over-fitting.
A trade off between variance and bias
- When n is much larger than p (the number of observations is larger than the number of variables), then there tends to be a lower variance and a good fit. If n is not much larger than p then there is a higher variability and a tendency to over fit.

Regularization

Process to Achieve optimum Model Complexity

Ridge Regression

alt text

Ridge Regression is a remedial measure taken to alleviate multicollinearity amongst regression predictor variables in a model.
Ridge regression will shrink the coefficients .
Coefficients are non-zero, and therefore are never completely eliminated from the model.

Lasso Regression

alt text

The LASSO (Least Absolute Shrinkage and Selection Operator) is a regression method that involves penalizing the absolute size of the regression coefficients.
Coefficients reach zero as λ or the tuning parameter increases, and therefore removes coefficients from the model. Hence does both selection and shrinkage of coefficients.

Ridge vs. Lasso visualization

Lasso regression will tend to pick one correlated coefficient and drop the others.
Coefficients reach zero as λ or the tuning parameter increases, and therefore removes coefficients from the model. Hence does both selection and shrinkage of coefficients.

The limitations of Lasso

If p>n, the lasso selects at most n variables.
Grouped variables: the lasso fails to do grouped selection. It tends to select one variable from a group and ignore the others.

Glmnet Package

Extremely efficient procedures for fitting the entire lasso or elastic-net regularization path for linear regression, logistic and multinomial regression models, Poisson regression and the Cox model.

Authors:

Jerome Friedman – glmnet coder
Rob Tibshirani – creator of the Lasso method
Trevor Hastie – Package Maintainer (center picture on the right)
Noah Simon - Cox coding

Elastic Net

According to Hastie and Zou, real data and a simulation study shows that the elastic net often outperforms the lasso, while it enjoys a similar sparsity of representation.
The elastic net encourages a grouping effect, where strong correlated predictors are kept in the model.
The penalty is a compromise between the ridge and lasso penalty. This is most useful in situations where there are many correlated predictor variables.
- Ridge generally is better at handling highly correlated predictors. Lasso is weaker, and tends to break down, therefore requiring a mix between the two methods.

Elastic Net

The L1 part of the penalty generates a sparse model.
The quadratic part of the penalty
- Removes the limitation on the number of selected variables;
- Encourages grouping effect;
- Stabilizes the L1 regularization path.

Regularization Path Trace

Elastic Net Advantages:

The elastic net performs simultaneous regularization and variable selection.
Ability to perform grouped selection
Appropriate for the p>>n problem
Interesting implications in other areas: sparse PCA and new support kernel machines

Navigating glmnet

fit <- glmnet(x, y)
plot(fit)
print(fit)

Navigating glmnet (continued)

Print(fit)
We are able to see a summary of the path at each step by using the print function.
From left to right the Df is the degrees of freedom, %Dev is the % of deviance explained, and Lambda is the lambda value.
Again, as Lambda increases the degrees of freedom decrease and the % deviance explained goes to zero

Navigating glmnet (continued)

In glmnet, it is possible to select the coefficient values for a particular step, or λ level.
coef(fit,s=0.1)
Here we see there are ten variables selected at the coefficient level

Navigating glmnet (continued)

Can also make predications for two models with new x values.
nx = matrix(rnorm(10*20),10,20)
predict(fit,newx=nx,s=c(0.1,0.05))

Navigating glmnet (continued)

However, it is often not appropriate to choose a λ without any guidance. That’s where cross validation is useful.
cvfit = cv.glmnet(x, y)
plot(cvfit)

Beyond Theory: Example

Let’s first simulate the data. 4 variables and output count data (generated from Poisson distribution).

set.seed(60134)
N <- 100 # number of exposures
n <- 10 # size of each exposure
n <- matrix(rep(n,N),ncol=1) # size of each exposure (vector)
k <- 4 # number of covariates
b <- c(0.5,0.10,0,-0.2) # value of covariate parameter vector
# simulated covariates
X <- matrix(c(rep(1,N),rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1))
            ,nrow=N,ncol=k)
colnames(X)<-c("True_X1","True_X2","True_X3","True_X4")
theta <- exp(X %*% b)
mu <- n * theta
y <- matrix(-99,nrow=N)
for (i in 1:N){y[i] <- rpois(1,mu[i])}

Example (continued)

Now we add 10 random variables (covariates)

k2=10
XX <- matrix(c(rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1),
               rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1),rnorm(N,0,1),
               rnorm(N,0,1),rnorm(N,0,1)),nrow=N,ncol=k2)

colnames(XX)<-c("X1","X2","X3","X4","X5","X6","X7","X8","X9","X10")

And three correlated and multicollinear features.

x_cor2<-X[,2]*1.5
x_cor3<-X[,3]/2
x_cor_2and4<-X[,2]*X[,4]
x_cor<-cbind(x_cor2,x_cor3,x_cor_2and4)
XXX<-cbind(X,XX,x_cor)
colnames(XXX)=c("True_X1","True_X2","True_X3","True_X4","X1","X2","X3","X4","X5","X6","X7","X8","X9","X10","x_cor2","x_cor3","x_cor_2and4")

Example (continued)

Here is the final simulated data

yX <- data.frame(n,y,XXX)
names(yX)

##  [1] "n"           "y"           "True_X1"     "True_X2"     "True_X3"    
##  [6] "True_X4"     "X1"          "X2"          "X3"          "X4"         
## [11] "X5"          "X6"          "X7"          "X8"          "X9"         
## [16] "X10"         "x_cor2"      "x_cor3"      "x_cor_2and4"

Example: Lasso

alpha=1 in glmnet elastic net return lasso. Coefficients converge to zero in lasso, so it could be used for variables selection

model.lasso<-glmnet(XXX, y, family="poisson", offset=log(n), alpha=1, nlambda = 100)

plot(model.lasso,label=T)

Example: Ridge

alpha=0 in glmnet elastic net return ridge. As expected coefficients are non-zero in ridge

model.ridge<-glmnet(XXX[,-1], y, family="poisson", offset=log(n), alpha=0, nlambda = 100)

plot(model.ridge,label=T)

Example: Lasso with cross-validation

glmnet fit 100 models with different lambda’s. Cross validation could be used to find the optimum lambda.

set.seed(60054)
cv.model.lasso<-cv.glmnet(XXX[,-1], y, family="poisson", offset=log(n), alpha=1, nfolds=7)

plot(cv.model.lasso)

Example: Lasso with cross-validation (continued)

glmnet fit 100 models with different lambda’s. Cross validation could be used to find the optimum lambda.

attributes(cv.model.lasso)

## $names
##  [1] "lambda"     "cvm"        "cvsd"       "cvup"       "cvlo"      
##  [6] "nzero"      "name"       "glmnet.fit" "lambda.min" "lambda.1se"
## 
## $class
## [1] "cv.glmnet"

opt.lam = c(cv.model.lasso$lambda.min, cv.model.lasso$lambda.1se)
opt.lam

## [1] 0.2934462 1.1846473

Example: Lasso with cross-validation (continued)

# b <- c(0.5,0.10,0,-0.2) = True coefficients
coef(cv.model.lasso, s=opt.lam)

## 17 x 2 sparse Matrix of class "dgCMatrix"
##                         1             2
## (Intercept)  0.5375460324  5.474378e-01
## True_X2      0.0637800000  5.026494e-03
## True_X3      .             .           
## True_X4     -0.2037099191 -1.612753e-01
## X1           .             .           
## X2          -0.0187948758  .           
## X3          -0.0242132135  .           
## X4           .             .           
## X5           0.0072053663  .           
## X6           .             .           
## X7           .             .           
## X8          -0.0098765465  .           
## X9           0.0033926210  .           
## X10         -0.0304074434  .           
## x_cor2       0.0006425355  1.572106e-07
## x_cor3       .             .           
## x_cor_2and4 -0.0126961053  .

glm on the selected features from lasso

## 
## Call:  glm(formula = y ~ offset(log(n)) + True_X2 + True_X4 + x_cor2, 
##     family = poisson(link = "log"), data = yX)
## 
## Coefficients:
## (Intercept)      True_X2      True_X4       x_cor2  
##     0.53091      0.08018     -0.22706           NA  
## 
## Degrees of Freedom: 99 Total (i.e. Null);  97 Residual
## Null Deviance:       209.7 
## Residual Deviance: 110.2     AIC: 579

Example: Ridge with cross-validation

glmnet fit 100 models with different lambda’s. Cross validation could be used to find the optimum lambda.

set.seed(60054)
cv.model.ridge<-cv.glmnet(XXX[,-1], y, family="poisson", offset=log(n), alpha=0, nfolds=7) 
opt.lam1 = c(cv.model.ridge$lambda.min, cv.model.ridge$lambda.1se)

plot(cv.model.ridge)

Example: Ridge with cross-validation (continued)

coef(cv.model.ridge, s=opt.lam1)

## 17 x 2 sparse Matrix of class "dgCMatrix"
##                         1             2
## (Intercept)  0.5383173632  0.5438883866
## True_X2      0.0405253270  0.0287478116
## True_X3     -0.0019828581 -0.0007945813
## True_X4     -0.1802579873 -0.1071036523
## X1          -0.0059800394 -0.0037653027
## X2          -0.0327161959 -0.0250856432
## X3          -0.0337986020 -0.0198842119
## X4          -0.0076619084 -0.0046624991
## X5           0.0155441789  0.0032589406
## X6          -0.0026323367  0.0035591116
## X7           0.0009740535  0.0036545662
## X8          -0.0257743786 -0.0078117871
## X9           0.0166185167  0.0090307820
## X10         -0.0426544236 -0.0261060553
## x_cor2       0.0269529154  0.0191656430
## x_cor3      -0.0039940405 -0.0015890217
## x_cor_2and4 -0.0338305946 -0.0307988999

Example: Elastic-net with cross-validation

α could also be optimized by k-cross validation. The R package caret apparently can build models using the glmnet package and should be set up to tune over both parameters α and λ.

set.seed(60054)
cv.model.alpha05<-cv.glmnet(XXX[,-1], y, family="poisson", offset=log(n), alpha=.5, nfolds=7) 
opt.lam2 = c(cv.model.alpha05$lambda.min, cv.model.alpha05$lambda.1se)
opt.lam2

## [1] 0.5347545 2.1588130

Example: Elastic-net with cross-validation

coef(cv.model.alpha05, s=opt.lam2)

## 17 x 2 sparse Matrix of class "dgCMatrix"
##                        1            2
## (Intercept)  0.537656862  0.546786444
## True_X2      0.035583782  0.006182489
## True_X3      .            .          
## True_X4     -0.201515129 -0.157592867
## X1           .            .          
## X2          -0.020212992  .          
## X3          -0.025215212  .          
## X4           .            .          
## X5           0.008129189  .          
## X6           .            .          
## X7           .            .          
## X8          -0.011556780  .          
## X9           0.004730270  .          
## X10         -0.031839364  .          
## x_cor2       0.020623849  0.003518974
## x_cor3       .            .          
## x_cor_2and4 -0.015055044  .

References:

“1.1. Generalized Linear Models.” 1.1. Generalized Linear Models — Scikit-learn 0.17.1 Documentation. Scikit-learn, n.d. Web. 05 Mar. 2016. http://scikit-learn.org/stable/modules/linear_model.html.
Friedman, Jerome, Trevor Hastie, and Rob Tibshirani. Regularization Paths for Generalized Linear Models via Coordinate Descent. Department of Statistics, Stanford University, 29 Apr. 2009. Web.
Hastie, Trevor, and Junyang Qian. “Glmnet Vignette.” Glmnet Vignette. Stanford University, 26 June 2014. Web. 5 Mar. 2016. https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. An Introduction to Statistical Learning with Applications in R. New York, NY: Springer New York, 2013. Springer-Verlag New York. Web. 5 Mar. 2016. http://www.springer.com/us/book/9781461471370.

Variable Selection in GLM

Why variable selection?

Models for Wide Data

Algorithms and models for selection

Regularization/Shrinkage

Regularization

Ridge Regression

Lasso Regression

Ridge vs. Lasso visualization

The limitations of Lasso

Glmnet Package

Elastic Net

Elastic Net

Regularization Path Trace

Elastic Net Advantages:

Navigating glmnet

Navigating glmnet (continued)

Navigating glmnet (continued)

Navigating glmnet (continued)

Navigating glmnet (continued)

Beyond Theory: Example

Example (continued)

Example (continued)

Example: Lasso

Example: Ridge

Example: Lasso with cross-validation

Example: Lasso with cross-validation (continued)

Example: Lasso with cross-validation (continued)

glm on the selected features from lasso

Example: Ridge with cross-validation

Example: Ridge with cross-validation (continued)

Example: Elastic-net with cross-validation

Example: Elastic-net with cross-validation

References:

References: