Blog3

One of the other tool that handy while analyzing datasets is Lasso Regression.

Yes, Lasso Regression, why? Because it is a great tool to analyze data in R.

LASSO stands for Least Absolute Shrinkage and Selection Operator.

It is a regression model that deals with large datasets which have huge number of features and are huge enough to enhance the tendency of the model to overfit that could cause computational challenges.

It was developed in 1989 to avoid many of the problems with overfitting when we have a large number of independent variables. It can be said as an alternative to the least squares estimate.

It can be used to model both categorical and continuous outcomes with a mix of predictor types.

The main objective for using this method is to minimize the prediction error for a quantitative response varibale. It shrinks the regression coeffeicients for some variables towards zero by imposing a constraint on the model parameters.

It uses L1 normalization technique in which tuning parameter is used as amount of shrinkage. If tuning parameter increase then bias increases and if it decreases then variance increases.

You can use Lasso when your main goal is to predict data rather than to interpret the coefficients of your model.

Example for Lasso Regression

#include required libraries
library(lars)

## Loaded lars 1.2

library(glmnet)

## Loading required package: Matrix

## Loaded glmnet 3.0-1

#include required dataset
data(diabetes)
attach(diabetes)


par(mfrow=c(2,5))
for(i in 1:10){
  plot(x[,i], y)
  abline(lm(y~x[,i]))
}

#let's generate a model
model_ols <- lm(y ~ x)
summary(model_ols)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.829  -38.534   -0.227   37.806  151.355 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  152.133      2.576  59.061  < 2e-16 ***
## xage         -10.012     59.749  -0.168 0.867000    
## xsex        -239.819     61.222  -3.917 0.000104 ***
## xbmi         519.840     66.534   7.813 4.30e-14 ***
## xmap         324.390     65.422   4.958 1.02e-06 ***
## xtc         -792.184    416.684  -1.901 0.057947 .  
## xldl         476.746    339.035   1.406 0.160389    
## xhdl         101.045    212.533   0.475 0.634721    
## xtch         177.064    161.476   1.097 0.273456    
## xltg         751.279    171.902   4.370 1.56e-05 ***
## xglu          67.625     65.984   1.025 0.305998    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.15 on 431 degrees of freedom
## Multiple R-squared:  0.5177, Adjusted R-squared:  0.5066 
## F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16

#let's plot fit
fit <- glmnet(x, y)
plot(fit, xvar = "norm", label = TRUE)

#let's plot cvfit
cvfit = cv.glmnet(x, y)
plot(cvfit)

Blog3

Sudhan Maharjan