One of the other tool that handy while analyzing datasets is Lasso Regression.
Yes, Lasso Regression, why? Because it is a great tool to analyze data in R.
LASSO stands for Least Absolute Shrinkage and Selection Operator.
It is a regression model that deals with large datasets which have huge number of features and are huge enough to enhance the tendency of the model to overfit that could cause computational challenges.
It was developed in 1989 to avoid many of the problems with overfitting when we have a large number of independent variables. It can be said as an alternative to the least squares estimate.
It can be used to model both categorical and continuous outcomes with a mix of predictor types.
The main objective for using this method is to minimize the prediction error for a quantitative response varibale. It shrinks the regression coeffeicients for some variables towards zero by imposing a constraint on the model parameters.
It uses L1 normalization technique in which tuning parameter is used as amount of shrinkage. If tuning parameter increase then bias increases and if it decreases then variance increases.
You can use Lasso when your main goal is to predict data rather than to interpret the coefficients of your model.
Example for Lasso Regression
#include required libraries
library(lars)
## Loaded lars 1.2
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-1
#include required dataset
data(diabetes)
attach(diabetes)
par(mfrow=c(2,5))
for(i in 1:10){
plot(x[,i], y)
abline(lm(y~x[,i]))
}
#let's generate a model
model_ols <- lm(y ~ x)
summary(model_ols)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.829 -38.534 -0.227 37.806 151.355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.133 2.576 59.061 < 2e-16 ***
## xage -10.012 59.749 -0.168 0.867000
## xsex -239.819 61.222 -3.917 0.000104 ***
## xbmi 519.840 66.534 7.813 4.30e-14 ***
## xmap 324.390 65.422 4.958 1.02e-06 ***
## xtc -792.184 416.684 -1.901 0.057947 .
## xldl 476.746 339.035 1.406 0.160389
## xhdl 101.045 212.533 0.475 0.634721
## xtch 177.064 161.476 1.097 0.273456
## xltg 751.279 171.902 4.370 1.56e-05 ***
## xglu 67.625 65.984 1.025 0.305998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.15 on 431 degrees of freedom
## Multiple R-squared: 0.5177, Adjusted R-squared: 0.5066
## F-statistic: 46.27 on 10 and 431 DF, p-value: < 2.2e-16
#let's plot fit
fit <- glmnet(x, y)
plot(fit, xvar = "norm", label = TRUE)
#let's plot cvfit
cvfit = cv.glmnet(x, y)
plot(cvfit)