Machine Learning and Regularization using Lasso

Lasso Regression

1. Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO.

#install.packages("glmnet", repos = "http://cran.us.r-project.org")
#install.packages("lars", repos = "http://cran.us.r-project.org")
library(glmnet)

## Warning: package 'glmnet' was built under R version 3.6.3

## Loading required package: Matrix

## Loaded glmnet 3.0-2

library(lars)

## Loaded lars 1.2

data(diabetes)

#head(diabetes)

#View(diabetes)
#attach(diabetes)

#summary(diabetes)

2. The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.

It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process.

summary(diabetes$x)

##       age                 sex                bmi                 map           
##  Min.   :-0.107226   Min.   :-0.04464   Min.   :-0.090275   Min.   :-0.112400  
##  1st Qu.:-0.037299   1st Qu.:-0.04464   1st Qu.:-0.034229   1st Qu.:-0.036656  
##  Median : 0.005383   Median :-0.04464   Median :-0.007284   Median :-0.005671  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.038076   3rd Qu.: 0.05068   3rd Qu.: 0.031248   3rd Qu.: 0.035644  
##  Max.   : 0.110727   Max.   : 0.05068   Max.   : 0.170555   Max.   : 0.132044  
##        tc                 ldl                 hdl           
##  Min.   :-0.126781   Min.   :-0.115613   Min.   :-0.102307  
##  1st Qu.:-0.034248   1st Qu.:-0.030358   1st Qu.:-0.035117  
##  Median :-0.004321   Median :-0.003819   Median :-0.006584  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.028358   3rd Qu.: 0.029844   3rd Qu.: 0.029312  
##  Max.   : 0.153914   Max.   : 0.198788   Max.   : 0.181179  
##       tch                 ltg                 glu           
##  Min.   :-0.076395   Min.   :-0.126097   Min.   :-0.137767  
##  1st Qu.:-0.039493   1st Qu.:-0.033249   1st Qu.:-0.033179  
##  Median :-0.002592   Median :-0.001948   Median :-0.001078  
##  Mean   : 0.000000   Mean   : 0.000000   Mean   : 0.000000  
##  3rd Qu.: 0.034309   3rd Qu.: 0.032433   3rd Qu.: 0.027917  
##  Max.   : 0.185234   Max.   : 0.133599   Max.   : 0.135612

par(mfrow=c(2,5))
for(i in 1:10)
  {
  plot(diabetes$x[,i], diabetes$y, xlab = colnames(diabetes$x)[i], ylab="y")
  abline(lm(diabetes$y~(diabetes$x)[,i]), col="red")
}

From the above scatterplots, we can see there is a positive relationship between all the predictors of ‘x’ and a dependent variable ‘y’, except hdl showing a negative correlation with ‘y’. Also, we can see that ‘y’ is independent of the precitor variable ‘sex’.

3. Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.

OLS linear regression allows to predict the value of the dependent variable for varying inputs of the predictor variable given the slope and intercept coefficients of the line of best fit.

#applying OLS (Ordinary Least Squares) linear regression

model_ols <- lm(diabetes$y ~ diabetes$x)
summary(model_ols)

## 
## Call:
## lm(formula = diabetes$y ~ diabetes$x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.829  -38.534   -0.227   37.806  151.355 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    152.133      2.576  59.061  < 2e-16 ***
## diabetes$xage  -10.012     59.749  -0.168 0.867000    
## diabetes$xsex -239.819     61.222  -3.917 0.000104 ***
## diabetes$xbmi  519.840     66.534   7.813 4.30e-14 ***
## diabetes$xmap  324.390     65.422   4.958 1.02e-06 ***
## diabetes$xtc  -792.184    416.684  -1.901 0.057947 .  
## diabetes$xldl  476.746    339.035   1.406 0.160389    
## diabetes$xhdl  101.045    212.533   0.475 0.634721    
## diabetes$xtch  177.064    161.476   1.097 0.273456    
## diabetes$xltg  751.279    171.902   4.370 1.56e-05 ***
## diabetes$xglu   67.625     65.984   1.025 0.305998    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 54.15 on 431 degrees of freedom
## Multiple R-squared:  0.5177, Adjusted R-squared:  0.5066 
## F-statistic: 46.27 on 10 and 431 DF,  p-value: < 2.2e-16

** Note: The coefficient for each explanatory variable reflects both the strength and type of relationship the explanatory variable has to the dependent variable.When the sign associated with the coefficient is negative, the relationship is negative. When the sign is positive, the relationship is positive.

Interpretation: Adjusted R-squared value is 0.51 indicates that the model explains approximately 51% of the variation in the dependent variables.

Y(diabetes) shows positive relationship with all the variables except age, sex and tc i.e., causing -10.01, -239.81 and -79.18 fold change in Y (or towards the progression of diabetes). However, the coefficient 519.84 for bmi indicates that for any change in bmi, you can expect diabetes to increase by an average of 519.84 and similarly for others coefficients.

4. Use the glmnet function to plot the path of each of x’s variable coefficients against the L1 norm of the beta vector.

This graph indicates at which stage each coefficient shrinks to zero.

model_lasso <- glmnet(diabetes$x, diabetes$y)

plot(model_lasso, xvar = "norm", label = TRUE)

In the above plot, each curve corresponds to a variable. It shows the path of its coefficient against the L1-norm of the whole coefficient vector as λ varies. The axis above indicates the number of nonzero coefficients at the current lambda(λ), which is the effective degrees of freedom (df) for the lasso.

5. Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

cv_fit <- cv.glmnet(diabetes$x, diabetes$y, alpha = 1, nlambda = 1000)

plot(cv_fit)

The left dashed vertical line indicates that the log of the optimal value of lambda is approximately -0.8, which is the one that minimizes the prediction error. This lambda value will give the most accurate model. The exact value of lambda can be viewed as follow:

#this will give minimum mean cross-validated error
cv_fit$lambda.min

## [1] 0.8811392

6. Using the minimum value of lambda from the previous exercise, get the estimated beta matrix.

Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y.

Using lambda.min as the best lambda, we get the following regression coefficients:

fit <- glmnet(diabetes$x, diabetes$y, alpha = 1, lambda=cv_fit$lambda.min)
fit$beta

## 10 x 1 sparse Matrix of class "dgCMatrix"
##              s0
## age    .       
## sex -200.735691
## bmi  522.542673
## map  298.892997
## tc  -110.472867
## ldl    .       
## hdl -220.091964
## tch    7.226631
## ltg  515.834123
## glu   55.715575

The coefficients of the variables age and ldl have been have been set to zero by the lasso algorithm, reducing the complexity of the model.

7. To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero.

Generally, the purpose of regularization is to balance accuracy and simplicity. The function cv.glmnet() finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.This value is called lambda.1se.

cv_fit$lambda.1se

## [1] 8.670368

fit <- glmnet(diabetes$x, diabetes$y, alpha = 1, lambda=cv_fit$lambda.1se)
fit$beta

## 10 x 1 sparse Matrix of class "dgCMatrix"
##            s0
## age   .      
## sex   .      
## bmi 485.80177
## map 159.83203
## tc    .      
## ldl   .      
## hdl -82.07465
## tch   .      
## ltg 421.50654
## glu   .

Using lambda.1se as the best lambda, gives the above regression coefficients. Using lambda.1se, only 5 variables have non-zero coefficients. The coefficients of all other variables such as age, tc, ldl, tch and glu have been set to zero by the lasso algorithm, reducing the complexity of the model.

8. As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results.

model_ols2 <- lm(diabetes$y~diabetes$x2)
summary(model_ols2)

## 
## Call:
## lm(formula = diabetes$y ~ diabetes$x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -158.216  -30.809   -3.857   31.348  153.946 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          152.133      2.532  60.086  < 2e-16 ***
## diabetes$x2age        50.721     65.513   0.774   0.4393    
## diabetes$x2sex      -267.344     65.270  -4.096 5.15e-05 ***
## diabetes$x2bmi       460.721     84.601   5.446 9.32e-08 ***
## diabetes$x2map       342.933     72.447   4.734 3.13e-06 ***
## diabetes$x2tc      -3599.542  60575.187  -0.059   0.9526    
## diabetes$x2ldl      3028.281  53238.699   0.057   0.9547    
## diabetes$x2hdl      1103.047  22636.179   0.049   0.9612    
## diabetes$x2tch        74.937    275.807   0.272   0.7860    
## diabetes$x2ltg      1828.210  19914.504   0.092   0.9269    
## diabetes$x2glu        62.754     70.398   0.891   0.3733    
## diabetes$x2age^2      67.691     69.470   0.974   0.3305    
## diabetes$x2bmi^2      45.849     83.288   0.550   0.5823    
## diabetes$x2map^2      -8.460     71.652  -0.118   0.9061    
## diabetes$x2tc^2     6668.449   7059.159   0.945   0.3454    
## diabetes$x2ldl^2    3583.174   5326.148   0.673   0.5015    
## diabetes$x2hdl^2    1731.821   1590.574   1.089   0.2769    
## diabetes$x2tch^2     773.374    606.967   1.274   0.2034    
## diabetes$x2ltg^2    1451.581   1730.103   0.839   0.4020    
## diabetes$x2glu^2     114.149     94.122   1.213   0.2260    
## diabetes$x2age:sex   148.678     73.407   2.025   0.0435 *  
## diabetes$x2age:bmi   -18.052     79.620  -0.227   0.8208    
## diabetes$x2age:map    18.534     76.303   0.243   0.8082    
## diabetes$x2age:tc   -158.891    617.109  -0.257   0.7970    
## diabetes$x2age:ldl   -67.285    494.527  -0.136   0.8918    
## diabetes$x2age:hdl   209.245    280.614   0.746   0.4563    
## diabetes$x2age:tch   184.960    210.330   0.879   0.3798    
## diabetes$x2age:ltg   124.667    223.765   0.557   0.5778    
## diabetes$x2age:glu    62.575     80.377   0.779   0.4367    
## diabetes$x2sex:bmi    64.612     77.902   0.829   0.4074    
## diabetes$x2sex:map    88.472     74.744   1.184   0.2373    
## diabetes$x2sex:tc    433.598    590.709   0.734   0.4634    
## diabetes$x2sex:ldl  -352.823    468.951  -0.752   0.4523    
## diabetes$x2sex:hdl  -124.731    273.870  -0.455   0.6491    
## diabetes$x2sex:tch  -131.223    199.714  -0.657   0.5115    
## diabetes$x2sex:ltg  -118.995    226.493  -0.525   0.5996    
## diabetes$x2sex:glu    45.758     73.650   0.621   0.5348    
## diabetes$x2bmi:map   154.720     86.340   1.792   0.0739 .  
## diabetes$x2bmi:tc   -302.045    667.930  -0.452   0.6514    
## diabetes$x2bmi:ldl   241.540    561.026   0.431   0.6671    
## diabetes$x2bmi:hdl   121.942    329.884   0.370   0.7118    
## diabetes$x2bmi:tch   -33.445    230.836  -0.145   0.8849    
## diabetes$x2bmi:ltg   114.673    255.987   0.448   0.6544    
## diabetes$x2bmi:glu    23.377     91.037   0.257   0.7975    
## diabetes$x2map:tc    478.303    682.264   0.701   0.4837    
## diabetes$x2map:ldl  -326.740    574.317  -0.569   0.5697    
## diabetes$x2map:hdl  -187.305    309.589  -0.605   0.5455    
## diabetes$x2map:tch   -58.294    198.601  -0.294   0.7693    
## diabetes$x2map:ltg  -154.795    271.966  -0.569   0.5696    
## diabetes$x2map:glu  -133.476     91.314  -1.462   0.1447    
## diabetes$x2tc:ldl  -9313.775  11771.220  -0.791   0.4293    
## diabetes$x2tc:hdl  -3932.025   3816.572  -1.030   0.3036    
## diabetes$x2tc:tch  -2205.910   1761.843  -1.252   0.2113    
## diabetes$x2tc:ltg  -3801.442  13166.091  -0.289   0.7729    
## diabetes$x2tc:glu   -176.295    595.459  -0.296   0.7673    
## diabetes$x2ldl:hdl  2642.645   3165.926   0.835   0.4044    
## diabetes$x2ldl:tch  1206.822   1470.512   0.821   0.4123    
## diabetes$x2ldl:ltg  2773.697  10960.214   0.253   0.8004    
## diabetes$x2ldl:glu    85.626    505.102   0.170   0.8655    
## diabetes$x2hdl:tch  1188.406   1002.242   1.186   0.2365    
## diabetes$x2hdl:ltg  1467.845   4609.793   0.318   0.7503    
## diabetes$x2hdl:glu   217.541    296.749   0.733   0.4640    
## diabetes$x2tch:ltg   389.805    624.671   0.624   0.5330    
## diabetes$x2tch:glu   235.693    235.064   1.003   0.3167    
## diabetes$x2ltg:glu    83.525    264.726   0.316   0.7525    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.23 on 377 degrees of freedom
## Multiple R-squared:  0.5924, Adjusted R-squared:  0.5233 
## F-statistic: 8.563 on 64 and 377 DF,  p-value: < 2.2e-16

Interpretation: Adjusted R-squared value is 0.59 indicates that the model explains approximately 59% of the variation in the dependent variables. Sex, map and bmi are the most significant variables having p-value close to 0.

In other words, we can say that sex,bmi and map can cause -267.344, 460.721 and 342.9 fold change respectively in Y (or towards the progression of diabetes).)

9. Use the glmnet function to plot the path of each of x2’s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero

model_lasso1 <- glmnet(diabetes$x2, diabetes$y)
plot(model_lasso1, xvar = "norm", label = TRUE)

Using the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

This is an effective way to narrow down on important predictors when there are many candidates.

cv_fit1 <- cv.glmnet(diabetes$x2, diabetes$y, alpha = 1, nlambda = 1000)
plot(cv_fit1)

cv_fit1$lambda.min

## [1] 2.867854

The above graph shows the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below). The left dashed vertical line indicates that the log of the optimal value of lambda is approximately 1, which is the one that minimizes the prediction error.

Using the minimum value of lambda from the cv_fit1 to get the estimated beta matrix.

Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y.

fit1 <- glmnet(diabetes$x2, diabetes$y, alpha = 1, lambda=cv_fit1$lambda.min)
fit1$beta

## 64 x 1 sparse Matrix of class "dgCMatrix"
##                  s0
## age        .       
## sex     -118.523835
## bmi      501.383490
## map      255.335272
## tc         .       
## ldl        .       
## hdl     -191.870846
## tch        .       
## ltg      468.455446
## glu       20.444426
## age^2     10.773639
## bmi^2     40.019997
## map^2      .       
## tc^2       .       
## ldl^2      .       
## hdl^2      .       
## tch^2      .       
## ltg^2      .       
## glu^2     71.929497
## age:sex  110.181166
## age:bmi    .       
## age:map   30.247639
## age:tc     .       
## age:ldl    .       
## age:hdl    .       
## age:tch    .       
## age:ltg    9.884011
## age:glu   10.886423
## sex:bmi    .       
## sex:map    2.452862
## sex:tc     .       
## sex:ldl    .       
## sex:hdl    .       
## sex:tch    .       
## sex:ltg    .       
## sex:glu    .       
## bmi:map   87.260298
## bmi:tc     .       
## bmi:ldl    .       
## bmi:hdl    .       
## bmi:tch    .       
## bmi:ltg    .       
## bmi:glu    .       
## map:tc     .       
## map:ldl    .       
## map:hdl    .       
## map:tch    .       
## map:ltg    .       
## map:glu    .       
## tc:ldl     .       
## tc:hdl     .       
## tc:tch     .       
## tc:ltg     .       
## tc:glu     .       
## ldl:hdl    .       
## ldl:tch    .       
## ldl:ltg    .       
## ldl:glu    .       
## hdl:tch    .       
## hdl:ltg    .       
## hdl:glu    .       
## tch:ltg    .       
## tch:glu    .       
## ltg:glu    .

Using lambda.min, we get the above regression coefficients. Only 15 variables have non-zero coefficients. The coefficients of all other variables such as age, tc, ldl and so on have been set to zero by the lasso algorithm, thus, reducing the complexity of the model.

Machine Learning and Regularization using Lasso

Kashika Tyagi

4/27/2020

Lasso Regression

1. Load the lars package and the diabetes dataset (Efron, Hastie, Johnstone and Tibshirani (2003) “Least Angle Regression” Annals of Statistics). This has patient level data on the progression of diabetes. Next, load the glmnet package that will be used to implement LASSO.

2. The dataset has three matrices x, x2 and y. While x has a smaller set of independent variables, x2 contains the full set with quadratic and interaction terms. y is the dependent variable which is a quantitative measure of the progression of diabetes.

3. Regress y on the predictors in x using OLS. We will use this result as benchmark for comparison.

OLS linear regression allows to predict the value of the dependent variable for varying inputs of the predictor variable given the slope and intercept coefficients of the line of best fit.

Interpretation: Adjusted R-squared value is 0.51 indicates that the model explains approximately 51% of the variation in the dependent variables.

4. Use the glmnet function to plot the path of each of x’s variable coefficients against the L1 norm of the beta vector.

This graph indicates at which stage each coefficient shrinks to zero.

5. Use the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

6. Using the minimum value of lambda from the previous exercise, get the estimated beta matrix.

7. To get a more parsimonious model we can use a higher value of lambda that is within one standard error of the minimum. Use this value of lambda to get the beta coefficients. Note that more coefficients are now shrunk to zero.

Generally, the purpose of regularization is to balance accuracy and simplicity. The function cv.glmnet() finds the value of lambda that gives the simplest model but also lies within one standard error of the optimal value of lambda.This value is called lambda.1se.

8. As mentioned earlier, x2 contains a wider variety of predictors. Using OLS, regress y on x2 and evaluate results.

Interpretation: Adjusted R-squared value is 0.59 indicates that the model explains approximately 59% of the variation in the dependent variables. Sex, map and bmi are the most significant variables having p-value close to 0.

9. Use the glmnet function to plot the path of each of x2’s variable coefficients against the L1 norm of the beta vector. This graph indicates at which stage each coefficient shrinks to zero

Using the cv.glmnet function to get the cross validation curve and the value of lambda that minimizes the mean cross validation error.

This is an effective way to narrow down on important predictors when there are many candidates.

Using the minimum value of lambda from the cv_fit1 to get the estimated beta matrix.

Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y.