#install.packages("glmnet", repos = "http://cran.us.r-project.org")
#install.packages("lars", repos = "http://cran.us.r-project.org")
library(glmnet)
## Warning: package 'glmnet' was built under R version 3.6.3
## Loading required package: Matrix
## Loaded glmnet 3.0-2
library(lars)
## Loaded lars 1.2
data(diabetes)
#head(diabetes)
#View(diabetes)
#attach(diabetes)
#summary(diabetes)
It is a good idea to visually inspect the relationship of each of the predictors with the dependent variable. Generate separate scatterplots with the line of best fit for all the predictors in x with y on the vertical axis. Use a loop to automate the process.
summary(diabetes$x)
## age sex bmi map
## Min. :-0.107226 Min. :-0.04464 Min. :-0.090275 Min. :-0.112400
## 1st Qu.:-0.037299 1st Qu.:-0.04464 1st Qu.:-0.034229 1st Qu.:-0.036656
## Median : 0.005383 Median :-0.04464 Median :-0.007284 Median :-0.005671
## Mean : 0.000000 Mean : 0.00000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.038076 3rd Qu.: 0.05068 3rd Qu.: 0.031248 3rd Qu.: 0.035644
## Max. : 0.110727 Max. : 0.05068 Max. : 0.170555 Max. : 0.132044
## tc ldl hdl
## Min. :-0.126781 Min. :-0.115613 Min. :-0.102307
## 1st Qu.:-0.034248 1st Qu.:-0.030358 1st Qu.:-0.035117
## Median :-0.004321 Median :-0.003819 Median :-0.006584
## Mean : 0.000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.028358 3rd Qu.: 0.029844 3rd Qu.: 0.029312
## Max. : 0.153914 Max. : 0.198788 Max. : 0.181179
## tch ltg glu
## Min. :-0.076395 Min. :-0.126097 Min. :-0.137767
## 1st Qu.:-0.039493 1st Qu.:-0.033249 1st Qu.:-0.033179
## Median :-0.002592 Median :-0.001948 Median :-0.001078
## Mean : 0.000000 Mean : 0.000000 Mean : 0.000000
## 3rd Qu.: 0.034309 3rd Qu.: 0.032433 3rd Qu.: 0.027917
## Max. : 0.185234 Max. : 0.133599 Max. : 0.135612
par(mfrow=c(2,5))
for(i in 1:10)
{
plot(diabetes$x[,i], diabetes$y, xlab = colnames(diabetes$x)[i], ylab="y")
abline(lm(diabetes$y~(diabetes$x)[,i]), col="red")
}
From the above scatterplots, we can see there is a positive relationship between all the predictors of ‘x’ and a dependent variable ‘y’, except hdl showing a negative correlation with ‘y’. Also, we can see that ‘y’ is independent of the precitor variable ‘sex’.
#applying OLS (Ordinary Least Squares) linear regression
model_ols <- lm(diabetes$y ~ diabetes$x)
summary(model_ols)
##
## Call:
## lm(formula = diabetes$y ~ diabetes$x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.829 -38.534 -0.227 37.806 151.355
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.133 2.576 59.061 < 2e-16 ***
## diabetes$xage -10.012 59.749 -0.168 0.867000
## diabetes$xsex -239.819 61.222 -3.917 0.000104 ***
## diabetes$xbmi 519.840 66.534 7.813 4.30e-14 ***
## diabetes$xmap 324.390 65.422 4.958 1.02e-06 ***
## diabetes$xtc -792.184 416.684 -1.901 0.057947 .
## diabetes$xldl 476.746 339.035 1.406 0.160389
## diabetes$xhdl 101.045 212.533 0.475 0.634721
## diabetes$xtch 177.064 161.476 1.097 0.273456
## diabetes$xltg 751.279 171.902 4.370 1.56e-05 ***
## diabetes$xglu 67.625 65.984 1.025 0.305998
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 54.15 on 431 degrees of freedom
## Multiple R-squared: 0.5177, Adjusted R-squared: 0.5066
## F-statistic: 46.27 on 10 and 431 DF, p-value: < 2.2e-16
** Note: The coefficient for each explanatory variable reflects both the strength and type of relationship the explanatory variable has to the dependent variable.When the sign associated with the coefficient is negative, the relationship is negative. When the sign is positive, the relationship is positive.
Y(diabetes) shows positive relationship with all the variables except age, sex and tc i.e., causing -10.01, -239.81 and -79.18 fold change in Y (or towards the progression of diabetes). However, the coefficient 519.84 for bmi indicates that for any change in bmi, you can expect diabetes to increase by an average of 519.84 and similarly for others coefficients.
model_lasso <- glmnet(diabetes$x, diabetes$y)
plot(model_lasso, xvar = "norm", label = TRUE)
In the above plot, each curve corresponds to a variable. It shows the path of its coefficient against the L1-norm of the whole coefficient vector as λ varies. The axis above indicates the number of nonzero coefficients at the current lambda(λ), which is the effective degrees of freedom (df) for the lasso.
cv_fit <- cv.glmnet(diabetes$x, diabetes$y, alpha = 1, nlambda = 1000)
plot(cv_fit)
The above graph shows the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below).
The left dashed vertical line indicates that the log of the optimal value of lambda is approximately -0.8, which is the one that minimizes the prediction error. This lambda value will give the most accurate model. The exact value of lambda can be viewed as follow:
#this will give minimum mean cross-validated error
cv_fit$lambda.min
## [1] 0.8811392
Note that some coefficients have been shrunk to zero. This indicates which predictors are important in explaining the variation in y.
Using lambda.min as the best lambda, we get the following regression coefficients:
fit <- glmnet(diabetes$x, diabetes$y, alpha = 1, lambda=cv_fit$lambda.min)
fit$beta
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## age .
## sex -200.735691
## bmi 522.542673
## map 298.892997
## tc -110.472867
## ldl .
## hdl -220.091964
## tch 7.226631
## ltg 515.834123
## glu 55.715575
The coefficients of the variables age and ldl have been have been set to zero by the lasso algorithm, reducing the complexity of the model.
cv_fit$lambda.1se
## [1] 8.670368
fit <- glmnet(diabetes$x, diabetes$y, alpha = 1, lambda=cv_fit$lambda.1se)
fit$beta
## 10 x 1 sparse Matrix of class "dgCMatrix"
## s0
## age .
## sex .
## bmi 485.80177
## map 159.83203
## tc .
## ldl .
## hdl -82.07465
## tch .
## ltg 421.50654
## glu .
Using lambda.1se as the best lambda, gives the above regression coefficients. Using lambda.1se, only 5 variables have non-zero coefficients. The coefficients of all other variables such as age, tc, ldl, tch and glu have been set to zero by the lasso algorithm, reducing the complexity of the model.
model_ols2 <- lm(diabetes$y~diabetes$x2)
summary(model_ols2)
##
## Call:
## lm(formula = diabetes$y ~ diabetes$x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -158.216 -30.809 -3.857 31.348 153.946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 152.133 2.532 60.086 < 2e-16 ***
## diabetes$x2age 50.721 65.513 0.774 0.4393
## diabetes$x2sex -267.344 65.270 -4.096 5.15e-05 ***
## diabetes$x2bmi 460.721 84.601 5.446 9.32e-08 ***
## diabetes$x2map 342.933 72.447 4.734 3.13e-06 ***
## diabetes$x2tc -3599.542 60575.187 -0.059 0.9526
## diabetes$x2ldl 3028.281 53238.699 0.057 0.9547
## diabetes$x2hdl 1103.047 22636.179 0.049 0.9612
## diabetes$x2tch 74.937 275.807 0.272 0.7860
## diabetes$x2ltg 1828.210 19914.504 0.092 0.9269
## diabetes$x2glu 62.754 70.398 0.891 0.3733
## diabetes$x2age^2 67.691 69.470 0.974 0.3305
## diabetes$x2bmi^2 45.849 83.288 0.550 0.5823
## diabetes$x2map^2 -8.460 71.652 -0.118 0.9061
## diabetes$x2tc^2 6668.449 7059.159 0.945 0.3454
## diabetes$x2ldl^2 3583.174 5326.148 0.673 0.5015
## diabetes$x2hdl^2 1731.821 1590.574 1.089 0.2769
## diabetes$x2tch^2 773.374 606.967 1.274 0.2034
## diabetes$x2ltg^2 1451.581 1730.103 0.839 0.4020
## diabetes$x2glu^2 114.149 94.122 1.213 0.2260
## diabetes$x2age:sex 148.678 73.407 2.025 0.0435 *
## diabetes$x2age:bmi -18.052 79.620 -0.227 0.8208
## diabetes$x2age:map 18.534 76.303 0.243 0.8082
## diabetes$x2age:tc -158.891 617.109 -0.257 0.7970
## diabetes$x2age:ldl -67.285 494.527 -0.136 0.8918
## diabetes$x2age:hdl 209.245 280.614 0.746 0.4563
## diabetes$x2age:tch 184.960 210.330 0.879 0.3798
## diabetes$x2age:ltg 124.667 223.765 0.557 0.5778
## diabetes$x2age:glu 62.575 80.377 0.779 0.4367
## diabetes$x2sex:bmi 64.612 77.902 0.829 0.4074
## diabetes$x2sex:map 88.472 74.744 1.184 0.2373
## diabetes$x2sex:tc 433.598 590.709 0.734 0.4634
## diabetes$x2sex:ldl -352.823 468.951 -0.752 0.4523
## diabetes$x2sex:hdl -124.731 273.870 -0.455 0.6491
## diabetes$x2sex:tch -131.223 199.714 -0.657 0.5115
## diabetes$x2sex:ltg -118.995 226.493 -0.525 0.5996
## diabetes$x2sex:glu 45.758 73.650 0.621 0.5348
## diabetes$x2bmi:map 154.720 86.340 1.792 0.0739 .
## diabetes$x2bmi:tc -302.045 667.930 -0.452 0.6514
## diabetes$x2bmi:ldl 241.540 561.026 0.431 0.6671
## diabetes$x2bmi:hdl 121.942 329.884 0.370 0.7118
## diabetes$x2bmi:tch -33.445 230.836 -0.145 0.8849
## diabetes$x2bmi:ltg 114.673 255.987 0.448 0.6544
## diabetes$x2bmi:glu 23.377 91.037 0.257 0.7975
## diabetes$x2map:tc 478.303 682.264 0.701 0.4837
## diabetes$x2map:ldl -326.740 574.317 -0.569 0.5697
## diabetes$x2map:hdl -187.305 309.589 -0.605 0.5455
## diabetes$x2map:tch -58.294 198.601 -0.294 0.7693
## diabetes$x2map:ltg -154.795 271.966 -0.569 0.5696
## diabetes$x2map:glu -133.476 91.314 -1.462 0.1447
## diabetes$x2tc:ldl -9313.775 11771.220 -0.791 0.4293
## diabetes$x2tc:hdl -3932.025 3816.572 -1.030 0.3036
## diabetes$x2tc:tch -2205.910 1761.843 -1.252 0.2113
## diabetes$x2tc:ltg -3801.442 13166.091 -0.289 0.7729
## diabetes$x2tc:glu -176.295 595.459 -0.296 0.7673
## diabetes$x2ldl:hdl 2642.645 3165.926 0.835 0.4044
## diabetes$x2ldl:tch 1206.822 1470.512 0.821 0.4123
## diabetes$x2ldl:ltg 2773.697 10960.214 0.253 0.8004
## diabetes$x2ldl:glu 85.626 505.102 0.170 0.8655
## diabetes$x2hdl:tch 1188.406 1002.242 1.186 0.2365
## diabetes$x2hdl:ltg 1467.845 4609.793 0.318 0.7503
## diabetes$x2hdl:glu 217.541 296.749 0.733 0.4640
## diabetes$x2tch:ltg 389.805 624.671 0.624 0.5330
## diabetes$x2tch:glu 235.693 235.064 1.003 0.3167
## diabetes$x2ltg:glu 83.525 264.726 0.316 0.7525
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.23 on 377 degrees of freedom
## Multiple R-squared: 0.5924, Adjusted R-squared: 0.5233
## F-statistic: 8.563 on 64 and 377 DF, p-value: < 2.2e-16
In other words, we can say that sex,bmi and map can cause -267.344, 460.721 and 342.9 fold change respectively in Y (or towards the progression of diabetes).)
model_lasso1 <- glmnet(diabetes$x2, diabetes$y)
plot(model_lasso1, xvar = "norm", label = TRUE)
In the above plot, each curve corresponds to a variable. It shows the path of its coefficient against the L1-norm of the whole coefficient vector as λ varies. The axis above indicates the number of nonzero coefficients at the current lambda(λ), which is the effective degrees of freedom (df) for the lasso.
cv_fit1 <- cv.glmnet(diabetes$x2, diabetes$y, alpha = 1, nlambda = 1000)
plot(cv_fit1)
cv_fit1$lambda.min
## [1] 2.867854
The above graph shows the cross-validation curve (red dotted line), and upper and lower standard deviation curves along the lambda sequence (error bars). Two selected lambda’s are indicated by the vertical dotted lines (see below). The left dashed vertical line indicates that the log of the optimal value of lambda is approximately 1, which is the one that minimizes the prediction error.
fit1 <- glmnet(diabetes$x2, diabetes$y, alpha = 1, lambda=cv_fit1$lambda.min)
fit1$beta
## 64 x 1 sparse Matrix of class "dgCMatrix"
## s0
## age .
## sex -118.523835
## bmi 501.383490
## map 255.335272
## tc .
## ldl .
## hdl -191.870846
## tch .
## ltg 468.455446
## glu 20.444426
## age^2 10.773639
## bmi^2 40.019997
## map^2 .
## tc^2 .
## ldl^2 .
## hdl^2 .
## tch^2 .
## ltg^2 .
## glu^2 71.929497
## age:sex 110.181166
## age:bmi .
## age:map 30.247639
## age:tc .
## age:ldl .
## age:hdl .
## age:tch .
## age:ltg 9.884011
## age:glu 10.886423
## sex:bmi .
## sex:map 2.452862
## sex:tc .
## sex:ldl .
## sex:hdl .
## sex:tch .
## sex:ltg .
## sex:glu .
## bmi:map 87.260298
## bmi:tc .
## bmi:ldl .
## bmi:hdl .
## bmi:tch .
## bmi:ltg .
## bmi:glu .
## map:tc .
## map:ldl .
## map:hdl .
## map:tch .
## map:ltg .
## map:glu .
## tc:ldl .
## tc:hdl .
## tc:tch .
## tc:ltg .
## tc:glu .
## ldl:hdl .
## ldl:tch .
## ldl:ltg .
## ldl:glu .
## hdl:tch .
## hdl:ltg .
## hdl:glu .
## tch:ltg .
## tch:glu .
## ltg:glu .
Using lambda.min, we get the above regression coefficients. Only 15 variables have non-zero coefficients. The coefficients of all other variables such as age, tc, ldl and so on have been set to zero by the lasso algorithm, thus, reducing the complexity of the model.