Instructions
This blog explores the differences between Lasso and Ridge regression in
predicting diabetes disease progression. The dataset used for this
analysis is commonly employed for practicing these regression techniques
and is available in various packages, such as LARS.
PART I: DOWNLOAD DATA & INSPECTION
There are 442 rows/observations and 11 columns, 10 of which are
predictor variables. The values of the observations for all the
variables have been adjusted to a scale with mean 0. There are no
records with missing values.
For this analysis, the variable ‘y’, which is diabetes progression one year after the baseline measurements, is the dependent variable. The other ten variables are the independent variables.
The variables are:
* diabetes.x.age - Age of Individual
* diabetes.x.sex - Sex of Individual
* diabetes.x.bmi - Body Mass Index
* diabetes.x.map - Mean Arterial Pressure
* diabetes.x.tc - Total Cholesterol
* diabetes.x.ldl - Low Density Lipidprotein
* diabetes.x.hdl - High Density Lipidprotein
* diabetes.x.tch - Total Cholesterol to HDL ratio
* diabetes.x.ltg - Log-transformed triglyceride levels
* diabetes.x.glu - Blood Glucose Level
* y - Disease progression of diabetes one year after the baseline
measurements
library(readr)
library(dplyr)
library(tidyverse)
library(psych)
library(glmnet) # For ridge regression
library(ggplot2) # For plotting
url <- "https://raw.githubusercontent.com/greggmaloy/621/refs/heads/main/diabetes_dataset.csv"
data <- read_csv(url)
head(data)
colSums(is.na(data))
## diabetes.x.age diabetes.x.sex diabetes.x.bmi diabetes.x.map diabetes.x.tc
## 0 0 0 0 0
## diabetes.x.ldl diabetes.x.hdl diabetes.x.tch diabetes.x.ltg diabetes.x.glu
## 0 0 0 0 0
## y
## 0
PART II: Running Ridge & Lasso Models
Below, the ridge regression will utilize cross validation in order to
select an optimized value for the regularization parameter. Optimization
of this value is important to help prevent underfitting and overfitting.
The regularization parameter, in the case of ridge regression,
essentially ‘adds a penalty’ to the sum of squares of the coefficient
values. By using the sum of the squares of the coefficient values, the
ridge model promotes smaller coefficient values thereby helping to the
prevent of over and underfitting.
Ridge Model
X <- as.matrix(data %>% select(-y)) # all independent variables
y <- data$y # dependent variable
# Splitting 80 - 20
set.seed(42)
train_indices <- sample(1:nrow(X), size = 0.8 * nrow(X))
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]
# Ridge regression
ridge_cv <- cv.glmnet(X_train, y_train, alpha = 0, lambda = 10^seq(-3, 3, length = 100))
best_lambda <- ridge_cv$lambda.min
ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = best_lambda)
# Ridge predictions
ridge_pred <- predict(ridge_model, s = best_lambda, newx = X_test) # Fixed this line
# Mean Squared Error
ridge_mse <- mean((y_test - ridge_pred)^2)
# R squared
ridge_r2 <- 1 - sum((y_test - ridge_pred)^2) / sum((y_test - mean(y_test))^2)
# Print results
cat("Ridge Regression:\n")
## Ridge Regression:
cat("Best Lambda:", best_lambda, "\n") # Fixed this line
## Best Lambda: 7.564633
cat("Test MSE:", ridge_mse, "\n")
## Test MSE: 2740.842
cat("Test R-squared:", ridge_r2, "\n\n")
## Test R-squared: 0.4005957
# Coefficients values
coefficients <- as.matrix(coef(ridge_model))
# convert to df and plot
coeff_df <- data.frame(Feature = rownames(coefficients), Coefficient = coefficients[, 1])
ggplot(coeff_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip() +
labs(title = paste("Ridge Coefficients (lambda =", round(best_lambda, 4), ")"),
x = "Features",
y = "Coefficient Value") +
theme_minimal()
coeff_df[order(-abs(coeff_df$Coefficient)), ]
LASSO MODEL
As with ridge regression, lasso regression will utilize cross-validation
to select the optimal value for the regularization parameter. Again, as
with the ridge model, optimizing the value of the the regularization
parameter is crucial to prevent both underfitting and overfitting. In
the case of lasso regression, the regularization parameter adds a
penalty to the sum of the absolute values of the coefficients. This
penalty encourages the model to shrink some coefficients to zero,
thereby selecting only important variables and reducing the size and
complexity of the model.
# Lasso cross-validation
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = 10^seq(-3, 3, length = 100))
lasso_best_lambda <- lasso_cv$lambda.min
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = lasso_best_lambda)
# Predictions on the test set
lasso_pred <- predict(lasso_model, s = lasso_best_lambda, newx = X_test)
# Mean Squared Error
lasso_mse <- mean((y_test - lasso_pred)^2)
# R-squared
lasso_r2 <- 1 - sum((y_test - lasso_pred)^2) / sum((y_test - mean(y_test))^2)
# print
cat("Lasso Regression:\n")
## Lasso Regression:
cat("Best Lambda:", lasso_best_lambda, "\n")
## Best Lambda: 0.9326033
cat("Test MSE:", lasso_mse, "\n")
## Test MSE: 2750.067
cat("Test R-squared:", lasso_r2, "\n")
## Test R-squared: 0.3985781
# coefficients values
coefficients <- as.matrix(coef(lasso_model))
# convert to df and plot
coeff_df <- data.frame(Feature = rownames(coefficients), Coefficient = coefficients[, 1])
ggplot(coeff_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
geom_bar(stat = "identity", fill = "blue") +
coord_flip() +
labs(title = paste("Lasso Coefficients (lambda =", round(best_lambda, 4), ")"),
x = "Features",
y = "Coefficient Value") +
theme_minimal()
# Print the coefficients sorted by magnitude
coeff_df[order(-abs(coeff_df$Coefficient)), ]
Part III: Discussion
MSE
Both models exhibited comparable MSE (ridge=2740.842 vs lass=2750.067).
The ridge model MSE is slightly smaller, indicating slightly better
predictive accuracy. Generally speaking, a lower MSE was expected for
the ridge model as there was strong suspicion that the independent
variables exhibit influence over one another.
#cat("Lasso Regression:\n")
#cat("Best Lambda:", lasso_best_lambda, "\n")
cat("Test Lasso MSE:", lasso_mse, "\n")
## Test Lasso MSE: 2750.067
cat("Test Ridge MSE:", ridge_mse, "\n")
## Test Ridge MSE: 2740.842
R-Squared
The R-squared values are identical (lasso=0.3985781 vs ridge=0.4005957)
indicating that 40% of the variance in diabetes disease progression is
explained by the independent variables. As the results are largely
identical, other factors would best drive the decision as to which model
is the better fit.
cat("Test Lasso R-squared:", lasso_r2, "\n")
## Test Lasso R-squared: 0.3985781
cat("Test Ridge R-squared:", ridge_r2, "\n")
## Test Ridge R-squared: 0.4005957
Coefficients Comparisons
ridge_coefficients <- as.matrix(coef(ridge_model))
lasso_coefficients <- as.matrix(coef(lasso_model))
coeff_comparison <- data.frame(
Feature = rownames(ridge_coefficients),
Ridge_Coefficient = ridge_coefficients[, 1],
Lasso_Coefficient = lasso_coefficients[, 1]
)
print(coeff_comparison)
## Feature Ridge_Coefficient Lasso_Coefficient
## (Intercept) (Intercept) 152.33934 152.29007
## diabetes.x.age diabetes.x.age -33.66398 -17.61919
## diabetes.x.sex diabetes.x.sex -237.97247 -235.22852
## diabetes.x.bmi diabetes.x.bmi 494.66101 525.95045
## diabetes.x.map diabetes.x.map 281.51802 268.43276
## diabetes.x.tc diabetes.x.tc -64.85190 -61.02029
## diabetes.x.ldl diabetes.x.ldl -44.39878 0.00000
## diabetes.x.hdl diabetes.x.hdl -193.28656 -231.43687
## diabetes.x.tch diabetes.x.tch 99.70404 0.00000
## diabetes.x.ltg diabetes.x.ltg 462.75285 518.92651
## diabetes.x.glu diabetes.x.glu 128.54321 101.64482
Feature Selection:
The lasso model excluded LDL and TCH variables from the regression as denoted by the zero coefficient value. This is one of the primary features of lasso regression as it automatically drops from the model variables which are of lesser importance.
As expected, the ridge model retained all variables.
Magnitude of Coefficients:
Both models are comparable in findings. Both denote that BMI, LTG, and MAP are positively associated with diabetes progression, while HDL, Age and being male are negatively associated and therefore protective.
The ridge model included LDL and TCH, with LDL being denoted as being negatively associated with diabetes progression and TCH as being positively denoted with diabetes progression.
Generally speaking the coefficient values are comparable between the models for the variables intercept, sex, map, total cholesterol. The differences between Age, BMI, HDL, and LTG are more pronounced.
Part IV: Conclusion
Although debatable, the ridge model would be the preferred model, as
denoted by slightly smaller MSE, identical R-squared and similarity in
coefficient values. I would hesitate to drop any of these variables from
the model, which the Lasso model does, as there is a strong possibility
of multicollinearity between the independent variables given that
disease progression can often be influenced by other diseases/health
states. Seemingly unimportant labs may actually be relevant.