DATA 621: BLOG 2

Instructions
This blog explores the differences between Lasso and Ridge regression in predicting diabetes disease progression. The dataset used for this analysis is commonly employed for practicing these regression techniques and is available in various packages, such as LARS.

PART I: DOWNLOAD DATA & INSPECTION
There are 442 rows/observations and 11 columns, 10 of which are predictor variables. The values of the observations for all the variables have been adjusted to a scale with mean 0. There are no records with missing values.

For this analysis, the variable ‘y’, which is diabetes progression one year after the baseline measurements, is the dependent variable. The other ten variables are the independent variables.

The variables are:
* diabetes.x.age - Age of Individual
* diabetes.x.sex - Sex of Individual
* diabetes.x.bmi - Body Mass Index
* diabetes.x.map - Mean Arterial Pressure
* diabetes.x.tc - Total Cholesterol
* diabetes.x.ldl - Low Density Lipidprotein
* diabetes.x.hdl - High Density Lipidprotein
* diabetes.x.tch - Total Cholesterol to HDL ratio
* diabetes.x.ltg - Log-transformed triglyceride levels
* diabetes.x.glu - Blood Glucose Level
* y - Disease progression of diabetes one year after the baseline measurements

library(readr)
library(dplyr)
library(tidyverse)
library(psych)
library(glmnet)   # For ridge regression
library(ggplot2)  # For plotting
url <- "https://raw.githubusercontent.com/greggmaloy/621/refs/heads/main/diabetes_dataset.csv"
data <- read_csv(url)
head(data)

colSums(is.na(data))

## diabetes.x.age diabetes.x.sex diabetes.x.bmi diabetes.x.map  diabetes.x.tc 
##              0              0              0              0              0 
## diabetes.x.ldl diabetes.x.hdl diabetes.x.tch diabetes.x.ltg diabetes.x.glu 
##              0              0              0              0              0 
##              y 
##              0

PART II: Running Ridge & Lasso Models
Below, the ridge regression will utilize cross validation in order to select an optimized value for the regularization parameter. Optimization of this value is important to help prevent underfitting and overfitting. The regularization parameter, in the case of ridge regression, essentially ‘adds a penalty’ to the sum of squares of the coefficient values. By using the sum of the squares of the coefficient values, the ridge model promotes smaller coefficient values thereby helping to the prevent of over and underfitting.

Ridge Model

X <- as.matrix(data %>% select(-y)) # all independent variables
y <- data$y  # dependent variable

# Splitting 80 - 20
set.seed(42)
train_indices <- sample(1:nrow(X), size = 0.8 * nrow(X))
X_train <- X[train_indices, ]
X_test <- X[-train_indices, ]
y_train <- y[train_indices]
y_test <- y[-train_indices]

# Ridge regression
ridge_cv <- cv.glmnet(X_train, y_train, alpha = 0, lambda = 10^seq(-3, 3, length = 100))
best_lambda <- ridge_cv$lambda.min
ridge_model <- glmnet(X_train, y_train, alpha = 0, lambda = best_lambda)

# Ridge predictions
ridge_pred <- predict(ridge_model, s = best_lambda, newx = X_test)  # Fixed this line

# Mean Squared Error
ridge_mse <- mean((y_test - ridge_pred)^2)

# R squared 
ridge_r2 <- 1 - sum((y_test - ridge_pred)^2) / sum((y_test - mean(y_test))^2)

# Print results
cat("Ridge Regression:\n")

## Ridge Regression:

cat("Best Lambda:", best_lambda, "\n")  # Fixed this line

## Best Lambda: 7.564633

cat("Test MSE:", ridge_mse, "\n")

## Test MSE: 2740.842

cat("Test R-squared:", ridge_r2, "\n\n")

## Test R-squared: 0.4005957

# Coefficients values
coefficients <- as.matrix(coef(ridge_model))

# convert to df and plot
coeff_df <- data.frame(Feature = rownames(coefficients), Coefficient = coefficients[, 1])
ggplot(coeff_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = paste("Ridge Coefficients (lambda =", round(best_lambda, 4), ")"),
       x = "Features",
       y = "Coefficient Value") +
  theme_minimal()

coeff_df[order(-abs(coeff_df$Coefficient)), ]

LASSO MODEL
As with ridge regression, lasso regression will utilize cross-validation to select the optimal value for the regularization parameter. Again, as with the ridge model, optimizing the value of the the regularization parameter is crucial to prevent both underfitting and overfitting. In the case of lasso regression, the regularization parameter adds a penalty to the sum of the absolute values of the coefficients. This penalty encourages the model to shrink some coefficients to zero, thereby selecting only important variables and reducing the size and complexity of the model.

# Lasso cross-validation
lasso_cv <- cv.glmnet(X_train, y_train, alpha = 1, lambda = 10^seq(-3, 3, length = 100))
lasso_best_lambda <- lasso_cv$lambda.min
lasso_model <- glmnet(X_train, y_train, alpha = 1, lambda = lasso_best_lambda)

# Predictions on the test set
lasso_pred <- predict(lasso_model, s = lasso_best_lambda, newx = X_test)

# Mean Squared Error
lasso_mse <- mean((y_test - lasso_pred)^2)

# R-squared 
lasso_r2 <- 1 - sum((y_test - lasso_pred)^2) / sum((y_test - mean(y_test))^2)

# print
cat("Lasso Regression:\n")

## Lasso Regression:

cat("Best Lambda:", lasso_best_lambda, "\n")

## Best Lambda: 0.9326033

cat("Test MSE:", lasso_mse, "\n")

## Test MSE: 2750.067

cat("Test R-squared:", lasso_r2, "\n")

## Test R-squared: 0.3985781

# coefficients values
coefficients <- as.matrix(coef(lasso_model))


# convert to df and plot 
coeff_df <- data.frame(Feature = rownames(coefficients), Coefficient = coefficients[, 1])
ggplot(coeff_df, aes(x = reorder(Feature, Coefficient), y = Coefficient)) +
  geom_bar(stat = "identity", fill = "blue") +
  coord_flip() +
  labs(title = paste("Lasso Coefficients (lambda =", round(best_lambda, 4), ")"),
       x = "Features",
       y = "Coefficient Value") +
  theme_minimal()

# Print the coefficients sorted by magnitude
coeff_df[order(-abs(coeff_df$Coefficient)), ]

Part III: Discussion
MSE
Both models exhibited comparable MSE (ridge=2740.842 vs lass=2750.067). The ridge model MSE is slightly smaller, indicating slightly better predictive accuracy. Generally speaking, a lower MSE was expected for the ridge model as there was strong suspicion that the independent variables exhibit influence over one another.

#cat("Lasso Regression:\n")
#cat("Best Lambda:", lasso_best_lambda, "\n")
cat("Test Lasso MSE:", lasso_mse, "\n")

## Test Lasso MSE: 2750.067

cat("Test Ridge MSE:", ridge_mse, "\n")

## Test Ridge MSE: 2740.842

R-Squared
The R-squared values are identical (lasso=0.3985781 vs ridge=0.4005957) indicating that 40% of the variance in diabetes disease progression is explained by the independent variables. As the results are largely identical, other factors would best drive the decision as to which model is the better fit.

cat("Test Lasso R-squared:", lasso_r2, "\n")

## Test Lasso R-squared: 0.3985781

cat("Test Ridge R-squared:", ridge_r2, "\n")

## Test Ridge R-squared: 0.4005957

Coefficients Comparisons

ridge_coefficients <- as.matrix(coef(ridge_model))
lasso_coefficients <- as.matrix(coef(lasso_model))
coeff_comparison <- data.frame(
  Feature = rownames(ridge_coefficients), 
  Ridge_Coefficient = ridge_coefficients[, 1],
  Lasso_Coefficient = lasso_coefficients[, 1]
)


print(coeff_comparison)

##                       Feature Ridge_Coefficient Lasso_Coefficient
## (Intercept)       (Intercept)         152.33934         152.29007
## diabetes.x.age diabetes.x.age         -33.66398         -17.61919
## diabetes.x.sex diabetes.x.sex        -237.97247        -235.22852
## diabetes.x.bmi diabetes.x.bmi         494.66101         525.95045
## diabetes.x.map diabetes.x.map         281.51802         268.43276
## diabetes.x.tc   diabetes.x.tc         -64.85190         -61.02029
## diabetes.x.ldl diabetes.x.ldl         -44.39878           0.00000
## diabetes.x.hdl diabetes.x.hdl        -193.28656        -231.43687
## diabetes.x.tch diabetes.x.tch          99.70404           0.00000
## diabetes.x.ltg diabetes.x.ltg         462.75285         518.92651
## diabetes.x.glu diabetes.x.glu         128.54321         101.64482

Feature Selection:

The lasso model excluded LDL and TCH variables from the regression as denoted by the zero coefficient value. This is one of the primary features of lasso regression as it automatically drops from the model variables which are of lesser importance.

As expected, the ridge model retained all variables.

Magnitude of Coefficients:

Both models are comparable in findings. Both denote that BMI, LTG, and MAP are positively associated with diabetes progression, while HDL, Age and being male are negatively associated and therefore protective.

The ridge model included LDL and TCH, with LDL being denoted as being negatively associated with diabetes progression and TCH as being positively denoted with diabetes progression.

Generally speaking the coefficient values are comparable between the models for the variables intercept, sex, map, total cholesterol. The differences between Age, BMI, HDL, and LTG are more pronounced.

Part IV: Conclusion
Although debatable, the ridge model would be the preferred model, as denoted by slightly smaller MSE, identical R-squared and similarity in coefficient values. I would hesitate to drop any of these variables from the model, which the Lasso model does, as there is a strong possibility of multicollinearity between the independent variables given that disease progression can often be influenced by other diseases/health states. Seemingly unimportant labs may actually be relevant.

DATA 621: BLOG 2

Gregg Maloy