---
title: "Multiple Logistic Regression Analysis: Risk Factors for Diabetes Complications"
author: "Munir, Sanggary, Farihah, Khai"
date: "`r Sys.Date()`"
description: "An in-depth multiple logistic regression analysis to identify significant risk factors associated with diabetes complications."
categories: [R, Logistic Regression, Medical Statistics, Diabetes]
tags: [R, Logistic Regression, Medical Statistics, Diabetes]
format:
html:
theme: cerulean
toc: true
toc-expand: true
toc-location: left
code-fold: true
code-summary: "Show code"
code-tools: true
number-sections: true
number-depth: 3
execute:
echo: true
warning: false
message: false
comment: NA
---
# Introduction
## Research Question / Objective
To investigate the association between various risk factors and the occurrence of diabetes complications (yes/no) using multiple logistic regression.
## Dataset and variable
```{r}
library(tidyverse)
set.seed(123) # For reproducibility
n <- 1000
age <- rnorm(n, mean=55, sd=10)
bmi <- rnorm(n, mean=28, sd=5)
sbp <- rnorm(n, mean=130, sd=15)
cholesterol <- rnorm(n, mean=200, sd=30)
smoking_status <- rbinom(n, 1, 0.3) # 30% smokers
# Logistic model to generate diabetes complications
logit_prob <- -5 + 0.02*age + 0.1*bmi + 0.005*sbp + 0.004*cholesterol + 1.0*smoking_status
prob_diabetes_complications <- 1 / (1 + exp(-logit_prob))
diabetes_complications <- rbinom(n, 1, prob_diabetes_complications)
```
A simulated dataset of 1,000 patients was generated to reflect realistic clinical characteristics.
The dataset includes the following variables:
- Age (continuous)
- Body Mass Index (BMI) (continuous)
- Blood Pressure (BP) (continuous)
- Cholesterol Level (continuous)
- Smoking Status (binary: 0 = non-smoker, 1 = smoker)
- Diabetes Complications (binary outcome: 0 = no, 1 = yes)
# Data Preparation
## Load Libraries
```{r}
library(tidyverse)
library(gt)
library(broom)
library(haven)
library(here)
library(gtsummary)
library(corrplot)
library(caret)
library(mfp)
library(MASS)
library(MuMIn)
library(DT)
library(sjPlot)
```
## Data cleaning and coding
```{r}
diabetes_data <- data.frame(
age = age,
bmi = bmi,
sbp = sbp,
cholesterol = cholesterol,
smoking_status = smoking_status,
diabetes_complications = diabetes_complications
)
glimpse(diabetes_data)
```
## Convert variables to factor with ordered levels
```{r}
diabetes_data <- diabetes_data %>%
mutate(
smoking_status = factor(smoking_status, levels = c(0, 1), labels = c("Non-smoker", "Smoker")),
diabetes_complications = factor(diabetes_complications, levels = c(0, 1), labels = c("No", "Yes"))
)
glimpse(diabetes_data)
```
# Descriptive Analysis
## Summary statistics table
```{r}
diabetes_data %>%
tbl_summary(
by = diabetes_complications,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"
)) %>%
bold_labels() %>%
as_gt() %>%
tab_caption(md("**Table 1: Descriptive Statistics**"))
```
## Explore data
Histogram for numerical variables and barplot for categorical variables
### Age
```{r}
ggplot(diabetes_data, aes(x = age)) +
geom_histogram() +
facet_grid (~diabetes_complications)
```
### BMI
```{r}
ggplot(diabetes_data, aes(x = bmi)) +
geom_histogram() +
facet_grid (~diabetes_complications)
```
### Systolic Blood Pressure
```{r}
ggplot(diabetes_data, aes(x = sbp)) +
geom_histogram() +
facet_grid (~diabetes_complications)
```
### Cholesterol
```{r}
ggplot(diabetes_data, aes(x = cholesterol)) +
geom_histogram() +
facet_grid (~diabetes_complications)
```
### Smoking Status
```{r}
ggplot(diabetes_data, aes(x = smoking_status)) +
geom_bar() +
facet_grid (~diabetes_complications)
```
# Univariate Analysis
## Null model
```{r}
modlog_null <- glm(diabetes_complications ~ 1, data = diabetes_data, family = binomial)
tbl_regression(modlog_null, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Null Model", subtitle = "Intercept-only")
```
## Age
```{r}
modlog_age <- glm(diabetes_complications ~ age, data = diabetes_data, family = binomial)
tbl_regression(modlog_age, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Univariate Model: Age", subtitle = "Effect of Age on Diabetes Complications")
```
## BMI
```{r}
modlog_bmi <- glm(diabetes_complications ~ bmi, data = diabetes_data, family = binomial)
tbl_regression(modlog_bmi, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Univariate Model: BMI", subtitle = "Effect of BMI on Diabetes Complications")
```
## Systolic Blood Pressure
```{r}
modlog_sbp <- glm(diabetes_complications ~ sbp, data = diabetes_data, family = binomial)
tbl_regression(modlog_sbp, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Univariate Model: Systolic Blood Pressure", subtitle = "Effect of SBP on Diabetes Complications")
```
## Cholesterol
```{r}
modlog_cholesterol <- glm(diabetes_complications ~ cholesterol, data = diabetes_data, family = binomial)
tbl_regression(modlog_cholesterol, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Univariate Model: Cholesterol", subtitle = "Effect of Cholesterol on Diabetes Complications")
```
## Smoking Status
```{r}
modlog_smoking <- glm(diabetes_complications ~ smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_smoking, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Univariate Model: Smoking Status", subtitle = "Effect of Smoking on Diabetes Complications")
```
*Interpretation* : Based on the univariate analyses, age, BMI, and smoking status all show significant associations with the occurrence of diabetes complications (95% CI does not include 1). . While, systolic blood pressure and cholesterol do not show significant associations (95% CI includes 1).
# Variable selection
## Preliminary model
We include the variables that were significant in the univariate analysis (age, BMI, smoking status) and clinically relevant variables (systolic blood pressure, cholesterol) in the preliminary multivariable model.
```{r}
modlog_prelim <- glm(diabetes_complications ~ age + bmi + sbp + cholesterol + smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_prelim, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Preliminary Multivariable Model", subtitle = "All selected predictors")
```
*Interpretation* : In the preliminary multivariable model, age, BMI, and smoking status remain significant predictors of diabetes complications. Systolic blood pressure and cholesterol are not significant in this model.
## Backward selection
```{r}
step_bw <- MASS::stepAIC(modlog_prelim, direction = "backward")
```
## Forward selection
```{r}
step_fw <- MASS::stepAIC(modlog_null, scope = list(lower = modlog_null, upper = modlog_prelim), direction = "forward")
```
## Both selection
```{r}
step_both <- MASS::stepAIC(modlog_prelim, direction = "both",
scope = list(lower = modlog_null, upper = modlog_prelim))
```
## Compare selected models
```{r}
anova_combine <- bind_rows(
as.data.frame(step_bw$anova) %>% mutate(Model = "Backward Selection"),
as.data.frame(step_fw$anova) %>% mutate(Model = "Forward Selection"),
as.data.frame(step_both$anova) %>% mutate(Model = "Both Selection")
) %>%
relocate(Model, .before = 1)
anova_combine %>%
gt() %>%
tab_header(
title = "Model Selection Comparison",
subtitle = "ANOVA Results for Different Selection Methods"
) %>%
fmt_number(
columns = vars(Df, Deviance, `Resid. Df`, `Resid. Dev`),
decimals = 2
)
```
*Interpretation* : All three selection methods (backward, forward, and both) resulted in the same final model including age, BMI, smoking status and cholesterol level as significant predictors of diabetes complications.
# Multivariable Analysis
## Age and BMI
```{r}
modlog_age.bmi <- glm(diabetes_complications ~ age + bmi, data = diabetes_data, family = binomial)
tbl_regression(modlog_age.bmi, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Multivariable Model: Age and BMI", subtitle = "Effects on Diabetes Complications")
```
## Age, BMI and Smoking Status
```{r}
modlog_age.bmi.smoking <- glm(diabetes_complications ~ age + bmi + smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_age.bmi.smoking, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Multivariable Model: Age, BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")
```
## Model full (Age, BMI, Smoking Status, Cholesterol)
```{r}
modlog_full <- glm(diabetes_complications ~ age + bmi + smoking_status + cholesterol, data = diabetes_data, family = binomial)
tbl_regression(modlog_full, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Multivariable Model: Age, BMI, Smoking Status and Cholesterol", subtitle = "Effects on Diabetes Complications")
```
# Model comparison and selection
## Using AAic
```{r}
model_list <- list(modlog_age, modlog_age.bmi, modlog_age.bmi.smoking, modlog_full)
model_comparison <- model.sel(model_list)
model_comparison %>%
as.data.frame() %>%
rownames_to_column(var = "Model") %>%
gt() %>%
tab_header(
title = "Model Comparison using AIC",
subtitle = "AIC Values for Different Models"
) %>%
fmt_number(
columns = vars(AICc, delta, weight), # use AICc instead of AIC
decimals = 2
)
```
*Interpretation* : Based on the AIC values, the full model including age, BMI, and smoking status is preferred as it has the lowest AIC value among the compared models.
## ANOVA to compare nested models
```{r}
anova(modlog_age, modlog_age.bmi, modlog_age.bmi.smoking, modlog_full)
```
*Interpretation* : The ANOVA results indicate that adding BMI and smoking status significantly improves the model fit compared to the simpler models (p \< 0.05). However, adding cholesterol does not significantly improve the model (p \> 0.05). Thus, the preferred model includes age, BMI, and smoking status.
# Check for confounding and mediation
## DAG plot
```{r}
library(ggdag)
dag <- dagify(
complications ~ smoking + bmi + age,
smoking ~ age,
bmi ~ age
)
ggdag(dag, text = TRUE) + theme_dag()
```
## Check correlation between variables
### Correlation between age and BMI
```{r}
cor(diabetes_data$age, diabetes_data$bmi, use = "complete.obs")
```
*Interpretation* : The correlations between age and are low (all \< 0.3), indicating minimal risk of multicollinearity and confounding among these predictors.
### correlation matrix
```{r}
cor_matrix <- diabetes_data %>%
dplyr::select(age, bmi) %>%
cor(use = "complete.obs")
cor_matrix
```
### Visualize correlation matrix
```{r}
corrplot(cor_matrix, method = "number", type = "upper", tl.col = "black", tl.srt = 45)
```
*Interpretation* : The correlation matrix visualization confirms that there are no strong correlations among the predictors, supporting the conclusion that confounding is unlikely to be a significant issue in the multivariable model.
## Assess confounding by examining percent change in OR
Main exposure: Age
```{r}
models <- list(
age = modlog_age,
age_bmi = modlog_age.bmi,
age_bmi_smoking = modlog_age.bmi.smoking
)
confounding <- map_df(models, ~ tidy(.x, exponentiate = TRUE), .id = "model") %>%
filter(term != "(Intercept)") %>%
filter(term == "age")
# Extract unadjusted OR for age
or_unadj <- confounding %>%
filter(model == "age") %>%
pull(estimate)
# Add percent change column
confounding <- confounding %>%
mutate(percent_change = round((estimate - or_unadj) / or_unadj * 100, 1))
confounding
```
*Interpretation* : The adjusted odds ratios for age change by less than 10% when adding BMI and smoking status to the model, indicating that there is no significant confounding effect from these variables on the relationship between age and diabetes complications.
## Check the association of each variable with the outcome (diabetes complications)
```{r}
slr.age.bmi.smoke <-
diabetes_data %>%
dplyr::select(age, bmi, smoking_status) %>%
purrr::map(~ glm(diabetes_complications ~ .x, data = diabetes_data, family = binomial)) %>%
purrr::map(tidy) %>% bind_rows()
#Display the result
slr.age.bmi.smoke %>%
mutate(model = c('b0', 'age', 'b0', 'bmi', 'b0', 'smoking_statusYes')) %>%
dplyr::select(model, everything())
```
*Interpretation* : Age, BMI, and smoking status are all statistically significant and clinically relevant predictors of diabetes complications. Each contributes independently to risk, with smoking showing the strongest effect. These results support including all three variables in the final model.
# Check for interaction
## Age and BMI
```{r}
mloginter_age.bmi <- glm(
diabetes_complications ~ age + bmi + smoking_status + age:bmi,
data = diabetes_data,
family = binomial
)
tbl_regression(mloginter_age.bmi, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Interaction Model: Age and BMI", subtitle = "Effects on Diabetes Complications")
```
## Age and Smoking Status
```{r}
mloginter_age.smoking <- glm(
diabetes_complications ~ age + bmi + smoking_status + age:smoking_status,
data = diabetes_data,
family = binomial
)
tbl_regression(mloginter_age.smoking, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Interaction Model: Age and Smoking Status", subtitle = "Effects on Diabetes Complications")
```
## BMI and Smoking Status
```{r}
mloginter_bmi.smoking <- glm(
diabetes_complications ~ age + bmi + smoking_status + bmi:smoking_status,
data = diabetes_data,
family = binomial
)
tbl_regression(mloginter_bmi.smoking, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Interaction Model: BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")
```
## age, bmi and smoking status
```{r}
mloginter_all <- glm(
diabetes_complications ~ age * bmi * smoking_status,
data = diabetes_data,
family = binomial
)
tbl_regression(mloginter_all, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Interaction Model: Age, BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")
```
*Interpretation* : None of the interaction terms between age, BMI, and smoking status are statistically significant (95% CI includes 1). This suggests that the effects of these predictors on diabetes complications are independent and do not modify each other.
# Model assessment
## Choose final model
age, BMI, and smoking status without interaction terms.
```{r}
final_model <- glm(
diabetes_complications ~ age + bmi + smoking_status,
data = diabetes_data,
family = binomial
)
tbl_regression(final_model, intercept = TRUE, exponentiate = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Final Multivariable Model", subtitle = "Effects on Diabetes Complications")
```
## Create predicted classes based on a 0.5 threshold (Overall fitness)
- Accuracy
- Sensitivity
- Specificity
```{r}
final.m.prob <- augment(final_model, type.predict = "response") %>%
mutate(pred.class = ifelse(.fitted >= 0.5, "Yes", "No"))
## Confusion matrix
confusionMatrix(as.factor(final.m.prob$pred.class), diabetes_data$diabetes_complications, positive = "Yes")
```
*Interpretation* :
- Final model demonstrates high sensitivity but low specificity.
- It is effective at identifying patients with diabetes complications (few missed cases), but it produces many false alarms by incorrectly labeling patients without complications as positive.
- This imbalance suggests the model prioritizes detecting complications at the expense of over-predicting them.
- This may be acceptable in a clinical screening context (where missing a complication is riskier than over-diagnosing), further refinement is needed to improve specificity and overall balance.
## Check for linearity of covariates with logit (numerical variables)
```{r}
lin.age.bmi <- mfp(diabetes_complications ~ fp(age) + fp(bmi), data = diabetes_data, family = binomial(link = "logit"),
verbose = TRUE)
summary(lin.age.bmi)
```
*Interpretation* : Fractional polynomial modeling did not identify any nonlinear transformations that improved fit, suggesting that simple linear relationships adequately describe the effects of age and BMI in this dataset.
# Diagnostic plot
```{r}
par(mfrow = c(2, 2))
plot(final_model)
```
*Interpretation* : The model shows signs of non-linearity, heteroscedasticity, and influential points. While logistic regression doesn’t require normal residuals, these patterns suggest that investigating outliers could improve model performance.
# Check for influential points
## Identify influential observations
```{r}
# Add diagnostics columns
diabetes_diag <- augment(final_model, diabetes_data)
# Define thresholds
n_obs <- nrow(diabetes_data)
p_preds <- length(coef(final_model))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs
# Filter Influential Observations
influential_obs <- diabetes_diag %>%
filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff)
# Print count
nrow(influential_obs)
```
## Cook's distance plot
```{r}
cutoff <- 4/(nrow(diabetes_data)-length(final_model$coefficients)-2)
plot(final_model, which=4, cook.levels=cutoff)
```
*Interpretation* : The Cook’s distance plot shows that a few observations exceed the threshold of 0.004, indicating they may be influential points affecting the regression results.
## Identify the influential observations
```{r}
# Create diagnostic datashet (predictions and residuals)
diabetes_pred.res <- augment(final_model)
diabetes_pred.res %>%
datatable()
# Identify influential observations based on cook's distance, residuals, and leverage
diabetes_pred.res %>%
filter(.std.resid > 2 | .std.resid < -2 | .hat > 0.038 | .cooksd > 0.5) %>%
datatable()
```
## Remove Influential Observations and Refit Model
```{r}
# Remove influential observations based on Cook's distance, residuals, and leverage
clean_data <- diabetes_diag %>%
filter(.cooksd <= cooks_cutoff & abs(.std.resid) <= 2 & .hat <= leverage_cutoff)
# Refit the logistic regression model on cleaned data
final_model_clean <- glm(diabetes_complications ~ age + bmi + smoking_status,
data = clean_data, family = binomial)
# Create regression summary table
tbl <- tbl_regression(
final_model_clean,
intercept = TRUE,
label = list(
age ~ "Age (years)",
bmi ~ "BMI (kg/m²)",
smoking_status ~ "Smoking Status"
)
) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
add_glance_source_note(include = c(AIC, BIC, logLik)) %>% # valid for GLM
modify_header(estimate ~ "**Adjusted Odds Ratio**")
# Convert to gt and style
tbl %>%
as_gt() %>%
tab_header(
title = "Final Logistic Regression Model (Influential Observations Removed)",
subtitle = "Predictors of Diabetes Complications"
)
```
## Compare coefficients before and after removing influential observations
```{r}
AIC(final_model, final_model_clean)
BIC(final_model, final_model_clean)
```
*Interpretation* : After removing influential observations, age, BMI, and smoking status remained statistically significant predictors of diabetes complications. The model fit improved slightly, as indicated by lower AIC and BIC values, suggesting a more robust model.
# Predictive analysis
## Fitted probabilities for existing patients
```{r}
# Fitted value
prob_dm.com <- augment(final_model_clean,
type.predict = 'response',
type.residuals = 'pearson')
prob_dm.com %>%
datatable()
```
## Predicted probabilities for new patients
```{r}
new_diabetes <- expand.grid(
age = c(40, 50, 60, 70),
bmi = c(20, 25, 30, 35),
smoking_status = c("Non-smoker", "Smoker")
)
# Add log-odds and predicted probabilities
new_diabetes$predicted_odds <- predict(final_model_clean,
newdata = new_diabetes,
type = "link")
new_diabetes$predicted_prob <- predict(final_model_clean,
newdata = new_diabetes,
type = "response")
# Create table
new_diabetes %>%
gt() %>%
tab_header(
title = "Predicted Log-Odds and Probabilities for New Patients",
subtitle = "Based on Final Multivariable Model"
) %>%
fmt_number(
columns = vars(predicted_odds, predicted_prob),
decimals = 3
)
```
*Interpretation* :
- A 60-year-old smoker with BMI 30 has a predicted probability of 0.62 for diabetes complications.
- A 40-year-old non-smoker with BMI 20 has a predicted probability of 0.24.
- This suggests that age, BMI, and smoking status all contribute meaningfully to risk, and the model captures their combined effect.
# Model presentation
## Final model table
```{r}
tbl <- tbl_regression(
final_model_clean,
exponentiate = TRUE, # <-- this converts log-odds to odds ratios
intercept = TRUE,
label = list(
age ~ "Age (years)",
bmi ~ "BMI (kg/m²)",
smoking_status ~ "Smoking Status"
)
) %>%
add_glance_source_note(include = c(AIC, BIC, logLik)) %>%
bold_labels() %>%
bold_p(t = 0.05) %>%
modify_header(estimate ~ "**Adj. OR**")
# Convert to gt and apply styling
tbl %>%
as_gt() %>%
tab_header(
title = "Final Logistic Regression Model",
subtitle = "Predictors of HbA1c Levels in Patients with Diabetes"
)
```
## Forest plot
```{r}
plot_model(final_model_clean,
type = "est", # estimates with CI
show.values = TRUE, # show OR values
value.offset = .3, # nudge labels
title = "Forest Plot of Predictors of Diabetes Complications",
axis.labels = c(
"age" = "Age (Years)",
"bmi" = "BMI (kg/m²)",
"smoking_statusSmoker" = "Smoking: Smoker"
),
vline.color = "red") +
theme_minimal()
```
*Interpretation* :
- The forest plot illustrates that age, BMI, and smoking status are significant predictors of diabetes complications.
- Smoking showed the strongest association, with an odds ratio of 6.42 (p \< 0.001), indicating a markedly elevated risk among smokers.
- BMI and age also demonstrated positive associations, with ORs of 1.15 and 1.08 respectively, suggesting that both increasing body mass and older age contribute to complication risk.
# Result and conclusion
## Model Specification and Justification
The final analysis employed a multivariable main‑effects linear regression model to estimate the independent contributions of clinical predictors to glycemic control.
- *Final Formula*:
The final logistic regression model estimates the odds of diabetes complications based on age, BMI, and smoking status. Expressed in terms of adjusted odds ratios:
$$
\text{Odds}_{\text{Complication}} = 0.01 \times (1.03)^{\text{Age}} \times (1.15)^{\text{BMI}} \times (6.42)^{\text{Smoking=yes}}
$$
Where:
- ( 0.01 ) is the baseline odds (intercept)
- ( 1.03 ) is the odds ratio per year of age
- ( 1.15 ) is the odds ratio per unit increase in BMI
- ( 6.42 ) is the odds ratio for smokers compared to non-smokers
- *Explanatory Power* : The logistic regression model demonstrated good fit (AIC = 1,016; BIC = 1,035; Log‑likelihood = −504).
- *Bias and Variance* : Interaction terms were evaluated but found to be statistically non-significant (95% CI includes 1).
- *Variable selection* : Variable selection was performed using AIC-based stepwise regression. The final main-effects model was selected as it achieved the lowest AIC, representing the optimal balance between explanatory accuracy and model parsimony.
- *Model selection* : The final logistic regression model was selected based on the lowest corrected Akaike Information Criterion (AICc = 1,207.11). Nested model comparisons using likelihood ratio tests showed significant improvement when BMI and smoking status were added (p \< 0.001), while the inclusion of cholesterol did not yield a statistically significant improvement (p = 0.1568). Therefore, the preferred model includes age, BMI, and smoking status.
## Analysis of Coefficients
These values represent the adjusted effect of each predictor on the odds of developing diabetes complications, holding all other variables constant (ceteris paribus):
- Age (years):\
Each additional year of age is associated with a 3% increase in the odds of diabetes complications (Adj.OR = 1.03; 95% CI: 1.02,1.05; p \< 0.001).
- BMI (kg/m²):\
Each unit increase in BMI is associated with a 15% increase in the odds of complications (Adj. OR = 1.15; 95% CI: 1.11, 1.19; p \< 0.001).
- Smoking Status:\
Compared to non-smokers, smokers have 6.42 times higher odds of developing diabetes complications (Adj. OR = 6.42; 95% CI: 4.12, 10.4; p \< 0.001).
## Subgroup Interpretation: Cumulative Risk
Because the final model is additive (main effects only), the impact of risk factors multiplies directly on the odds scale. This allows us to interpret the combined burden of risk across different patient profiles:
- *Metabolic Cost of Obesity* : So each unit increase in BMI (e.g., from 25 to 26) increases the odds of complications by 15%, holding other variables constant
- *Age-Related Risk Accumulation* : Each additional year of age increases the odds of complications by 3% (OR = 1.03). For example, a 60-year-old patient has approximately 1.81 times higher odds of complications than a 40-year-old peer.
This illustrates how age, BMI, and smoking status interact multiplicatively to elevate complication risk.
## Model Assessment and Diagnostics
The final model was rigorously validated to ensure robustness and reliability of inference.
- *Linearity of the Logit* : Fractional polynomial modeling did not identify any nonlinear transformations that improved fit, suggesting that simple linear relationships adequately describe the effects of age and BMI in this dataset.
- *Influential Observations* : A few influential observations were identified based on Cook’s distance, standardized residuals, and leverage. After removing these points, the refitted model maintained consistent coefficient estimates and significance levels, indicating robustness of findings.
## Conclusion
- This study demonstrates that older age, higher BMI, and smoking status are each independently associated with an increased risk of diabetes complications.
- The model reveals a clear dose–response relationship with BMI, highlighting its dominant role in driving risk, while age exerts a modest but consistent linear effect. Smoking status further amplifies complication risk, underscoring its importance as a modifiable factor.
- Although the model explains a moderate proportion of variance, the findings suggest that additional determinants, such as medication adherence, physical activity, and diet quality also contribute to outcomes.
- Clinically, these results emphasize the need for a multifaceted approach to diabetes management.
- Structured weight reduction strategies should be prioritized given the strong effect of BMI, while smoking cessation interventions remain critical for reducing complication risk.
- Age-related considerations highlight the importance of tailoring treatment targets and strategies for older adults, balancing efficacy with safety.
- Ultimately, comprehensive care plans that integrate weight management, smoking cessation, and individualized monitoring are essential to optimize outcomes and reduce the burden of diabetes complications.