Multiple Logistic Regression Analysis: Risk Factors for Diabetes Complications

Logistic Regression

Medical Statistics

Diabetes

An in-depth multiple logistic regression analysis to identify significant risk factors associated with diabetes complications.

Author

Munir, Sanggary, Farihah, Khai

Published

December 26, 2025

1 Introduction

1.1 Research Question / Objective

To investigate the association between various risk factors and the occurrence of diabetes complications (yes/no) using multiple logistic regression.

1.2 Dataset and variable

Show code

library(tidyverse)

set.seed(123) # For reproducibility
n <- 1000
age <- rnorm(n, mean=55, sd=10)
bmi <- rnorm(n, mean=28, sd=5)
sbp <- rnorm(n, mean=130, sd=15)
cholesterol <- rnorm(n, mean=200, sd=30)
smoking_status <- rbinom(n, 1, 0.3) # 30% smokers
# Logistic model to generate diabetes complications
logit_prob <- -5 + 0.02*age + 0.1*bmi + 0.005*sbp + 0.004*cholesterol + 1.0*smoking_status
prob_diabetes_complications <- 1 / (1 + exp(-logit_prob))
diabetes_complications <- rbinom(n, 1, prob_diabetes_complications)

A simulated dataset of 1,000 patients was generated to reflect realistic clinical characteristics.

The dataset includes the following variables:

Age (continuous)
Body Mass Index (BMI) (continuous)
Blood Pressure (BP) (continuous)
Cholesterol Level (continuous)
Smoking Status (binary: 0 = non-smoker, 1 = smoker)
Diabetes Complications (binary outcome: 0 = no, 1 = yes)

2 Data Preparation

2.1 Load Libraries

Show code

library(tidyverse)
library(gt)
library(broom)
library(haven)
library(here)
library(gtsummary)
library(corrplot)
library(caret)
library(mfp)
library(MASS)
library(MuMIn)
library(DT)
library(sjPlot)

2.2 Data cleaning and coding

Show code

diabetes_data <- data.frame(
  age = age,
  bmi = bmi,
  sbp = sbp,
  cholesterol = cholesterol,
  smoking_status = smoking_status,
  diabetes_complications = diabetes_complications
)
glimpse(diabetes_data)

Rows: 1,000
Columns: 6
$ age                    <dbl> 49.39524, 52.69823, 70.58708, 55.70508, 56.2928…
$ bmi                    <dbl> 23.02101, 22.80022, 27.91010, 27.33912, 15.2532…
$ sbp                    <dbl> 122.3259, 133.5541, 121.8762, 148.2884, 132.612…
$ cholesterol            <dbl> 195.4908, 190.1673, 156.5550, 179.0815, 277.954…
$ smoking_status         <int> 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0,…
$ diabetes_complications <int> 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1,…

2.3 Convert variables to factor with ordered levels

Show code

diabetes_data <- diabetes_data %>%
  mutate(
    smoking_status = factor(smoking_status, levels = c(0, 1), labels = c("Non-smoker", "Smoker")),
    diabetes_complications = factor(diabetes_complications, levels = c(0, 1), labels = c("No", "Yes"))
  )
glimpse(diabetes_data)

Rows: 1,000
Columns: 6
$ age                    <dbl> 49.39524, 52.69823, 70.58708, 55.70508, 56.2928…
$ bmi                    <dbl> 23.02101, 22.80022, 27.91010, 27.33912, 15.2532…
$ sbp                    <dbl> 122.3259, 133.5541, 121.8762, 148.2884, 132.612…
$ cholesterol            <dbl> 195.4908, 190.1673, 156.5550, 179.0815, 277.954…
$ smoking_status         <fct> Non-smoker, Non-smoker, Smoker, Non-smoker, Smo…
$ diabetes_complications <fct> Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, No, No, …

3 Descriptive Analysis

3.1 Summary statistics table

Show code

diabetes_data %>% 
  tbl_summary(
    by = diabetes_complications,
    statistic = list(
      all_continuous() ~ "{mean} ({sd})",
      all_categorical() ~ "{n} / {N} ({p}%)"
    )) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_caption(md("**Table 1: Descriptive Statistics**"))

**Table 1: Descriptive Statistics**
Characteristic	No N = 356¹	Yes N = 644¹
age	54 (10)	56 (10)
bmi	26.6 (4.9)	29.1 (4.9)
sbp	130 (14)	130 (15)
cholesterol	198 (30)	201 (30)
smoking_status
Non-smoker	296 / 356 (83%)	426 / 644 (66%)
Smoker	60 / 356 (17%)	218 / 644 (34%)
¹ Mean (SD); n / N (%)

3.2 Explore data

Histogram for numerical variables and barplot for categorical variables

3.2.1 Age

Show code

ggplot(diabetes_data, aes(x = age)) +
  geom_histogram() +
  facet_grid (~diabetes_complications)

3.2.2 BMI

Show code

ggplot(diabetes_data, aes(x = bmi)) +
  geom_histogram() +
  facet_grid (~diabetes_complications)

3.2.3 Systolic Blood Pressure

Show code

ggplot(diabetes_data, aes(x = sbp)) +
  geom_histogram() +
  facet_grid (~diabetes_complications)

3.2.4 Cholesterol

Show code

ggplot(diabetes_data, aes(x = cholesterol)) +
  geom_histogram() +
  facet_grid (~diabetes_complications)

3.2.5 Smoking Status

Show code

ggplot(diabetes_data, aes(x = smoking_status)) +
  geom_bar() +
  facet_grid (~diabetes_complications)

4 Univariate Analysis

4.1 Null model

Show code

modlog_null <- glm(diabetes_complications ~ 1, data = diabetes_data, family = binomial)

tbl_regression(modlog_null, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Null Model", subtitle = "Intercept-only")

Characteristic	OR	95% CI	p-value
Null Model
Intercept-only
(Intercept)	1.81	1.59, 2.06	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.2 Age

Show code

modlog_age <- glm(diabetes_complications ~ age, data = diabetes_data, family = binomial)
tbl_regression(modlog_age, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Univariate Model: Age", subtitle = "Effect of Age on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Univariate Model: Age
Effect of Age on Diabetes Complications
(Intercept)	0.43	0.21, 0.91	0.027
age	1.03	1.01, 1.04	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.3 BMI

Show code

modlog_bmi <- glm(diabetes_complications ~ bmi, data = diabetes_data, family = binomial)
tbl_regression(modlog_bmi, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Univariate Model: BMI", subtitle = "Effect of BMI on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Univariate Model: BMI
Effect of BMI on Diabetes Complications
(Intercept)	0.11	0.05, 0.24	<0.001
bmi	1.11	1.08, 1.14	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.4 Systolic Blood Pressure

Show code

modlog_sbp <- glm(diabetes_complications ~ sbp, data = diabetes_data, family = binomial)
tbl_regression(modlog_sbp, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Univariate Model: Systolic Blood Pressure", subtitle = "Effect of SBP on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Univariate Model: Systolic Blood Pressure
Effect of SBP on Diabetes Complications
(Intercept)	2.18	0.69, 6.93	0.2
sbp	1.00	0.99, 1.01	0.7
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.5 Cholesterol

Show code

modlog_cholesterol <- glm(diabetes_complications ~ cholesterol, data = diabetes_data, family = binomial)
tbl_regression(modlog_cholesterol, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Univariate Model: Cholesterol", subtitle = "Effect of Cholesterol on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Univariate Model: Cholesterol
Effect of Cholesterol on Diabetes Complications
(Intercept)	1.06	0.44, 2.55	0.9
cholesterol	1.00	1.00, 1.01	0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

4.6 Smoking Status

Show code

modlog_smoking <- glm(diabetes_complications ~ smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_smoking, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Univariate Model: Smoking Status", subtitle = "Effect of Smoking on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Univariate Model: Smoking Status
Effect of Smoking on Diabetes Complications
(Intercept)	1.44	1.24, 1.67	<0.001
smoking_status
Non-smoker	—	—
Smoker	2.52	1.84, 3.51	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation : Based on the univariate analyses, age, BMI, and smoking status all show significant associations with the occurrence of diabetes complications (95% CI does not include 1). . While, systolic blood pressure and cholesterol do not show significant associations (95% CI includes 1).

5 Variable selection

5.1 Preliminary model

We include the variables that were significant in the univariate analysis (age, BMI, smoking status) and clinically relevant variables (systolic blood pressure, cholesterol) in the preliminary multivariable model.

Show code

modlog_prelim <- glm(diabetes_complications ~ age + bmi + sbp + cholesterol + smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_prelim, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Preliminary Multivariable Model", subtitle = "All selected predictors")

Characteristic	OR	95% CI	p-value
Preliminary Multivariable Model
All selected predictors
(Intercept)	0.01	0.00, 0.08	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.11	1.08, 1.14	<0.001
sbp	1.00	0.99, 1.01	>0.9
cholesterol	1.00	1.00, 1.01	0.2
smoking_status
Non-smoker	—	—
Smoker	2.70	1.95, 3.80	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation : In the preliminary multivariable model, age, BMI, and smoking status remain significant predictors of diabetes complications. Systolic blood pressure and cholesterol are not significant in this model.

5.2 Backward selection

Show code

step_bw <- MASS::stepAIC(modlog_prelim, direction = "backward")

Start:  AIC=1209.05
diabetes_complications ~ age + bmi + sbp + cholesterol + smoking_status

                 Df Deviance    AIC
- sbp             1   1197.1 1207.1
<none>                1197.0 1209.0
- cholesterol     1   1199.1 1209.1
- age             1   1208.5 1218.5
- smoking_status  1   1234.5 1244.5
- bmi             1   1250.0 1260.0

Step:  AIC=1207.06
diabetes_complications ~ age + bmi + cholesterol + smoking_status

                 Df Deviance    AIC
<none>                1197.1 1207.1
- cholesterol     1   1199.1 1207.1
- age             1   1208.5 1216.5
- smoking_status  1   1234.7 1242.7
- bmi             1   1250.0 1258.0

5.3 Forward selection

Show code

step_fw <- MASS::stepAIC(modlog_null, scope = list(lower = modlog_null, upper = modlog_prelim), direction = "forward")

Start:  AIC=1304.16
diabetes_complications ~ 1

                 Df Deviance    AIC
+ bmi             1   1247.5 1251.5
+ smoking_status  1   1267.4 1271.4
+ age             1   1287.2 1291.2
<none>                1302.2 1304.2
+ cholesterol     1   1300.7 1304.7
+ sbp             1   1302.1 1306.1

Step:  AIC=1251.46
diabetes_complications ~ bmi

                 Df Deviance    AIC
+ smoking_status  1   1210.5 1216.5
+ age             1   1236.4 1242.4
<none>                1247.5 1251.5
+ cholesterol     1   1245.8 1251.8
+ sbp             1   1247.2 1253.2

Step:  AIC=1216.5
diabetes_complications ~ bmi + smoking_status

              Df Deviance    AIC
+ age          1   1199.1 1207.1
<none>             1210.5 1216.5
+ cholesterol  1   1208.5 1216.5
+ sbp          1   1210.5 1218.5

Step:  AIC=1207.07
diabetes_complications ~ bmi + smoking_status + age

              Df Deviance    AIC
+ cholesterol  1   1197.1 1207.1
<none>             1199.1 1207.1
+ sbp          1   1199.1 1209.1

Step:  AIC=1207.06
diabetes_complications ~ bmi + smoking_status + age + cholesterol

       Df Deviance    AIC
<none>      1197.1 1207.1
+ sbp   1   1197.0 1209.0

5.4 Both selection

Show code

step_both <- MASS::stepAIC(modlog_prelim, direction = "both",
                           scope = list(lower = modlog_null, upper = modlog_prelim))

Start:  AIC=1209.05
diabetes_complications ~ age + bmi + sbp + cholesterol + smoking_status

                 Df Deviance    AIC
- sbp             1   1197.1 1207.1
<none>                1197.0 1209.0
- cholesterol     1   1199.1 1209.1
- age             1   1208.5 1218.5
- smoking_status  1   1234.5 1244.5
- bmi             1   1250.0 1260.0

Step:  AIC=1207.06
diabetes_complications ~ age + bmi + cholesterol + smoking_status

                 Df Deviance    AIC
<none>                1197.1 1207.1
- cholesterol     1   1199.1 1207.1
+ sbp             1   1197.0 1209.0
- age             1   1208.5 1216.5
- smoking_status  1   1234.7 1242.7
- bmi             1   1250.0 1258.0

5.5 Compare selected models

Show code

anova_combine <- bind_rows(
  as.data.frame(step_bw$anova) %>% mutate(Model = "Backward Selection"),
  as.data.frame(step_fw$anova) %>% mutate(Model = "Forward Selection"),
  as.data.frame(step_both$anova) %>% mutate(Model = "Both Selection")
) %>% 
  relocate(Model, .before = 1)

anova_combine %>%
  gt() %>%
  tab_header(
    title = "Model Selection Comparison",
    subtitle = "ANOVA Results for Different Selection Methods"
  ) %>%
  fmt_number(
    columns = vars(Df, Deviance, `Resid. Df`, `Resid. Dev`),
    decimals = 2
  )

Model	Step	Df	Deviance	Resid. Df	Resid. Dev	AIC
Model Selection Comparison
ANOVA Results for Different Selection Methods
Backward Selection		NA	NA	994.00	1,197.05	1209.051
Backward Selection	- sbp	1.00	0.01	995.00	1,197.06	1207.061
Forward Selection		NA	NA	999.00	1,302.16	1304.164
Forward Selection	+ bmi	1.00	54.70	998.00	1,247.46	1251.464
Forward Selection	+ smoking_status	1.00	36.97	997.00	1,210.50	1216.496
Forward Selection	+ age	1.00	11.43	996.00	1,199.07	1207.066
Forward Selection	+ cholesterol	1.00	2.00	995.00	1,197.06	1207.061
Both Selection		NA	NA	994.00	1,197.05	1209.051
Both Selection	- sbp	1.00	0.01	995.00	1,197.06	1207.061

Interpretation : All three selection methods (backward, forward, and both) resulted in the same final model including age, BMI, smoking status and cholesterol level as significant predictors of diabetes complications.

6 Multivariable Analysis

6.1 Age and BMI

Show code

modlog_age.bmi <- glm(diabetes_complications ~ age + bmi, data = diabetes_data, family = binomial)
tbl_regression(modlog_age.bmi, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Multivariable Model: Age and BMI", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Multivariable Model: Age and BMI
Effects on Diabetes Complications
(Intercept)	0.03	0.01, 0.10	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.10	1.07, 1.13	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

6.2 Age, BMI and Smoking Status

Show code

modlog_age.bmi.smoking <- glm(diabetes_complications ~ age + bmi + smoking_status, data = diabetes_data, family = binomial)
tbl_regression(modlog_age.bmi.smoking, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Multivariable Model: Age, BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Multivariable Model: Age, BMI and Smoking Status
Effects on Diabetes Complications
(Intercept)	0.02	0.01, 0.07	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.11	1.08, 1.14	<0.001
smoking_status
Non-smoker	—	—
Smoker	2.69	1.94, 3.78	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

6.3 Model full (Age, BMI, Smoking Status, Cholesterol)

Show code

modlog_full <- glm(diabetes_complications ~ age + bmi + smoking_status + cholesterol, data = diabetes_data, family = binomial)
tbl_regression(modlog_full, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Multivariable Model: Age, BMI, Smoking Status and Cholesterol", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Multivariable Model: Age, BMI, Smoking Status and Cholesterol
Effects on Diabetes Complications
(Intercept)	0.01	0.00, 0.05	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.11	1.08, 1.14	<0.001
smoking_status
Non-smoker	—	—
Smoker	2.71	1.95, 3.80	<0.001
cholesterol	1.00	1.00, 1.01	0.2
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

7 Model comparison and selection

7.1 Using AAic

Show code

model_list <- list(modlog_age, modlog_age.bmi, modlog_age.bmi.smoking, modlog_full)
model_comparison <- model.sel(model_list)
model_comparison %>%
  as.data.frame() %>%
  rownames_to_column(var = "Model") %>%
  gt() %>%
  tab_header(
    title = "Model Comparison using AIC",
    subtitle = "AIC Values for Different Models"
  ) %>%
  fmt_number(
    columns = vars(AICc, delta, weight),  # use AICc instead of AIC
    decimals = 2
  )

Model	(Intercept)	age	bmi	smoking_status	cholesterol	df	logLik	AICc	delta	weight
Model Comparison using AIC
AIC Values for Different Models
3	-3.7979679	0.02389317	0.10178097	+	NA	4	-599.5329	1,207.11	0.00	0.50
4	-4.4787255	0.02392073	0.10226361	+	0.00333267	5	-598.5304	1,207.12	0.02	0.50
2	-3.3981630	0.02309816	0.09781372	NA	NA	3	-618.2040	1,242.43	35.33	0.00
1	-0.8361999	0.02607846	NA	NA	NA	2	-643.6129	1,291.24	84.13	0.00

Interpretation : Based on the AIC values, the full model including age, BMI, and smoking status is preferred as it has the lowest AIC value among the compared models.

7.2 ANOVA to compare nested models

Show code

anova(modlog_age, modlog_age.bmi, modlog_age.bmi.smoking, modlog_full)

Analysis of Deviance Table

Model 1: diabetes_complications ~ age
Model 2: diabetes_complications ~ age + bmi
Model 3: diabetes_complications ~ age + bmi + smoking_status
Model 4: diabetes_complications ~ age + bmi + smoking_status + cholesterol
  Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
1       998     1287.2                          
2       997     1236.4  1   50.818 1.014e-12 ***
3       996     1199.1  1   37.342 9.911e-10 ***
4       995     1197.1  1    2.005    0.1568    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation : The ANOVA results indicate that adding BMI and smoking status significantly improves the model fit compared to the simpler models (p < 0.05). However, adding cholesterol does not significantly improve the model (p > 0.05). Thus, the preferred model includes age, BMI, and smoking status.

8 Check for confounding and mediation

8.1 DAG plot

Show code

library(ggdag)

dag <- dagify(
  complications ~ smoking + bmi + age,
  smoking ~ age,
  bmi ~ age
)

ggdag(dag, text = TRUE) + theme_dag()

8.2 Check correlation between variables

8.2.1 Correlation between age and BMI

Show code

cor(diabetes_data$age, diabetes_data$bmi, use = "complete.obs")

[1] 0.08647944

Interpretation : The correlations between age and are low (all < 0.3), indicating minimal risk of multicollinearity and confounding among these predictors.

8.2.2 correlation matrix

Show code

cor_matrix <- diabetes_data %>%
  dplyr::select(age, bmi) %>%
  cor(use = "complete.obs")
cor_matrix

           age        bmi
age 1.00000000 0.08647944
bmi 0.08647944 1.00000000

8.2.3 Visualize correlation matrix

Show code

corrplot(cor_matrix, method = "number", type = "upper", tl.col = "black", tl.srt = 45)

Interpretation : The correlation matrix visualization confirms that there are no strong correlations among the predictors, supporting the conclusion that confounding is unlikely to be a significant issue in the multivariable model.

8.3 Assess confounding by examining percent change in OR

Main exposure: Age

Show code

models <- list(
  age = modlog_age,
  age_bmi    = modlog_age.bmi,
  age_bmi_smoking = modlog_age.bmi.smoking
)

confounding <- map_df(models, ~ tidy(.x, exponentiate = TRUE), .id = "model") %>%
  filter(term != "(Intercept)") %>%
  filter(term == "age")

# Extract unadjusted OR for age
or_unadj <- confounding %>%
  filter(model == "age") %>%
  pull(estimate)

# Add percent change column
confounding <- confounding %>%
  mutate(percent_change = round((estimate - or_unadj) / or_unadj * 100, 1))

confounding

# A tibble: 3 × 7
  model           term  estimate std.error statistic  p.value percent_change
  <chr>           <chr>    <dbl>     <dbl>     <dbl>    <dbl>          <dbl>
1 age             age       1.03   0.00683      3.82 0.000133            0  
2 age_bmi         age       1.02   0.00701      3.29 0.000992           -0.3
3 age_bmi_smoking age       1.02   0.00714      3.35 0.000822           -0.2

Interpretation : The adjusted odds ratios for age change by less than 10% when adding BMI and smoking status to the model, indicating that there is no significant confounding effect from these variables on the relationship between age and diabetes complications.

8.4 Check the association of each variable with the outcome (diabetes complications)

Show code

slr.age.bmi.smoke <- 
  diabetes_data %>%
  dplyr::select(age, bmi, smoking_status) %>%
  purrr::map(~ glm(diabetes_complications ~ .x, data = diabetes_data, family = binomial)) %>%
  purrr::map(tidy) %>%  bind_rows()

#Display the result
slr.age.bmi.smoke %>% 
  mutate(model = c('b0', 'age', 'b0', 'bmi', 'b0', 'smoking_statusYes')) %>%
  dplyr::select(model, everything())

# A tibble: 6 × 6
  model             term        estimate std.error statistic  p.value
  <chr>             <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 b0                (Intercept)  -0.836    0.378       -2.21 2.68e- 2
2 age               .x            0.0261   0.00683      3.82 1.33e- 4
3 b0                (Intercept)  -2.21     0.396       -5.59 2.23e- 8
4 bmi               .x            0.101    0.0142       7.12 1.11e-12
5 b0                (Intercept)   0.364    0.0757       4.81 1.50e- 6
6 smoking_statusYes .xSmoker      0.926    0.164        5.64 1.72e- 8

Interpretation : Age, BMI, and smoking status are all statistically significant and clinically relevant predictors of diabetes complications. Each contributes independently to risk, with smoking showing the strongest effect. These results support including all three variables in the final model.

9 Check for interaction

9.1 Age and BMI

Show code

mloginter_age.bmi <- glm(
  diabetes_complications ~ age + bmi + smoking_status + age:bmi,
  data = diabetes_data,
  family = binomial
)

tbl_regression(mloginter_age.bmi, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Interaction Model: Age and BMI", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Interaction Model: Age and BMI
Effects on Diabetes Complications
(Intercept)	0.02	0.00, 1.46	0.075
age	1.03	0.95, 1.12	0.5
bmi	1.12	0.95, 1.32	0.2
smoking_status
Non-smoker	—	—
Smoker	2.69	1.94, 3.78	<0.001
age * bmi	1.00	1.00, 1.00	0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

9.2 Age and Smoking Status

Show code

mloginter_age.smoking <- glm(
  diabetes_complications ~ age + bmi + smoking_status + age:smoking_status,
  data = diabetes_data,
  family = binomial
)

tbl_regression(mloginter_age.smoking, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Interaction Model: Age and Smoking Status", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Interaction Model: Age and Smoking Status
Effects on Diabetes Complications
(Intercept)	0.03	0.01, 0.08	<0.001
age	1.02	1.01, 1.04	0.008
bmi	1.11	1.08, 1.14	<0.001
smoking_status
Non-smoker	—	—
Smoker	1.33	0.20, 8.78	0.8
age * smoking_status
age * Smoker	1.01	0.98, 1.05	0.5
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

9.3 BMI and Smoking Status

Show code

mloginter_bmi.smoking <- glm(
  diabetes_complications ~ age + bmi + smoking_status + bmi:smoking_status,
  data = diabetes_data,
  family = binomial
)
tbl_regression(mloginter_bmi.smoking, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Interaction Model: BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Interaction Model: BMI and Smoking Status
Effects on Diabetes Complications
(Intercept)	0.02	0.01, 0.08	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.10	1.07, 1.14	<0.001
smoking_status
Non-smoker	—	—
Smoker	1.90	0.27, 12.8	0.5
bmi * smoking_status
bmi * Smoker	1.01	0.95, 1.09	0.7
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

9.4 age, bmi and smoking status

Show code

mloginter_all <- glm(
  diabetes_complications ~ age * bmi * smoking_status,
  data = diabetes_data,
  family = binomial
)
tbl_regression(mloginter_all, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Interaction Model: Age, BMI and Smoking Status", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Interaction Model: Age, BMI and Smoking Status
Effects on Diabetes Complications
(Intercept)	0.02	0.00, 4.00	0.2
age	1.03	0.93, 1.13	0.6
bmi	1.11	0.92, 1.34	0.3
smoking_status
Non-smoker	—	—
Smoker	1.47	0.00, 57,205	>0.9
age * bmi	1.00	1.00, 1.00	>0.9
age * smoking_status
age * Smoker	1.00	0.83, 1.22	>0.9
bmi * smoking_status
bmi * Smoker	1.00	0.67, 1.48	>0.9
age * bmi * smoking_status
age * bmi * Smoker	1.00	0.99, 1.01	>0.9
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

Interpretation : None of the interaction terms between age, BMI, and smoking status are statistically significant (95% CI includes 1). This suggests that the effects of these predictors on diabetes complications are independent and do not modify each other.

10 Model assessment

10.1 Choose final model

age, BMI, and smoking status without interaction terms.

Show code

final_model <- glm(
  diabetes_complications ~ age + bmi + smoking_status,
  data = diabetes_data,
  family = binomial
)
tbl_regression(final_model, intercept = TRUE, exponentiate = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Final Multivariable Model", subtitle = "Effects on Diabetes Complications")

Characteristic	OR	95% CI	p-value
Final Multivariable Model
Effects on Diabetes Complications
(Intercept)	0.02	0.01, 0.07	<0.001
age	1.02	1.01, 1.04	<0.001
bmi	1.11	1.08, 1.14	<0.001
smoking_status
Non-smoker	—	—
Smoker	2.69	1.94, 3.78	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio

10.2 Create predicted classes based on a 0.5 threshold (Overall fitness)

Accuracy
Sensitivity
Specificity

Show code

final.m.prob <- augment(final_model, type.predict = "response") %>%
  mutate(pred.class = ifelse(.fitted >= 0.5, "Yes", "No"))

## Confusion matrix

confusionMatrix(as.factor(final.m.prob$pred.class), diabetes_data$diabetes_complications, positive = "Yes")

Confusion Matrix and Statistics

          Reference
Prediction  No Yes
       No  113  79
       Yes 243 565
                                         
               Accuracy : 0.678          
                 95% CI : (0.648, 0.7069)
    No Information Rate : 0.644          
    P-Value [Acc > NIR] : 0.01301        
                                         
                  Kappa : 0.2171         
                                         
 Mcnemar's Test P-Value : < 2e-16        
                                         
            Sensitivity : 0.8773         
            Specificity : 0.3174         
         Pos Pred Value : 0.6993         
         Neg Pred Value : 0.5885         
             Prevalence : 0.6440         
         Detection Rate : 0.5650         
   Detection Prevalence : 0.8080         
      Balanced Accuracy : 0.5974         
                                         
       'Positive' Class : Yes

Interpretation :

Final model demonstrates high sensitivity but low specificity.
It is effective at identifying patients with diabetes complications (few missed cases), but it produces many false alarms by incorrectly labeling patients without complications as positive.
This imbalance suggests the model prioritizes detecting complications at the expense of over-predicting them.
This may be acceptable in a clinical screening context (where missing a complication is riskier than over-diagnosing), further refinement is needed to improve specificity and overall balance.

10.3 Check for linearity of covariates with logit (numerical variables)

Show code

lin.age.bmi <- mfp(diabetes_complications ~ fp(age) + fp(bmi), data = diabetes_data, family = binomial(link = "logit"),
                   verbose = TRUE)


    Variable    Deviance    Power(s)
------------------------------------------------
Cycle 1
    bmi              
                1287.226     
                1236.408    1
                1236.408    1
                1236.323    -2 2

    age              
                1247.464     
                1236.408    1
                1236.249    2
                1234.883    -2 -2


Tansformation
    shift scale
bmi     0    10
age     0   100

Fractional polynomials
    df.initial select alpha df.final power1 power2
bmi          4      1  0.05        1      1      .
age          4      1  0.05        1      1      .


Transformations of covariates:
           formula
age I((age/100)^1)
bmi  I((bmi/10)^1)


Deviance table:
         Resid. Dev
Null model   1302.164
Linear model     1236.408
Final model  1236.408

Show code

summary(lin.age.bmi)


Call:
glm(formula = diabetes_complications ~ I((bmi/10)^1) + I((age/100)^1), 
    family = binomial(link = "logit"), data = diabetes_data)

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)     -3.3982     0.5412  -6.279 3.40e-10 ***
I((bmi/10)^1)    0.9781     0.1422   6.879 6.02e-12 ***
I((age/100)^1)   2.3098     0.7015   3.293 0.000992 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1302.2  on 999  degrees of freedom
Residual deviance: 1236.4  on 997  degrees of freedom
AIC: 1242.4

Number of Fisher Scoring iterations: 4

Interpretation : Fractional polynomial modeling did not identify any nonlinear transformations that improved fit, suggesting that simple linear relationships adequately describe the effects of age and BMI in this dataset.

11 Diagnostic plot

Show code

par(mfrow = c(2, 2))
plot(final_model)

Interpretation : The model shows signs of non-linearity, heteroscedasticity, and influential points. While logistic regression doesn’t require normal residuals, these patterns suggest that investigating outliers could improve model performance.

12 Check for influential points

12.1 Identify influential observations

Show code

# Add diagnostics columns
diabetes_diag <- augment(final_model, diabetes_data)

# Define thresholds
n_obs <- nrow(diabetes_data)
p_preds <- length(coef(final_model))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs

# Filter Influential Observations
influential_obs <- diabetes_diag %>%
  filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff)

# Print count
nrow(influential_obs)

[1] 61

12.2 Cook’s distance plot

Show code

cutoff <- 4/(nrow(diabetes_data)-length(final_model$coefficients)-2)
plot(final_model, which=4, cook.levels=cutoff)

Interpretation : The Cook’s distance plot shows that a few observations exceed the threshold of 0.004, indicating they may be influential points affecting the regression results.

12.3 Identify the influential observations

Show code

# Create diagnostic datashet (predictions and residuals)
diabetes_pred.res <- augment(final_model)
diabetes_pred.res %>%
  datatable()

Show code

# Identify influential observations based on cook's distance, residuals, and leverage
diabetes_pred.res %>%
  filter(.std.resid > 2 | .std.resid < -2 | .hat > 0.038 | .cooksd > 0.5) %>%
  datatable()

12.4 Remove Influential Observations and Refit Model

Show code

# Remove influential observations based on Cook's distance, residuals, and leverage
clean_data <- diabetes_diag %>%
  filter(.cooksd <= cooks_cutoff & abs(.std.resid) <= 2 & .hat <= leverage_cutoff)

# Refit the logistic regression model on cleaned data
final_model_clean <- glm(diabetes_complications ~ age + bmi + smoking_status,
                         data = clean_data, family = binomial)

# Create regression summary table
tbl <- tbl_regression(
  final_model_clean,
  intercept = TRUE,
  label = list(
    age ~ "Age (years)",
    bmi ~ "BMI (kg/m²)",
    smoking_status ~ "Smoking Status"
  )
) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(AIC, BIC, logLik)) %>%   # valid for GLM
  modify_header(estimate ~ "**Adjusted Odds Ratio**")

# Convert to gt and style
tbl %>%
  as_gt() %>%
  tab_header(
    title = "Final Logistic Regression Model (Influential Observations Removed)",
    subtitle = "Predictors of Diabetes Complications"
  )

Characteristic	Adjusted Odds Ratio	95% CI	p-value
Final Logistic Regression Model (Influential Observations Removed)
Predictors of Diabetes Complications
(Intercept)	-5.2	-6.5, -4.0	<0.001
Age (years)	0.03	0.02, 0.05	<0.001
BMI (kg/m²)	0.14	0.11, 0.17	<0.001
Smoking Status
Non-smoker	—	—
Smoker	1.9	1.4, 2.3	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
AIC = 1,016; BIC = 1,035; Log-likelihood = -504

12.5 Compare coefficients before and after removing influential observations

Show code

AIC(final_model, final_model_clean)

                  df      AIC
final_model        4 1207.066
final_model_clean  4 1015.771

Show code

BIC(final_model, final_model_clean)

                  df      BIC
final_model        4 1226.697
final_model_clean  4 1035.151

Interpretation : After removing influential observations, age, BMI, and smoking status remained statistically significant predictors of diabetes complications. The model fit improved slightly, as indicated by lower AIC and BIC values, suggesting a more robust model.

13 Predictive analysis

13.1 Fitted probabilities for existing patients

Show code

# Fitted value
prob_dm.com <- augment(final_model_clean,
        type.predict = 'response',
        type.residuals = 'pearson')
prob_dm.com %>%
  datatable()

13.2 Predicted probabilities for new patients

Show code

new_diabetes <- expand.grid(
  age = c(40, 50, 60, 70),
  bmi = c(20, 25, 30, 35),
  smoking_status = c("Non-smoker", "Smoker")
)

# Add log-odds and predicted probabilities
new_diabetes$predicted_odds <- predict(final_model_clean,
                                           newdata = new_diabetes,
                                           type = "link")

new_diabetes$predicted_prob <- predict(final_model_clean,
                                       newdata = new_diabetes,
                                       type = "response")

# Create table
new_diabetes %>%
  gt() %>%
  tab_header(
    title = "Predicted Log-Odds and Probabilities for New Patients",
    subtitle = "Based on Final Multivariable Model"
  ) %>%
  fmt_number(
    columns = vars(predicted_odds, predicted_prob),
    decimals = 3
  )

age	bmi	smoking_status	predicted_odds	predicted_prob
Predicted Log-Odds and Probabilities for New Patients
Based on Final Multivariable Model
40	20	Non-smoker	−1.174	0.236
50	20	Non-smoker	−0.854	0.299
60	20	Non-smoker	−0.534	0.370
70	20	Non-smoker	−0.214	0.447
40	25	Non-smoker	−0.476	0.383
50	25	Non-smoker	−0.156	0.461
60	25	Non-smoker	0.164	0.541
70	25	Non-smoker	0.484	0.619
40	30	Non-smoker	0.222	0.555
50	30	Non-smoker	0.542	0.632
60	30	Non-smoker	0.862	0.703
70	30	Non-smoker	1.182	0.765
40	35	Non-smoker	0.919	0.715
50	35	Non-smoker	1.239	0.775
60	35	Non-smoker	1.559	0.826
70	35	Non-smoker	1.879	0.868
40	20	Smoker	0.686	0.665
50	20	Smoker	1.006	0.732
60	20	Smoker	1.326	0.790
70	20	Smoker	1.646	0.838
40	25	Smoker	1.383	0.800
50	25	Smoker	1.703	0.846
60	25	Smoker	2.023	0.883
70	25	Smoker	2.343	0.912
40	30	Smoker	2.081	0.889
50	30	Smoker	2.401	0.917
60	30	Smoker	2.721	0.938
70	30	Smoker	3.041	0.954
40	35	Smoker	2.779	0.942
50	35	Smoker	3.099	0.957
60	35	Smoker	3.419	0.968
70	35	Smoker	3.739	0.977

Interpretation :

A 60-year-old smoker with BMI 30 has a predicted probability of 0.62 for diabetes complications.
A 40-year-old non-smoker with BMI 20 has a predicted probability of 0.24.
This suggests that age, BMI, and smoking status all contribute meaningfully to risk, and the model captures their combined effect.

14 Model presentation

14.1 Final model table

Show code

tbl <- tbl_regression(
  final_model_clean,
  exponentiate = TRUE,   # <-- this converts log-odds to odds ratios
  intercept = TRUE,
  label = list(
    age ~ "Age (years)",
    bmi ~ "BMI (kg/m²)",
    smoking_status ~ "Smoking Status"
  )
) %>%
  add_glance_source_note(include = c(AIC, BIC, logLik)) %>%
  bold_labels() %>%
  bold_p(t = 0.05) %>%
  modify_header(estimate ~ "**Adj. OR**")


# Convert to gt and apply styling
tbl %>%
  as_gt() %>%
  tab_header(
    title = "Final Logistic Regression Model",
    subtitle = "Predictors of HbA1c Levels in Patients with Diabetes"
  )

Characteristic	Adj. OR	95% CI	p-value
Final Logistic Regression Model
Predictors of HbA1c Levels in Patients with Diabetes
(Intercept)	0.01	0.00, 0.02	<0.001
Age (years)	1.03	1.02, 1.05	<0.001
BMI (kg/m²)	1.15	1.11, 1.19	<0.001
Smoking Status
Non-smoker	—	—
Smoker	6.42	4.12, 10.4	<0.001
Abbreviations: CI = Confidence Interval, OR = Odds Ratio
AIC = 1,016; BIC = 1,035; Log-likelihood = -504

14.2 Forest plot

Show code

plot_model(final_model_clean,
           type = "est",              # estimates with CI
           show.values = TRUE,        # show OR values
           value.offset = .3,         # nudge labels
           title = "Forest Plot of Predictors of Diabetes Complications",
           axis.labels = c(
             "age" = "Age (Years)",
             "bmi" = "BMI (kg/m²)",
             "smoking_statusSmoker" = "Smoking: Smoker"
           ),
           vline.color = "red") +
  theme_minimal()

Interpretation :

The forest plot illustrates that age, BMI, and smoking status are significant predictors of diabetes complications.
Smoking showed the strongest association, with an odds ratio of 6.42 (p < 0.001), indicating a markedly elevated risk among smokers.
BMI and age also demonstrated positive associations, with ORs of 1.15 and 1.08 respectively, suggesting that both increasing body mass and older age contribute to complication risk.

15 Result and conclusion

15.1 Model Specification and Justification

The final analysis employed a multivariable main‑effects linear regression model to estimate the independent contributions of clinical predictors to glycemic control.

Final Formula:

The final logistic regression model estimates the odds of diabetes complications based on age, BMI, and smoking status. Expressed in terms of adjusted odds ratios:

\[ \text{Odds}_{\text{Complication}} = 0.01 \times (1.03)^{\text{Age}} \times (1.15)^{\text{BMI}} \times (6.42)^{\text{Smoking=yes}} \]

Where:

( 0.01 ) is the baseline odds (intercept)
( 1.03 ) is the odds ratio per year of age
( 1.15 ) is the odds ratio per unit increase in BMI
( 6.42 ) is the odds ratio for smokers compared to non-smokers
Explanatory Power : The logistic regression model demonstrated good fit (AIC = 1,016; BIC = 1,035; Log‑likelihood = −504).
Bias and Variance : Interaction terms were evaluated but found to be statistically non-significant (95% CI includes 1).
Variable selection : Variable selection was performed using AIC-based stepwise regression. The final main-effects model was selected as it achieved the lowest AIC, representing the optimal balance between explanatory accuracy and model parsimony.
Model selection : The final logistic regression model was selected based on the lowest corrected Akaike Information Criterion (AICc = 1,207.11). Nested model comparisons using likelihood ratio tests showed significant improvement when BMI and smoking status were added (p < 0.001), while the inclusion of cholesterol did not yield a statistically significant improvement (p = 0.1568). Therefore, the preferred model includes age, BMI, and smoking status.

15.2 Analysis of Coefficients

These values represent the adjusted effect of each predictor on the odds of developing diabetes complications, holding all other variables constant (ceteris paribus):

Age (years):
Each additional year of age is associated with a 3% increase in the odds of diabetes complications (Adj.OR = 1.03; 95% CI: 1.02,1.05; p < 0.001).
BMI (kg/m²):
Each unit increase in BMI is associated with a 15% increase in the odds of complications (Adj. OR = 1.15; 95% CI: 1.11, 1.19; p < 0.001).
Smoking Status:
Compared to non-smokers, smokers have 6.42 times higher odds of developing diabetes complications (Adj. OR = 6.42; 95% CI: 4.12, 10.4; p < 0.001).

15.3 Subgroup Interpretation: Cumulative Risk

Because the final model is additive (main effects only), the impact of risk factors multiplies directly on the odds scale. This allows us to interpret the combined burden of risk across different patient profiles:

Metabolic Cost of Obesity : So each unit increase in BMI (e.g., from 25 to 26) increases the odds of complications by 15%, holding other variables constant
Age-Related Risk Accumulation : Each additional year of age increases the odds of complications by 3% (OR = 1.03). For example, a 60-year-old patient has approximately 1.81 times higher odds of complications than a 40-year-old peer.

This illustrates how age, BMI, and smoking status interact multiplicatively to elevate complication risk.

15.4 Model Assessment and Diagnostics

The final model was rigorously validated to ensure robustness and reliability of inference.

Linearity of the Logit : Fractional polynomial modeling did not identify any nonlinear transformations that improved fit, suggesting that simple linear relationships adequately describe the effects of age and BMI in this dataset.
Influential Observations : A few influential observations were identified based on Cook’s distance, standardized residuals, and leverage. After removing these points, the refitted model maintained consistent coefficient estimates and significance levels, indicating robustness of findings.

15.5 Conclusion

This study demonstrates that older age, higher BMI, and smoking status are each independently associated with an increased risk of diabetes complications.
The model reveals a clear dose–response relationship with BMI, highlighting its dominant role in driving risk, while age exerts a modest but consistent linear effect. Smoking status further amplifies complication risk, underscoring its importance as a modifiable factor.
Although the model explains a moderate proportion of variance, the findings suggest that additional determinants, such as medication adherence, physical activity, and diet quality also contribute to outcomes.
Clinically, these results emphasize the need for a multifaceted approach to diabetes management.
Structured weight reduction strategies should be prioritized given the strong effect of BMI, while smoking cessation interventions remain critical for reducing complication risk.
Age-related considerations highlight the importance of tailoring treatment targets and strategies for older adults, balancing efficacy with safety.
Ultimately, comprehensive care plans that integrate weight management, smoking cessation, and individualized monitoring are essential to optimize outcomes and reduce the burden of diabetes complications.