To identify significant predictors of glycemic control, measured by HbA1c levels, among patients with Type 2 diabetes.
library(tidyverse)
set.seed(123)
n <- 1000
# Generate variables
age <- rnorm(n, mean = 55, sd = 10)
bmi <- sample(c("underweight", "normal", "overweight", "obese"),
n, replace = TRUE,
prob = c(0.1, 0.4, 0.3, 0.2))
# Duration of diabetes (always positive)
duration_diabetes <- rlnorm(n, meanlog = log(10), sdlog = 0.3)
# HbA1c outcome
hba1c <- 4 +
0.05 * age +
ifelse(bmi == "underweight", 0,
ifelse(bmi == "normal", 0.5,
ifelse(bmi == "overweight", 1, 1.5))) +
0.1 * duration_diabetes +
rnorm(n, mean = 0, sd = 1)
# Assemble into a tibble
diabetes_data <- tibble( age = round(age, 1),
bmi = bmi,
duration_diabetes = round(duration_diabetes, 1),
hba1c = round(hba1c, 1) )
head(diabetes_data)
# A tibble: 6 × 4
age bmi duration_diabetes hba1c
<dbl> <chr> <dbl> <dbl>
1 49.4 normal 7.8 7.1
2 52.7 normal 9.1 8.6
3 70.6 normal 7.6 8.1
4 55.7 overweight 12.1 8.5
5 56.3 overweight 14 10
6 72.2 overweight 18.9 10
A simulated dataset of 1,000 patients was generated to reflect realistic clinical characteristics.
The dataset includes:
library(broom) # For tidying model outputs
library(gtsummary) # NEW: Automates regression tables
library(performance) # NEW: Best-in-class model diagnostics
library(MASS) # For stepAIC
library(MuMIn) # For model selection using AICc
library(dplyr) # For data manipulation
library(ggplot2) # For data visualization
library(gt) # For enhanced table formatting
library(lmtest) # For Breusch-Pagan test
library(car) # For VIF calculation
library(DT) # For interactive tables
library(patchwork) # For combining ggplots
library(sjPlot) # For plotting model effects
glimpse(diabetes_data)
Rows: 1,000
Columns: 4
$ age <dbl> 49.4, 52.7, 70.6, 55.7, 56.3, 72.2, 59.6, 42.3, 48.1…
$ bmi <chr> "normal", "normal", "normal", "overweight", "overwei…
$ duration_diabetes <dbl> 7.8, 9.1, 7.6, 12.1, 14.0, 18.9, 11.2, 7.7, 13.6, 13…
$ hba1c <dbl> 7.1, 8.6, 8.1, 8.5, 10.0, 10.0, 9.1, 8.4, 7.1, 9.9, …
diabetes_data <- diabetes_data %>%
mutate(across(bmi, ~ factor(.x,
levels = c("normal", "underweight", "overweight", "obese"))))
glimpse(diabetes_data)
Rows: 1,000
Columns: 4
$ age <dbl> 49.4, 52.7, 70.6, 55.7, 56.3, 72.2, 59.6, 42.3, 48.1…
$ bmi <fct> normal, normal, normal, overweight, overweight, over…
$ duration_diabetes <dbl> 7.8, 9.1, 7.6, 12.1, 14.0, 18.9, 11.2, 7.7, 13.6, 13…
$ hba1c <dbl> 7.1, 8.6, 8.1, 8.5, 10.0, 10.0, 9.1, 8.4, 7.1, 9.9, …
diabetes_data %>%
tbl_summary(
label = list(age ~ "Age (Years)", bmi ~ "BMI Category", duration_diabetes ~ "Duration (Years)", hba1c ~ "HbA1c (%)"),
statistic = list(all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)")
) %>%
modify_header(label = "**Variable**") %>%
bold_labels()
| Variable | N = 1,0001 |
|---|---|
| Age (Years) | 55 (10) |
| BMI Category | |
| normal | 402 (40%) |
| underweight | 107 (11%) |
| overweight | 291 (29%) |
| obese | 200 (20%) |
| Duration (Years) | 10.4 (3.1) |
| HbA1c (%) | 8.59 (1.25) |
| 1 Mean (SD); n (%) | |
mod_null <- lm(hba1c ~ 1, data = diabetes_data)
tbl_regression(mod_null, intercept = TRUE) %>%
bold_p(t = 0.05) %>%
as_gt() %>%
tab_header(title = "Null Model", subtitle = "Intercept-only")
| Null Model | |||
| Intercept-only | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 8.6 | 8.5, 8.7 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
mod_age <- lm(hba1c ~ age, data = diabetes_data)
tbl_regression(mod_age) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Univariate: Age")
| Univariate: Age | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
mod_bmi <- lm(hba1c ~ bmi, data = diabetes_data)
tbl_regression(mod_bmi) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Univariate: BMI")
| Univariate: BMI | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| bmi | |||
| normal | — | — | |
| underweight | -0.51 | -0.76, -0.26 | <0.001 |
| overweight | 0.42 | 0.25, 0.59 | <0.001 |
| obese | 1.1 | 0.92, 1.3 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
mod_duration <- lm(hba1c ~ duration_diabetes, data = diabetes_data)
tbl_regression(mod_duration) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Univariate: Duration")
| Univariate: Duration | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| duration_diabetes | 0.10 | 0.07, 0.12 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
All Predictors in univariable analysis with p-value <0.25 are included in multivariable analysis (here, we include all predictors).
We now build a model using stepwise variable selection methods (forward, backward, both) based on AIC.
prelim_mod <- lm(hba1c ~ age + bmi + duration_diabetes, data = diabetes_data)
tbl_regression(prelim_mod) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Preliminary Model: Age, BMI and Duration")
| Preliminary Model: Age, BMI and Duration | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.53 | -0.74, -0.31 | <0.001 |
| overweight | 0.36 | 0.21, 0.51 | <0.001 |
| obese | 1.1 | 0.89, 1.2 | <0.001 |
| duration_diabetes | 0.11 | 0.09, 0.13 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
step_bw <- MASS::stepAIC(prelim_mod, direction = "backward")
Start: AIC=11.76
hba1c ~ age + bmi + duration_diabetes
Df Sum of Sq RSS AIC
<none> 999.76 11.763
- duration_diabetes 1 120.95 1120.71 123.966
- age 1 218.59 1218.35 207.498
- bmi 3 224.79 1224.55 208.573
step_both <- MASS::stepAIC(mod_null,
direction = "both",
scope = list(upper = prelim_mod, lower = mod_null))
Start: AIC=448.96
hba1c ~ 1
Df Sum of Sq RSS AIC
+ bmi 3 242.024 1321.5 286.79
+ age 1 223.730 1339.8 296.54
+ duration_diabetes 1 95.988 1467.6 387.60
<none> 1563.5 448.96
Step: AIC=286.79
hba1c ~ bmi
Df Sum of Sq RSS AIC
+ age 1 200.81 1120.7 123.97
+ duration_diabetes 1 103.18 1218.3 207.50
<none> 1321.5 286.79
- bmi 3 242.02 1563.5 448.96
Step: AIC=123.97
hba1c ~ bmi + age
Df Sum of Sq RSS AIC
+ duration_diabetes 1 120.95 999.76 11.763
<none> 1120.71 123.966
- age 1 200.81 1321.53 286.787
- bmi 3 219.11 1339.82 296.536
Step: AIC=11.76
hba1c ~ bmi + age + duration_diabetes
Df Sum of Sq RSS AIC
<none> 999.76 11.763
- duration_diabetes 1 120.95 1120.71 123.966
- age 1 218.59 1218.35 207.498
- bmi 3 224.79 1224.55 208.573
step_fw <- MASS::stepAIC(mod_null,
direction = "forward",
scope = list(upper = prelim_mod, lower = mod_null))
Start: AIC=448.96
hba1c ~ 1
Df Sum of Sq RSS AIC
+ bmi 3 242.024 1321.5 286.79
+ age 1 223.730 1339.8 296.54
+ duration_diabetes 1 95.988 1467.6 387.60
<none> 1563.5 448.96
Step: AIC=286.79
hba1c ~ bmi
Df Sum of Sq RSS AIC
+ age 1 200.81 1120.7 123.97
+ duration_diabetes 1 103.18 1218.3 207.50
<none> 1321.5 286.79
Step: AIC=123.97
hba1c ~ bmi + age
Df Sum of Sq RSS AIC
+ duration_diabetes 1 120.95 999.76 11.763
<none> 1120.71 123.966
Step: AIC=11.76
hba1c ~ bmi + age + duration_diabetes
anova_combined <- bind_rows(
as.data.frame(step_bw$anova) %>% mutate(Method = "Backward"),
as.data.frame(step_both$anova) %>% mutate(Method = "Both"),
as.data.frame(step_fw$anova) %>% mutate(Method = "Forward")
) %>%
relocate(Method, .before = 1)
anova_combined
Method Step Df Deviance Resid. Df Resid. Dev AIC
1 Backward NA NA 994 999.7629 11.76289
2 Both NA NA 999 1563.5508 448.95941
3 Both + bmi 3 242.0244 996 1321.5265 286.78749
4 Both + age 1 200.8128 995 1120.7137 123.96571
5 Both + duration_diabetes 1 120.9508 994 999.7629 11.76289
6 Forward NA NA 999 1563.5508 448.95941
7 Forward + bmi 3 242.0244 996 1321.5265 286.78749
8 Forward + age 1 200.8128 995 1120.7137 123.96571
9 Forward + duration_diabetes 1 120.9508 994 999.7629 11.76289
Intepretation: All three selection methods (forward, backward, and both) consistently identified age, BMI, and duration of diabetes as significant predictors of HbA1c levels.
mod_full <- prelim_mod
tbl_regression(mod_full) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
add_glance_source_note(include = c(r.squared, AIC)) %>%
as_gt() %>%
tab_header(title = "Multivariable: Full Model")
| Multivariable: Full Model | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.53 | -0.74, -0.31 | <0.001 |
| overweight | 0.36 | 0.21, 0.51 | <0.001 |
| obese | 1.1 | 0.89, 1.2 | <0.001 |
| duration_diabetes | 0.11 | 0.09, 0.13 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.361; AIC = 2,852 | |||
Interpretation: In the full model, age, BMI, and duration of diabetes are all significant predictors of HbA1c levels (p < 0.001). The model explains approximately 36.1% of the variance in HbA1c levels (R² = 0.361).
mod_age.bmi <- lm(hba1c ~ age + bmi, data = diabetes_data)
tbl_regression(mod_age.bmi) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
add_glance_source_note(include = c(r.squared, AIC)) %>%
as_gt() %>%
tab_header(title = "Multivariable: Age and BMI")
| Multivariable: Age and BMI | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.49 | -0.72, -0.27 | <0.001 |
| overweight | 0.43 | 0.27, 0.59 | <0.001 |
| obese | 1.0 | 0.87, 1.2 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.283; AIC = 2,964 | |||
Interpretation: In the model with age and BMI, both predictors are significant (p < 0.001). The model explains approximately 23.8% of the variance in HbA1c levels (R² = 0.238).
# Create a list of models to compare
model_list <- list(mod_null, mod_age, mod_age.bmi, mod_full)
model_comparison <- model.sel(model_list)
model_comparison
Model selection table
(Int) age bmi drt_dbt df logLik AICc delta weight
4 4.560 0.04733 + 0.1114 7 -1418.820 2851.8 0.00 1
3 5.815 0.04529 + 6 -1475.921 2963.9 112.17 0
2 5.963 0.04771 3 -1565.207 3136.4 284.68 0
1 8.595 2 -1642.418 3288.8 437.10 0
Models ranked by AICc(x)
Interpretation: The best model for predicting HbA1c includes age, BMI, and duration of diabetes (Model Full), with an AICc of 2852.
anova(mod_age, mod_age.bmi, mod_full)
Analysis of Variance Table
Model 1: hba1c ~ age
Model 2: hba1c ~ age + bmi
Model 3: hba1c ~ age + bmi + duration_diabetes
Res.Df RSS Df Sum of Sq F Pr(>F)
1 998 1339.82
2 995 1120.71 3 219.11 72.615 < 2.2e-16 ***
3 994 999.76 1 120.95 120.254 < 2.2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Interpretation: The ANOVA results indicate that adding BMI and duration of diabetes significantly improves model fit compared to age alone (p < 0.001). Thus, the full model is preferred.
# age and Duration of diabetes
diabetes_data %>%
ggplot(aes(age, duration_diabetes)) +
geom_point() +
geom_smooth(method="lm", color="red")
# Age and BMI
diabetes_data %>%
ggplot(aes(age, fill = bmi)) +
geom_boxplot()
# BMI and Duration of diabetes
diabetes_data %>%
ggplot(aes(duration_diabetes, fill = bmi)) +
geom_boxplot()
Interpretation :
# Collect results into one data frame
models <- list(mod_age, mod_age.bmi,mod_full)
coef_table <- lapply(models, function(m) tidy(m, conf.int = TRUE))
names(coef_table) <- c("mod_age", "mod_age.bmi", "mod_full")
coef_compare <- bind_rows(coef_table, .id = "Model")
coef_compare
# A tibble: 13 × 8
Model term estimate std.error statistic p.value conf.low conf.high
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 mod_age (Inter… 5.96 0.207 28.8 3.80e-133 5.56 6.37
2 mod_age age 0.0477 0.00370 12.9 2.27e- 35 0.0405 0.0550
3 mod_age.bmi (Inter… 5.82 0.194 30.0 2.18e-141 5.43 6.20
4 mod_age.bmi age 0.0453 0.00339 13.4 1.58e- 37 0.0386 0.0519
5 mod_age.bmi bmiund… -0.493 0.115 -4.27 2.13e- 5 -0.720 -0.267
6 mod_age.bmi bmiove… 0.427 0.0817 5.23 2.06e- 7 0.267 0.588
7 mod_age.bmi bmiobe… 1.05 0.0920 11.4 2.34e- 28 0.867 1.23
8 mod_full (Inter… 4.56 0.216 21.1 4.70e- 82 4.14 4.98
9 mod_full age 0.0473 0.00321 14.7 1.24e- 44 0.0410 0.0536
10 mod_full bmiund… -0.528 0.109 -4.84 1.53e- 6 -0.742 -0.314
11 mod_full bmiove… 0.360 0.0774 4.66 3.67e- 6 0.209 0.512
12 mod_full bmiobe… 1.06 0.0869 12.2 3.77e- 32 0.893 1.23
13 mod_full durati… 0.111 0.0102 11.0 1.71e- 26 0.0915 0.131
*Interpretation: The coefficients for age and duration remain stable when BMI is added to the model, indicating minimal confounding. Similarly, the effect of BMI is robust to adjustment for both age and duration of diabetes. No major confounding is detected. All three variables contribute independently and should be retained in the final multivariable model.
inter_age.bmi <- lm(hba1c ~ age + bmi + duration_diabetes + age:bmi, data = diabetes_data)
tbl_regression(inter_age.bmi) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Model with interaction: Age and BMI")
| Model with interaction: Age and BMI | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.06 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.72 | -1.9, 0.45 | 0.2 |
| overweight | 0.80 | -0.10, 1.7 | 0.081 |
| obese | 0.98 | 0.05, 1.9 | 0.039 |
| duration_diabetes | 0.11 | 0.09, 0.13 | <0.001 |
| age * bmi | |||
| age * underweight | 0.00 | -0.02, 0.02 | 0.7 |
| age * overweight | -0.01 | -0.02, 0.01 | 0.3 |
| age * obese | 0.00 | -0.01, 0.02 | 0.9 |
| Abbreviation: CI = Confidence Interval | |||
inter_age.duration <- lm(hba1c ~ age + bmi + duration_diabetes + age:duration_diabetes, data = diabetes_data)
tbl_regression(inter_age.duration) %>%
add_global_p() %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Model with interaction: Age and Duration of diabetes")
| Model with interaction: Age and Duration of diabetes | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.03, 0.07 | <0.001 |
| bmi | <0.001 | ||
| normal | — | — | |
| underweight | -0.53 | -0.74, -0.31 | |
| overweight | 0.36 | 0.21, 0.51 | |
| obese | 1.1 | 0.89, 1.2 | |
| duration_diabetes | 0.12 | 0.01, 0.22 | 0.033 |
| age * duration_diabetes | 0.00 | 0.00, 0.00 | >0.9 |
| Abbreviation: CI = Confidence Interval | |||
inter_bmi.duration <- lm(hba1c ~ age + bmi + duration_diabetes + bmi:duration_diabetes, data = diabetes_data)
tbl_regression(inter_bmi.duration) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Model with interaction: BMI and Duration of diabetes")
| Model with interaction: BMI and Duration of diabetes | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.36 | -1.1, 0.39 | 0.3 |
| overweight | 0.74 | 0.22, 1.3 | 0.005 |
| obese | 1.0 | 0.38, 1.6 | 0.002 |
| duration_diabetes | 0.13 | 0.09, 0.16 | <0.001 |
| bmi * duration_diabetes | |||
| underweight * duration_diabetes | -0.02 | -0.08, 0.05 | 0.6 |
| overweight * duration_diabetes | -0.04 | -0.08, 0.01 | 0.13 |
| obese * duration_diabetes | 0.01 | -0.05, 0.07 | 0.8 |
| Abbreviation: CI = Confidence Interval | |||
inter_age.bmi.duration <- lm(hba1c ~ age + bmi + duration_diabetes + age:bmi:duration_diabetes, data = diabetes_data)
tbl_regression(inter_age.bmi.duration) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
as_gt() %>%
tab_header(title = "Model with interaction: Age, BMI and Duration of diabetes")
| Model with interaction: Age, BMI and Duration of diabetes | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.03, 0.07 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.50 | -1.2, 0.18 | 0.15 |
| overweight | 0.81 | 0.34, 1.3 | <0.001 |
| obese | 1.0 | 0.48, 1.5 | <0.001 |
| duration_diabetes | 0.11 | 0.01, 0.22 | 0.038 |
| age * bmi * duration_diabetes | |||
| age * normal * duration_diabetes | 0.00 | 0.00, 0.00 | 0.8 |
| age * underweight * duration_diabetes | 0.00 | 0.00, 0.00 | 0.9 |
| age * overweight * duration_diabetes | 0.00 | 0.00, 0.00 | 0.6 |
| age * obese * duration_diabetes | 0.00 | 0.00, 0.00 | 0.7 |
| Abbreviation: CI = Confidence Interval | |||
Intepretation: There is no significant interaction effect between the predictors.
par(mfrow=c(2,2))
plot(mod_full)
Interpretation:
qqnorm(residuals(mod_full))
qqline(residuals(mod_full))
bptest(mod_full)
studentized Breusch-Pagan test
data: mod_full
BP = 6.0617, df = 5, p-value = 0.3003
Interpretation: Breusch–Pagan test indicated no violation of the homoscedasticity assumption (BP = 6.26, df = 5, p = 0.28). Thus, the residuals appear to have constant variance, supporting the validity of the regression model’s standard errors and inference.
shapiro.test(mod_full$residuals)
Shapiro-Wilk normality test
data: mod_full$residuals
W = 0.99899, p-value = 0.8684
Interpretation: Normally distributed regression model.
res <- residuals(mod_full)
# Age vs residuals
ggplot(diabetes_data, aes(x = age, y = res)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs Age", y = "Residuals")
# BMI vs residuals (categorical)
ggplot(diabetes_data, aes(x = bmi, y = res)) +
geom_boxplot(fill = "lightblue") +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs BMI", y = "Residuals")
# Duration vs residuals
ggplot(diabetes_data, aes(x = duration_diabetes, y = res)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE, color = "blue") +
geom_hline(yintercept = 0, color = "red") +
labs(title = "Residuals vs Duration of Diabetes", y = "Residuals")
Interpretation:
head(res)
1 2 3 4 5 6
-0.6672671 0.5317223 -0.6484015 -0.4049344 0.8550088 -0.4434072
hist(res)
Intepretation: The residual is normally distributed.
check_collinearity(mod_full)
# Check for Multicollinearity
Low Correlation
Term VIF VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
age 1.01 [1.00, 43.18] 1.00 0.99 [0.02, 1.00]
bmi 1.01 [1.00, 2.83] 1.00 0.99 [0.35, 1.00]
duration_diabetes 1.01 [1.00, 2.84] 1.01 0.99 [0.35, 1.00]
Interpretation: No multicollinearity.
MASS::boxcox(mod_full)
Intepretation:
# check_model() runs: VIF, Normality of Residuals, Homoscedasticity, Influential points
check_model(mod_full)
diabetes_pred.res <- augment(mod_full)
diabetes_pred.res %>%
datatable()
new_diabetes.data <- expand_grid( # expand_grid is the tidy version of expand.grid
age = seq(30, 80, by=5),
bmi = factor(c("underweight", "normal", "overweight", "obese"), levels=c("normal", "underweight", "overweight", "obese")),
duration_diabetes = mean(diabetes_data$duration_diabetes)
)
## Predict with confidence intervals
augment(mod_full, newdata = new_diabetes.data, interval = "confidence") %>%
head() %>%
knitr::kable()
| age | bmi | duration_diabetes | .fitted | .lower | .upper |
|---|---|---|---|---|---|
| 30 | underweight | 10.436 | 6.614753 | 6.369285 | 6.860222 |
| 30 | normal | 10.436 | 7.142696 | 6.957418 | 7.327975 |
| 30 | overweight | 10.436 | 7.503163 | 7.308281 | 7.698045 |
| 30 | obese | 10.436 | 8.205749 | 7.989278 | 8.422221 |
| 35 | underweight | 10.436 | 6.851408 | 6.624523 | 7.078292 |
| 35 | normal | 10.436 | 7.379351 | 7.219906 | 7.538795 |
pred.hbA1c <- augment(mod_full, newdata = new_diabetes.data)
pred.hbA1c %>% datatable()
# 1. Define the profiles
risk_profiles <- tibble(
Profile = c("Low Risk Scenario", "Medium Risk Scenario", "High Risk Scenario"),
age = c(40, 55, 70), # Young vs Middle vs Old
bmi = factor(c("normal", "overweight", "obese"),
levels = c("normal", "underweight", "overweight", "obese")),
duration_diabetes = c(2, 10, 20) # Short vs Medium vs Long duration
)
# 2. Predict HbA1c for these profiles
risk_profiles |>
mutate(
Predicted_HbA1c = predict(mod_full, newdata = risk_profiles),
# Optional: Add 95% Confidence Interval to see the range
CI_Lower = predict(mod_full, newdata = risk_profiles, interval = "confidence")[,2],
CI_Upper = predict(mod_full, newdata = risk_profiles, interval = "confidence")[,3]
) |>
knitr::kable(digits = 2, caption = "Predicted HbA1c for Hypothetical Patient Profiles")
| Profile | age | bmi | duration_diabetes | Predicted_HbA1c | CI_Lower | CI_Upper |
|---|---|---|---|---|---|---|
| Low Risk Scenario | 40 | normal | 2 | 6.68 | 6.46 | 6.89 |
| Medium Risk Scenario | 55 | overweight | 10 | 8.64 | 8.52 | 8.75 |
| High Risk Scenario | 70 | obese | 20 | 11.16 | 10.90 | 11.42 |
Interpretation: The predicted HbA1c levels increase from the low-risk to high-risk scenarios, reflecting the cumulative impact of age, BMI, and duration of diabetes on glycemic control.
# 1. Create a dataset varying ONLY Age
age_effect <- tibble(
age = seq(30, 80, by = 5), # Vary Age from 30 to 80
bmi = factor("normal", levels = c("normal", "underweight", "overweight", "obese")), # Fix BMI at "Normal"
duration_diabetes = mean(diabetes_data$duration_diabetes) # Fix Duration at the average
) |>
mutate(
Predicted_HbA1c = predict(mod_full, newdata = pick(everything()))
)
# 2. Plot the relationship
ggplot(age_effect, aes(x = age, y = Predicted_HbA1c)) +
geom_line(color = "steelblue", linewidth = 1.2) + # Note: 'linewidth' is preferred over 'size' in new ggplot2
geom_point(color = "steelblue", size = 3) +
labs(
title = "Effect of Age on HbA1c Levels",
subtitle = "Prediction for a patient with Normal BMI and average diabetes duration",
x = "Age (years)",
y = "Predicted HbA1c (%)"
) +
theme_minimal()
Interpretation: The plot illustrates a positive linear relationship between age and predicted HbA1c levels, indicating that as patients age, their glycemic control tends to worsen, even when BMI and duration of diabetes are held constant.
Identifying Influential Data Points in Regression
We used three diagnostic criteria to identify potentially influential observations:
Where:
- \(n\) = number of observations
- \(k\) = number of predictors in the
model
These thresholds help flag data points that may disproportionately influence model estimates or distort inference.
# Add diagnostics columns
diabetes_diag <- augment(mod_full, diabetes_data)
# Define thresholds
n_obs <- nrow(diabetes_data)
p_preds <- length(coef(mod_full))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs
# Filter Influential Observations
influential_obs <- diabetes_diag %>%
filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff)
# Print count
nrow(influential_obs)
[1] 85
cutoff <- 4/(nrow(diabetes_data)-length(mod_full$coefficients)-2)
plot(mod_full, which=4, cook.levels=cutoff)
Interpretation: The Cook’s distance plot shows that a few observations exceed the threshold of 0.004, indicating they may be influential points affecting the regression results.
diabetes_pred.res %>%
filter(.std.resid > 2 | .std.resid < -2 | .hat > 0.038) %>%
datatable()
# Remove influentials and Refit
clean_data <- diabetes_diag %>%
filter(.cooksd <= cooks_cutoff & abs(.std.resid) <= 2 & .hat <= leverage_cutoff)
# Re-run the final model
final_model_clean <- lm(hba1c ~ age + bmi + duration_diabetes, data = clean_data)
tbl_regression(final_model_clean) %>%
bold_p(t = 0.05) %>%
bold_labels() %>%
add_glance_source_note(include = c(r.squared, AIC)) %>%
as_gt() %>%
tab_header(title = "Final Model after Removing Influential Observations")
| Final Model after Removing Influential Observations | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| age | 0.05 | 0.04, 0.05 | <0.001 |
| bmi | |||
| normal | — | — | |
| underweight | -0.47 | -0.67, -0.27 | <0.001 |
| overweight | 0.36 | 0.23, 0.50 | <0.001 |
| obese | 1.1 | 0.92, 1.2 | <0.001 |
| duration_diabetes | 0.11 | 0.09, 0.13 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.415; AIC = 2,320 | |||
# Run diagnostic test
par(mfrow=c(2,2))
plot(final_model_clean)
Interpretation:
tbl <- tbl_regression(
final_model_clean,
intercept = TRUE,
label = list(
age ~ "Age (years)",
bmi ~ "BMI (kg/m²)",
duration_diabetes ~ "Duration (years)"
)
) %>%
# Add model fit metrics
add_glance_source_note(include = c(r.squared, adj.r.squared, AIC)) %>%
bold_labels() %>%
bold_p(t = 0.05) %>%
modify_header(estimate ~ "**Adj. beta**")
# Relabel BMI levels safely (plain text first)
tbl <- tbl %>%
modify_table_body(
~ .x %>%
mutate(
label = case_when(
variable == "bmi" & label == "normal" ~ "Normal",
variable == "bmi" & label == "underweight" ~ "Underweight",
variable == "bmi" & label == "overweight" ~ "Overweight",
variable == "bmi" & label == "obese" ~ "Obese",
TRUE ~ label
)
)
)
# Convert to gt and apply styling
tbl %>%
as_gt() %>%
tab_header(
title = "Final Multiple Linear Regression Model",
subtitle = "Predictors of HbA1c (Outliers Removed)"
) %>%
# Italicize BMI levels by matching text
tab_style(
style = cell_text(style = "italic"),
locations = cells_body(
rows = grepl("BMI", variable)
)
) %>%
# Add reference category footnote safely
tab_footnote(
footnote = "Reference category: Normal BMI",
locations = cells_body(
columns = "estimate",
rows = variable == "bmi" & label %in% c("Underweight", "Overweight", "Obese")
)
)
| Final Multiple Linear Regression Model | |||
| Predictors of HbA1c (Outliers Removed) | |||
| Characteristic | Adj. beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 4.5 | 4.1, 4.9 | <0.001 |
| Age (years) | 0.05 | 0.04, 0.05 | <0.001 |
| BMI (kg/m²) | |||
| Normal | — | — | |
| Underweight | -0.471 | -0.67, -0.27 | <0.001 |
| Overweight | 0.361 | 0.23, 0.50 | <0.001 |
| Obese | 1.11 | 0.92, 1.2 | <0.001 |
| Duration (years) | 0.11 | 0.09, 0.13 | <0.001 |
| 1 Reference category: Normal BMI | |||
| Abbreviation: CI = Confidence Interval | |||
| R² = 0.415; Adjusted R² = 0.412; AIC = 2,320 | |||
# Generate Forest Plot for the cleaned final model
plot_model(final_model_clean,
type = "est",
show.values = TRUE,
value.offset = .3,
title = "Forest Plot of HbA1c Predictors",
# Map the variable names (left) to clean labels (right)
axis.labels = c(
"duration_diabetes" = "Duration of Diabetes (Years)",
"bmiobese" = "BMI: Obese",
"bmioverweight" = "BMI: Overweight",
"bmiunderweight" = "BMI: Underweight",
"age" = "Age (Years)"
),
vline.color = "red") +
theme_minimal()
Interpretation: The forest plot visually summarizes the estimated effects of each predictor on HbA1c levels, along with their 95% confidence intervals. Age, BMI categories, and duration of diabetes all show significant associations with HbA1c, as indicated by confidence intervals that do not cross zero.
The final analysis used a multivariable main-effects linear regression model to estimate the independent contributions of clinical predictors to glycemic control.
Final Formula:
\[ \widehat{HbA1c} = 4.50 + 0.05(\text{Age}) + 0.36(\text{Overweight}) + 1.10(\text{Obese}) - 0.47(\text{Underweight}) + 0.11(\text{Duration of Diabetes}) \]
Explanatory Power: The model accounts for approximately 41.5% of the variance in HbA1c levels (Adjusted R² = 0.415), indicating moderate predictive strength.
Bias and Variance: Interaction terms were evaluated but found to be statistically non-significant (p > 0.05) and increased model complexity without improving fit.
Variable selection: Using AIC-based stepwise selection, the main-effects model was identified as the optimal balance between parsimony and explanatory accuracy.
Model selection: The final model was chosen based on lowest AICc (2852) and significant improvement over nested models (ANOVA p < 0.001).
These values represent the adjusted effect of each predictor on HbA1c levels, holding all other variables constant (ceteris paribus):
Age: Each additional year of age is associated with a 0.05% increase in HbA1c (95% CI: 0.04, 0.05) after adjusting for BMI and duration.
Duration of Diabetes: Each additional year living with diabetes is associated with a 0.11% increase in HbA1c (95% CI: 0.09, 0.13) after adjusted for age and BMI.
BMI (Obesity):
Obese individuals have an HbA1c level 1.10% higher (95% CI: 0.92, 1.20) compared to normal weight, after adjusting for age and duration.
Overweight individuals have an HbA1c level 0.36% higher (95% CI: 0.23, 0.50) compared to normal weight, after adjusting for age and duration.
Underweight individuals have a 0.47% lower HbA1c level (95% CI: −0.67, −0.27) compared to normal weight, after adjusting for age and duration.
Because the final model is additive (main effects only), the impact of risk factors adds up directly. This allows us to interpret the combined burden of disease across different patient profiles:
The Metabolic Cost of Obesity: An obese patient begins with a glycemic disadvantage equivalent to approximately 22 years of aging compared to a normal‑weight peer of the same age and diabetes duration. This highlights the substantial independent effect of obesity on HbA1c.
Disease Progression: Each year of diabetes duration contributes an additional 0.11% increase in HbA1c. For example, a patient with a 10‑year history of diabetes is predicted to have HbA1c levels that are 1.1% higher
The final model was rigorously validated to ensure robustness and reliability of inference.
Older age, higher BMI, and longer duration of diabetes are each independently associated with higher HbA1c levels. The model demonstrates a clear dose–response relationship with BMI and a strong linear effect of diabetes duration. With a moderate fit, the findings suggest that additional factors beyond those measured also influence glycemic control.
These results highlighted the importance of weight management and vigilant monitoring of glycemic control as patients age and live longer with diabetes.
These findings carry important implications for patient care and diabetes management:
Weight Management: Since obesity exerts the strongest independent effect on HbA1c, structured interventions targeting weight reduction (e.g., lifestyle modification, nutrition counseling, pharmacotherapy) are critical for improving glycemic control.
Long-Term Monitoring: The progressive rise in HbA1c with longer diabetes duration underscores the need for vigilant, ongoing monitoring. Regular HbA1c testing and timely treatment adjustments are essential to mitigate disease progression.
Age-Related Care: The modest but consistent effect of aging highlights the importance of tailoring glycemic targets and treatment strategies for older adults, balancing efficacy with safety.
Holistic Approach: With the model explaining ~41% of HbA1c variance, other factors (such as medication adherence, physical activity, diet quality, and genetic predisposition) must also be addressed in comprehensive care plans.