1 Introduction

1.1 Research Question / Objective

To identify significant predictors of glycemic control, measured by HbA1c levels, among patients with Type 2 diabetes.

1.2 Dataset and Variables

library(tidyverse)

set.seed(123)
n <- 1000

# Generate variables
age <- rnorm(n, mean = 55, sd = 10)

bmi <- sample(c("underweight", "normal", "overweight", "obese"),
              n, replace = TRUE,
              prob = c(0.1, 0.4, 0.3, 0.2))

# Duration of diabetes (always positive)
duration_diabetes <- rlnorm(n, meanlog = log(10), sdlog = 0.3)

# HbA1c outcome
hba1c <- 4 +
  0.05 * age +
  ifelse(bmi == "underweight", 0,
         ifelse(bmi == "normal", 0.5,
                ifelse(bmi == "overweight", 1, 1.5))) +
  0.1 * duration_diabetes +
  rnorm(n, mean = 0, sd = 1)

# Assemble into a tibble
diabetes_data <- tibble( age = round(age, 1), 
                         bmi = bmi,
                         duration_diabetes = round(duration_diabetes, 1), 
                         hba1c = round(hba1c, 1) ) 
head(diabetes_data)
# A tibble: 6 × 4
    age bmi        duration_diabetes hba1c
  <dbl> <chr>                  <dbl> <dbl>
1  49.4 normal                   7.8   7.1
2  52.7 normal                   9.1   8.6
3  70.6 normal                   7.6   8.1
4  55.7 overweight              12.1   8.5
5  56.3 overweight              14    10  
6  72.2 overweight              18.9  10  

A simulated dataset of 1,000 patients was generated to reflect realistic clinical characteristics.

The dataset includes:

  • Age (years)
  • BMI (categorical)
  • Duration of Diabetes (years)
  • Physical Activity (hours/week)
  • HbA1c (%)

2 Data preparation

2.1 Load libraries

library(broom)       # For tidying model outputs
library(gtsummary)   # NEW: Automates regression tables
library(performance) # NEW: Best-in-class model diagnostics
library(MASS)        # For stepAIC
library(MuMIn)        # For model selection using AICc
library(dplyr)       # For data manipulation
library(ggplot2)   # For data visualization
library(gt)          # For enhanced table formatting
library(lmtest)      # For Breusch-Pagan test
library(car)         # For VIF calculation
library(DT)          # For interactive tables
library(patchwork)    # For combining ggplots
library(sjPlot)      # For plotting model effects

2.2 Data cleaning and coding

glimpse(diabetes_data)
Rows: 1,000
Columns: 4
$ age               <dbl> 49.4, 52.7, 70.6, 55.7, 56.3, 72.2, 59.6, 42.3, 48.1…
$ bmi               <chr> "normal", "normal", "normal", "overweight", "overwei…
$ duration_diabetes <dbl> 7.8, 9.1, 7.6, 12.1, 14.0, 18.9, 11.2, 7.7, 13.6, 13…
$ hba1c             <dbl> 7.1, 8.6, 8.1, 8.5, 10.0, 10.0, 9.1, 8.4, 7.1, 9.9, …

2.3 Convert BMI to factor with ordered levels

diabetes_data <- diabetes_data %>%
  mutate(across(bmi, ~ factor(.x,
                              levels = c("normal", "underweight", "overweight", "obese"))))
glimpse(diabetes_data)
Rows: 1,000
Columns: 4
$ age               <dbl> 49.4, 52.7, 70.6, 55.7, 56.3, 72.2, 59.6, 42.3, 48.1…
$ bmi               <fct> normal, normal, normal, overweight, overweight, over…
$ duration_diabetes <dbl> 7.8, 9.1, 7.6, 12.1, 14.0, 18.9, 11.2, 7.7, 13.6, 13…
$ hba1c             <dbl> 7.1, 8.6, 8.1, 8.5, 10.0, 10.0, 9.1, 8.4, 7.1, 9.9, …

3 Descriptive analysis

diabetes_data %>%
  tbl_summary(
    label = list(age ~ "Age (Years)", bmi ~ "BMI Category", duration_diabetes ~ "Duration (Years)", hba1c ~ "HbA1c (%)"),
    statistic = list(all_continuous() ~ "{mean} ({sd})", all_categorical() ~ "{n} ({p}%)")
  ) %>%
  modify_header(label = "**Variable**") %>%
  bold_labels()
Variable N = 1,0001
Age (Years) 55 (10)
BMI Category
    normal 402 (40%)
    underweight 107 (11%)
    overweight 291 (29%)
    obese 200 (20%)
Duration (Years) 10.4 (3.1)
HbA1c (%) 8.59 (1.25)
1 Mean (SD); n (%)

4 Univariable analysis

4.1 Null model

mod_null <- lm(hba1c ~ 1, data = diabetes_data)

tbl_regression(mod_null, intercept = TRUE) %>%
  bold_p(t = 0.05) %>%
  as_gt() %>%
  tab_header(title = "Null Model", subtitle = "Intercept-only")
Null Model
Intercept-only
Characteristic Beta 95% CI p-value
(Intercept) 8.6 8.5, 8.7 <0.001
Abbreviation: CI = Confidence Interval

4.2 Age

mod_age <- lm(hba1c ~ age, data = diabetes_data)

tbl_regression(mod_age) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Univariate: Age")
Univariate: Age
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
Abbreviation: CI = Confidence Interval

4.3 BMI

mod_bmi <- lm(hba1c ~ bmi, data = diabetes_data)

tbl_regression(mod_bmi) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Univariate: BMI")
Univariate: BMI
Characteristic Beta 95% CI p-value
bmi


    normal
    underweight -0.51 -0.76, -0.26 <0.001
    overweight 0.42 0.25, 0.59 <0.001
    obese 1.1 0.92, 1.3 <0.001
Abbreviation: CI = Confidence Interval

4.4 Duration of diabetes

mod_duration <- lm(hba1c ~ duration_diabetes, data = diabetes_data)

tbl_regression(mod_duration) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Univariate: Duration")
Univariate: Duration
Characteristic Beta 95% CI p-value
duration_diabetes 0.10 0.07, 0.12 <0.001
Abbreviation: CI = Confidence Interval

5 Variable selection

All Predictors in univariable analysis with p-value <0.25 are included in multivariable analysis (here, we include all predictors).

We now build a model using stepwise variable selection methods (forward, backward, both) based on AIC.

5.1 Preliminary model with all significant predictors in univariable analysis

prelim_mod <- lm(hba1c ~ age + bmi + duration_diabetes, data = diabetes_data)
tbl_regression(prelim_mod) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Preliminary Model: Age, BMI and Duration")
Preliminary Model: Age, BMI and Duration
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
bmi


    normal
    underweight -0.53 -0.74, -0.31 <0.001
    overweight 0.36 0.21, 0.51 <0.001
    obese 1.1 0.89, 1.2 <0.001
duration_diabetes 0.11 0.09, 0.13 <0.001
Abbreviation: CI = Confidence Interval

5.2 Backward selection

step_bw <- MASS::stepAIC(prelim_mod, direction = "backward")
Start:  AIC=11.76
hba1c ~ age + bmi + duration_diabetes

                    Df Sum of Sq     RSS     AIC
<none>                            999.76  11.763
- duration_diabetes  1    120.95 1120.71 123.966
- age                1    218.59 1218.35 207.498
- bmi                3    224.79 1224.55 208.573

5.3 Both selection

step_both <- MASS::stepAIC(mod_null, 
                           direction = "both", 
                           scope = list(upper = prelim_mod, lower = mod_null))
Start:  AIC=448.96
hba1c ~ 1

                    Df Sum of Sq    RSS    AIC
+ bmi                3   242.024 1321.5 286.79
+ age                1   223.730 1339.8 296.54
+ duration_diabetes  1    95.988 1467.6 387.60
<none>                           1563.5 448.96

Step:  AIC=286.79
hba1c ~ bmi

                    Df Sum of Sq    RSS    AIC
+ age                1    200.81 1120.7 123.97
+ duration_diabetes  1    103.18 1218.3 207.50
<none>                           1321.5 286.79
- bmi                3    242.02 1563.5 448.96

Step:  AIC=123.97
hba1c ~ bmi + age

                    Df Sum of Sq     RSS     AIC
+ duration_diabetes  1    120.95  999.76  11.763
<none>                           1120.71 123.966
- age                1    200.81 1321.53 286.787
- bmi                3    219.11 1339.82 296.536

Step:  AIC=11.76
hba1c ~ bmi + age + duration_diabetes

                    Df Sum of Sq     RSS     AIC
<none>                            999.76  11.763
- duration_diabetes  1    120.95 1120.71 123.966
- age                1    218.59 1218.35 207.498
- bmi                3    224.79 1224.55 208.573

5.4 Forward selection

step_fw <- MASS::stepAIC(mod_null, 
                         direction = "forward", 
                         scope = list(upper = prelim_mod, lower = mod_null))
Start:  AIC=448.96
hba1c ~ 1

                    Df Sum of Sq    RSS    AIC
+ bmi                3   242.024 1321.5 286.79
+ age                1   223.730 1339.8 296.54
+ duration_diabetes  1    95.988 1467.6 387.60
<none>                           1563.5 448.96

Step:  AIC=286.79
hba1c ~ bmi

                    Df Sum of Sq    RSS    AIC
+ age                1    200.81 1120.7 123.97
+ duration_diabetes  1    103.18 1218.3 207.50
<none>                           1321.5 286.79

Step:  AIC=123.97
hba1c ~ bmi + age

                    Df Sum of Sq     RSS     AIC
+ duration_diabetes  1    120.95  999.76  11.763
<none>                           1120.71 123.966

Step:  AIC=11.76
hba1c ~ bmi + age + duration_diabetes

5.5 Compare results (Anova Table)

anova_combined <- bind_rows(
  as.data.frame(step_bw$anova)   %>% mutate(Method = "Backward"),
  as.data.frame(step_both$anova) %>% mutate(Method = "Both"),
  as.data.frame(step_fw$anova)   %>% mutate(Method = "Forward")
) %>%
  relocate(Method, .before = 1)

anova_combined
    Method                Step Df Deviance Resid. Df Resid. Dev       AIC
1 Backward                     NA       NA       994   999.7629  11.76289
2     Both                     NA       NA       999  1563.5508 448.95941
3     Both               + bmi  3 242.0244       996  1321.5265 286.78749
4     Both               + age  1 200.8128       995  1120.7137 123.96571
5     Both + duration_diabetes  1 120.9508       994   999.7629  11.76289
6  Forward                     NA       NA       999  1563.5508 448.95941
7  Forward               + bmi  3 242.0244       996  1321.5265 286.78749
8  Forward               + age  1 200.8128       995  1120.7137 123.96571
9  Forward + duration_diabetes  1 120.9508       994   999.7629  11.76289

Intepretation: All three selection methods (forward, backward, and both) consistently identified age, BMI, and duration of diabetes as significant predictors of HbA1c levels.

6 Multivariable analysis

6.1 Model full (Age, BMI, Duration of Diabetes)

mod_full <- prelim_mod

tbl_regression(mod_full) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(r.squared, AIC)) %>%
  as_gt() %>%
  tab_header(title = "Multivariable: Full Model")
Multivariable: Full Model
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
bmi


    normal
    underweight -0.53 -0.74, -0.31 <0.001
    overweight 0.36 0.21, 0.51 <0.001
    obese 1.1 0.89, 1.2 <0.001
duration_diabetes 0.11 0.09, 0.13 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.361; AIC = 2,852

Interpretation: In the full model, age, BMI, and duration of diabetes are all significant predictors of HbA1c levels (p < 0.001). The model explains approximately 36.1% of the variance in HbA1c levels (R² = 0.361).

6.2 Age and BMI

mod_age.bmi <- lm(hba1c ~ age + bmi, data = diabetes_data)

tbl_regression(mod_age.bmi) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(r.squared, AIC)) %>%
  as_gt() %>%
  tab_header(title = "Multivariable: Age and BMI")
Multivariable: Age and BMI
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
bmi


    normal
    underweight -0.49 -0.72, -0.27 <0.001
    overweight 0.43 0.27, 0.59 <0.001
    obese 1.0 0.87, 1.2 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.283; AIC = 2,964

Interpretation: In the model with age and BMI, both predictors are significant (p < 0.001). The model explains approximately 23.8% of the variance in HbA1c levels (R² = 0.238).

7 Model comparison and selection

7.1 Using AICc for multi model comparison

# Create a list of models to compare
model_list <- list(mod_null, mod_age, mod_age.bmi, mod_full)
model_comparison <- model.sel(model_list)
model_comparison
Model selection table 
  (Int)     age bmi drt_dbt df    logLik   AICc  delta weight
4 4.560 0.04733   +  0.1114  7 -1418.820 2851.8   0.00      1
3 5.815 0.04529   +          6 -1475.921 2963.9 112.17      0
2 5.963 0.04771              3 -1565.207 3136.4 284.68      0
1 8.595                      2 -1642.418 3288.8 437.10      0
Models ranked by AICc(x) 

Interpretation: The best model for predicting HbA1c includes age, BMI, and duration of diabetes (Model Full), with an AICc of 2852.

7.2 ANOVA to compare nested models

anova(mod_age, mod_age.bmi, mod_full)
Analysis of Variance Table

Model 1: hba1c ~ age
Model 2: hba1c ~ age + bmi
Model 3: hba1c ~ age + bmi + duration_diabetes
  Res.Df     RSS Df Sum of Sq       F    Pr(>F)    
1    998 1339.82                                   
2    995 1120.71  3    219.11  72.615 < 2.2e-16 ***
3    994  999.76  1    120.95 120.254 < 2.2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Interpretation: The ANOVA results indicate that adding BMI and duration of diabetes significantly improves model fit compared to age alone (p < 0.001). Thus, the full model is preferred.

8 Check for confounders

8.1 Visualize relationship between age, BMI and duration of diabetes

# age and Duration of diabetes
diabetes_data %>%
  ggplot(aes(age, duration_diabetes)) +
  geom_point() +
  geom_smooth(method="lm", color="red")

# Age and BMI
diabetes_data %>%
  ggplot(aes(age, fill = bmi)) +
  geom_boxplot()

# BMI and Duration of diabetes
diabetes_data %>%
  ggplot(aes(duration_diabetes, fill = bmi)) +
  geom_boxplot()

Interpretation :

  • age and duration of diabetes: The scatterplot shows a weak negative relationship: older individuals tend to have shorter durations of diagnosed diabetes. The modest slope suggests age and duration are not strongly correlated, supporting their independent inclusion in regression models.
  • age and BMI: NBoxplots of age across BMI categories reveal that underweight individuals are generally younger, while obese individuals span a wider age range with slightly higher median age. This pattern suggests BMI varies with age, but the overlap across categories indicates that age and BMI are not tightly coupled.
  • BMI and duration of diabetes: Boxplots of diabetes duration across BMI categories show that underweight individuals may have longer durations, while other groups display overlapping distributions. This could imply that prolonged disease influences weight status, or that BMI shifts over time with disease progression..

8.2 Comparing coefficients across models

# Collect results into one data frame
models <- list(mod_age, mod_age.bmi,mod_full)

coef_table <- lapply(models, function(m) tidy(m, conf.int = TRUE))
names(coef_table) <- c("mod_age", "mod_age.bmi", "mod_full")

coef_compare <- bind_rows(coef_table, .id = "Model")
coef_compare
# A tibble: 13 × 8
   Model       term    estimate std.error statistic   p.value conf.low conf.high
   <chr>       <chr>      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
 1 mod_age     (Inter…   5.96     0.207       28.8  3.80e-133   5.56      6.37  
 2 mod_age     age       0.0477   0.00370     12.9  2.27e- 35   0.0405    0.0550
 3 mod_age.bmi (Inter…   5.82     0.194       30.0  2.18e-141   5.43      6.20  
 4 mod_age.bmi age       0.0453   0.00339     13.4  1.58e- 37   0.0386    0.0519
 5 mod_age.bmi bmiund…  -0.493    0.115       -4.27 2.13e-  5  -0.720    -0.267 
 6 mod_age.bmi bmiove…   0.427    0.0817       5.23 2.06e-  7   0.267     0.588 
 7 mod_age.bmi bmiobe…   1.05     0.0920      11.4  2.34e- 28   0.867     1.23  
 8 mod_full    (Inter…   4.56     0.216       21.1  4.70e- 82   4.14      4.98  
 9 mod_full    age       0.0473   0.00321     14.7  1.24e- 44   0.0410    0.0536
10 mod_full    bmiund…  -0.528    0.109       -4.84 1.53e-  6  -0.742    -0.314 
11 mod_full    bmiove…   0.360    0.0774       4.66 3.67e-  6   0.209     0.512 
12 mod_full    bmiobe…   1.06     0.0869      12.2  3.77e- 32   0.893     1.23  
13 mod_full    durati…   0.111    0.0102      11.0  1.71e- 26   0.0915    0.131 

*Interpretation: The coefficients for age and duration remain stable when BMI is added to the model, indicating minimal confounding. Similarly, the effect of BMI is robust to adjustment for both age and duration of diabetes. No major confounding is detected. All three variables contribute independently and should be retained in the final multivariable model.

9 Check for interaction

9.1 Age and BMI

inter_age.bmi <- lm(hba1c ~ age + bmi + duration_diabetes + age:bmi, data = diabetes_data)
tbl_regression(inter_age.bmi) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Model with interaction: Age and BMI")
Model with interaction: Age and BMI
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.06 <0.001
bmi


    normal
    underweight -0.72 -1.9, 0.45 0.2
    overweight 0.80 -0.10, 1.7 0.081
    obese 0.98 0.05, 1.9 0.039
duration_diabetes 0.11 0.09, 0.13 <0.001
age * bmi


    age * underweight 0.00 -0.02, 0.02 0.7
    age * overweight -0.01 -0.02, 0.01 0.3
    age * obese 0.00 -0.01, 0.02 0.9
Abbreviation: CI = Confidence Interval

9.2 Age and Duration of diabetes

inter_age.duration <- lm(hba1c ~ age + bmi + duration_diabetes + age:duration_diabetes, data = diabetes_data)
tbl_regression(inter_age.duration) %>%
  add_global_p() %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Model with interaction: Age and Duration of diabetes")
Model with interaction: Age and Duration of diabetes
Characteristic Beta 95% CI p-value
age 0.05 0.03, 0.07 <0.001
bmi

<0.001
    normal
    underweight -0.53 -0.74, -0.31
    overweight 0.36 0.21, 0.51
    obese 1.1 0.89, 1.2
duration_diabetes 0.12 0.01, 0.22 0.033
age * duration_diabetes 0.00 0.00, 0.00 >0.9
Abbreviation: CI = Confidence Interval

9.3 BMI and Duration of diabetes

inter_bmi.duration <- lm(hba1c ~ age + bmi + duration_diabetes + bmi:duration_diabetes, data = diabetes_data)
tbl_regression(inter_bmi.duration) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Model with interaction: BMI and Duration of diabetes")
Model with interaction: BMI and Duration of diabetes
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
bmi


    normal
    underweight -0.36 -1.1, 0.39 0.3
    overweight 0.74 0.22, 1.3 0.005
    obese 1.0 0.38, 1.6 0.002
duration_diabetes 0.13 0.09, 0.16 <0.001
bmi * duration_diabetes


    underweight * duration_diabetes -0.02 -0.08, 0.05 0.6
    overweight * duration_diabetes -0.04 -0.08, 0.01 0.13
    obese * duration_diabetes 0.01 -0.05, 0.07 0.8
Abbreviation: CI = Confidence Interval

9.4 Age, BMI and Duration of diabetes

inter_age.bmi.duration <- lm(hba1c ~ age + bmi + duration_diabetes + age:bmi:duration_diabetes, data = diabetes_data)
tbl_regression(inter_age.bmi.duration) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  as_gt() %>%
  tab_header(title = "Model with interaction: Age, BMI and Duration of diabetes")
Model with interaction: Age, BMI and Duration of diabetes
Characteristic Beta 95% CI p-value
age 0.05 0.03, 0.07 <0.001
bmi


    normal
    underweight -0.50 -1.2, 0.18 0.15
    overweight 0.81 0.34, 1.3 <0.001
    obese 1.0 0.48, 1.5 <0.001
duration_diabetes 0.11 0.01, 0.22 0.038
age * bmi * duration_diabetes


    age * normal * duration_diabetes 0.00 0.00, 0.00 0.8
    age * underweight * duration_diabetes 0.00 0.00, 0.00 0.9
    age * overweight * duration_diabetes 0.00 0.00, 0.00 0.6
    age * obese * duration_diabetes 0.00 0.00, 0.00 0.7
Abbreviation: CI = Confidence Interval

Intepretation: There is no significant interaction effect between the predictors.

10 Model assessment

10.1 Diagnostic plots for the final model

par(mfrow=c(2,2))
plot(mod_full)

Interpretation:

  • Residuals vs Fitted: The residuals appear randomly scattered around zero, indicating homoscedasticity.
  • Normal Q-Q: The points closely follow the diagonal line, suggesting that the residuals are normally distributed.
  • Scale-Location: The spread of residuals is fairly constant across fitted values, supporting the homoscedasticity assumption.
  • Residuals vs Leverage: No points exceed the Cook’s distance lines, indicating no highly influential observations.

10.1.1 QQ plot for normality of residuals (Optional)

qqnorm(residuals(mod_full))
qqline(residuals(mod_full))

10.2 Heteroscedasticity of residual (Breusch–Pagan test)

bptest(mod_full)

    studentized Breusch-Pagan test

data:  mod_full
BP = 6.0617, df = 5, p-value = 0.3003

Interpretation: Breusch–Pagan test indicated no violation of the homoscedasticity assumption (BP = 6.26, df = 5, p = 0.28). Thus, the residuals appear to have constant variance, supporting the validity of the regression model’s standard errors and inference.

10.3 Shapiro-wilk test for normality of residuals

shapiro.test(mod_full$residuals)

    Shapiro-Wilk normality test

data:  mod_full$residuals
W = 0.99899, p-value = 0.8684

Interpretation: Normally distributed regression model.

10.4 Plot residuals against independent variables

res <- residuals(mod_full)

# Age vs residuals
ggplot(diabetes_data, aes(x = age, y = res)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Residuals vs Age", y = "Residuals")

# BMI vs residuals (categorical)
ggplot(diabetes_data, aes(x = bmi, y = res)) +
  geom_boxplot(fill = "lightblue") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Residuals vs BMI", y = "Residuals")

# Duration vs residuals
ggplot(diabetes_data, aes(x = duration_diabetes, y = res)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "Residuals vs Duration of Diabetes", y = "Residuals")

Interpretation:

  • Residual equal variance across age. Homoscedasticity is met.
  • Residual equal variance across BMI categories. Homoscedasticity is met.
  • Residual equal variance across duration of diabetes. Homoscedasticity is met.

10.5 Fitted value and residual

head(res)
         1          2          3          4          5          6 
-0.6672671  0.5317223 -0.6484015 -0.4049344  0.8550088 -0.4434072 
hist(res)

Intepretation: The residual is normally distributed.

10.6 VIF (multicollinearity)

check_collinearity(mod_full)
# Check for Multicollinearity

Low Correlation

              Term  VIF    VIF 95% CI adj. VIF Tolerance Tolerance 95% CI
               age 1.01 [1.00, 43.18]     1.00      0.99     [0.02, 1.00]
               bmi 1.01 [1.00,  2.83]     1.00      0.99     [0.35, 1.00]
 duration_diabetes 1.01 [1.00,  2.84]     1.01      0.99     [0.35, 1.00]

Interpretation: No multicollinearity.

10.7 Box-Cox plot (Check for transformation needed

  • if lamda is outside the range of -1 to 1, transformation is needed due to validity of normality and homoscedasticity assumptions.
MASS::boxcox(mod_full)

Intepretation:

  • The curve is smooth and the peak lies close to λ = 1, and the confidence interval includes 1.
  • This suggests that no transformation is necessary (normality and homoscedasticity are met).

10.7.1 Diagnostic Dashboard (Alternative method)

# check_model() runs: VIF, Normality of Residuals, Homoscedasticity, Influential points
check_model(mod_full)

10.8 Create diagnostic datashet (predictions and residuals)

diabetes_pred.res <- augment(mod_full)
diabetes_pred.res %>%
  datatable()

11 Prediction

11.1 Generate new data

new_diabetes.data <- expand_grid( # expand_grid is the tidy version of expand.grid
  age = seq(30, 80, by=5),
  bmi = factor(c("underweight", "normal", "overweight", "obese"), levels=c("normal", "underweight", "overweight", "obese")),
  duration_diabetes = mean(diabetes_data$duration_diabetes)
)

## Predict with confidence intervals
augment(mod_full, newdata = new_diabetes.data, interval = "confidence") %>%
  head() %>%
  knitr::kable()
age bmi duration_diabetes .fitted .lower .upper
30 underweight 10.436 6.614753 6.369285 6.860222
30 normal 10.436 7.142696 6.957418 7.327975
30 overweight 10.436 7.503163 7.308281 7.698045
30 obese 10.436 8.205749 7.989278 8.422221
35 underweight 10.436 6.851408 6.624523 7.078292
35 normal 10.436 7.379351 7.219906 7.538795

11.2 Predict HbA1c levels for new data

pred.hbA1c <- augment(mod_full, newdata = new_diabetes.data)
pred.hbA1c %>% datatable()

11.3 Predict HbA1c for hypothetical patient profiles

# 1. Define the profiles
risk_profiles <- tibble(
  Profile = c("Low Risk Scenario", "Medium Risk Scenario", "High Risk Scenario"),
  age = c(40, 55, 70),  # Young vs Middle vs Old
  bmi = factor(c("normal", "overweight", "obese"), 
               levels = c("normal", "underweight", "overweight", "obese")), 
  duration_diabetes = c(2, 10, 20) # Short vs Medium vs Long duration
)

# 2. Predict HbA1c for these profiles
risk_profiles |> 
  mutate(
    Predicted_HbA1c = predict(mod_full, newdata = risk_profiles),
    # Optional: Add 95% Confidence Interval to see the range
    CI_Lower = predict(mod_full, newdata = risk_profiles, interval = "confidence")[,2],
    CI_Upper = predict(mod_full, newdata = risk_profiles, interval = "confidence")[,3]
  ) |>
  knitr::kable(digits = 2, caption = "Predicted HbA1c for Hypothetical Patient Profiles")
Predicted HbA1c for Hypothetical Patient Profiles
Profile age bmi duration_diabetes Predicted_HbA1c CI_Lower CI_Upper
Low Risk Scenario 40 normal 2 6.68 6.46 6.89
Medium Risk Scenario 55 overweight 10 8.64 8.52 8.75
High Risk Scenario 70 obese 20 11.16 10.90 11.42

Interpretation: The predicted HbA1c levels increase from the low-risk to high-risk scenarios, reflecting the cumulative impact of age, BMI, and duration of diabetes on glycemic control.

11.4 Visualize effect of Age on HbA1c

# 1. Create a dataset varying ONLY Age
age_effect <- tibble(
  age = seq(30, 80, by = 5), # Vary Age from 30 to 80
  bmi = factor("normal", levels = c("normal", "underweight", "overweight", "obese")), # Fix BMI at "Normal"
  duration_diabetes = mean(diabetes_data$duration_diabetes) # Fix Duration at the average
) |> 
  mutate(
    Predicted_HbA1c = predict(mod_full, newdata = pick(everything()))
  )

# 2. Plot the relationship
ggplot(age_effect, aes(x = age, y = Predicted_HbA1c)) +
  geom_line(color = "steelblue", linewidth = 1.2) + # Note: 'linewidth' is preferred over 'size' in new ggplot2
  geom_point(color = "steelblue", size = 3) +
  labs(
    title = "Effect of Age on HbA1c Levels",
    subtitle = "Prediction for a patient with Normal BMI and average diabetes duration",
    x = "Age (years)",
    y = "Predicted HbA1c (%)"
  ) +
  theme_minimal()

Interpretation: The plot illustrates a positive linear relationship between age and predicted HbA1c levels, indicating that as patients age, their glycemic control tends to worsen, even when BMI and duration of diabetes are held constant.

12 Check Influentials

Identifying Influential Data Points in Regression

We used three diagnostic criteria to identify potentially influential observations:

  1. Cook’s Distance
    • Threshold: values above 1, or above \(\frac{4}{n}\)
  2. Standardized Residuals
    • Threshold: values greater than 2 or less than −2
  3. Leverage
    • Threshold: values above \(\frac{2k + 2}{n}\)
    • Example: \(\frac{2(7) + 2}{4225} = 0.0038\)

Where:
- \(n\) = number of observations
- \(k\) = number of predictors in the model

These thresholds help flag data points that may disproportionately influence model estimates or distort inference.

12.1 Identify Influential Observations

# Add diagnostics columns
diabetes_diag <- augment(mod_full, diabetes_data)

# Define thresholds
n_obs <- nrow(diabetes_data)
p_preds <- length(coef(mod_full))
cooks_cutoff <- 4 / n_obs
leverage_cutoff <- (2 * p_preds + 2) / n_obs

# Filter Influential Observations
influential_obs <- diabetes_diag %>%
  filter(.cooksd > cooks_cutoff | abs(.std.resid) > 2 | .hat > leverage_cutoff)

# Print count
nrow(influential_obs)
[1] 85

12.2 Cook’s distance

cutoff <- 4/(nrow(diabetes_data)-length(mod_full$coefficients)-2)
plot(mod_full, which=4, cook.levels=cutoff)

Interpretation: The Cook’s distance plot shows that a few observations exceed the threshold of 0.004, indicating they may be influential points affecting the regression results.

12.3 Leverage and standardized residuals

  • to indentify observations with high leverage and large standardized residuals
diabetes_pred.res %>% 
  filter(.std.resid > 2 | .std.resid < -2 | .hat > 0.038) %>% 
  datatable()

12.4 Remove Influential Observations and Refit Model

# Remove influentials and Refit
clean_data <- diabetes_diag %>%
  filter(.cooksd <= cooks_cutoff & abs(.std.resid) <= 2 & .hat <= leverage_cutoff)

# Re-run the final model
final_model_clean <- lm(hba1c ~ age + bmi + duration_diabetes, data = clean_data)

tbl_regression(final_model_clean) %>%
  bold_p(t = 0.05) %>%
  bold_labels() %>%
  add_glance_source_note(include = c(r.squared, AIC)) %>%
  as_gt() %>%
  tab_header(title = "Final Model after Removing Influential Observations")
Final Model after Removing Influential Observations
Characteristic Beta 95% CI p-value
age 0.05 0.04, 0.05 <0.001
bmi


    normal
    underweight -0.47 -0.67, -0.27 <0.001
    overweight 0.36 0.23, 0.50 <0.001
    obese 1.1 0.92, 1.2 <0.001
duration_diabetes 0.11 0.09, 0.13 <0.001
Abbreviation: CI = Confidence Interval
R² = 0.415; AIC = 2,320
# Run diagnostic test
par(mfrow=c(2,2))
plot(final_model_clean)

Interpretation:

  • After removing influential observations, the regression coefficients and significance levels remained largely unchanged.
  • This indicates that the original model was robust and not overly sensitive to a few extreme data points.
  • The diagnostic plots for the cleaned model continue to support the assumptions of linear regression, including linearity, homoscedasticity, normality of residuals, and absence of undue influence.

13 Model presentation

13.1 Final model table

tbl <- tbl_regression(
  final_model_clean,
  intercept = TRUE,
  label = list(
    age ~ "Age (years)",
    bmi ~ "BMI (kg/m²)",
    duration_diabetes ~ "Duration (years)"
  )
) %>%
  # Add model fit metrics
  add_glance_source_note(include = c(r.squared, adj.r.squared, AIC)) %>%
  bold_labels() %>%
  bold_p(t = 0.05) %>%
  modify_header(estimate ~ "**Adj. beta**")

# Relabel BMI levels safely (plain text first)
tbl <- tbl %>%
  modify_table_body(
    ~ .x %>%
      mutate(
        label = case_when(
          variable == "bmi" & label == "normal" ~ "Normal",
          variable == "bmi" & label == "underweight" ~ "Underweight",
          variable == "bmi" & label == "overweight" ~ "Overweight",
          variable == "bmi" & label == "obese" ~ "Obese",
          TRUE ~ label
        )
      )
  )

# Convert to gt and apply styling
tbl %>%
  as_gt() %>%
  tab_header(
    title = "Final Multiple Linear Regression Model",
    subtitle = "Predictors of HbA1c (Outliers Removed)"
  ) %>%
  # Italicize BMI levels by matching text
  tab_style(
    style = cell_text(style = "italic"),
    locations = cells_body(
      rows = grepl("BMI", variable)
    )
  ) %>%
  # Add reference category footnote safely
tab_footnote(
  footnote = "Reference category: Normal BMI",
  locations = cells_body(
    columns = "estimate",
    rows = variable == "bmi" & label %in% c("Underweight", "Overweight", "Obese")
  )
)
Final Multiple Linear Regression Model
Predictors of HbA1c (Outliers Removed)
Characteristic Adj. beta 95% CI p-value
(Intercept) 4.5 4.1, 4.9 <0.001
Age (years) 0.05 0.04, 0.05 <0.001
BMI (kg/m²)


    Normal
    Underweight -0.471 -0.67, -0.27 <0.001
    Overweight 0.361 0.23, 0.50 <0.001
    Obese 1.11 0.92, 1.2 <0.001
Duration (years) 0.11 0.09, 0.13 <0.001
1 Reference category: Normal BMI
Abbreviation: CI = Confidence Interval
R² = 0.415; Adjusted R² = 0.412; AIC = 2,320

13.2 Forest plot of the final model

# Generate Forest Plot for the cleaned final model
plot_model(final_model_clean, 
           type = "est", 
           show.values = TRUE, 
           value.offset = .3, 
           title = "Forest Plot of HbA1c Predictors",
           # Map the variable names (left) to clean labels (right)
           axis.labels = c(
             "duration_diabetes" = "Duration of Diabetes (Years)",
             "bmiobese" = "BMI: Obese",
             "bmioverweight" = "BMI: Overweight",
             "bmiunderweight" = "BMI: Underweight",
             "age" = "Age (Years)"
           ),
           vline.color = "red") + 
  theme_minimal()

Interpretation: The forest plot visually summarizes the estimated effects of each predictor on HbA1c levels, along with their 95% confidence intervals. Age, BMI categories, and duration of diabetes all show significant associations with HbA1c, as indicated by confidence intervals that do not cross zero.

14 Result and conclusion

14.1 Model Specification and Justification

The final analysis used a multivariable main-effects linear regression model to estimate the independent contributions of clinical predictors to glycemic control.

Final Formula:

\[ \widehat{HbA1c} = 4.50 + 0.05(\text{Age}) + 0.36(\text{Overweight}) + 1.10(\text{Obese}) - 0.47(\text{Underweight}) + 0.11(\text{Duration of Diabetes}) \]

  • Explanatory Power: The model accounts for approximately 41.5% of the variance in HbA1c levels (Adjusted R² = 0.415), indicating moderate predictive strength.

  • Bias and Variance: Interaction terms were evaluated but found to be statistically non-significant (p > 0.05) and increased model complexity without improving fit.

  • Variable selection: Using AIC-based stepwise selection, the main-effects model was identified as the optimal balance between parsimony and explanatory accuracy.

  • Model selection: The final model was chosen based on lowest AICc (2852) and significant improvement over nested models (ANOVA p < 0.001).

14.2 Analysis of Coefficients

These values represent the adjusted effect of each predictor on HbA1c levels, holding all other variables constant (ceteris paribus):

  • Age: Each additional year of age is associated with a 0.05% increase in HbA1c (95% CI: 0.04, 0.05) after adjusting for BMI and duration.

    • This reflects a gradual, age-related decline in glycemic control.
  • Duration of Diabetes: Each additional year living with diabetes is associated with a 0.11% increase in HbA1c (95% CI: 0.09, 0.13) after adjusted for age and BMI.

    • This effect is independent of age and highlights the progressive nature of the disease.
  • BMI (Obesity):

    • Obese individuals have an HbA1c level 1.10% higher (95% CI: 0.92, 1.20) compared to normal weight, after adjusting for age and duration.

    • Overweight individuals have an HbA1c level 0.36% higher (95% CI: 0.23, 0.50) compared to normal weight, after adjusting for age and duration.

    • Underweight individuals have a 0.47% lower HbA1c level (95% CI: −0.67, −0.27) compared to normal weight, after adjusting for age and duration.

14.3 Subgroup Interpretation: Cumulative Risk

Because the final model is additive (main effects only), the impact of risk factors adds up directly. This allows us to interpret the combined burden of disease across different patient profiles:

  • The Metabolic Cost of Obesity: An obese patient begins with a glycemic disadvantage equivalent to approximately 22 years of aging compared to a normal‑weight peer of the same age and diabetes duration. This highlights the substantial independent effect of obesity on HbA1c.

  • Disease Progression: Each year of diabetes duration contributes an additional 0.11% increase in HbA1c. For example, a patient with a 10‑year history of diabetes is predicted to have HbA1c levels that are 1.1% higher

14.4 Model Assessment and Diagnostics

The final model was rigorously validated to ensure robustness and reliability of inference.

  • Homoscedasticity: The Breusch-Pagan test () indicates we fail to reject the null hypothesis of constant variance. This confirms the model is homoscedastic and standard errors are valid.
  • Normality of Residuals: The Shapiro-Wilk test () suggests the residuals are normally distributed, supporting the validity of hypothesis tests and confidence intervals.
  • Multicollinearity: Variance Inflation Factors (VIF) for all predictors were low (< 2.0), confirming that Age, BMI, and Duration are sufficiently independent and not essentially collinear.
  • Influence and Outliers: Potentially influential observations were identified using Cook’s Distance (Threshold ). While some points exceeded this cutoff, removing them did not alter the direction or significance of the predictors, suggesting the model is stable and not driven by a few extreme cases.

14.5 Conclusion

  1. Older age, higher BMI, and longer duration of diabetes are each independently associated with higher HbA1c levels. The model demonstrates a clear dose–response relationship with BMI and a strong linear effect of diabetes duration. With a moderate fit, the findings suggest that additional factors beyond those measured also influence glycemic control.

  2. These results highlighted the importance of weight management and vigilant monitoring of glycemic control as patients age and live longer with diabetes.

  3. These findings carry important implications for patient care and diabetes management:

  • Weight Management: Since obesity exerts the strongest independent effect on HbA1c, structured interventions targeting weight reduction (e.g., lifestyle modification, nutrition counseling, pharmacotherapy) are critical for improving glycemic control.

  • Long-Term Monitoring: The progressive rise in HbA1c with longer diabetes duration underscores the need for vigilant, ongoing monitoring. Regular HbA1c testing and timely treatment adjustments are essential to mitigate disease progression.

  • Age-Related Care: The modest but consistent effect of aging highlights the importance of tailoring glycemic targets and treatment strategies for older adults, balancing efficacy with safety.

  • Holistic Approach: With the model explaining ~41% of HbA1c variance, other factors (such as medication adherence, physical activity, diet quality, and genetic predisposition) must also be addressed in comprehensive care plans.