Diabetes Clustering

Clustering Diabetes Data

# —- PACKAGES —-

library(readxl)
library(gtsummary)
library(dplyr)

Caricamento pacchetto: 'dplyr'
I seguenti oggetti sono mascherati da 'package:stats':

    filter, lag
I seguenti oggetti sono mascherati da 'package:base':

    intersect, setdiff, setequal, union
library(ggplot2)
library(crosstable)

Caricamento pacchetto: 'crosstable'
Il seguente oggetto è mascherato da 'package:gtsummary':

    as_gt
library(tidyr)
library(corrplot)
corrplot 0.95 loaded
library(gmodels)
library(caret)
Caricamento del pacchetto richiesto: lattice
library(clustMixType)
library(cluster)
# Directory
#setwd("C:/Users/lucyq/Downloads")

# ---- 1) IMPORTING THE DATA ----

path <- "C:/Users/lucyq/Downloads/diabetes_study_filled.xlsx"

d2015 <- read_excel(path, sheet = "2015")
d2025 <- read_excel(path, sheet = "2025")

Check the structure of the data in the ENVIRONMENT PANEL: we can see num & chr

To use clustering functions for MIXED data, we will need to separate the NUMERICAL variables from the CATEGORICAL variables.

  • d2015_num is the numerical data
  • d2015_cat is the categorical data

BUT actually the YEAR is a categorical variable, even through it is a number:

# Separate numerical variables from categorical variables:

d2015_num <- d2015 %>% select(where(is.numeric), -year)
d2015_cat <- d2015 %>% select(where(is.character), year)

Convert categorical variables to unordered factors:

# Convert every categorical variable to FACTORS:
d2015_fac <- d2015_cat %>% mutate(across(everything(), as.factor))

Let’s check the numerical variables are sufficiently varied to be able to HELP clustering.

The nearZeroVar function from caret gives us:
- Frequency ratio = Most common / 2nd most common (>19 is problematic)
- The percentage of values that are unique (<10% is problematic)
- Zero Variance: TRUE is problematic (no variance at all, all values identical)

and the ultimate verdict
- Near Zero Variance: TRUE indicates we need to eliminate this variable (unbalanced or no variation)

# Identify variables with zero or very low variance
nearZeroVar(d2015_num, saveMetrics=TRUE)
          freqRatio percentUnique zeroVar   nzv
age        1.454545          17.6   FALSE FALSE
lab_hba1c  1.000000          13.6   FALSE FALSE

For our numeric variables nzv is FALSE so these are good to go!

Let’s check the categorical categories are sufficiently varied to be able to HELP clustering:

nearZeroVar(d2015_fac, saveMetrics = TRUE)
                      freqRatio percentUnique zeroVar   nzv
country                1.015152           2.0   FALSE FALSE
sex                    1.000000           0.8   FALSE FALSE
BMI_cat                1.117647           1.6   FALSE FALSE
SDI                    1.000000           1.2   FALSE FALSE
diabetes_self_report  14.625000           0.8   FALSE FALSE
gestational_diabetes 124.000000           0.8   FALSE  TRUE
year                   0.000000           0.4    TRUE  TRUE

WATCH OUT:

  • year 1 - only one value across entire dataset! This will not help differentiate clusters.

  • gestational_diabetes - highly imbalanced variable (freqRatio 124)! This will not help differentiate clusters.

  • diabetes_self_report - imbalanced variable (freqRatio still very high)

REMOVE these:

# Remove the problematic factor variables
d2015_fac_clean <- d2015_fac %>% select(-gestational_diabetes, -diabetes_self_report, -year)

Combine data

# Combine numerical and factor data
d2015_mixed <- cbind(d2015_num, d2015_fac_clean)

K-prototypes clustering

# Set seed so that we will always get the same result:
set.seed(1234)

# Run k-prototypes (k=3 clusters)
kprot_result <- kproto(d2015_mixed, k = 3, nstart = 25)
# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

Estimated lambda: 113.5272 

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.

# NAs in variables:
      age lab_hba1c   country       sex   BMI_cat       SDI 
        0         0         0         0         0         0 
0 observation(s) with NAs.
# Add cluster assignments back to data
d2015_mixed$cluster <- kprot_result$cluster

# View cluster sizes
table(kprot_result$cluster)

 1  2  3 
89 93 68 
# View cluster centers
summary(kprot_result)
age 
  Min. 1st Qu. Median     Mean 3rd Qu. Max.
1   40      48     52 52.23596   57.00   70
2   40      44     47 49.00000   54.00   64
3   60      67     72 73.13235   78.25   85

-----------------------------------------------------------------
lab_hba1c 
  Min. 1st Qu. Median     Mean 3rd Qu. Max.
1  4.8     5.2    5.8 5.748315     6.1  8.6
2  4.8     5.4    5.7 5.981720     6.2  9.0
3  4.8     5.2    5.7 5.860294     6.2  9.0

-----------------------------------------------------------------
country 
       
cluster Africa Brazil China  Rome   USA
      1  0.169  0.079 0.124 0.225 0.404
      2  0.387  0.151 0.344 0.086 0.032
      3  0.235  0.176 0.338 0.088 0.162

-----------------------------------------------------------------
sex 
       
cluster   men woman
      1 0.258 0.742
      2 0.753 0.247
      3 0.471 0.529

-----------------------------------------------------------------
BMI_cat 
       
cluster Normal weight Obesity Overweight Underweight
      1         0.180   0.202      0.539       0.079
      2         0.430   0.333      0.183       0.054
      3         0.574   0.103      0.294       0.029

-----------------------------------------------------------------
SDI 
       
cluster  High Intermediate   Low
      1 0.719        0.124 0.157
      2 0.161        0.581 0.258
      3 0.309        0.515 0.176

-----------------------------------------------------------------

Visualise age and lab_hbalc:

# plot lab_hbalc against age, per cluster
ggplot(d2015_mixed, aes(x = age, y = lab_hba1c, color = as.factor(cluster))) +
  geom_point() +
  theme_minimal()

Compare age & SDI:

ggplot(d2015_mixed, aes(x = age, y = SDI, color = as.factor(cluster))) +
  geom_point() +
  theme_minimal()

KEY PATTERNS

# Create summary by cluster
cluster_summary <- d2015_mixed %>%
  group_by(cluster) %>%
  summarise(
    n = n(),
    mean_age = round(mean(age), 1),
    mean_hba1c = round(mean(lab_hba1c), 2),
    most_common_sex = names(sort(table(sex), decreasing = TRUE))[1],
    most_common_country = names(sort(table(country), decreasing = TRUE))[1],
    most_common_SDI = names(sort(table(SDI), decreasing = TRUE))[1],
    most_common_BMI = names(sort(table(BMI_cat), decreasing = TRUE))[1]
  )

# View the table
cluster_summary
# A tibble: 3 × 8
  cluster     n mean_age mean_hba1c most_common_sex most_common_country
    <int> <int>    <dbl>      <dbl> <chr>           <chr>              
1       1    89     52.2       5.75 woman           USA                
2       2    93     49         5.98 men             Africa             
3       3    68     73.1       5.86 woman           China              
# ℹ 2 more variables: most_common_SDI <chr>, most_common_BMI <chr>

Cluster 1 (Red):

Cluster 2 (Green):

Cluster 3 (Blue):

Just to compare, what happens if we DON’T remove variables which don’t vary much?

# Include the problematic variables
d2015_mixed_bad <- cbind(d2015_num, d2015_fac)  

# Cluster with unbalanced data
kprot_bad <- kproto(d2015_mixed_bad, k = 3, nstart = 25)
# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

Estimated lambda: 188.8523 

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.

# NAs in variables:
                 age            lab_hba1c              country 
                   0                    0                    0 
                 sex              BMI_cat                  SDI 
                   0                    0                    0 
diabetes_self_report gestational_diabetes                 year 
                   0                    0                    0 
0 observation(s) with NAs.
# Add cluster assignments back to data
d2015_mixed_bad$cluster <- kprot_bad$cluster
# Create summary by cluster
cluster_summary_bad <- d2015_mixed_bad %>%
  group_by(cluster) %>%
  summarise(
    n = n(),
    mean_age = round(mean(age), 1),
    mean_hba1c = round(mean(lab_hba1c), 2),
    most_common_sex = names(sort(table(sex), decreasing = TRUE))[1],
    most_common_country = names(sort(table(country), decreasing = TRUE))[1],
    most_common_SDI = names(sort(table(SDI), decreasing = TRUE))[1],
    most_common_BMI = names(sort(table(BMI_cat), decreasing = TRUE))[1],
    most_common_dia_sr = names(sort(table(diabetes_self_report), decreasing = TRUE))[1],
    most_common_gest = names(sort(table(gestational_diabetes), decreasing = TRUE))[1],
    most_common_year = names(sort(table(year), decreasing = TRUE))[1]
  )

# View the table
cluster_summary_bad
# A tibble: 3 × 11
  cluster     n mean_age mean_hba1c most_common_sex most_common_country
    <int> <int>    <dbl>      <dbl> <chr>           <chr>              
1       1    89     48.6       5.99 men             Africa             
2       2    97     53.6       5.74 woman           USA                
3       3    64     72.6       5.89 men             China              
# ℹ 5 more variables: most_common_SDI <chr>, most_common_BMI <chr>,
#   most_common_dia_sr <chr>, most_common_gest <chr>, most_common_year <chr>

So the clusters are different. So what??? How can I tell one is better than another?

Clustering Quality Metrics

Compare total within-cluster distance (lower = better):

# Lower is better = more compact clusters
kprot_result$tot.withinss      # Clean clustering
[1] 62655.81
kprot_bad$tot.withinss         # With unbalanced variables
[1] 99527.87

The ‘clean’ clustering has a much lower within-cluster variance; that is, the clusters are more compact.

We can also examine the silhouette scores to see how well-separated the clusters are:

# Calculate silhouette scores
sil_clean <- silhouette(kprot_result$cluster, daisy(d2015_mixed, metric = "gower"))
sil_bad <- silhouette(kprot_bad$cluster, daisy(d2015_mixed_bad, metric = "gower"))

# Average silhouette width (higher = better)
mean(sil_clean[,3])
[1] 0.2671279
mean(sil_bad[,3])
[1] 0.2737974
# Visualize
plot(sil_clean)

plot(sil_bad)

In this case, the silhouette scores weren’t great either time, indicating both clustering solutions have overlapping clusters. A silhouette score of 0.26-0.5 indicates a weak structure, 0.51-0.7 indicates reasonable structure and > 0.71 indicates a strong structure (very rare with real life data)!

CONCLUSION

  • Use clean data

  • Eliminate unbalanced variables

  • Play with ‘k’ to see if you can improve the clustering results