I seguenti oggetti sono mascherati da 'package:stats':
filter, lag
I seguenti oggetti sono mascherati da 'package:base':
intersect, setdiff, setequal, union
library(ggplot2)library(crosstable)
Caricamento pacchetto: 'crosstable'
Il seguente oggetto è mascherato da 'package:gtsummary':
as_gt
library(tidyr)library(corrplot)
corrplot 0.95 loaded
library(gmodels)
library(caret)
Caricamento del pacchetto richiesto: lattice
library(clustMixType)library(cluster)
# Directory#setwd("C:/Users/lucyq/Downloads")# ---- 1) IMPORTING THE DATA ----path <-"C:/Users/lucyq/Downloads/diabetes_study_filled.xlsx"d2015 <-read_excel(path, sheet ="2015")d2025 <-read_excel(path, sheet ="2025")
Check the structure of the data in the ENVIRONMENT PANEL: we can see num & chr
To use clustering functions for MIXED data, we will need to separate the NUMERICAL variables from the CATEGORICAL variables.
d2015_num is the numerical data
d2015_cat is the categorical data
BUT actually the YEAR is a categorical variable, even through it is a number:
# Separate numerical variables from categorical variables:d2015_num <- d2015 %>%select(where(is.numeric), -year)d2015_cat <- d2015 %>%select(where(is.character), year)
Convert categorical variables to unordered factors:
# Convert every categorical variable to FACTORS:d2015_fac <- d2015_cat %>%mutate(across(everything(), as.factor))
Let’s check the numerical variables are sufficiently varied to be able to HELP clustering.
The nearZeroVar function from caret gives us:
- Frequency ratio = Most common / 2nd most common (>19 is problematic)
- The percentage of values that are unique (<10% is problematic)
- Zero Variance: TRUE is problematic (no variance at all, all values identical)
and the ultimate verdict
- Near Zero Variance: TRUE indicates we need to eliminate this variable (unbalanced or no variation)
# Identify variables with zero or very low variancenearZeroVar(d2015_num, saveMetrics=TRUE)
# Combine numerical and factor datad2015_mixed <-cbind(d2015_num, d2015_fac_clean)
K-prototypes clustering
# Set seed so that we will always get the same result:set.seed(1234)# Run k-prototypes (k=3 clusters)kprot_result <-kproto(d2015_mixed, k =3, nstart =25)
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
Estimated lambda: 113.5272
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country sex BMI_cat SDI
0 0 0 0 0 0
0 observation(s) with NAs.
# Add cluster assignments back to datad2015_mixed$cluster <- kprot_result$cluster# View cluster sizestable(kprot_result$cluster)
1 2 3
89 93 68
# View cluster centerssummary(kprot_result)
age
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 40 48 52 52.23596 57.00 70
2 40 44 47 49.00000 54.00 64
3 60 67 72 73.13235 78.25 85
-----------------------------------------------------------------
lab_hba1c
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 4.8 5.2 5.8 5.748315 6.1 8.6
2 4.8 5.4 5.7 5.981720 6.2 9.0
3 4.8 5.2 5.7 5.860294 6.2 9.0
-----------------------------------------------------------------
country
cluster Africa Brazil China Rome USA
1 0.169 0.079 0.124 0.225 0.404
2 0.387 0.151 0.344 0.086 0.032
3 0.235 0.176 0.338 0.088 0.162
-----------------------------------------------------------------
sex
cluster men woman
1 0.258 0.742
2 0.753 0.247
3 0.471 0.529
-----------------------------------------------------------------
BMI_cat
cluster Normal weight Obesity Overweight Underweight
1 0.180 0.202 0.539 0.079
2 0.430 0.333 0.183 0.054
3 0.574 0.103 0.294 0.029
-----------------------------------------------------------------
SDI
cluster High Intermediate Low
1 0.719 0.124 0.157
2 0.161 0.581 0.258
3 0.309 0.515 0.176
-----------------------------------------------------------------
Visualise age and lab_hbalc:
# plot lab_hbalc against age, per clusterggplot(d2015_mixed, aes(x = age, y = lab_hba1c, color =as.factor(cluster))) +geom_point() +theme_minimal()
Compare age & SDI:
ggplot(d2015_mixed, aes(x = age, y = SDI, color =as.factor(cluster))) +geom_point() +theme_minimal()
# A tibble: 3 × 8
cluster n mean_age mean_hba1c most_common_sex most_common_country
<int> <int> <dbl> <dbl> <chr> <chr>
1 1 89 52.2 5.75 woman USA
2 2 93 49 5.98 men Africa
3 3 68 73.1 5.86 woman China
# ℹ 2 more variables: most_common_SDI <chr>, most_common_BMI <chr>
Cluster 1 (Red):
Cluster 2 (Green):
Cluster 3 (Blue):
Just to compare, what happens if we DON’T remove variables which don’t vary much?
# Include the problematic variablesd2015_mixed_bad <-cbind(d2015_num, d2015_fac) # Cluster with unbalanced datakprot_bad <-kproto(d2015_mixed_bad, k =3, nstart =25)
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
Estimated lambda: 188.8523
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# NAs in variables:
age lab_hba1c country
0 0 0
sex BMI_cat SDI
0 0 0
diabetes_self_report gestational_diabetes year
0 0 0
0 observation(s) with NAs.
# Add cluster assignments back to datad2015_mixed_bad$cluster <- kprot_bad$cluster
In this case, the silhouette scores weren’t great either time, indicating both clustering solutions have overlapping clusters. A silhouette score of 0.26-0.5 indicates a weak structure, 0.51-0.7 indicates reasonable structure and > 0.71 indicates a strong structure (very rare with real life data)!
CONCLUSION
Use clean data
Eliminate unbalanced variables
Play with ‘k’ to see if you can improve the clustering results