Introduction

Clustering analysis is a technique that group similar observations into a number of clusters based on the observed values of several variables for each observation.

In this project, Clustering Analysis will be employed to group individuals with similar conditions based on the diabetes health indicators dataset into clusters, which can be valuable to health industry and medical professionals as well as nutritionists.

We will explore various clustering techniques like K-Means, Partitioning Around Medoids (PAM), Clustering Large Applications (CLARA) and Hierarchical Clustering.

Dataset

The dataset was acquired from Kaggle and it has a total of 70,692 observations and 21 variables. It can be found at https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv

Original dataset
diabetes_data <- read.csv("diabetes_df_BRFSS2015.csv")
sprintf("The original dataset has %s observations with a total of %s features.", nrow(diabetes_data), ncol(diabetes_data))
## [1] "The original dataset has 70692 observations with a total of 22 features."
Description of Variables

HighBP: 0 = no high BP 1 = high BP

HighChol: 0 = no high cholesterol 1 = high cholesterol

CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years

BMI: Body Mass Index

MentHlth: days of poor mental health scale 1-30 days

DiffWalk: difficulty walking or climbing stairs? 0 = no 1 = yes

Age: 13-level age category 1=18-24, 5=40-44, 6=45-49, 8=55-59, 13=80 or older

Income: scale 1-8; 1 = less than $10,000, 8 = $75,000 or more

Education: scale 1-6 1 = Never attended school/only kindergarten 2 = elementary

Other variables description can be seen at https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf

For clustering analysis, let’s work with a subset of the original dataset with 3,535 observations (5% of original observations).
diabetes_df <- diabetes_data[sample(1:nrow(diabetes_data), 3535, replace=FALSE), ]
sprintf("The new dataset has %s observations with a total of %s features.", nrow(diabetes_df), ncol(diabetes_df))
## [1] "The new dataset has 3535 observations with a total of 22 features."
Packages Used for Analysis
library(corrplot)
library(stats)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(ggplotify)
library(cluster)
library(flexclust)
library(fpc)
library(ClusterR)
library(ggpubr)
library(dplyr)
library(knitr)

Data Preprocessing and Analysis

Dataset Structure
str(diabetes_df)
## 'data.frame':    3535 obs. of  22 variables:
##  $ Diabetes_binary     : num  1 1 1 0 0 0 0 0 1 0 ...
##  $ HighBP              : num  1 1 0 0 0 0 1 1 1 0 ...
##  $ HighChol            : num  0 1 0 0 0 0 0 1 1 1 ...
##  $ CholCheck           : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BMI                 : num  24 22 20 29 26 18 20 40 27 33 ...
##  $ Smoker              : num  1 0 1 0 0 0 0 0 0 0 ...
##  $ Stroke              : num  0 0 0 0 0 0 0 1 0 0 ...
##  $ HeartDiseaseorAttack: num  0 1 0 0 0 0 0 0 0 0 ...
##  $ PhysActivity        : num  1 0 1 1 1 1 1 0 1 1 ...
##  $ Fruits              : num  1 1 1 1 0 1 1 0 1 0 ...
##  $ Veggies             : num  1 1 1 1 0 1 1 0 1 1 ...
##  $ HvyAlcoholConsump   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ AnyHealthcare       : num  1 0 1 1 1 1 1 1 1 1 ...
##  $ NoDocbcCost         : num  0 1 0 0 1 1 0 0 0 0 ...
##  $ GenHlth             : num  3 1 4 2 1 3 2 5 3 5 ...
##  $ MentHlth            : num  0 30 0 10 0 0 0 30 0 30 ...
##  $ PhysHlth            : num  0 6 30 30 0 0 0 30 1 30 ...
##  $ DiffWalk            : num  0 1 0 1 0 0 0 1 0 0 ...
##  $ Sex                 : num  0 1 0 0 1 0 0 1 0 0 ...
##  $ Age                 : num  8 7 9 11 4 1 13 7 9 12 ...
##  $ Education           : num  5 4 5 6 6 5 5 4 4 2 ...
##  $ Income              : num  1 1 1 7 7 1 6 1 8 1 ...
Missing Values
sprintf("The dataset contain %s missing values .",sum(is.na(diabetes_df)))
## [1] "The dataset contain 0 missing values ."
Dataset Normalisation using Z-score

Data Normalization helps to avoid some features having larger impact on the analysis outcome because they have larger values than others as features are of different scale.

diabetes_df_norm <- as.data.frame(lapply(diabetes_df[,2:ncol(diabetes_df)], scale))
Correlation Matrix
corr_matrix<- cor(diabetes_df_norm)
corrplot(corr_matrix, type="lower", method = "square")

The Correlation matrix shows some strong positive correlation (GenHlth, PhysHlth, DiffWalk, Education,Income) and some slightly strong negative correlation (Income,GenHlth).

Outliers detection using IQR Rule
var <- c(colnames(diabetes_df_norm))
outlier <- c()
for(i in var){
  max <- quantile(diabetes_df_norm[,i], 0.75) + (IQR(diabetes_df_norm[,i]) * 1.5 )
  min <- quantile(diabetes_df_norm[,i], 0.25) - (IQR(diabetes_df_norm[,i]) * 1.5 )
  idx <- which(diabetes_df_norm[,i] < min | diabetes_df_norm[,i] > max)
  print(paste(i, length(idx), sep=' ')) 
  outlier <- c(outlier, idx) 
}
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 90"
## [1] "BMI 107"
## [1] "Smoker 0"
## [1] "Stroke 212"
## [1] "HeartDiseaseorAttack 514"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 786"
## [1] "HvyAlcoholConsump 156"
## [1] "AnyHealthcare 155"
## [1] "NoDocbcCost 345"
## [1] "GenHlth 0"
## [1] "MentHlth 580"
## [1] "PhysHlth 625"
## [1] "DiffWalk 849"
## [1] "Sex 0"
## [1] "Age 45"
## [1] "Education 0"
## [1] "Income 0"

There are quite a large number of outliers present in HeartDiseaseorAttack, Veggies, MentHlth and PhysHlth.

Clusterability Assessment

Let’s run Hopkins statistic to assess clusterability of the data. Using get_cluster_tendency, if the value of Hopkins statistic is closer to 1, we do not reject the null hypothesis and conclude that the dataset is not significantly clusterable.

diabetes_hstat <- get_clust_tendency(diabetes_df_norm , 20, graph=FALSE)
diabetes_hstat$hopkins_stat
## [1] 0.7588758

However, we will proceed with clustering analysis to see if we can find any groupings.

Optimal Number of Clusters

Using Average Silhoutte method to find the optimal number of clusters for K-Means, PAM, CLARA and Hierarchical Clustering.

K-Means and Hierarchical Clustering
k <- fviz_nbclust(diabetes_df_norm, FUNcluster = kmeans, method = "silhouette") +  ggtitle("K-Means") 
h <- fviz_nbclust(diabetes_df_norm, FUNcluster = hcut, method = "silhouette") + ggtitle("Hierarchical") 
grid.arrange(k, h, ncol=2)

PAM and CLARA
p <- fviz_nbclust(diabetes_df_norm, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM") 
c <- fviz_nbclust(diabetes_df_norm, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA") 
grid.arrange(p, c, ncol=2)

Clustering Analysis

K-Means Clustering

K-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster (Natasha,2021). Its goal is to minimize the differences within a cluster and maximize the differences between 2 clusters. It is a partitional clustering algorithm. The K-means clustering method is sensitive to outliers.

#Using the elblow method, let’s confirm the optimal number of clusters to be used for K-Means Clustering Analysis.

Optimal_Clusters_KMeans(diabetes_df_norm, max_clusters=10, plot_clusters = TRUE)

##  [1] 1.0000000 0.9495487 0.8331614 0.7927930 0.7565445 0.7170454 0.7157230
##  [8] 0.6943536 0.6723967 0.6337929
The elbow method also seem to be pointing towards an optimal number of 2 clusters.
However, let’s apply K-Means using 2 and 3 clusters.
diabetes_df_km2 <- kmeans(diabetes_df_norm, 2)

diabetes_df_km3 <- kmeans(diabetes_df_norm, 3)
K-Means Cluster Visualization
2 clusters
clus_2 <- fviz_cluster(list(data=diabetes_df_norm, cluster=diabetes_df_km2$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", 
             ggtheme=theme_classic())+ stat_mean(aes(color = cluster), size = 6)

df_sil2 <- silhouette(diabetes_df_km2$cluster, dist(diabetes_df_norm))
sil_2 <- fviz_silhouette(df_sil2)
##   cluster size ave.sil.width
## 1       1 1087          0.02
## 2       2 2448          0.23
grid.arrange(clus_2, sil_2, ncol=2)

3 clusters
clus_3 <- fviz_cluster(list(data=diabetes_df_norm, cluster=diabetes_df_km3$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, palette="jco", 
             ggtheme=theme_classic())+ stat_mean(aes(color = cluster), size = 6)

df_sil3 <- silhouette(diabetes_df_km3$cluster, dist(diabetes_df_norm))
sil_3 <- fviz_silhouette(df_sil3)
##   cluster size ave.sil.width
## 1       1 1273          0.11
## 2       2 1481          0.09
## 3       3  781          0.01
grid.arrange(clus_3, sil_3, ncol=2)

K-Means Cluster Statistics
2 clusters
stat_km2 <- cclust(diabetes_df_norm, 2, dist="euclidean")
stripes(stat_km2)

3 clusters
stat_km3 <- cclust(diabetes_df_norm, 3, dist="euclidean")
stripes(stat_km3)

It is observed that Kmeans clustered the dataset better using 2 clusters with a higher average silhouette index of 0.17 compared to clustering with 3 clusters. Cluster 2 had a larger observation size compared to Cluster 1.

Partitioning Around Medoids (PAM)

It is more robust than K-Means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values.

Let’s apply PAM using 2 and 3 clusters.
diabetes_df_pam2 <- pam(diabetes_df_norm,2) 

diabetes_df_pam3 <-pam(diabetes_df_norm,3)
PAM Cluster Visualization
2 clusters
clus_2_pam <- fviz_cluster(diabetes_df_pam2, geom = "point", ellipse.type = "convex",stand=FALSE, palette="jco", ggtheme=theme_classic())

pam_sil2 <- fviz_silhouette(silhouette(diabetes_df_pam2))
##   cluster size ave.sil.width
## 1       1 1808          0.12
## 2       2 1727          0.03
grid.arrange(clus_2_pam, pam_sil2, ncol=2)

3 clusters
clus_3_pam <- fviz_cluster(diabetes_df_pam3, geom = "point", ellipse.type = "convex",stand=FALSE, palette="jco", ggtheme=theme_classic())

pam_sil3 <- fviz_silhouette(silhouette(diabetes_df_pam3))
##   cluster size ave.sil.width
## 1       1 1428          0.08
## 2       2 1268          0.08
## 3       3  839          0.01
grid.arrange(clus_3_pam, pam_sil3, ncol=2)

It is observed that PAM clustered the dataset better using 2 clusters with a higher average silhouette index compared to clustering with 3 clusters. Just like K-Means, Cluster 2 had a larger observation size.

Clustering Large Applications (CLARA)

Let’s also apply CLARA clustering algorithm using 2 and 3 clusters.
diabetes_df_clara2<- eclust(diabetes_df_norm, "clara", k=2)
diabetes_df_clara3<- eclust(diabetes_df_norm, "clara", k=3)
CLARA Cluster Visualization
2 clusters
clus_2_cla <- fviz_cluster(diabetes_df_clara2, palette=c("#00AFBB", "goldenrod4", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())

cla_sil2 <- fviz_silhouette(diabetes_df_clara2)
##   cluster size ave.sil.width
## 1       1 1984          0.01
## 2       2 1551          0.18
grid.arrange(clus_2_cla, cla_sil2, ncol=2)

3 clusters
clus_3_cla <- fviz_cluster(diabetes_df_clara3, palette=c("#00AFBB", "goldenrod4", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())

cla_sil3 <- fviz_silhouette(diabetes_df_clara3)
##   cluster size ave.sil.width
## 1       1 1428          0.00
## 2       2 1040          0.00
## 3       3 1067          0.13
grid.arrange(clus_3_cla, cla_sil3, ncol=2)

Hierarchical Clustering

It produces a set of nested clusters organized as a hierarchical tree and may be visualized as a dendrogram

#Using gap stastistics, let’s confirm the optimal number of clusters to be used for Hierarchical Clustering Analysis.

gap_stat <- clusGap(diabetes_df, FUN = hcut, nstart = 25, K.max = 5, B = 50)
fviz_gap_stat(gap_stat)

Hierachical Clustering Using Complete linkage method and Ward’s Method
diabetes_df_matrix <- dist(diabetes_df_norm, method = "euclidean")
Complete Linkage Method
diabetes_df_hc <- hclust(diabetes_df_matrix, method = "complete")
plot(diabetes_df_hc, cex = 0.6, hang = -1, main = "Dendrogram with Complete Linkage")
rect.hclust(diabetes_df_hc, k=2, border='red')

2 Clusters
diabetes_df_c2 <- cutree(diabetes_df_hc, k = 2)
table(diabetes_df_c2)
## diabetes_df_c2
##    1    2 
## 3520   15
3 Clusters
diabetes_df_c3 <- cutree(diabetes_df_hc, k = 3)
table(diabetes_df_c3)
## diabetes_df_c3
##    1    2    3 
## 3433   87   15
Ward’s Linkage method
diabetes_df_hc_w <- hclust(diabetes_df_matrix, method = "ward.D2" )
plot(diabetes_df_hc_w, cex = 0.6, hang = -1, main = "Dendrogram with Ward Linkage")
rect.hclust(diabetes_df_hc_w, k=2, border='red')

2 Clusters
diabetes_df_w2 <- cutree(diabetes_df_hc_w, k = 2)
table(diabetes_df_w2)
## diabetes_df_w2
##    1    2 
## 2014 1521
3 Clusters
diabetes_df_w3 <- cutree(diabetes_df_hc_w, k = 3)
table(diabetes_df_w3)
## diabetes_df_w3
##    1    2    3 
## 2014  453 1068
Agglomerative coefficient

measures the strength of the clustering structure found. A value closer to 1 suggests strong cluster structure.

Complete linkage method
diabetes_hc <- agnes(diabetes_df_norm, method = "complete" )
diabetes_hc$ac
## [1] 0.8589288
Ward’s method
diabetes_hc_w <- agnes(diabetes_df_norm, method = "ward" )
diabetes_hc_w$ac
## [1] 0.9813054

Summary

The main goal of this article was to run clustering analysis on the diabetes health indicators dataset to observe what groupings can be found in the dataset. 2 groupings or Clusters of similar observations were found which possibly represents people who has diabetes and those who do not. The Clusters were not well seperated and have some overlapping which could be tied to the dataset not being significantly clusterable. Also, it would also be best to stick with clustering algorithms that are less susceptible to Outliers such as Ward’s Linkage Hierarchical Clustering Algorithm, PAM and CLARA.

I hope you found this article interesting and got some new insights into Clustering as a tool for Unsupervised Learning.