Clustering analysis is a technique that group similar observations into a number of clusters based on the observed values of several variables for each observation.
In this project, Clustering Analysis will be employed to group individuals with similar conditions based on the diabetes health indicators dataset into clusters, which can be valuable to health industry and medical professionals as well as nutritionists.
We will explore various clustering techniques like K-Means, Partitioning Around Medoids (PAM), Clustering Large Applications (CLARA) and Hierarchical Clustering.
The dataset was acquired from Kaggle and it has a total of 70,692 observations and 21 variables. It can be found at https://www.kaggle.com/alexteboul/diabetes-health-indicators-dataset?select=diabetes_binary_5050split_health_indicators_BRFSS2015.csv
diabetes_data <- read.csv("diabetes_df_BRFSS2015.csv")
sprintf("The original dataset has %s observations with a total of %s features.", nrow(diabetes_data), ncol(diabetes_data))
## [1] "The original dataset has 70692 observations with a total of 22 features."
HighBP: 0 = no high BP 1 = high BP
HighChol: 0 = no high cholesterol 1 = high cholesterol
CholCheck: 0 = no cholesterol check in 5 years 1 = yes cholesterol check in 5 years
BMI: Body Mass Index
MentHlth: days of poor mental health scale 1-30 days
DiffWalk: difficulty walking or climbing stairs? 0 = no 1 = yes
Age: 13-level age category 1=18-24, 5=40-44, 6=45-49, 8=55-59, 13=80 or older
Income: scale 1-8; 1 = less than $10,000, 8 = $75,000 or more
Education: scale 1-6 1 = Never attended school/only kindergarten 2 = elementary
Other variables description can be seen at https://www.cdc.gov/brfss/annual_data/2015/pdf/codebook15_llcp.pdf
diabetes_df <- diabetes_data[sample(1:nrow(diabetes_data), 3535, replace=FALSE), ]
sprintf("The new dataset has %s observations with a total of %s features.", nrow(diabetes_df), ncol(diabetes_df))
## [1] "The new dataset has 3535 observations with a total of 22 features."
library(corrplot)
library(stats)
library(factoextra)
library(ggplot2)
library(gridExtra)
library(ggplotify)
library(cluster)
library(flexclust)
library(fpc)
library(ClusterR)
library(ggpubr)
library(dplyr)
library(knitr)
str(diabetes_df)
## 'data.frame': 3535 obs. of 22 variables:
## $ Diabetes_binary : num 1 1 1 0 0 0 0 0 1 0 ...
## $ HighBP : num 1 1 0 0 0 0 1 1 1 0 ...
## $ HighChol : num 0 1 0 0 0 0 0 1 1 1 ...
## $ CholCheck : num 1 1 1 1 1 1 1 1 1 1 ...
## $ BMI : num 24 22 20 29 26 18 20 40 27 33 ...
## $ Smoker : num 1 0 1 0 0 0 0 0 0 0 ...
## $ Stroke : num 0 0 0 0 0 0 0 1 0 0 ...
## $ HeartDiseaseorAttack: num 0 1 0 0 0 0 0 0 0 0 ...
## $ PhysActivity : num 1 0 1 1 1 1 1 0 1 1 ...
## $ Fruits : num 1 1 1 1 0 1 1 0 1 0 ...
## $ Veggies : num 1 1 1 1 0 1 1 0 1 1 ...
## $ HvyAlcoholConsump : num 0 0 0 0 0 0 0 0 0 0 ...
## $ AnyHealthcare : num 1 0 1 1 1 1 1 1 1 1 ...
## $ NoDocbcCost : num 0 1 0 0 1 1 0 0 0 0 ...
## $ GenHlth : num 3 1 4 2 1 3 2 5 3 5 ...
## $ MentHlth : num 0 30 0 10 0 0 0 30 0 30 ...
## $ PhysHlth : num 0 6 30 30 0 0 0 30 1 30 ...
## $ DiffWalk : num 0 1 0 1 0 0 0 1 0 0 ...
## $ Sex : num 0 1 0 0 1 0 0 1 0 0 ...
## $ Age : num 8 7 9 11 4 1 13 7 9 12 ...
## $ Education : num 5 4 5 6 6 5 5 4 4 2 ...
## $ Income : num 1 1 1 7 7 1 6 1 8 1 ...
sprintf("The dataset contain %s missing values .",sum(is.na(diabetes_df)))
## [1] "The dataset contain 0 missing values ."
Data Normalization helps to avoid some features having larger impact on the analysis outcome because they have larger values than others as features are of different scale.
diabetes_df_norm <- as.data.frame(lapply(diabetes_df[,2:ncol(diabetes_df)], scale))
corr_matrix<- cor(diabetes_df_norm)
corrplot(corr_matrix, type="lower", method = "square")
The Correlation matrix shows some strong positive correlation (GenHlth, PhysHlth, DiffWalk, Education,Income) and some slightly strong negative correlation (Income,GenHlth).
var <- c(colnames(diabetes_df_norm))
outlier <- c()
for(i in var){
max <- quantile(diabetes_df_norm[,i], 0.75) + (IQR(diabetes_df_norm[,i]) * 1.5 )
min <- quantile(diabetes_df_norm[,i], 0.25) - (IQR(diabetes_df_norm[,i]) * 1.5 )
idx <- which(diabetes_df_norm[,i] < min | diabetes_df_norm[,i] > max)
print(paste(i, length(idx), sep=' '))
outlier <- c(outlier, idx)
}
## [1] "HighBP 0"
## [1] "HighChol 0"
## [1] "CholCheck 90"
## [1] "BMI 107"
## [1] "Smoker 0"
## [1] "Stroke 212"
## [1] "HeartDiseaseorAttack 514"
## [1] "PhysActivity 0"
## [1] "Fruits 0"
## [1] "Veggies 786"
## [1] "HvyAlcoholConsump 156"
## [1] "AnyHealthcare 155"
## [1] "NoDocbcCost 345"
## [1] "GenHlth 0"
## [1] "MentHlth 580"
## [1] "PhysHlth 625"
## [1] "DiffWalk 849"
## [1] "Sex 0"
## [1] "Age 45"
## [1] "Education 0"
## [1] "Income 0"
There are quite a large number of outliers present in HeartDiseaseorAttack, Veggies, MentHlth and PhysHlth.
Let’s run Hopkins statistic to assess clusterability of the data. Using get_cluster_tendency, if the value of Hopkins statistic is closer to 1, we do not reject the null hypothesis and conclude that the dataset is not significantly clusterable.
diabetes_hstat <- get_clust_tendency(diabetes_df_norm , 20, graph=FALSE)
diabetes_hstat$hopkins_stat
## [1] 0.7588758
However, we will proceed with clustering analysis to see if we can find any groupings.
Using Average Silhoutte method to find the optimal number of clusters for K-Means, PAM, CLARA and Hierarchical Clustering.
k <- fviz_nbclust(diabetes_df_norm, FUNcluster = kmeans, method = "silhouette") + ggtitle("K-Means")
h <- fviz_nbclust(diabetes_df_norm, FUNcluster = hcut, method = "silhouette") + ggtitle("Hierarchical")
grid.arrange(k, h, ncol=2)
p <- fviz_nbclust(diabetes_df_norm, FUNcluster = cluster::pam, method = "silhouette") + ggtitle("PAM")
c <- fviz_nbclust(diabetes_df_norm, FUNcluster = cluster::clara, method = "silhouette") + ggtitle("CLARA")
grid.arrange(p, c, ncol=2)
K-means is a centroid-based clustering algorithm, where we calculate the distance between each data point and a centroid to assign it to a cluster (Natasha,2021). Its goal is to minimize the differences within a cluster and maximize the differences between 2 clusters. It is a partitional clustering algorithm. The K-means clustering method is sensitive to outliers.
#Using the elblow method, let’s confirm the optimal number of clusters to be used for K-Means Clustering Analysis.
Optimal_Clusters_KMeans(diabetes_df_norm, max_clusters=10, plot_clusters = TRUE)
## [1] 1.0000000 0.9495487 0.8331614 0.7927930 0.7565445 0.7170454 0.7157230
## [8] 0.6943536 0.6723967 0.6337929
diabetes_df_km2 <- kmeans(diabetes_df_norm, 2)
diabetes_df_km3 <- kmeans(diabetes_df_norm, 3)
clus_2 <- fviz_cluster(list(data=diabetes_df_norm, cluster=diabetes_df_km2$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco",
ggtheme=theme_classic())+ stat_mean(aes(color = cluster), size = 6)
df_sil2 <- silhouette(diabetes_df_km2$cluster, dist(diabetes_df_norm))
sil_2 <- fviz_silhouette(df_sil2)
## cluster size ave.sil.width
## 1 1 1087 0.02
## 2 2 2448 0.23
grid.arrange(clus_2, sil_2, ncol=2)
clus_3 <- fviz_cluster(list(data=diabetes_df_norm, cluster=diabetes_df_km3$cluster),
ellipse.type="norm", geom="point", stand=FALSE, palette="jco",
ggtheme=theme_classic())+ stat_mean(aes(color = cluster), size = 6)
df_sil3 <- silhouette(diabetes_df_km3$cluster, dist(diabetes_df_norm))
sil_3 <- fviz_silhouette(df_sil3)
## cluster size ave.sil.width
## 1 1 1273 0.11
## 2 2 1481 0.09
## 3 3 781 0.01
grid.arrange(clus_3, sil_3, ncol=2)
stat_km2 <- cclust(diabetes_df_norm, 2, dist="euclidean")
stripes(stat_km2)
stat_km3 <- cclust(diabetes_df_norm, 3, dist="euclidean")
stripes(stat_km3)
It is more robust than K-Means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values.
diabetes_df_pam2 <- pam(diabetes_df_norm,2)
diabetes_df_pam3 <-pam(diabetes_df_norm,3)
clus_2_pam <- fviz_cluster(diabetes_df_pam2, geom = "point", ellipse.type = "convex",stand=FALSE, palette="jco", ggtheme=theme_classic())
pam_sil2 <- fviz_silhouette(silhouette(diabetes_df_pam2))
## cluster size ave.sil.width
## 1 1 1808 0.12
## 2 2 1727 0.03
grid.arrange(clus_2_pam, pam_sil2, ncol=2)
clus_3_pam <- fviz_cluster(diabetes_df_pam3, geom = "point", ellipse.type = "convex",stand=FALSE, palette="jco", ggtheme=theme_classic())
pam_sil3 <- fviz_silhouette(silhouette(diabetes_df_pam3))
## cluster size ave.sil.width
## 1 1 1428 0.08
## 2 2 1268 0.08
## 3 3 839 0.01
grid.arrange(clus_3_pam, pam_sil3, ncol=2)
diabetes_df_clara2<- eclust(diabetes_df_norm, "clara", k=2)
diabetes_df_clara3<- eclust(diabetes_df_norm, "clara", k=3)
clus_2_cla <- fviz_cluster(diabetes_df_clara2, palette=c("#00AFBB", "goldenrod4", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())
cla_sil2 <- fviz_silhouette(diabetes_df_clara2)
## cluster size ave.sil.width
## 1 1 1984 0.01
## 2 2 1551 0.18
grid.arrange(clus_2_cla, cla_sil2, ncol=2)
clus_3_cla <- fviz_cluster(diabetes_df_clara3, palette=c("#00AFBB", "goldenrod4", "#E7B800"), ellipse.type="t", geom="point", pointsize=1, ggtheme=theme_classic())
cla_sil3 <- fviz_silhouette(diabetes_df_clara3)
## cluster size ave.sil.width
## 1 1 1428 0.00
## 2 2 1040 0.00
## 3 3 1067 0.13
grid.arrange(clus_3_cla, cla_sil3, ncol=2)
It produces a set of nested clusters organized as a hierarchical tree and may be visualized as a dendrogram
#Using gap stastistics, let’s confirm the optimal number of clusters to be used for Hierarchical Clustering Analysis.
gap_stat <- clusGap(diabetes_df, FUN = hcut, nstart = 25, K.max = 5, B = 50)
fviz_gap_stat(gap_stat)
diabetes_df_matrix <- dist(diabetes_df_norm, method = "euclidean")
diabetes_df_hc <- hclust(diabetes_df_matrix, method = "complete")
plot(diabetes_df_hc, cex = 0.6, hang = -1, main = "Dendrogram with Complete Linkage")
rect.hclust(diabetes_df_hc, k=2, border='red')
diabetes_df_c2 <- cutree(diabetes_df_hc, k = 2)
table(diabetes_df_c2)
## diabetes_df_c2
## 1 2
## 3520 15
diabetes_df_c3 <- cutree(diabetes_df_hc, k = 3)
table(diabetes_df_c3)
## diabetes_df_c3
## 1 2 3
## 3433 87 15
diabetes_df_hc_w <- hclust(diabetes_df_matrix, method = "ward.D2" )
plot(diabetes_df_hc_w, cex = 0.6, hang = -1, main = "Dendrogram with Ward Linkage")
rect.hclust(diabetes_df_hc_w, k=2, border='red')
diabetes_df_w2 <- cutree(diabetes_df_hc_w, k = 2)
table(diabetes_df_w2)
## diabetes_df_w2
## 1 2
## 2014 1521
diabetes_df_w3 <- cutree(diabetes_df_hc_w, k = 3)
table(diabetes_df_w3)
## diabetes_df_w3
## 1 2 3
## 2014 453 1068
measures the strength of the clustering structure found. A value closer to 1 suggests strong cluster structure.
diabetes_hc <- agnes(diabetes_df_norm, method = "complete" )
diabetes_hc$ac
## [1] 0.8589288
diabetes_hc_w <- agnes(diabetes_df_norm, method = "ward" )
diabetes_hc_w$ac
## [1] 0.9813054
The main goal of this article was to run clustering analysis on the diabetes health indicators dataset to observe what groupings can be found in the dataset. 2 groupings or Clusters of similar observations were found which possibly represents people who has diabetes and those who do not. The Clusters were not well seperated and have some overlapping which could be tied to the dataset not being significantly clusterable. Also, it would also be best to stick with clustering algorithms that are less susceptible to Outliers such as Ward’s Linkage Hierarchical Clustering Algorithm, PAM and CLARA.
I hope you found this article interesting and got some new insights into Clustering as a tool for Unsupervised Learning.
https://stats.stackexchange.com/questions/332651/validating-cluster-tendency-using-hopkins-statistic
https://www.sciencedirect.com/topics/medicine-and-dentistry/cluster-analysis
Natasha Sharma (2021) https://neptune.ai/blog/k-means-clustering
https://stats.stackexchange.com/questions/332651/validating-cluster-tendency-using-hopkins-statistic