Obesity has been a prominent topic for a very long time, often referred to as the “obesity epidemic,” particularly highlighted in USA social media and acknowledged by WHO website. Typically, discussions on obesity revolve around the weight of an individual and societal standards of a healthy appearance. However, the dataset I have chosen includes a column with the classification of the subject’s BMI. I won’t be using this information for clustering, as unsupervised learning involves identifying hidden structures in “unlabeled” data. Additionally, there is ample research highlighting why BMI is not a reliable measure. I’m curious to see if the data clusters in a way that differs from what BMI might suggest.
The primary goal is to first determine if the data clusters efficiently and stably. To achieve this, I will employ various clustering algorithms.
This data set is derived from a online repository UC Irvine Machine Learning Repository.
This data is an estimation with 23% of the data collected from the real time users and 77% of them was synthetically generated using SMOTE filter and Weka Tool.
Before we proceed, let’s examine how it is constructed:
library(tidyverse)
library(corrplot)
library("viridis")
# Creating a less detiled NGen -generalized variable with WHO labels
obesity$NGen <- with(obesity, ifelse(NObeyesdad %in% c("Insufficient_Weight"), "Insufficient",
ifelse(NObeyesdad %in% c("Normal_Weight"), "Normal",
ifelse(NObeyesdad %in% c("Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III"), "Obese", ifelse(NObeyesdad %in% c("Overweight_Level_I", "Overweight_Level_II"), "Overweight", NA)))))
#Removing the NObeyesdad as redundant label
obesity <- select(obesity,-c(NObeyesdad))
Note: The data has been preprocessed, including transforming categorical variables into factors, and creating a generalized label variable for comparison to clustering output. The label was created based on the WHO label categorizing people by their BMI levels, it’s a simplified version without the distinction between levels of obesity or overweight category.
colSums(is.na(obesity))
## Gender
## 0
## Age
## 0
## Height
## 0
## Weight
## 0
## family_history_with_overweight
## 0
## FAVC
## 0
## FCVC
## 0
## NCP
## 0
## CAEC
## 0
## SMOKE
## 0
## CH2O
## 0
## SCC
## 0
## FAF
## 0
## TUE
## 0
## CALC
## 0
## MTRANS
## 0
## NGen
## 0
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : Factor w/ 2 levels "F","M": 1 1 2 2 2 2 1 2 2 2 ...
## $ Age : int 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ family_history_with_overweight: int 1 1 1 0 0 0 1 0 1 1 ...
## $ FAVC : int 0 0 0 0 0 1 1 0 1 1 ...
## $ FCVC : Factor w/ 3 levels "rarely","sometimes",..: 2 3 2 3 2 2 3 2 3 2 ...
## $ NCP : int 3 3 3 3 1 3 3 3 3 3 ...
## $ CAEC : Factor w/ 4 levels "No","Sometimes",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ SMOKE : int 0 1 0 0 0 0 0 0 0 0 ...
## $ CH2O : Factor w/ 3 levels "Less than 1l",..: 2 3 2 2 2 2 2 2 2 2 ...
## $ SCC : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FAF : Factor w/ 4 levels "No Activity",..: 1 4 3 3 1 1 2 4 2 2 ...
## $ TUE : Factor w/ 3 levels "0-2h","2-5h",..: 2 1 2 1 1 1 1 1 2 2 ...
## $ CALC : Factor w/ 4 levels "No","Sometimes",..: 1 2 3 3 2 2 2 2 3 1 ...
## $ MTRANS : chr "Public" "Public" "Public" "Walking" ...
## $ NGen : chr "Normal" "Normal" "Normal" "Overweight" ...
summary(obesity)
## Gender Age Height Weight
## F:1043 Min. :14.00 Min. :1.450 Min. : 39.00
## M:1068 1st Qu.:20.00 1st Qu.:1.630 1st Qu.: 65.47
## Median :23.00 Median :1.700 Median : 83.00
## Mean :24.32 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.770 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC
## Min. :0.0000 Min. :0.0000
## 1st Qu.:1.0000 1st Qu.:1.0000
## Median :1.0000 Median :1.0000
## Mean :0.8176 Mean :0.8839
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000
## FCVC NCP CAEC
## rarely : 102 Min. :1.000 No : 51
## sometimes:1013 1st Qu.:3.000 Sometimes :1765
## always : 996 Median :3.000 Frequently: 242
## Mean :2.688 Always : 53
## 3rd Qu.:3.000
## Max. :4.000
## SMOKE CH2O SCC
## Min. :0.00000 Less than 1l: 485 Min. :0.00000
## 1st Qu.:0.00000 1-2l :1110 1st Qu.:0.00000
## Median :0.00000 More than 2l: 516 Median :0.00000
## Mean :0.02084 Mean :0.04548
## 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :1.00000 Max. :1.00000
## FAF TUE CALC
## No Activity:720 0-2h :952 No : 639
## 1-2 days :776 2-5h :915 Sometimes :1401
## 3-4 days :496 More than 5h:244 Frequently: 70
## 5-6 days :119 Always : 1
##
##
## MTRANS NGen
## Length:2111 Length:2111
## Class :character Class :character
## Mode :character Mode :character
##
##
##
SUMMARY:
The data was preprocessed and synthesized before it was put in the repository, resulting in no missing values. In this dataset, we have 17 variables:
Gender,
Age,
Height,
Weight,
Family history with overweight,
FAVC (Frequent habit of eating high-caloric food - fast food),
FCVC (Eating vegetables in meals), NCP (Number of Meals),
CAEC (Eating between Meals), SMOKE (Smoking),
CH2O (Drinking water),
SCC (Monitoring of calorie intake),
FAF (Physical activity),
TUE (Habit of time dedicated to technology),
CALC (Alcohol consumption),
MTRANS (Type of transportation used),
NGen (Added by me generalized WHO labels).
Looking at the structure, out of the 16 variables describing the data (excluding NObeyesdad,NGen which are not useful for clustering), 4 are numerical, 4 are binary data, and 8 are categorical data—some continuous and some not.
This part delves into Exploratory Data Analysis as it is crucial in understanding and summarizing the main characteristics of a dataset, being the first stop in analysis
obesity%>% summarise(AvgAge=mean(Age),AvgWeight=mean(Weight),AvgHeight=mean(Height), PercFemale = sum(Gender == "F") / n())
## AvgAge AvgWeight AvgHeight PercFemale
## 1 24.31596 86.58604 1.70162 0.4940786
The calculated averages provide a snapshot of the dataset, indicating the average age, weight, height, and the percentage of females. This statistical summary serves as a basis for further exploration.
var_fchar <- names(obesity)[sapply(obesity, function(x) is.factor(x) | is.character(x) | (is.integer(x)))]
var_fchar <- var_fchar[!(var_fchar %in% c("Age"))]
par(mfrow = c(5, 3), mar = c(2, 2, 2, 2))
for (attribute in var_fchar) {
attribute_counts <- table(obesity[[attribute]])
barplot(attribute_counts, main = attribute, col = viridis(length(attribute_counts)), cex.names=0.7)
}
# Reset the plotting layout
par(mfrow = c(1, 1))
var_num <- c("Age","Height","Weight")
par(mfrow = c(1, 3), mar = c(2, 2, 2, 2))
for (attribute in var_num) {
hist_data <- obesity[[attribute]]
hist(hist_data, main = attribute, col = viridis(10))
}
# Reset the plotting layout
par(mfrow = c(1, 1))
In this part we have barplots and histogram of each of the variables. As we can observe the most of the factor and character variables are not evenly distributed along the options. The age variable has left side skewness while the height and weight follow more of a normal distribution.
obesity_matrix <- data.matrix(obesity, rownames.force = NA)
O <- cor(obesity_matrix)
corrplot(O, method = "number", number.cex = 0.5, order="hclust", tl.cex=0.5)
The
correlation plot reveals interesting insights. Height and gender emerge
as the most correlated variables, aligning with our general
understanding of human biology and body composition.We can also observe
an negative correlation of -0.6 would imply an inverse relationship
between age and the chosen means of transport.It could suggest that as
age increases, the preference for certain means of transport decreases,
and vice versa.
This chapter, focusing on basic statistics, EDA and correlation analysis, provides a comprehensive overview of the dataset, offering insights into demographics and highlighting key relationships between variables.
In this part, we can embark on the crucial step of preparing the data for clustering analysis. Leveraging a suite of R packages mentioned below, I aim to ensure the dataset is suitable for various clustering algorithms. The process involves handling categorical variables, scaling numerical features, and evaluating the dataset’s structure to enhance the efficacy of subsequent clustering techniques.
library(cluster)
library(factoextra)
library(dendextend)
library(flexclust)
library(fpc)
library(clustertend)
library(ClusterR)
library(purrr)
library(gridExtra)
library(cowplot)
First we do dummy Coding for Factor Variables: Factor variables like Gender, FCVC, CAEC, CH2O, FAF, TUE, and CALC are converted into dummy variables using model.matrix to represent categorical data in a format suitable for clustering analysis. Then we drop the categorical variables we no longer need as well as we drop unnecessary variables MTRANS, Intercept and label variables NObeyesdad and NGen. Then numerical variables (Age, Height, Weight, NCP) are standardized using the scale function. Standardization ensures that variables with different scales do not disproportionately influence clustering results.
The summary statistics and correlation matrix, already provided in Basic Statistic, offer valuable insights into the structure of the dataset. The summary statistics provide a snapshot of central tendencies and distributions for each variable and we can see that no variables seem to have any visible outliers, all the minimum and maximum values for the numerical variables seem to lay in scope of possible values.
In 1954 Hopkins and Skellam invented a statistic in order to test for spacial randomness of data, in other words how well can the data cluster. In the original approche we have a null hypothesis that states that the dataset is uniformly distributed and the alternative one that the data is not uniformly distributed - meaning it is possible to cluster the data. The general approche to interpretetion is the we can’t reject h0 if the Hopkins stat is below <0.5, 0.5 mean randomly distributed data and a statistic closer to 1 mean you can reject the null hypothesis and assume that data is not uniformly distributed.
set.seed(77)
hopkins(clustering_data, n=nrow(clustering_data)-1)
## $H
## [1] 0.1887477
res <- get_clust_tendency(clustering_data, n = nrow(clustering_data)-1, graph = FALSE)
res$hopkins_stat
## [1] 0.8062487
In R we have a few possible functions to get the Hopkins statistic. As we previously established the The Hopkins statistic, yielding a value below 0.5, suggests that the data has a low tendency to cluster. However, it’s essential to note that the Hopkins statistic calculated with the hopkins() function is actually returns the value of 1-Hopkins statistic. The other aproched I used shows the straigh forward calculation of Hopkins statistic where we can see it follows the orginal interpretation. Both of those values could suggest that obesity data set has a tendency to cluster. However, Hopkins statistics is only an initial test as there are many distributions that are not trully random but do not cluster well [Wright, Kevin. “Will the Real Hopkins Statistic Please Stand Up?.” R Journal 14.3 (2022).]. Therefore, below we perform other checks to motivate the clustering analysis.
# Calinski-Harabasz & Duda-Hart
km1<-kmeans(clustering_data, 4) # stats::
round(calinhara(clustering_data, km1$cluster),digits=2) #fpc::calinhara()
## [1] 379.61
km2<-kmeans(clustering_data, 5) # stats::
round(calinhara(clustering_data, km2$cluster),digits=2) #fpc::calinhara()
## [1] 316.73
km3<-kmeans(clustering_data,4)
dudahart2(clustering_data, km3$cluster) #fpc::
## $p.value
## [1] 0
##
## $dh
## [1] 0.3344053
##
## $compare
## [1] 0.954389
##
## $cluster1
## [1] FALSE
##
## $alpha
## [1] 0.001
##
## $z
## [1] 3.090232
Above we performed additional two statistics, firstly the Calinski-Harabasz index. It is a quality measure for clustering that evaluates the ratio of between-cluster variance to within-cluster variance, providing a higher score for well-separated and compact clusters. Then we perform Duda-Hart statistic which for k-means algorithm has a straight forward interpretation null hypothesis is homogeneity of clusters, while the alternative is heterogeneity of clusters.
Analyzing the outputs of both Calinski-Harabasz & Duda-Hart statistics based on K-means, we can confidently state that the data exhibits clusterability. The Calinski-Harabasz statistic, where a higher value is desirable, is often employed for comparing the number of clusters. In this case, a cluster count of 4 demonstrates a higher statistic, suggesting its suitability. This will be further validated in the Optimal Number of Clusters section.
The Duda-Hart statistic reinforces the ease of dividing the data into clusters. Both verification steps confirm that $cluster1 is FALSE, and $dh < $compare, further supporting the clustering potential identified by the Calinski-Harabasz statistic.
di<-dist(clustering_data)
di<-get_dist(clustering_data, method="euclidean")
fviz_dist(di, show_labels = FALSE)+ labs(title="obesity data")
To provide a visual confirmation of the clustering tendency, a dissimilarity plot based on the Euclidean distance is generated. Observing distinct blocks of different colors on the plot confirms the presence of a clustering structure within the obesity dataset. This visual approach aligns with the statistical assessments conducted earlier, reinforcing the conclusion that the data is indeed clusterable.
a <- fviz_nbclust(clustering_data, FUNcluster = kmeans, method = "silhouette",linecolor = "steelblue") +
labs(title= "Optimal N of clusters K-means") + theme_classic(base_size = 8)
b <- fviz_nbclust(clustering_data, FUNcluster = cluster::pam, method = "silhouette",linecolor = "darkolivegreen") +
labs(title= "Optimal N of clusters PAM") + theme_classic(base_size = 8)
c <- fviz_nbclust(clustering_data, FUNcluster = hcut, method = "silhouette",linecolor = "brown4") +
labs(title= "Optimal N of clusters Hcut agg") + theme_classic(base_size = 8)
d <- fviz_nbclust(clustering_data, FUNcluster = hcut,hc_func = "diana", method = "silhouette",linecolor = "pink4") +
labs(title= "Optimal N of clusters Hcut div") + theme_classic(base_size = 8)
plot_grid(a,b,c,d, ncol=2, align = "h")
To determine the optimal number of clusters, silhouette analysis is conducted using different clustering algorithms: K-means, PAM (Partitioning Around Medoids), and Hierarchical Clustering both agglomerative approch as well as the divisive one (hcut). The charts above showcase the silhouette method for each algorithm. The silhouette method measures the compactness and separation of clusters in a dataset by evaluating the average distance between each data point and others within the same cluster compared to the nearest neighboring cluster.
-K-means: The optimal number of clusters is suggested to be 4, where the silhouette score is maximized.
-PAM (Partitioning Around Medoids): The silhouette method proposes 6 clusters as the optimal choice.
-Hierarchical Clustering Agglomerative (hcut): Silhouette analysis indicates that 4 clusters are optimal.
-Hierarchical Clustering Divisive (hcut=diana): Silhouette analysis indicates that 3 clusters are optimal.
These findings will guide the subsequent steps in the clustering process, helping choose the appropriate number of clusters for each algorithm. There are four Ngen groups, so it’s interesting that the optimal number of cluster of two of the methods agrees with the number of label subgroups.
In this segment, we delve into the application of K-means clustering, a widely used unsupervised learning algorithm that partitions data points into distinct clusters based on similarity. The primary objective is to identify patterns and groupings within the obesity dataset, shedding light on potential insights and relationships among the variables.
kmp<-eclust(clustering_data, k=4, FUNcluster="kmeans", graph=FALSE)
# k-means
cluster_km <-eclust(clustering_data, "kmeans", k= 4,graph = FALSE)
#cluster_km <- kmeans(clustering_data, 4)
obesity$clusterkm<- cluster_km$cluster # check the assignment into clusters
# general characteristics of clusters
obesity%>%group_by(clusterkm)%>% summarise(AvgAge=mean(Age),AvgWeight=mean(Weight),AvgHeight=mean(Height), PercFemale = sum(Gender == "F") / n(),size=n())
## # A tibble: 4 × 6
## clusterkm AvgAge AvgWeight AvgHeight PercFemale size
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 21.9 71.8 1.64 0.617 345
## 2 2 20.4 65.0 1.68 0.523 684
## 3 3 32.5 93.9 1.66 0.696 460
## 4 4 23.9 113. 1.79 0.244 622
The dataset is clustered into four groups using the K-means algorithm. A new variable clusterkm is added to the dataset, indicating the cluster assignment for each observation.
General Characteristics of Clusters: The summary table provides average values for age, weight, height, the percentage of females, and the size of each cluster. Cluster 1: Characterized by an average age of approximately 21.9 years, moderate weight and height, with around 61.7% females. Cluster 2: Represents individuals with an average age of about 20.4 years, lower weight and slightly taller, with approximately 52.3% females. Cluster 3: Comprises individuals with an average age of around 32.5 years, higher weight and moderate height, with about 69.6% females. Cluster 4: Shows an average age of approximately 23.9 years, the highest weight and tallest height, with a lower percentage of females at around 24.4%.
We can see that in each cluster the average weight increases, suggesting the groups were clustered according to the body mass.
fviz_silhouette(kmp)
## cluster size ave.sil.width
## 1 1 345 0.19
## 2 2 684 0.14
## 3 3 460 0.15
## 4 4 622 0.22
Averaging silhouette widths for those 4 clusters are 0.17 for both metrics. Widths around 0.2 or higher are generally considered indicative of well-defined clusters. The achieved silhouette widths implore us to try a different clustering method.
In this section, we explore the application of Partitioning Around Medoids (PAM) clustering on the obesity dataset. PAM, a robust alternative to K-means in the presence of noise and outliers, identifies medoids as representative data points and partitions the dataset into clusters based on their similarities. This analysis aims to reveal additional insights into the underlying structure of the data, complementing the findings from the K-means clustering approach.
pamp<-eclust(clustering_data, k=6, FUNcluster="pam", graph=FALSE)
# pam
cluster_pam<-eclust(clustering_data, "pam", k= 6,graph = FALSE)
obesity$clusterpam<- cluster_pam$clustering
obesity%>%group_by(clusterpam)%>% summarise(AvgAge=mean(Age),AvgWeight=mean(Weight),AvgHeight=mean(Height), PercFemale = sum(Gender == "F") / n(),size=n())
## # A tibble: 6 × 6
## clusterpam AvgAge AvgWeight AvgHeight PercFemale size
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 21.2 74.4 1.73 0.209 526
## 2 2 19.9 55.9 1.62 0.901 322
## 3 3 26.0 114. 1.80 0.105 419
## 4 4 23.0 75.3 1.65 0.597 290
## 5 5 38.6 79.1 1.65 0.732 179
## 6 6 24.8 111. 1.68 0.787 375
Identifiable Characteristics in PAM Clusters: Cluster 1 (Averagely male): Characterized by a second lowest average age, weight, and second tallest height. Predominantly consists of males. Cluster 2 (Youthful and Lean): Characterized by a significantly lower average age, weight, and shorter height. Predominantly consists of females. Cluster 3 (Mature and Heavy): Exhibits higher age, weight, and height, with a relatively lower percentage of females. Cluster 4 (Mature and Heavy): Exhibits higher age, weight, and height, with a relatively lower percentage of females. Cluster 5 (No identyty): It has no Identifiable characteristics in the four variables we are looking at here Cluster 6 (Adults with High Weight): Features adults with higher average weight and height, predominantly females.
fviz_silhouette(pamp)
## cluster size ave.sil.width
## 1 1 526 0.09
## 2 2 322 0.08
## 3 3 419 0.14
## 4 4 290 0.17
## 5 5 179 0.18
## 6 6 375 0.27
We can observe a slight reduction in silhouette width compared to the K-means clusters. It motivates to try the next clustering algorithm. In my approche I decided to not try the Clara algorithm. While Clara is designed for larger datasets, my obesity dataset, with a moderate size, makes it practical to opt for the Partitioning Around Medoids (PAM) algorithm directly. Clara involves drawing multiple samples from the dataset and applying PAM on each, providing the best clustering as the output. Given the dataset’s size, the comprehensive insights gained from PAM make it a suitable choice for identifying robust clusters without the need for additional computational overhead associated with Clara. This decision ensures an efficient and effective clustering analysis tailored to the characteristics of the obesity dataset.
In the upcoming section, I delve into hierarchical clustering, an exploratory technique that organizes data into a tree-like structure based on similarity. Hierarchical clustering can be approached in two primary methods: agglomerative and divisive.
Agglomerative Clustering: This bottom-up approach starts with individual data points as separate clusters and progressively merges them based on similarity, forming a hierarchy of clusters.
Divisive Clustering: In contrast, divisive clustering begins with a single cluster that encompasses all data points and recursively divides it into smaller clusters, creating a hierarchical structure.
# multiple methods to assess
m <- c( "average", "single", "complete", "ward","weighted")
names(m) <- c( "average", "single", "complete", "ward","weighted")
# the coefficient
ac1 <- function(x) {
agnes(clustering_data, method = x)$ac
}
map_dbl(m, ac1)
## average single complete ward weighted
## 0.8705132 0.7940348 0.9166149 0.9886381 0.8832498
agnes(clustering_data, method = "ward",metric = "euclidean")$ac
## [1] 0.9886381
agnes(clustering_data, method = "ward",metric = "manhattan")$ac
## [1] 0.9931877
To determine the most effective agglomerative clustering method for our dataset, we assess multiple methods, including “Average”, “Single”, “Complete”,“Weighted” and “Ward.” The Agglomerative Coefficient is utilized as a metric to evaluate the quality of clustering achieved by each method. By comparing these coefficients, we can identify that the Ward method demonstrates the highest Agglomerative Coefficient. Therefore, for our clustering analysis, we opt to utilize the Ward method as it is deemed the most suitable for revealing meaningful structures within the obesity dataset. Similarly we can compare the two metrics euclidian and manhattan aviable in agnes, we can also clearly see that the manhattan metric is the one to choose.
# cut tree into 4 groups
hc_a <- eclust(clustering_data, k=4, FUNcluster="hclust", hc_method = "ward.D2", hc_metric = "manhattan")
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use
## "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra
## package.
## Please report the issue at
## <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where
## this warning was generated.
cta <- cutree(as.hclust(hc_a), k = 4)
# plots with borders
plot(hc_a, cex=0.6, hang=-1, main = "Dendrogram Aglomerative")
rect.hclust(hc_a, k=4, border='red')
In this step, the hierarchical clustering dendrogram obtained using Ward’s method is cut into four distinct groups. The decision to cut the tree into four clusters aligns with the pre-determined optimal number of clusters, ensuring consistency with the overall clustering analysis.The resulting dendrogram plot visually represents the hierarchical structure of the clusters, while the red borders indicate the boundaries of the four identified groups.
# observations and their groups
obesity <- obesity %>% mutate(clusterha = cta)
obesity%>%group_by(clusterha)%>% summarise(AvgAge=mean(Age),AvgWeight=mean(Weight),AvgHeight=mean(Height), PercFemale = sum(Gender == "F") / n(),size=n())
## # A tibble: 4 × 6
## clusterha AvgAge AvgWeight AvgHeight PercFemale size
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 25.2 71.3 1.61 0.826 298
## 2 2 24.8 99.7 1.71 0.572 817
## 3 3 25.2 93.4 1.76 0.0801 662
## 4 4 20.6 54.8 1.64 0.829 334
With agglomerative approch interestingly we can see that most clusters are evenly distribiuted by female and males. Cluster 1: Is the “youngest” chapter of the group with lower weight, and an average height of 1.69 meters. Cluster 2: Is the “heaviest” group with significantly greater weight than average, and a slightly taller stature (average height of 1.70 meters in the dataset). Cluster 3: Encompasses individuals with an average age of approximately 21.5 years, moderate weight, and a slightly shorter stature Cluster 4: Represents individuals with an older age profile, with an average age of approximately 36.5 years, moderate weight, and an average height of 1.68 meters.
fviz_silhouette(hc_a)
## cluster size ave.sil.width
## 1 1 298 0.12
## 2 2 817 0.14
## 3 3 662 0.23
## 4 4 334 0.09
hd <- diana(clustering_data,metric = "euclidean")
hd$dc
## [1] 0.9035308
hd1 <- diana(clustering_data,metric = "manhattan")
hd1$dc
## [1] 0.9399839
In the divisive clustering analysis, dissimilarities between observations are calculated using both Euclidean and Manhattan metrics. The Dissimilarity Coefficient (DC) is employed to assess the quality of divisive clustering.It can be observed that the Manhattan metric performs better in capturing dissimilarities between observations, as evidenced by a higher Dissimilarity Coefficient. This metric is chosen for further divisive clustering analysis due to its superior performance in revealing distinctive patterns within the obesity dataset.
# cut tree into 4 groups
hc_d <- eclust(clustering_data, k=3, FUNcluster="diana", hc_metric="manhattan")
ctd <- cutree(as.hclust(hc_d), k = 3)
pltree(hc_d, cex = 0.6, hang = -1, main = "Dendrogram of DIANA")
rect.hclust(hc_d, k=3, border='red')
In
this step, the divisive hierarchical clustering dendrogram obtained
using the DIANA method with Manhattan metric is cut into four distinct
groups.
# observations and their groups
obesity <- obesity %>% mutate(clusterhd = ctd)
obesity%>%group_by(clusterhd)%>% summarise(AvgAge=mean(Age),AvgWeight=mean(Weight),AvgHeight=mean(Height), PercFemale = sum(Gender == "F") / n(),size=n())
## # A tibble: 3 × 6
## clusterhd AvgAge AvgWeight AvgHeight PercFemale size
## <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 1 20.5 57.4 1.60 0.851 316
## 2 2 25.3 96.2 1.73 0.392 1593
## 3 3 22.5 56.6 1.66 0.738 202
Cluster 1: It’s a cluster that is mostly young females with very low weight and rather short. Cluster 2: It’s an age close to dataset average, but predominantly male cluster. With midium weight and taller height than other clusters. Cluster 3: This cluster has a significantly higher weight than any other clusters, while other parameters are more moderate. Cluster 4: Again mostly female cluster with a significantly lower weight to average but taller than the first cluster.
fviz_silhouette(hc_d)
## cluster size ave.sil.width
## 1 1 316 0.17
## 2 2 1593 0.22
## 3 3 202 0.16
In this section, the chi-square tests are conducted and contingency tables are constructed to examine the relationship between the identified clusters and the categorical variable “NGen” (Generalized WHO labels) within the obesity dataset.
chisq.test(table(kmp$cluster, obesity$NGen))
##
## Pearson's Chi-squared test
##
## data: table(kmp$cluster, obesity$NGen)
## X-squared = 953.71, df = 9, p-value < 2.2e-16
chisq.test(table(pamp$cluster, obesity$NGen))
##
## Pearson's Chi-squared test
##
## data: table(pamp$cluster, obesity$NGen)
## X-squared = 1267.1, df = 15, p-value < 2.2e-16
chisq.test(table(hc_a$cluster, obesity$NGen))
##
## Pearson's Chi-squared test
##
## data: table(hc_a$cluster, obesity$NGen)
## X-squared = 975.63, df = 9, p-value < 2.2e-16
chisq.test(table(hc_d$cluster, obesity$NGen))
##
## Pearson's Chi-squared test
##
## data: table(hc_d$cluster, obesity$NGen)
## X-squared = 883.96, df = 6, p-value < 2.2e-16
The extremely low p-values obtained from the chi-squared tests for each clustering method indicate a statistically significant association between the clusters created by the respective methods and the “Ngen” variable, which represents the obesity level. Therefore, it implies that the clusters created by each method are not independent of the obesity level, and there is evidence that the clustering methods successfully differentiate between groups with different obesity levels. In practical terms, this means that the clustering methods have identified patterns or structures in the data that align with the variation in obesity levels determined by BMI.
# Agglomerative
contingency_table5 <- tableGrob(table(hc_a$cluster, obesity$NGen), theme=ttheme_default(base_size = 11))
# Divisive
contingency_table6 <- tableGrob(table(hc_d$cluster, obesity$NGen),theme=ttheme_default(base_size = 11))
# K-means
contingency_table <- tableGrob(table(kmp$cluster, obesity$NGen),theme=ttheme_default(base_size = 11))
# PAM
contingency_table3 <- tableGrob(table(pamp$cluster, obesity$NGen),theme=ttheme_default(base_size = 10))
grid.arrange(contingency_table5,contingency_table6,contingency_table, contingency_table3, ncol=2)
The tables visually display the distribution of generalized labels about weight: insufficient, normal, overweight, obese, within each clusters of each method.
In this clustering project focused on obesity data, we explored three clustering methodologies: K-means, Partitioning Around Medoids (PAM), and Hierarchical Clustering. The analysis aimed to identify inherent patterns in the dataset related to various attributes and provide insights into potential subgroups within the population. The choice of clustering method (K-means, PAM, Hierarchical) depends on the dataset’s characteristics and the research objectives. Each method provided unique insights, contributing to a comprehensive understanding of the data.
The Chi-Square tests confirmed that the observed WHO label distribution within clusters was significantly different from random chance, indicating the meaningfulness of the identified subgroups, but before we conclude based on that the meaningfulness of the BMI description of classes, we need to remember about the limitation of this dataset. The dataset’s synthetic nature, derived from real-time and synthetic data, may introduce biases. Additionally, clustering results are sensitive to the choice of distance metrics and preprocessing steps.
In my opininon exploring additional clustering algorithms, incorporating dimension reduction, and validating clusters with external criteria could enhance the robustness of subgroup identification.
Annotation: This RMarkdown was prepared with a help of study materials for a course “Unsuperviesed Learning” at WNE University of Warsaw.