Obesity - Clustering Method Based on Eating Habits and Physical Condition

The aim of this paper is to analyze the obesity level based on eating habits and physical condition of examined people with the use of clustering methods. The data will be also analyzed for dimension reduction.

The data set comes from 2019 and presents obesity levels of individuals from the countries such as Mexico, Peru and Colombia. It contains 17 attributes and 2111 records and was downloaded from Kaggle.

source: https://www.pexels.com/pl-pl/zdjecie/zdrowy-kobieta-apple-cukier-6551415/

Loading necessary packages

# Loading necessary packages
library(readr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(psych)
library(corrplot)
library(cluster)
library(factoextra)
library(pdp)
library(gridExtra)
library(clustertend)
library(flexclust)
library(mclust)

Dataset

The data set contains 17 attributes including basic personal information: gender, age, height, weight, attributes connected with eating disorders:
family_history_with_overweight - (Yes/No),
FAVC - Frequent consumption of high caloric food (Yes/No),
FCVC - Frequency of consumption of vegetables (Never/Sometimes/Always),
NCP - Number of main meals,
CAEC - Consumption of food between meals (No/Sometimes/Frequently/Always), , SMOKE - Consumption of cigarettes (Yes/No),
CH20 - Consumption of water daily (Less than a liter/Between 1-2L/More),
CALC - Consumption of alcohol (I do not drink/Sometimes/Frequently/Always),
and attributes related with the physical condition which are:
SCC - Calories consumption monitoring (Yes/No),
FAF - Physical activity frequency (I do not have/1 or 2 days/2 or 4 days/4 or 5 days),
TUE - Time using technology devices (0–2 hours/3–5 hours/More than 5 hours),
MTRANS - Transportation used (Automobile/Motorbike/Bike/Public Transportation/Walking).
Finally, the variable NObeyesdad was created with the values of: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III, based on the following equation weight/height^2 and information from WHO and Mexican Normativity.¹

First step is to load the dataset and prepare for further analysis.

# Loading dataset
ObesityDataSet <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")

ObesityDataSet_num <- ObesityDataSet

ObesityDataSet_num$CAEC <- ifelse(ObesityDataSet$CAEC == "no", 0,
                             ifelse(ObesityDataSet$CAEC == "Sometimes", 1,
                                    ifelse(ObesityDataSet$CAEC == "Frequently", 2, 3)))

ObesityDataSet_num$CALC <- ifelse(ObesityDataSet$CALC == "no", 0,
                             ifelse(ObesityDataSet$CALC == "Sometimes", 1,
                                    ifelse(ObesityDataSet$CALC == "Frequently", 2, 3)))

ObesityDataSet_num$MTRANS <- ifelse(ObesityDataSet$MTRANS == "Automobile", 1,
                             ifelse(ObesityDataSet$MTRANS == "Bike", 2,
                                    ifelse(ObesityDataSet$MTRANS == "Motorbike", 3, 
                                           ifelse(ObesityDataSet$MTRANS == "Public_Trans", 4,5))))

ObesityDataSet_num$NObeyesdad <- ifelse(ObesityDataSet$NObeyesdad == "Insufficient_Weight", 1,
                             ifelse(ObesityDataSet$NObeyesdad == "Normal_Weight", 2,
                              ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_I", 3,
                              ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_II", 4,
                              ifelse(ObesityDataSet$NObeyesdad == "Obesity_Type_I", 5,
                              ifelse(ObesityDataSet$NObeyesdad == "Obesity_Type_II", 6,7
                                           ))))))


ObesityDataSet_num$Gender <- ifelse(ObesityDataSet$Gender == "Male", 0, 1)

ObesityDataSet_num$family_history_with_overweight <- ifelse(ObesityDataSet$family_history_with_overweight == "no", 0, 1)

ObesityDataSet_num$FAVC <- ifelse(ObesityDataSet$FAVC == "no", 0, 1)

ObesityDataSet_num$SMOKE <- ifelse(ObesityDataSet$SMOKE == "no", 0, 1)

ObesityDataSet_num$SCC <- ifelse(ObesityDataSet$SCC == "no", 0, 1)

# Summary of dataset
summary(ObesityDataSet_num)

##      Gender            Age            Height          Weight      
##  Min.   :0.0000   Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  1st Qu.:0.0000   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##  Median :0.0000   Median :22.78   Median :1.700   Median : 83.00  
##  Mean   :0.4941   Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##  3rd Qu.:1.0000   3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##  Max.   :1.0000   Max.   :61.00   Max.   :1.980   Max.   :173.00  
##  family_history_with_overweight      FAVC             FCVC      
##  Min.   :0.0000                 Min.   :0.0000   Min.   :1.000  
##  1st Qu.:1.0000                 1st Qu.:1.0000   1st Qu.:2.000  
##  Median :1.0000                 Median :1.0000   Median :2.386  
##  Mean   :0.8176                 Mean   :0.8839   Mean   :2.419  
##  3rd Qu.:1.0000                 3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :1.0000                 Max.   :1.0000   Max.   :3.000  
##       NCP             CAEC           SMOKE              CH2O      
##  Min.   :1.000   Min.   :0.000   Min.   :0.00000   Min.   :1.000  
##  1st Qu.:2.659   1st Qu.:1.000   1st Qu.:0.00000   1st Qu.:1.585  
##  Median :3.000   Median :1.000   Median :0.00000   Median :2.000  
##  Mean   :2.686   Mean   :1.141   Mean   :0.02084   Mean   :2.008  
##  3rd Qu.:3.000   3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:2.477  
##  Max.   :4.000   Max.   :3.000   Max.   :1.00000   Max.   :3.000  
##       SCC               FAF              TUE              CALC       
##  Min.   :0.00000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.1245   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.00000   Median :1.0000   Median :0.6253   Median :1.0000  
##  Mean   :0.04548   Mean   :1.0103   Mean   :0.6579   Mean   :0.7314  
##  3rd Qu.:0.00000   3rd Qu.:1.6667   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.00000   Max.   :3.0000   Max.   :2.0000   Max.   :3.0000  
##      MTRANS        NObeyesdad   
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:5.000   1st Qu.:2.000  
##  Median :5.000   Median :4.000  
##  Mean   :4.114   Mean   :4.112  
##  3rd Qu.:5.000   3rd Qu.:6.000  
##  Max.   :5.000   Max.   :7.000

######

ObesityDataSet <- ObesityDataSet %>%
     mutate_at(c(7,8,11,13,14), round)

ObesityDataSet$FCVC <- factor(ObesityDataSet$FCVC,
levels = c(1,2,3),
labels = c("Never", "Sometimes", "Always"))

ObesityDataSet$CH2O <- factor(ObesityDataSet$CH2O,
levels = c(1,2,3),
labels = c("Less than a liter", "Between 1-2L", "More"))

ObesityDataSet$FAF <- factor(ObesityDataSet$FAF,
levels = c(0,1,2,3),
labels = c("I do not have", "1 or 2 days", "2 or 4 days", "4 or 5 days"))

ObesityDataSet$TUE <- factor(ObesityDataSet$TUE,
levels = c(0,1,2),
labels = c("0–2 hours", "3–5 hours", "More than 5 hours"))

ObesityDataSet <- ObesityDataSet %>%
     mutate_at(c(1,5,6,8,9,10,12,15,16,17), as.factor)

# Summary of dataset
summary(ObesityDataSet)

##     Gender          Age            Height          Weight      
##  Female:1043   Min.   :14.00   Min.   :1.450   Min.   : 39.00  
##  Male  :1068   1st Qu.:19.95   1st Qu.:1.630   1st Qu.: 65.47  
##                Median :22.78   Median :1.700   Median : 83.00  
##                Mean   :24.31   Mean   :1.702   Mean   : 86.59  
##                3rd Qu.:26.00   3rd Qu.:1.768   3rd Qu.:107.43  
##                Max.   :61.00   Max.   :1.980   Max.   :173.00  
##                                                                
##  family_history_with_overweight  FAVC             FCVC      NCP     
##  no : 385                       no : 245   Never    : 102   1: 316  
##  yes:1726                       yes:1866   Sometimes:1013   2: 176  
##                                            Always   : 996   3:1470  
##                                                             4: 149  
##                                                                     
##                                                                     
##                                                                     
##          CAEC      SMOKE                     CH2O       SCC      
##  Always    :  53   no :2067   Less than a liter: 485   no :2015  
##  Frequently: 242   yes:  44   Between 1-2L     :1110   yes:  96  
##  no        :  51              More             : 516             
##  Sometimes :1765                                                 
##                                                                  
##                                                                  
##                                                                  
##             FAF                     TUE              CALC     
##  I do not have:720   0–2 hours        :952   Always    :   1  
##  1 or 2 days  :776   3–5 hours        :915   Frequently:  70  
##  2 or 4 days  :496   More than 5 hours:244   no        : 639  
##  4 or 5 days  :119                           Sometimes :1401  
##                                                               
##                                                               
##                                                               
##                    MTRANS                   NObeyesdad 
##  Automobile           : 457   Insufficient_Weight:272  
##  Bike                 :   7   Normal_Weight      :287  
##  Motorbike            :  11   Obesity_Type_I     :351  
##  Public_Transportation:1580   Obesity_Type_II    :297  
##  Walking              :  56   Obesity_Type_III   :324  
##                               Overweight_Level_I :290  
##                               Overweight_Level_II:290

For needs of this analysis I created two versions of data set, the one with only numeric, normalized variables and the one with categorical variables.

Below there are presented distributions of all variables from the data set.

ObesityDataSetPlot_cat <- melt(ObesityDataSet[,c(1,5:16)], id.vars = NULL)
ObesityDataSetPlot_num <- melt(ObesityDataSet[,c(2:4)], id.vars = NULL)

ggplot(data = ObesityDataSetPlot_cat,aes(x = value)) + 
  facet_wrap(~variable, scales = "free", ncol = 3) + 
  geom_histogram(stat="count") + 
  theme(axis.text = element_text(size=10)) +
  theme(axis.title = element_text(size = 12))

ggplot(data = ObesityDataSet,aes(x = NObeyesdad)) + 
  geom_histogram(stat="count") + 
  theme(axis.text = element_text(size=20)) +
  theme(axis.title = element_text(size = 25))

ggplot(data = ObesityDataSetPlot_num,aes(x = value)) + 
  facet_wrap(~variable, scales = "free", ncol = 3) + 
  geom_histogram()

There are 3 continuous variables: age, height and weight, the rest is categorical.
Looking at the distribution and summary of data we may notice some basic info about respondents. The data set is balanced in regards to gender, around 50:50 for females and males.
Respondents’ age starts at 14, the oldest person is 61 but most of them are rather young - 75% is not older than 26 years.
The height is closer to normal distribution, the scale is from 1.45m to 1.98m with the average of 1.70m. The weight has quite broad scale starting from 39kg and ending at 173 kg with the mean equals 87kg.
Most of respondents claim to have family history with overweight - 82%. The data set is balanced in regards to the BMI level (variable NObeyesdad), the most frequent group is the one with Obesity Type I.

Dimension Reduction

Before the clustering analysis I would like to apply dimension reduction methods in order to choose the most valid variables. The goal is to simplify the data without losing critical information.

PCA

The data set contains 14 categorical variables out of 17, therefore the PCA method for dimension reduction may not be the most efficient one as it suits better continuous variables. MDS method could be more suitable for categorical data however it is mainly used for visualization of the similarity or dissimilarity between objects. Regarding that my main goal is to simplify the data without losing critical information I choose PCA.

First I will compute PCA and try to find the optimal number of variables.

PCA method requires numeric variables therefore I skipped variable MTRANS which cannot be interpreted as numeric and next I scaled the data set, so the mean value equals 0.

I calculated correlation using Spearman method as it is more appropriate for categorical data which is not normally distributed.

ObesityDataSet_norm <- scale(ObesityDataSet_num[,-16])

cor_w <- cor(ObesityDataSet_norm, method="spearman")
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)

The highest correlation is observed between variables height and gender, weight and height, weight and level of BMI (NObeyesdad) and family_history_with_overweight with weight. For the analysis the most interesting is relation between family_history_with_overweight with weight as it may suggest the influence of family history on obesity problem. The rest of strong correlations are rather expected.

Now it’s time to compute PCA. I’m removing variable NObeyesdad as this is a dependent variable in this analysis and Weight as it is explaining BMI Level.

On the first plot there is shown the percentage of variances explained by each principal component.

In PCA the Kaiser criterion drops the components, for which the eigenvalues are less than 1 (when the data is standardized). Greater than 1 eigenvalue suggests that the corresponding component explains more variance than a single variable, given that a variable accounts for a unit of variance.²

ObesityDataSet_norm <- scale(ObesityDataSet_num[,c(-4,-16,-17)])
res.pca <- prcomp(ObesityDataSet_norm)

fviz_eig(res.pca, addlabels = TRUE, ncp = 15, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4")

fviz_eig(res.pca, choice = "eigenvalue", ncp = 15, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4",  addlabels = TRUE,   main = "Eigenvalues")

According to the Kaiser criterion there is 7 variables with high contribution. By reducing dimensions to 7 we can explain 67,5% of variance. Let’s check these variables with the use of another visualization.

The chosen variables are following: height, gender, CALC, age, FCVC, TUE, CH20.

pca_var <- get_pca_var(res.pca)
fviz_contrib(res.pca, "var", axes = 1:6, fill = "hotpink3", color = "brown4")

Next plot which is a graph of variables shows correlation and contribution. Positively correlated variables are pointed to the same side of the plot, negatively correlated variables are point to opposite sides of the graph. Additionally the gradient color is representing the contribution of the variable to the PC.³

fviz_pca_var(res.pca,
             col.var = "contrib", # Color by contributions to the PC
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

The below graph shows individuals grouped together in regards to BMI level. We can see that it is hard to distinguish some groups based on the profile of individuals, only Obesity Type II seems to be more structured.

groups <- ObesityDataSet$NObeyesdad
fviz_pca_ind(res.pca,
             col.ind = groups, # color by groups
             palette = c("lightblue","pink","#00AFBB",  "#E7B800", "#FC4E07", "darkred", "lightgreen"),
             addEllipses = TRUE, # Concentration ellipses
             ellipse.type = "confidence",
             legend.title = "Groups",
             repel = TRUE
             )

Clustering

Now I would like to analyze the data using clustering algorithms. Cluster analysis is a type of technique that aims to group similar data points together into clusters or subgroups. The main goal of clustering is to identify patterns in the data without prior knowledge of the group memberships.The basic idea behind it is to measure the similarity or dissimilarity between each pair of data points and then group them into clusters.

Based on Principal Component Analysis conducted before I created a new data set with 7 variables which had the highest contribution and they are following: Gender, Height, Age,
FCVC - Frequency of consumption of vegetables (Never/Sometimes/Always),
CH20 - Consumption of water daily (Less than a liter/Between 1-2L/More),
CALC - Consumption of alcohol (I do not drink/Sometimes/Frequently/Always),
TUE - Time using technology devices (0–2 hours/3–5 hours/More than 5 hours).

ObesityDataSet_norm_7 <- ObesityDataSet_norm[,c("Height","Gender","CH2O", "CALC", "FCVC", "Age", "TUE")]

The hopkins function is used to estimate the cluster tendency of a dataset. It is a measure of the distances between each data point and its nearest neighbor within the dataset to the distances between each data point and its nearest neighbor within a randomly generated dataset with the same marginal distributions as the original dataset. The Hopkins statistic ranges between 0 and 1, with values closer to 1 indicating a higher degree of clustering tendency in the dataset.

A Hopkins value of 0.1-0.2 indicates that the dataset has a low degree of clustering tendency. In other words, the data points are more uniformly distributed rather than being clustered together. This value suggests that applying clustering algorithms to this dataset may not be the best approach.

paste0("Hopkins statistic for the limited data set: ", clustertend::hopkins(ObesityDataSet_norm_7))

## [1] "Hopkins statistic for the limited data set: 0.157632108218407"

paste0("Hopkins statistic for the whole data set: ", clustertend::hopkins(ObesityDataSet_norm))

## [1] "Hopkins statistic for the whole data set: 0.182092541793173"

Comparing silhouette width and total within sum of square values, the quality of clustering algorithm is lower for the whole data set than for chosen 7 variables. Therefore I’m going to use the limited data set in the further analysis.

kmeans_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = kmeans, method = "silhouette") + 
  theme_classic() + ggtitle("Kmeans - Whole Data Set") + ylab("Silhouette Width")
pam_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = cluster::pam, method = "silhouette") + 
  theme_classic() + ggtitle("Pam - Whole Data Set") + ylab("Silhouette Width")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = hcut, method = "silhouette") + 
  theme_classic() + ggtitle("Hierarchical - Whole Data Set") + ylab("Silhouette Width")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)

kmeans_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = kmeans, method = "wss") + 
  theme_classic() + ggtitle("Kmeans - Whole Data Set")
pam_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = cluster::pam, method = "wss") + 
  theme_classic() + ggtitle("Pam - Whole Data Set") 
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = hcut, method = "wss") + 
  theme_classic() + ggtitle("Hierarchical - Whole Data Set") 
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)

kmeans_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = kmeans, method = "silhouette") + 
  theme_classic() + ggtitle("Kmeans - 7 variables") + ylab("Silhouette Width")
pam_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = cluster::pam, method = "silhouette") + 
  theme_classic() + ggtitle("Pam - 7 variables") + ylab("Silhouette Width")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = hcut, method = "silhouette") + 
  theme_classic() + ggtitle("Hierarchical - 7 variables") + ylab("Silhouette Width")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)

kmeans_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = kmeans, method = "wss") + 
  theme_classic() + ggtitle("Kmeans - 7 variables")
pam_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = cluster::pam, method = "wss") + 
  theme_classic() + ggtitle("Pam - 7 variables") 
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = hcut, method = "wss") + 
  theme_classic() + ggtitle("Hierarchical - 7 variables") 
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)

Pam and Kmeans algorithms provided similar results. Considering that Kmeans is more efficiently computed I choose this algorithm for further analysis.
According to above visualizations the most optimal clusters number is 2. However in addition I decided to verify results for 7 clusters regarding that there is 7 groups in the dependent variable NObeyesdad.

Below there are visualized clusters and silhouette width for 2 and 7 clusters.

cl_kmeans <- eclust(ObesityDataSet_norm_7, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_sil2 <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1 1060          0.21
## 2       2 1051          0.22

kmeans_cluster2 <- fviz_cluster(cl_kmeans, data = ObesityDataSet_norm_7, elipse.type = "convex", labelsize = 0,palette=c("#F8766D", "#00BA38", "#619CFF", "#FFA600", "#7A6A5B", "#00FFFF", "#FF00FF")) + theme_minimal()
grid.arrange(kmeans_sil2, kmeans_cluster2, ncol=2)

cl_kmeans <- eclust(ObesityDataSet_norm_7, k=7, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_sil7 <- fviz_silhouette(cl_kmeans)

##   cluster size ave.sil.width
## 1       1  308          0.10
## 2       2  254          0.13
## 3       3  356          0.38
## 4       4  266          0.20
## 5       5  415          0.19
## 6       6  207          0.13
## 7       7  305          0.15

kmeans_cluster7 <- fviz_cluster(cl_kmeans, data = ObesityDataSet_norm_7, elipse.type = "convex",labelsize = 0, palette=c("#F8766D", "#00BA38", "#619CFF", "#FFA600", "#7A6A5B", "#00FFFF", "#FF00FF")) + theme_minimal()
grid.arrange(kmeans_sil7, kmeans_cluster7, ncol=2)

In order to evaluate results below there are plots with distance of data points to cluster centroids using stripes.

# statistics by clustered groups
stats2 <- cclust(ObesityDataSet_norm_7, 2, dist="euclidean") 
stripes(stats2)

stats7 <- cclust(ObesityDataSet_norm_7, 7, dist="euclidean") 
stripes(stats7)

Finally I would like to verify the correctness of clustering algorithms comparing to the true labels from data set.

First approach is computing the adjusted rand index which compares two classifications. I have also generated random sample in order to check whether clustering classification is more accurate.

In order to compare results with 2 clusters I created variable Obesity_binar which classify Obesity Type I,II and III as 2 and rest as 1, the proportion is 1139 to 972 observations.

Obesity_binar <- ifelse(ObesityDataSet$NObeyesdad == "Insufficient_Weight", 1,
                             ifelse(ObesityDataSet$NObeyesdad == "Normal_Weight", 1,
                              ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_I", 1,
                              ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_II", 1,2
                                           ))))

table(Obesity_binar)

## Obesity_binar
##    1    2 
## 1139  972

ARI_2clust <- adjustedRandIndex(kmeans_cluster2$data$cluster, Obesity_binar)
paste0("Adjusted Rand Index for 2 clusters: ", ARI_2clust)

## [1] "Adjusted Rand Index for 2 clusters: -0.000196274851551966"

rand_ARI_2clust <- randIndex(Obesity_binar, sample(Obesity_binar))
paste0("Rand Index for 2 clusters and random sample: ", rand_ARI_2clust)

## [1] "Rand Index for 2 clusters and random sample: 0.000222002830914471"

ARI_7clust <- adjustedRandIndex(kmeans_cluster7$data$cluster, ObesityDataSet_num$NObeyesdad)
paste0("Adjusted Rand Index for 7 clusters: ", ARI_7clust)

## [1] "Adjusted Rand Index for 7 clusters: 0.143730328422885"

rand_ARI_7clust <- randIndex(ObesityDataSet_num$NObeyesdad, sample(ObesityDataSet_num$NObeyesdad))
paste0("Rand Index for 7 clusters and random sample: ", rand_ARI_7clust)

## [1] "Rand Index for 7 clusters and random sample: 0.000377448323153927"

I would like to also visualize the results. In order to do it I created a confusion matrix which counts the number of instances where the predicted cluster assignment matches the true class label. Next the confusion matrix was used to plot results with ggplot2, where the x-axis represents the true class labels, the y-axis represents the predicted cluster assignments, and the fill color represents the count of observations. The color scale goes from white (low count) to red (high count).

# Create a confusion matrix
conf_mat7 <- table(ObesityDataSet_num$NObeyesdad, kmeans_cluster7$data$cluster)

# Create a confusion matrix plot
ggplot(data.frame(expand.grid(actual = rownames(conf_mat7), predicted = colnames(conf_mat7)), count = as.vector(conf_mat7)), aes(x = actual, y = predicted, fill = count)) + 
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "red") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "True Class", y = "Predicted Cluster", fill = "Count")

# Create a confusion matrix
conf_mat2 <- table(Obesity_binar, kmeans_cluster2$data$cluster)

# Create a confusion matrix plot
ggplot(data.frame(expand.grid(actual = rownames(conf_mat2), predicted = colnames(conf_mat2)), count = as.vector(conf_mat2)), aes(x = actual, y = predicted, fill = count)) + 
  geom_tile() + 
  scale_fill_gradient(low = "white", high = "red") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "True Class", y = "Predicted Cluster", fill = "Count")

Comparing results of verification tests 7 clusters gave much more accurate fit than 2 clusters and than random sample.

Summary

According to the Hopkins value of 0.15 the dataset has rather low degree of clustering tendency. As such, it may not be the most effective approach to employ clustering algorithms to fulfill my goal of identifying the key determinants of obesity levels. Although the accuracy rate of 14% may appear somewhat unsatisfactory, the application of principal component analysis (PCA) yielded more favorable outcomes and improved the model fit despite the fact that most variables are categorical.

Bibliography

Palechor, F.M., Manotas, A.,(2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico, ELSEVIER, Data in Brief, Volume 25, 104344, doi: https://doi.org/10.1016/j.dib.2019.104344.↩︎
Beavers, Amy S.; John W. Lounsbury, Jennifer K. Richards, Schuyler W. Huck, Gary J. Skolits, and Shelley L. Esquivel. (2013). Practical considerations for using exploratory factor analysis in educational research. Practical Assessment, Research and Evaluation.↩︎
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/↩︎