The aim of this paper is to analyze the obesity level based on eating habits and physical condition of examined people with the use of clustering methods. The data will be also analyzed for dimension reduction.
The data set comes from 2019 and presents obesity levels of individuals from the countries such as Mexico, Peru and Colombia. It contains 17 attributes and 2111 records and was downloaded from Kaggle.
# Loading necessary packages
library(readr)
library(tidyverse)
library(ggplot2)
library(reshape2)
library(psych)
library(corrplot)
library(cluster)
library(factoextra)
library(pdp)
library(gridExtra)
library(clustertend)
library(flexclust)
library(mclust)
The data set contains 17 attributes including basic personal
information: gender, age,
height, weight, attributes connected
with eating disorders:
family_history_with_overweight - (Yes/No),
FAVC - Frequent consumption of high caloric food
(Yes/No),
FCVC - Frequency of consumption of vegetables
(Never/Sometimes/Always),
NCP - Number of main meals,
CAEC - Consumption of food between meals
(No/Sometimes/Frequently/Always), , SMOKE - Consumption
of cigarettes (Yes/No),
CH20 - Consumption of water daily (Less than a
liter/Between 1-2L/More),
CALC - Consumption of alcohol (I do not
drink/Sometimes/Frequently/Always),
and attributes related with the physical condition which are:
SCC - Calories consumption monitoring (Yes/No),
FAF - Physical activity frequency (I do not have/1 or 2
days/2 or 4 days/4 or 5 days),
TUE - Time using technology devices (0–2 hours/3–5
hours/More than 5 hours),
MTRANS - Transportation used
(Automobile/Motorbike/Bike/Public Transportation/Walking).
Finally, the variable NObeyesdad was created with the
values of: Insufficient Weight, Normal Weight, Overweight Level I,
Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type
III, based on the following equation weight/height^2 and
information from WHO and Mexican Normativity.1
First step is to load the dataset and prepare for further analysis.
# Loading dataset
ObesityDataSet <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")
ObesityDataSet_num <- ObesityDataSet
ObesityDataSet_num$CAEC <- ifelse(ObesityDataSet$CAEC == "no", 0,
ifelse(ObesityDataSet$CAEC == "Sometimes", 1,
ifelse(ObesityDataSet$CAEC == "Frequently", 2, 3)))
ObesityDataSet_num$CALC <- ifelse(ObesityDataSet$CALC == "no", 0,
ifelse(ObesityDataSet$CALC == "Sometimes", 1,
ifelse(ObesityDataSet$CALC == "Frequently", 2, 3)))
ObesityDataSet_num$MTRANS <- ifelse(ObesityDataSet$MTRANS == "Automobile", 1,
ifelse(ObesityDataSet$MTRANS == "Bike", 2,
ifelse(ObesityDataSet$MTRANS == "Motorbike", 3,
ifelse(ObesityDataSet$MTRANS == "Public_Trans", 4,5))))
ObesityDataSet_num$NObeyesdad <- ifelse(ObesityDataSet$NObeyesdad == "Insufficient_Weight", 1,
ifelse(ObesityDataSet$NObeyesdad == "Normal_Weight", 2,
ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_I", 3,
ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_II", 4,
ifelse(ObesityDataSet$NObeyesdad == "Obesity_Type_I", 5,
ifelse(ObesityDataSet$NObeyesdad == "Obesity_Type_II", 6,7
))))))
ObesityDataSet_num$Gender <- ifelse(ObesityDataSet$Gender == "Male", 0, 1)
ObesityDataSet_num$family_history_with_overweight <- ifelse(ObesityDataSet$family_history_with_overweight == "no", 0, 1)
ObesityDataSet_num$FAVC <- ifelse(ObesityDataSet$FAVC == "no", 0, 1)
ObesityDataSet_num$SMOKE <- ifelse(ObesityDataSet$SMOKE == "no", 0, 1)
ObesityDataSet_num$SCC <- ifelse(ObesityDataSet$SCC == "no", 0, 1)
# Summary of dataset
summary(ObesityDataSet_num)
## Gender Age Height Weight
## Min. :0.0000 Min. :14.00 Min. :1.450 Min. : 39.00
## 1st Qu.:0.0000 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Median :0.0000 Median :22.78 Median :1.700 Median : 83.00
## Mean :0.4941 Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:1.0000 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :1.0000 Max. :61.00 Max. :1.980 Max. :173.00
## family_history_with_overweight FAVC FCVC
## Min. :0.0000 Min. :0.0000 Min. :1.000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.:2.000
## Median :1.0000 Median :1.0000 Median :2.386
## Mean :0.8176 Mean :0.8839 Mean :2.419
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:3.000
## Max. :1.0000 Max. :1.0000 Max. :3.000
## NCP CAEC SMOKE CH2O
## Min. :1.000 Min. :0.000 Min. :0.00000 Min. :1.000
## 1st Qu.:2.659 1st Qu.:1.000 1st Qu.:0.00000 1st Qu.:1.585
## Median :3.000 Median :1.000 Median :0.00000 Median :2.000
## Mean :2.686 Mean :1.141 Mean :0.02084 Mean :2.008
## 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:0.00000 3rd Qu.:2.477
## Max. :4.000 Max. :3.000 Max. :1.00000 Max. :3.000
## SCC FAF TUE CALC
## Min. :0.00000 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.00000 1st Qu.:0.1245 1st Qu.:0.0000 1st Qu.:0.0000
## Median :0.00000 Median :1.0000 Median :0.6253 Median :1.0000
## Mean :0.04548 Mean :1.0103 Mean :0.6579 Mean :0.7314
## 3rd Qu.:0.00000 3rd Qu.:1.6667 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.00000 Max. :3.0000 Max. :2.0000 Max. :3.0000
## MTRANS NObeyesdad
## Min. :1.000 Min. :1.000
## 1st Qu.:5.000 1st Qu.:2.000
## Median :5.000 Median :4.000
## Mean :4.114 Mean :4.112
## 3rd Qu.:5.000 3rd Qu.:6.000
## Max. :5.000 Max. :7.000
######
ObesityDataSet <- ObesityDataSet %>%
mutate_at(c(7,8,11,13,14), round)
ObesityDataSet$FCVC <- factor(ObesityDataSet$FCVC,
levels = c(1,2,3),
labels = c("Never", "Sometimes", "Always"))
ObesityDataSet$CH2O <- factor(ObesityDataSet$CH2O,
levels = c(1,2,3),
labels = c("Less than a liter", "Between 1-2L", "More"))
ObesityDataSet$FAF <- factor(ObesityDataSet$FAF,
levels = c(0,1,2,3),
labels = c("I do not have", "1 or 2 days", "2 or 4 days", "4 or 5 days"))
ObesityDataSet$TUE <- factor(ObesityDataSet$TUE,
levels = c(0,1,2),
labels = c("0–2 hours", "3–5 hours", "More than 5 hours"))
ObesityDataSet <- ObesityDataSet %>%
mutate_at(c(1,5,6,8,9,10,12,15,16,17), as.factor)
# Summary of dataset
summary(ObesityDataSet)
## Gender Age Height Weight
## Female:1043 Min. :14.00 Min. :1.450 Min. : 39.00
## Male :1068 1st Qu.:19.95 1st Qu.:1.630 1st Qu.: 65.47
## Median :22.78 Median :1.700 Median : 83.00
## Mean :24.31 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
##
## family_history_with_overweight FAVC FCVC NCP
## no : 385 no : 245 Never : 102 1: 316
## yes:1726 yes:1866 Sometimes:1013 2: 176
## Always : 996 3:1470
## 4: 149
##
##
##
## CAEC SMOKE CH2O SCC
## Always : 53 no :2067 Less than a liter: 485 no :2015
## Frequently: 242 yes: 44 Between 1-2L :1110 yes: 96
## no : 51 More : 516
## Sometimes :1765
##
##
##
## FAF TUE CALC
## I do not have:720 0–2 hours :952 Always : 1
## 1 or 2 days :776 3–5 hours :915 Frequently: 70
## 2 or 4 days :496 More than 5 hours:244 no : 639
## 4 or 5 days :119 Sometimes :1401
##
##
##
## MTRANS NObeyesdad
## Automobile : 457 Insufficient_Weight:272
## Bike : 7 Normal_Weight :287
## Motorbike : 11 Obesity_Type_I :351
## Public_Transportation:1580 Obesity_Type_II :297
## Walking : 56 Obesity_Type_III :324
## Overweight_Level_I :290
## Overweight_Level_II:290
For needs of this analysis I created two versions of data set, the one with only numeric, normalized variables and the one with categorical variables.
Below there are presented distributions of all variables from the data set.
ObesityDataSetPlot_cat <- melt(ObesityDataSet[,c(1,5:16)], id.vars = NULL)
ObesityDataSetPlot_num <- melt(ObesityDataSet[,c(2:4)], id.vars = NULL)
ggplot(data = ObesityDataSetPlot_cat,aes(x = value)) +
facet_wrap(~variable, scales = "free", ncol = 3) +
geom_histogram(stat="count") +
theme(axis.text = element_text(size=10)) +
theme(axis.title = element_text(size = 12))
ggplot(data = ObesityDataSet,aes(x = NObeyesdad)) +
geom_histogram(stat="count") +
theme(axis.text = element_text(size=20)) +
theme(axis.title = element_text(size = 25))
ggplot(data = ObesityDataSetPlot_num,aes(x = value)) +
facet_wrap(~variable, scales = "free", ncol = 3) +
geom_histogram()
There are 3 continuous variables: age, height and weight, the rest is
categorical.
Looking at the distribution and summary of data we may notice some basic
info about respondents. The data set is balanced in regards to gender,
around 50:50 for females and males.
Respondents’ age starts at 14, the oldest person is 61 but most of them
are rather young - 75% is not older than 26 years.
The height is closer to normal distribution, the scale is from 1.45m to
1.98m with the average of 1.70m. The weight has quite broad scale
starting from 39kg and ending at 173 kg with the mean equals 87kg.
Most of respondents claim to have family history with overweight - 82%.
The data set is balanced in regards to the BMI level (variable
NObeyesdad), the most frequent group is the one with Obesity
Type I.
Before the clustering analysis I would like to apply dimension reduction methods in order to choose the most valid variables. The goal is to simplify the data without losing critical information.
The data set contains 14 categorical variables out of 17, therefore the PCA method for dimension reduction may not be the most efficient one as it suits better continuous variables. MDS method could be more suitable for categorical data however it is mainly used for visualization of the similarity or dissimilarity between objects. Regarding that my main goal is to simplify the data without losing critical information I choose PCA.
First I will compute PCA and try to find the optimal number of variables.
PCA method requires numeric variables therefore I skipped variable MTRANS which cannot be interpreted as numeric and next I scaled the data set, so the mean value equals 0.
I calculated correlation using Spearman method as it is more appropriate for categorical data which is not normally distributed.
ObesityDataSet_norm <- scale(ObesityDataSet_num[,-16])
cor_w <- cor(ObesityDataSet_norm, method="spearman")
corrplot(cor_w, type = "lower", order = "hclust", tl.col = "black", tl.cex = 0.5)
The highest correlation is observed between variables height and gender, weight and height, weight and level of BMI (NObeyesdad) and family_history_with_overweight with weight. For the analysis the most interesting is relation between family_history_with_overweight with weight as it may suggest the influence of family history on obesity problem. The rest of strong correlations are rather expected.
Now it’s time to compute PCA. I’m removing variable NObeyesdad as this is a dependent variable in this analysis and Weight as it is explaining BMI Level.
On the first plot there is shown the percentage of variances explained by each principal component.
In PCA the Kaiser criterion drops the components, for which the eigenvalues are less than 1 (when the data is standardized). Greater than 1 eigenvalue suggests that the corresponding component explains more variance than a single variable, given that a variable accounts for a unit of variance.2
ObesityDataSet_norm <- scale(ObesityDataSet_num[,c(-4,-16,-17)])
res.pca <- prcomp(ObesityDataSet_norm)
fviz_eig(res.pca, addlabels = TRUE, ncp = 15, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4")
fviz_eig(res.pca, choice = "eigenvalue", ncp = 15, barfill = "hotpink3", barcolor = "hotpink4", linecolor = "brown4", addlabels = TRUE, main = "Eigenvalues")
According to the Kaiser criterion there is 7 variables with high contribution. By reducing dimensions to 7 we can explain 67,5% of variance. Let’s check these variables with the use of another visualization.
The chosen variables are following: height, gender, CALC, age, FCVC, TUE, CH20.
pca_var <- get_pca_var(res.pca)
fviz_contrib(res.pca, "var", axes = 1:6, fill = "hotpink3", color = "brown4")
Next plot which is a graph of variables shows correlation and contribution. Positively correlated variables are pointed to the same side of the plot, negatively correlated variables are point to opposite sides of the graph. Additionally the gradient color is representing the contribution of the variable to the PC.3
fviz_pca_var(res.pca,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
The below graph shows individuals grouped together in regards to BMI level. We can see that it is hard to distinguish some groups based on the profile of individuals, only Obesity Type II seems to be more structured.
groups <- ObesityDataSet$NObeyesdad
fviz_pca_ind(res.pca,
col.ind = groups, # color by groups
palette = c("lightblue","pink","#00AFBB", "#E7B800", "#FC4E07", "darkred", "lightgreen"),
addEllipses = TRUE, # Concentration ellipses
ellipse.type = "confidence",
legend.title = "Groups",
repel = TRUE
)
Now I would like to analyze the data using clustering algorithms. Cluster analysis is a type of technique that aims to group similar data points together into clusters or subgroups. The main goal of clustering is to identify patterns in the data without prior knowledge of the group memberships.The basic idea behind it is to measure the similarity or dissimilarity between each pair of data points and then group them into clusters.
Based on Principal Component Analysis conducted before I created a
new data set with 7 variables which had the highest contribution and
they are following: Gender, Height,
Age,
FCVC - Frequency of consumption of vegetables
(Never/Sometimes/Always),
CH20 - Consumption of water daily (Less than a
liter/Between 1-2L/More),
CALC - Consumption of alcohol (I do not
drink/Sometimes/Frequently/Always),
TUE - Time using technology devices (0–2 hours/3–5
hours/More than 5 hours).
ObesityDataSet_norm_7 <- ObesityDataSet_norm[,c("Height","Gender","CH2O", "CALC", "FCVC", "Age", "TUE")]
The hopkins function is used to estimate the cluster tendency of a dataset. It is a measure of the distances between each data point and its nearest neighbor within the dataset to the distances between each data point and its nearest neighbor within a randomly generated dataset with the same marginal distributions as the original dataset. The Hopkins statistic ranges between 0 and 1, with values closer to 1 indicating a higher degree of clustering tendency in the dataset.
A Hopkins value of 0.1-0.2 indicates that the dataset has a low degree of clustering tendency. In other words, the data points are more uniformly distributed rather than being clustered together. This value suggests that applying clustering algorithms to this dataset may not be the best approach.
paste0("Hopkins statistic for the limited data set: ", clustertend::hopkins(ObesityDataSet_norm_7))
## [1] "Hopkins statistic for the limited data set: 0.157632108218407"
paste0("Hopkins statistic for the whole data set: ", clustertend::hopkins(ObesityDataSet_norm))
## [1] "Hopkins statistic for the whole data set: 0.182092541793173"
Comparing silhouette width and total within sum of square values, the quality of clustering algorithm is lower for the whole data set than for chosen 7 variables. Therefore I’m going to use the limited data set in the further analysis.
kmeans_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = kmeans, method = "silhouette") +
theme_classic() + ggtitle("Kmeans - Whole Data Set") + ylab("Silhouette Width")
pam_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = cluster::pam, method = "silhouette") +
theme_classic() + ggtitle("Pam - Whole Data Set") + ylab("Silhouette Width")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = hcut, method = "silhouette") +
theme_classic() + ggtitle("Hierarchical - Whole Data Set") + ylab("Silhouette Width")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)
kmeans_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = kmeans, method = "wss") +
theme_classic() + ggtitle("Kmeans - Whole Data Set")
pam_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = cluster::pam, method = "wss") +
theme_classic() + ggtitle("Pam - Whole Data Set")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm, FUNcluster = hcut, method = "wss") +
theme_classic() + ggtitle("Hierarchical - Whole Data Set")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)
kmeans_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = kmeans, method = "silhouette") +
theme_classic() + ggtitle("Kmeans - 7 variables") + ylab("Silhouette Width")
pam_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = cluster::pam, method = "silhouette") +
theme_classic() + ggtitle("Pam - 7 variables") + ylab("Silhouette Width")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = hcut, method = "silhouette") +
theme_classic() + ggtitle("Hierarchical - 7 variables") + ylab("Silhouette Width")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)
kmeans_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = kmeans, method = "wss") +
theme_classic() + ggtitle("Kmeans - 7 variables")
pam_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = cluster::pam, method = "wss") +
theme_classic() + ggtitle("Pam - 7 variables")
hierarchical_plot <- fviz_nbclust(ObesityDataSet_norm_7, FUNcluster = hcut, method = "wss") +
theme_classic() + ggtitle("Hierarchical - 7 variables")
grid.arrange(kmeans_plot,pam_plot,hierarchical_plot, ncol=1)
Pam and Kmeans algorithms provided similar results. Considering that
Kmeans is more efficiently computed I choose this algorithm for further
analysis.
According to above visualizations the most optimal clusters number is 2.
However in addition I decided to verify results for 7 clusters regarding
that there is 7 groups in the dependent variable
NObeyesdad.
Below there are visualized clusters and silhouette width for 2 and 7 clusters.
cl_kmeans <- eclust(ObesityDataSet_norm_7, k=2, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_sil2 <- fviz_silhouette(cl_kmeans)
## cluster size ave.sil.width
## 1 1 1060 0.21
## 2 2 1051 0.22
kmeans_cluster2 <- fviz_cluster(cl_kmeans, data = ObesityDataSet_norm_7, elipse.type = "convex", labelsize = 0,palette=c("#F8766D", "#00BA38", "#619CFF", "#FFA600", "#7A6A5B", "#00FFFF", "#FF00FF")) + theme_minimal()
grid.arrange(kmeans_sil2, kmeans_cluster2, ncol=2)
cl_kmeans <- eclust(ObesityDataSet_norm_7, k=7, FUNcluster="kmeans", hc_metric="euclidean", graph=FALSE)
kmeans_sil7 <- fviz_silhouette(cl_kmeans)
## cluster size ave.sil.width
## 1 1 308 0.10
## 2 2 254 0.13
## 3 3 356 0.38
## 4 4 266 0.20
## 5 5 415 0.19
## 6 6 207 0.13
## 7 7 305 0.15
kmeans_cluster7 <- fviz_cluster(cl_kmeans, data = ObesityDataSet_norm_7, elipse.type = "convex",labelsize = 0, palette=c("#F8766D", "#00BA38", "#619CFF", "#FFA600", "#7A6A5B", "#00FFFF", "#FF00FF")) + theme_minimal()
grid.arrange(kmeans_sil7, kmeans_cluster7, ncol=2)
In order to evaluate results below there are plots with distance of data points to cluster centroids using stripes.
# statistics by clustered groups
stats2 <- cclust(ObesityDataSet_norm_7, 2, dist="euclidean")
stripes(stats2)
stats7 <- cclust(ObesityDataSet_norm_7, 7, dist="euclidean")
stripes(stats7)
Finally I would like to verify the correctness of clustering algorithms comparing to the true labels from data set.
First approach is computing the adjusted rand index which compares two classifications. I have also generated random sample in order to check whether clustering classification is more accurate.
In order to compare results with 2 clusters I created variable Obesity_binar which classify Obesity Type I,II and III as 2 and rest as 1, the proportion is 1139 to 972 observations.
Obesity_binar <- ifelse(ObesityDataSet$NObeyesdad == "Insufficient_Weight", 1,
ifelse(ObesityDataSet$NObeyesdad == "Normal_Weight", 1,
ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_I", 1,
ifelse(ObesityDataSet$NObeyesdad == "Overweight_Level_II", 1,2
))))
table(Obesity_binar)
## Obesity_binar
## 1 2
## 1139 972
ARI_2clust <- adjustedRandIndex(kmeans_cluster2$data$cluster, Obesity_binar)
paste0("Adjusted Rand Index for 2 clusters: ", ARI_2clust)
## [1] "Adjusted Rand Index for 2 clusters: -0.000196274851551966"
rand_ARI_2clust <- randIndex(Obesity_binar, sample(Obesity_binar))
paste0("Rand Index for 2 clusters and random sample: ", rand_ARI_2clust)
## [1] "Rand Index for 2 clusters and random sample: 0.000222002830914471"
ARI_7clust <- adjustedRandIndex(kmeans_cluster7$data$cluster, ObesityDataSet_num$NObeyesdad)
paste0("Adjusted Rand Index for 7 clusters: ", ARI_7clust)
## [1] "Adjusted Rand Index for 7 clusters: 0.143730328422885"
rand_ARI_7clust <- randIndex(ObesityDataSet_num$NObeyesdad, sample(ObesityDataSet_num$NObeyesdad))
paste0("Rand Index for 7 clusters and random sample: ", rand_ARI_7clust)
## [1] "Rand Index for 7 clusters and random sample: 0.000377448323153927"
I would like to also visualize the results. In order to do it I created a confusion matrix which counts the number of instances where the predicted cluster assignment matches the true class label. Next the confusion matrix was used to plot results with ggplot2, where the x-axis represents the true class labels, the y-axis represents the predicted cluster assignments, and the fill color represents the count of observations. The color scale goes from white (low count) to red (high count).
# Create a confusion matrix
conf_mat7 <- table(ObesityDataSet_num$NObeyesdad, kmeans_cluster7$data$cluster)
# Create a confusion matrix plot
ggplot(data.frame(expand.grid(actual = rownames(conf_mat7), predicted = colnames(conf_mat7)), count = as.vector(conf_mat7)), aes(x = actual, y = predicted, fill = count)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "True Class", y = "Predicted Cluster", fill = "Count")
# Create a confusion matrix
conf_mat2 <- table(Obesity_binar, kmeans_cluster2$data$cluster)
# Create a confusion matrix plot
ggplot(data.frame(expand.grid(actual = rownames(conf_mat2), predicted = colnames(conf_mat2)), count = as.vector(conf_mat2)), aes(x = actual, y = predicted, fill = count)) +
geom_tile() +
scale_fill_gradient(low = "white", high = "red") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(x = "True Class", y = "Predicted Cluster", fill = "Count")
Comparing results of verification tests 7 clusters gave much more accurate fit than 2 clusters and than random sample.
According to the Hopkins value of 0.15 the dataset has rather low degree of clustering tendency. As such, it may not be the most effective approach to employ clustering algorithms to fulfill my goal of identifying the key determinants of obesity levels. Although the accuracy rate of 14% may appear somewhat unsatisfactory, the application of principal component analysis (PCA) yielded more favorable outcomes and improved the model fit despite the fact that most variables are categorical.
Palechor, F.M., Manotas, A.,(2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico, ELSEVIER, Data in Brief, Volume 25, 104344, doi: https://doi.org/10.1016/j.dib.2019.104344.↩︎
Beavers, Amy S.; John W. Lounsbury, Jennifer K. Richards, Schuyler W. Huck, Gary J. Skolits, and Shelley L. Esquivel. (2013). Practical considerations for using exploratory factor analysis in educational research. Practical Assessment, Research and Evaluation.↩︎
http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/↩︎