This report uses unsupervised learning techniques to analyze the dataset:
Estimation of Obesity Levels Based On Eating Habits and Physical Condition https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+…
The goal is to identify distinct patterns in individuals’ physical characteristics and lifestyle habits that correspond to different levels of obesity.
Unlike standard regression tasks where the target is known, this analysis focuses on dimensionality reduction to visualize the complex, high-dimensional structure of the data. We aim to see whether lifestyle variables (such as eating habits and transportation modes) naturally group individuals into patterns that relate to clinical obesity classifications.
The main variables in the dataset are:
The primary objective is to construct a low-dimensional typology of obesity risk factors. Specifically, we aim to:
Scope: 2,111 individuals from Mexico, Peru, and Colombia. Variables: 17 features including dietary habits, physical condition, and demographic data.
We use R packages designed for mixed-type data analysis (cluster), dimension reduction (MASS, Rtsne, vegan), and visualization (ggplot2, plotly, viridis).
library(tidyverse)
library(cluster)
library(Rtsne)
library(corrplot)
library(gridExtra)
library(MASS)
library(vegan)
library(plotly)
library(viridis)
library(knitr)
set.seed(123)
The dataset contains a mix of numerical and categorical variables. A critical step in dimension reduction is ensuring that no two objects are identical. If two rows are identical, the distance between them is zero, which can cause division-by-zero errors in the isoMDS algorithm.
We load the data and remove duplicate entries based on the predictor variables.
df <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv", stringsAsFactors = TRUE)
print(paste("Original dataset dimensions:", nrow(df), "rows,", ncol(df), "columns"))
## [1] "Original dataset dimensions: 2111 rows, 17 columns"
# We exclude the target variable 'NObeyesdad' to check for identical features
df <- df[!duplicated(dplyr::select(df, -NObeyesdad)), ]
print(paste("Cleaned dataset dimensions:", nrow(df), "rows,", ncol(df), "columns."))
## [1] "Cleaned dataset dimensions: 2087 rows, 17 columns."
ordered_levels <- c("Insufficient_Weight", "Normal_Weight", "Overweight_Level_I",
"Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III")
df$NObeyesdad <- factor(df$NObeyesdad, levels = ordered_levels)
p1 <- ggplot(df, aes(x = NObeyesdad, fill = NObeyesdad)) +
geom_bar() +
theme_minimal() +
coord_flip() +
scale_fill_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
labs(title = "Distribution of Obesity Levels",
x = "Obesity Level", y = "Count") +
theme(legend.position = "none")
print(p1)
# Correlation Matrix
numeric_vars <- df %>% select_if(is.numeric)
corrplot(cor(numeric_vars), method = "color", type = "upper",
tl.col = "black", title = "Correlation of Numerical Features", mar=c(0,0,1,0))
We see strong correlations between weight and height variables, which is expected and shows that some numerical predictors carry similar information. However, these simple correlations only use numeric variables and ignore important categorical variables (like type of transport or family history). This is one reason why we move to a more flexible distance measure.
Standard Euclidean distance is not suitable for this dataset because it cannot handle differences between categories like “Public Transportation” and “Automobile”.
To solve this, we use the Gower Distance. This metric:
This gives a dissimilarity matrix where 0 means “very similar” and 1 means “maximally different” for each pair of individuals.
# Excluding the target label
df_features <- df %>% select(-NObeyesdad)
gower_dist <- daisy(df_features, metric = "gower")
We employ Non-Metric MDS (isoMDS) rather than Classical MDS (cmdscale) because obesity levels and lifestyle habits are more naturally ordered than strictly metric. Non-metric MDS tries to preserve the rank ordering of distances rather than the exact values.
The algorithm minimizes “Stress”, a measure of how different the distances in the 2D map are from the original high-dimensional distances.
# Non-Metric MDS
mds_fit <- isoMDS(gower_dist, k = 2, trace = FALSE)
# Extracting coordinates
mds_data <- as.data.frame(mds_fit$points)
colnames(mds_data) <- c("Dim1", "Dim2")
mds_data$Obesity_Level <- df$NObeyesdad
plot_mds <- ggplot(mds_data, aes(x = Dim1, y = Dim2, color = Obesity_Level)) +
geom_point(alpha = 0.6, size = 2) +
# Ellipses
stat_ellipse(type = "norm", level = 0.95, linetype = 2, alpha = 0.5) +
theme_minimal() +
scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
labs(title = "Non-Metric MDS (Gower Distance)",
subtitle = paste("Stress Value:", round(mds_fit$stress, 3), "%"),
x = "Dimension 1", y = "Dimension 2")
print(plot_mds)
The MDS plot shows how individuals with different obesity levels are arranged in a 2D space. Neighbouring levels often overlap, which is expected because the differences between adjacent classes (for example Normal and Overweight_Level_I) are subtle and partly based on cut-off choices.
The reported stress is around 25.5%. It is often considered that stress values above about 20% indicate a relatively poor global fit. This means that compressing this complex 17-dimensional dataset into only two dimensions inevitably distorts some distances, especially for points that are far apart.
To check how well the 2D map reflects the original distances, we use:
# 1. Shepard Correlation
mds_2d_dist <- dist(mds_fit$points)
fit_correlation <- cor(gower_dist, mds_2d_dist)
print(paste("Shepard Correlation (Validity of 2D Map):", round(fit_correlation, 4)))
## [1] "Shepard Correlation (Validity of 2D Map): 0.8497"
# 2. Shepard Plot (Using a random sample of 5000)
set.seed(123)
sample_indices <- sample(length(gower_dist), 5000)
shepard_df <- data.frame(
Original = as.numeric(gower_dist)[sample_indices],
MDS = as.numeric(mds_2d_dist)[sample_indices]
)
plot_shepard <- ggplot(shepard_df, aes(x = Original, y = MDS)) +
geom_point(alpha = 0.3, size = 1) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
theme_minimal() +
labs(title = "Shepard Plot (Goodness of Fit)",
subtitle = paste("Correlation:", round(fit_correlation, 3), "- Higher values mean better preservation of distances"))
print(plot_shepard)
In this analysis, the Shepard correlation is about 0.85. This is high and shows that there is a strong monotonic relationship between the original distances and the distances in the 2D map.
Taken together, the results indicate that the stress value is high (about 25.5%), which tells us that there is noticeable distortion when we force the data into two dimensions. At the same time, the Shepard correlation (about 0.85) confirms that the relative ordering of distances is still preserved to a large extent. This means that the 2D MDS map is not suitable for precise distance measurement, but it is useful for qualitative exploration of the overall structure and for comparing groups.
We project the original variables onto the MDS ordination using the
envfit function from the vegan package. The arrows show
directions where each variable increases most strongly.
fit_vectors <- envfit(mds_fit, df_features, permutations = 999, na.rm = TRUE)
scores_vectors <- as.data.frame(scores(fit_vectors, display = "vectors"))
scores_vectors$Variable <- rownames(scores_vectors)
plot_mds_vectors <- ggplot(mds_data, aes(x = Dim1, y = Dim2)) +
geom_point(aes(color = Obesity_Level), alpha = 0.4) +
geom_segment(data = scores_vectors, aes(x = 0, xend = Dim1, y = 0, yend = Dim2),
arrow = arrow(length = unit(0.2, "cm")), color = "black", linewidth = 1) +
geom_text(data = scores_vectors, aes(x = Dim1, y = Dim2, label = Variable),
vjust = -0.5, fontface = "bold", color = "black") +
scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
theme_minimal() +
labs(title = "MDS Interpretation: Variable Vectors",
subtitle = "Arrows indicate the direction of variable influence")
print(plot_mds_vectors)
This plot helps to read the axes as “gradients” of behaviour. Variables with longer arrows have stronger relationships with the configuration.
While MDS focuses on preserving overall distance structure, t-SNE (t-Distributed Stochastic Neighbor Embedding) focuses more on local neighbourhoods: it tries to keep nearby points together in the low-dimensional space.
We run t-SNE directly on the Gower distance matrix to see whether it can reveal more fine-grained separation between obesity classes.
set.seed(123)
tsne_fit <- Rtsne(gower_dist, is_distance = TRUE, perplexity = 40, max_iter = 1000)
tsne_data <- as.data.frame(tsne_fit$Y)
colnames(tsne_data) <- c("tSNE_1", "tSNE_2")
tsne_data$Obesity_Level <- df$NObeyesdad
# Visualizing t-SNE
plot_tsne <- ggplot(tsne_data, aes(x = tSNE_1, y = tSNE_2, color = Obesity_Level)) +
geom_point(alpha = 0.7, size = 2) +
stat_ellipse(type = "t", level = 0.95, linetype = 2, alpha = 0.5) +
theme_minimal() +
scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
labs(title = "t-SNE Analysis (Perplexity = 40)",
subtitle = "t-SNE emphasises local neighbourhoods of individuals",
x = "t-SNE Dimension 1", y = "t-SNE Dimension 2")
print(plot_tsne)
In the t-SNE map, points from different obesity levels tend to form overlapping but structured clouds. Neighbouring classes (for example Normal_Weight and Overweight_Level_I) often mix, while some more extreme classes can look slightly more concentrated. This supports the idea that the transition between categories is gradual, with clearer differences mainly at the extremes.
As before, this pattern is descriptive, it does not prove any biological mechanism, but it shows how the sample is organised in terms of the recorded habits and characteristics.
To better see the structure of the data, we also project the Gower distances into a 3-dimensional t-SNE space. This can reveal separation that is not visible in 2D.
# Run 3D t-SNE
set.seed(123)
tsne_3d <- Rtsne(gower_dist, is_distance = TRUE, perplexity = 30, dims = 3, max_iter = 1000)
tsne_3d_data <- as.data.frame(tsne_3d$Y)
colnames(tsne_3d_data) <- c("X", "Y", "Z")
tsne_3d_data$Obesity_Level <- df$NObeyesdad
# Generate matching colors
custom_colors <- viridis::viridis(7, begin = 0, end = 0.8, direction = -1)
# Create Plotly object
interactive_plot <- plot_ly(tsne_3d_data, x = ~X, y = ~Y, z = ~Z,
color = ~Obesity_Level, colors = custom_colors,
type = 'scatter3d', mode = 'markers',
marker = list(size = 3)) %>%
layout(title = "3D Interactive t-SNE of Obesity Levels",
scene = list(xaxis = list(title = 'Dim 1'),
yaxis = list(title = 'Dim 2'),
zaxis = list(title = 'Dim 3')))
interactive_plot
Finally, we check how clusters derived from the reduced MDS space relate to the original obesity labels. We use Partitioning Around Medoids (PAM) with k = 7 (the same number as the obesity classes) on the MDS coordinates.
# PAM Clustering
pam_fit <- pam(mds_fit$points, k = 7)
# Confusion Matrix
conf_matrix <- table(Predicted_Cluster = pam_fit$clustering, Actual_Label = df$NObeyesdad)
print(conf_matrix)
## Actual_Label
## Predicted_Cluster Insufficient_Weight Normal_Weight Overweight_Level_I
## 1 65 64 32
## 2 104 59 38
## 3 46 26 24
## 4 13 55 23
## 5 11 37 17
## 6 0 21 77
## 7 28 20 65
## Actual_Label
## Predicted_Cluster Overweight_Level_II Obesity_Type_I Obesity_Type_II
## 1 76 78 1
## 2 14 2 0
## 3 55 86 70
## 4 14 5 0
## 5 33 55 95
## 6 15 75 1
## 7 83 50 130
## Actual_Label
## Predicted_Cluster Obesity_Type_III
## 1 1
## 2 0
## 3 0
## 4 0
## 5 1
## 6 322
## 7 0
# Silhouette Plot
sil_info <- silhouette(pam_fit)
plot(sil_info, col = custom_colors, border = NA, main = "Silhouette Plot for MDS Clustering")
The confusion matrix shows how well the seven PAM clusters line up with the seven obesity levels. Perfect agreement would mean one cluster per label, which is not realistic in this kind of behavioural dataset. Instead, we see:
This pattern is consistent with a messy middle - the boundary between normal weight and mild overweight is gradual, while the most extreme obesity level forms a more distinct group in this dataset.
The silhouette plot summarises how compact and well-separated the PAM clusters are in the MDS space. Higher average silhouette values mean that individuals are closer to their own cluster than to other clusters. In this case, the values suggest that the clusters are not perfect, but they carry more structure than would be expected from random noise.
This analysis shows how dimensionality reduction and clustering can be used to explore obesity levels in a mixed-type dataset.
Overall, the study does not try to predict obesity or to establish causal links. Instead, it offers a descriptive view of how lifestyle and demographic variables position individuals in a lower-dimensional space, and how this structure relates to the seven obesity categories in the dataset. The combination of a relatively high stress value with a strong Shepard correlation supports the use of these maps for careful, qualitative interpretation, while reminding us that they should not be used as precise quantitative models of distance.
Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.
Mardia, K. V. (1978). Some properties of clasical multi-dimesional scaling. Communications in Statistics-Theory and Methods, 7(13), 1233-1241.
Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.
An AI assistant was consulted for assistance with RMarkdown syntax and error debugging.