1. Introduction

This report uses unsupervised learning techniques to analyze the dataset:

Estimation of Obesity Levels Based On Eating Habits and Physical Condition https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+…

The goal is to identify distinct patterns in individuals’ physical characteristics and lifestyle habits that correspond to different levels of obesity.

Unlike standard regression tasks where the target is known, this analysis focuses on dimensionality reduction to visualize the complex, high-dimensional structure of the data. We aim to see whether lifestyle variables (such as eating habits and transportation modes) naturally group individuals into patterns that relate to clinical obesity classifications.

The main variables in the dataset are:

  • Gender – Biological sex of the individual; Categorical
  • Age – Age in years; Numerical
  • Height – Height in meters; Numerical
  • Weight – Weight in kilograms; Numerical
  • family_history_with_overweight – Has a family member who suffered or suffers from overweight?; Binary (Yes/No)
  • FAVC – Frequent consumption of high caloric food; Binary (Yes/No)
  • FCVC – Frequency of consumption of vegetables; Numerical (Ordinal)
  • NCP – Number of main meals; Numerical
  • CAEC – Consumption of food between meals; Categorical (Always, Frequently, Sometimes, No)
  • SMOKE – Smoking status; Binary (Yes/No)
  • CH2O – Consumption of water daily (liters); Numerical
  • SCC – Calories consumption monitoring; Binary (Yes/No)
  • FAF – Physical activity frequency (days per week); Numerical
  • TUE – Time using technology devices (hours per day); Numerical
  • CALC – Consumption of alcohol; Categorical (Always, Frequently, Sometimes, No)
  • MTRANS – Transportation used; Categorical (Auto, Bike, Motorbike, Public, Walking)
  • NObeyesdad – Target Variable: Obesity level classification; Categorical (7 Levels)

1.1 Research Objective

The primary objective is to construct a low-dimensional typology of obesity risk factors. Specifically, we aim to:

  • Quantify Dissimilarity: Calculate a distance metric that handles both numerical (e.g., Age, Weight) and categorical (e.g., Transportation, Family History) data.
  • Visualize Structure: Use Non-Metric Multidimensional Scaling (isoMDS) and t-Distributed Stochastic Neighbor Embedding (t-SNE) to map the 17-dimensional dataset into 2D and 3D spaces.
  • Validate Results: Statistically assess the quality of these reductions using Shepard diagrams and Stress values.
  • Interpret Drivers: Identify which specific variables drive the separation between obesity levels.

Scope: 2,111 individuals from Mexico, Peru, and Colombia. Variables: 17 features including dietary habits, physical condition, and demographic data.


2. Data and Feature Engineering

2.1 Loading Packages

We use R packages designed for mixed-type data analysis (cluster), dimension reduction (MASS, Rtsne, vegan), and visualization (ggplot2, plotly, viridis).

library(tidyverse)
library(cluster)
library(Rtsne)
library(corrplot)
library(gridExtra)
library(MASS)
library(vegan)
library(plotly)
library(viridis)
library(knitr)

set.seed(123)

2.2 Reading and Cleaning Data

The dataset contains a mix of numerical and categorical variables. A critical step in dimension reduction is ensuring that no two objects are identical. If two rows are identical, the distance between them is zero, which can cause division-by-zero errors in the isoMDS algorithm.

We load the data and remove duplicate entries based on the predictor variables.

df <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv", stringsAsFactors = TRUE)

print(paste("Original dataset dimensions:", nrow(df), "rows,", ncol(df), "columns"))
## [1] "Original dataset dimensions: 2111 rows, 17 columns"
# We exclude the target variable 'NObeyesdad' to check for identical features
df <- df[!duplicated(dplyr::select(df, -NObeyesdad)), ]

print(paste("Cleaned dataset dimensions:", nrow(df), "rows,", ncol(df), "columns."))
## [1] "Cleaned dataset dimensions: 2087 rows, 17 columns."
ordered_levels <- c("Insufficient_Weight", "Normal_Weight", "Overweight_Level_I", 
                    "Overweight_Level_II", "Obesity_Type_I", "Obesity_Type_II", "Obesity_Type_III")
df$NObeyesdad <- factor(df$NObeyesdad, levels = ordered_levels)

2.3 Exploratory Data Analysis (EDA)

p1 <- ggplot(df, aes(x = NObeyesdad, fill = NObeyesdad)) +
  geom_bar() +
  theme_minimal() +
  coord_flip() +
  scale_fill_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) + 
  labs(title = "Distribution of Obesity Levels", 
       x = "Obesity Level", y = "Count") +
  theme(legend.position = "none")

print(p1)

# Correlation Matrix
numeric_vars <- df %>% select_if(is.numeric)
corrplot(cor(numeric_vars), method = "color", type = "upper", 
         tl.col = "black", title = "Correlation of Numerical Features", mar=c(0,0,1,0))

We see strong correlations between weight and height variables, which is expected and shows that some numerical predictors carry similar information. However, these simple correlations only use numeric variables and ignore important categorical variables (like type of transport or family history). This is one reason why we move to a more flexible distance measure.


3. Methodology: Distance Calculation

Standard Euclidean distance is not suitable for this dataset because it cannot handle differences between categories like “Public Transportation” and “Automobile”.

To solve this, we use the Gower Distance. This metric:

  • Uses scaled differences for numerical variables.
  • Uses a matching coefficient for categorical variables.

This gives a dissimilarity matrix where 0 means “very similar” and 1 means “maximally different” for each pair of individuals.

# Excluding the target label
df_features <- df %>% select(-NObeyesdad)

gower_dist <- daisy(df_features, metric = "gower")

4. Non-Metric Multidimensional Scaling (isoMDS)

We employ Non-Metric MDS (isoMDS) rather than Classical MDS (cmdscale) because obesity levels and lifestyle habits are more naturally ordered than strictly metric. Non-metric MDS tries to preserve the rank ordering of distances rather than the exact values.

The algorithm minimizes “Stress”, a measure of how different the distances in the 2D map are from the original high-dimensional distances.

# Non-Metric MDS
mds_fit <- isoMDS(gower_dist, k = 2, trace = FALSE) 

# Extracting coordinates
mds_data <- as.data.frame(mds_fit$points)
colnames(mds_data) <- c("Dim1", "Dim2")
mds_data$Obesity_Level <- df$NObeyesdad

plot_mds <- ggplot(mds_data, aes(x = Dim1, y = Dim2, color = Obesity_Level)) +
  geom_point(alpha = 0.6, size = 2) +
  # Ellipses
  stat_ellipse(type = "norm", level = 0.95, linetype = 2, alpha = 0.5) +
  theme_minimal() +
  scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
  labs(title = "Non-Metric MDS (Gower Distance)",
       subtitle = paste("Stress Value:", round(mds_fit$stress, 3), "%"),
       x = "Dimension 1", y = "Dimension 2")

print(plot_mds)

The MDS plot shows how individuals with different obesity levels are arranged in a 2D space. Neighbouring levels often overlap, which is expected because the differences between adjacent classes (for example Normal and Overweight_Level_I) are subtle and partly based on cut-off choices.

The reported stress is around 25.5%. It is often considered that stress values above about 20% indicate a relatively poor global fit. This means that compressing this complex 17-dimensional dataset into only two dimensions inevitably distorts some distances, especially for points that are far apart.

4.1 Statistical Validation

To check how well the 2D map reflects the original distances, we use:

  • Shepard Correlation: Measures the linear relationship between original Gower distances and the new 2D distances.
  • Shepard Diagram: A scatter plot of these pairs of distances.
# 1. Shepard Correlation
mds_2d_dist <- dist(mds_fit$points)
fit_correlation <- cor(gower_dist, mds_2d_dist)
print(paste("Shepard Correlation (Validity of 2D Map):", round(fit_correlation, 4)))
## [1] "Shepard Correlation (Validity of 2D Map): 0.8497"
# 2. Shepard Plot (Using a random sample of 5000)
set.seed(123)
sample_indices <- sample(length(gower_dist), 5000) 
shepard_df <- data.frame(
  Original = as.numeric(gower_dist)[sample_indices],
  MDS = as.numeric(mds_2d_dist)[sample_indices]
)

plot_shepard <- ggplot(shepard_df, aes(x = Original, y = MDS)) +
  geom_point(alpha = 0.3, size = 1) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  theme_minimal() +
  labs(title = "Shepard Plot (Goodness of Fit)", 
       subtitle = paste("Correlation:", round(fit_correlation, 3), "- Higher values mean better preservation of distances"))

print(plot_shepard)

In this analysis, the Shepard correlation is about 0.85. This is high and shows that there is a strong monotonic relationship between the original distances and the distances in the 2D map.

Taken together, the results indicate that the stress value is high (about 25.5%), which tells us that there is noticeable distortion when we force the data into two dimensions. At the same time, the Shepard correlation (about 0.85) confirms that the relative ordering of distances is still preserved to a large extent. This means that the 2D MDS map is not suitable for precise distance measurement, but it is useful for qualitative exploration of the overall structure and for comparing groups.

4.2 Interpreting the Axes

We project the original variables onto the MDS ordination using the envfit function from the vegan package. The arrows show directions where each variable increases most strongly.

fit_vectors <- envfit(mds_fit, df_features, permutations = 999, na.rm = TRUE)

scores_vectors <- as.data.frame(scores(fit_vectors, display = "vectors"))
scores_vectors$Variable <- rownames(scores_vectors)

plot_mds_vectors <- ggplot(mds_data, aes(x = Dim1, y = Dim2)) +
  geom_point(aes(color = Obesity_Level), alpha = 0.4) +
  geom_segment(data = scores_vectors, aes(x = 0, xend = Dim1, y = 0, yend = Dim2), 
               arrow = arrow(length = unit(0.2, "cm")), color = "black", linewidth = 1) +
  geom_text(data = scores_vectors, aes(x = Dim1, y = Dim2, label = Variable), 
            vjust = -0.5, fontface = "bold", color = "black") +
  scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
  theme_minimal() +
  labs(title = "MDS Interpretation: Variable Vectors",
       subtitle = "Arrows indicate the direction of variable influence")

print(plot_mds_vectors)

This plot helps to read the axes as “gradients” of behaviour. Variables with longer arrows have stronger relationships with the configuration.


5. t-SNE Method

While MDS focuses on preserving overall distance structure, t-SNE (t-Distributed Stochastic Neighbor Embedding) focuses more on local neighbourhoods: it tries to keep nearby points together in the low-dimensional space.

We run t-SNE directly on the Gower distance matrix to see whether it can reveal more fine-grained separation between obesity classes.

set.seed(123)
tsne_fit <- Rtsne(gower_dist, is_distance = TRUE, perplexity = 40, max_iter = 1000)

tsne_data <- as.data.frame(tsne_fit$Y)
colnames(tsne_data) <- c("tSNE_1", "tSNE_2")
tsne_data$Obesity_Level <- df$NObeyesdad

# Visualizing t-SNE
plot_tsne <- ggplot(tsne_data, aes(x = tSNE_1, y = tSNE_2, color = Obesity_Level)) +
  geom_point(alpha = 0.7, size = 2) +
  stat_ellipse(type = "t", level = 0.95, linetype = 2, alpha = 0.5) +
  theme_minimal() +
  scale_color_viridis_d(option = "viridis", begin = 0, end = 0.8, direction = -1) +
  labs(title = "t-SNE Analysis (Perplexity = 40)",
       subtitle = "t-SNE emphasises local neighbourhoods of individuals",
       x = "t-SNE Dimension 1", y = "t-SNE Dimension 2")

print(plot_tsne)

In the t-SNE map, points from different obesity levels tend to form overlapping but structured clouds. Neighbouring classes (for example Normal_Weight and Overweight_Level_I) often mix, while some more extreme classes can look slightly more concentrated. This supports the idea that the transition between categories is gradual, with clearer differences mainly at the extremes.

As before, this pattern is descriptive, it does not prove any biological mechanism, but it shows how the sample is organised in terms of the recorded habits and characteristics.


6. Interactive 3D Visualization

To better see the structure of the data, we also project the Gower distances into a 3-dimensional t-SNE space. This can reveal separation that is not visible in 2D.

# Run 3D t-SNE
set.seed(123)
tsne_3d <- Rtsne(gower_dist, is_distance = TRUE, perplexity = 30, dims = 3, max_iter = 1000)
tsne_3d_data <- as.data.frame(tsne_3d$Y)
colnames(tsne_3d_data) <- c("X", "Y", "Z")
tsne_3d_data$Obesity_Level <- df$NObeyesdad

# Generate matching colors
custom_colors <- viridis::viridis(7, begin = 0, end = 0.8, direction = -1)

# Create Plotly object
interactive_plot <- plot_ly(tsne_3d_data, x = ~X, y = ~Y, z = ~Z, 
                            color = ~Obesity_Level, colors = custom_colors, 
                            type = 'scatter3d', mode = 'markers', 
                            marker = list(size = 3)) %>%
  layout(title = "3D Interactive t-SNE of Obesity Levels",
         scene = list(xaxis = list(title = 'Dim 1'),
                      yaxis = list(title = 'Dim 2'),
                      zaxis = list(title = 'Dim 3')))

interactive_plot

7. Clustering Validation (PAM)

Finally, we check how clusters derived from the reduced MDS space relate to the original obesity labels. We use Partitioning Around Medoids (PAM) with k = 7 (the same number as the obesity classes) on the MDS coordinates.

# PAM Clustering
pam_fit <- pam(mds_fit$points, k = 7)

# Confusion Matrix
conf_matrix <- table(Predicted_Cluster = pam_fit$clustering, Actual_Label = df$NObeyesdad)
print(conf_matrix)
##                  Actual_Label
## Predicted_Cluster Insufficient_Weight Normal_Weight Overweight_Level_I
##                 1                  65            64                 32
##                 2                 104            59                 38
##                 3                  46            26                 24
##                 4                  13            55                 23
##                 5                  11            37                 17
##                 6                   0            21                 77
##                 7                  28            20                 65
##                  Actual_Label
## Predicted_Cluster Overweight_Level_II Obesity_Type_I Obesity_Type_II
##                 1                  76             78               1
##                 2                  14              2               0
##                 3                  55             86              70
##                 4                  14              5               0
##                 5                  33             55              95
##                 6                  15             75               1
##                 7                  83             50             130
##                  Actual_Label
## Predicted_Cluster Obesity_Type_III
##                 1                1
##                 2                0
##                 3                0
##                 4                0
##                 5                1
##                 6              322
##                 7                0
# Silhouette Plot
sil_info <- silhouette(pam_fit)
plot(sil_info, col = custom_colors, border = NA, main = "Silhouette Plot for MDS Clustering")

The confusion matrix shows how well the seven PAM clusters line up with the seven obesity levels. Perfect agreement would mean one cluster per label, which is not realistic in this kind of behavioural dataset. Instead, we see:

  • Some labels are mostly captured by a single cluster. For example, the most extreme obesity level (Obesity_Type_III) is almost entirely assigned to one cluster, and Insufficient_Weight is also strongly concentrated.
  • Middle categories, such as Normal_Weight and Overweight_Level_I, are more mixed. They are split across several clusters and often appear together.

This pattern is consistent with a messy middle - the boundary between normal weight and mild overweight is gradual, while the most extreme obesity level forms a more distinct group in this dataset.

The silhouette plot summarises how compact and well-separated the PAM clusters are in the MDS space. Higher average silhouette values mean that individuals are closer to their own cluster than to other clusters. In this case, the values suggest that the clusters are not perfect, but they carry more structure than would be expected from random noise.


8. Conclusions

This analysis shows how dimensionality reduction and clustering can be used to explore obesity levels in a mixed-type dataset.

  • Gower Distance and isoMDS provide a 2D map that reflects a large part of the original dissimilarities between individuals. The stress value is high (about 25.5%), which indicates that some distance information is lost when we compress 17 dimensions into 2. However, the Shepard correlation is also high (about 0.85), which means that the relative structure of distances is still preserved to a useful degree. In other words, the map is not exact, but it is informative for visual and qualitative analysis.
  • t-SNE focuses on local relationships and helps to visualise small clusters and denser regions. Middle categories (for example normal weight and mild overweight) tend to overlap, while more extreme categories can appear more compact. This is consistent with the idea of a “messy middle”, where neighbouring weight categories are hard to separate cleanly.
  • Statistical checks such as the Shepard correlation and the silhouette values show that the low-dimensional maps keep a meaningful amount of information from the original 17-dimensional space, even though some information is inevitably lost.
  • PAM clustering in the reduced space does not reproduce the obesity labels one-to-one, but it reveals a clear pattern:
    • The most extreme level (Obesity_Type_III) is grouped quite consistently, which suggests that individuals in this category share more similar combinations of recorded habits in this dataset.
    • Normal_Weight and Overweight_Level_I are less clearly separated and often appear together, which matches the real-world intuition that the boundary between normal and slightly overweight is gradual rather than sharp.

Overall, the study does not try to predict obesity or to establish causal links. Instead, it offers a descriptive view of how lifestyle and demographic variables position individuals in a lower-dimensional space, and how this structure relates to the seven obesity categories in the dataset. The combination of a relatively high stress value with a strong Shepard correlation supports the use of these maps for careful, qualitative interpretation, while reminding us that they should not be used as precise quantitative models of distance.


Bibliography

Estimation of Obesity Levels Based On Eating Habits and Physical Condition [Dataset]. (2019). UCI Machine Learning Repository. https://doi.org/10.24432/C5H31Z.

Mardia, K. V. (1978). Some properties of clasical multi-dimesional scaling. Communications in Statistics-Theory and Methods, 7(13), 1233-1241.

Gower, J. C. (1971). A general coefficient of similarity and some of its properties. Biometrics, 857-871.

An AI assistant was consulted for assistance with RMarkdown syntax and error debugging.