Obesity is a growing global health concern strongly linked to behavioral factors like diet, physical activity, and transportation. While studies such as Palechor & De la Hoz Manotas (2019) have applied machine learning to predict obesity, they often emphasize accuracy over interpretability. This project proposes a categorical data analysis approach to uncover direct statistical associations and visualize how lifestyle behaviors correspond with obesity. By prioritizing explainability, we aim to provide a health-relevant perspective that complements “black box” predictive models.
The primary objectives of this analysis are to:
investigate associations between categorical lifestyle factors and obesity categories;
identify which variables show the strongest relationships with obesity status;
visualize multidimensional associations using Correspondence Analysis (CA);
compare findings with results from previous machine-learning studies.
The study utilizes the UCI “Estimation of Obesity Levels
Based on Eating Habits and Physical Condition” dataset
available via the UCI Machine Learning Repository. The sample consists
of 2,111 individuals from Mexico, Peru, and Colombia. The target
variable, NObeyesdad (Obesity Level), contains seven levels
ranging from Insufficient Weight to Obesity Type III.
The dataset includes 16 categorical or ordinal predictors covering
demographics, eating habits, and physical activity. Notably, 23% of the
data was collected directly from users via a web platform, while 77% was
generated synthetically using the Weka tool and SMOTE filter.
All predictor variables are categorical or ordinal. The dataset is available via the UCI Machine Learning Repository.
The dataset contains a mix of real and synthetic data generated using SMOTE (Synthetic Minority Over-sampling Technique). As a result, several integer-based variables (e.g., number of meals, vegetable consumption) appear as decimal values. To perform categorical analysis, we must first preprocess the data by:
library(tidyverse)
library(gtsummary)
# 1. Load Data
raw_data <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv")
# 2. data cleaning
obesity_clean <- raw_data %>%
rename(Gender = Gender, Age = Age,Height = Height,
Weight = Weight,
FamilyHistory = family_history_with_overweight,
HighCaloricFood = FAVC,
VegConsumption = FCVC,
MealsPerDay = NCP,#number of main meals/day
SnackBetweenMeals = CAEC,
Smoker = SMOKE,
WaterIntake = CH2O,
CalorieMonitoring = SCC,
PhysicalActivity = FAF,
TechUseTime = TUE,
Alcohol = CALC,
Transport = MTRANS,
ObesityLevel = NObeyesdad) %>%
#Fix floats to nearest integer
mutate(VegConsumption = round(VegConsumption),
MealsPerDay = round(MealsPerDay),
WaterIntake = round(WaterIntake),
PhysicalActivity = round(PhysicalActivity),
TechUseTime = round(TechUseTime)) %>%
# create Age groups
mutate(AgeGroup = cut(Age,
breaks = c(0, 20, 30, 40, 100),
labels = c("Under 20", "20-29", "30-39", "40+"))) %>%
# convert to Factors
mutate(ObesityLevel = factor(ObesityLevel,
levels = c("Insufficient_Weight",
"Normal_Weight",
"Overweight_Level_I",
"Overweight_Level_II",
"Obesity_Type_I",
"Obesity_Type_II",
"Obesity_Type_III"),
ordered = TRUE),
SnackBetweenMeals = factor(SnackBetweenMeals,
levels = c("no", "Sometimes", "Frequently", "Always"),
ordered = TRUE),
Alcohol = factor(Alcohol,
levels = c("no", "Sometimes", "Frequently", "Always"),
ordered = TRUE),
# nominal factors (No order) & integer-to-factor
across(c(Gender, FamilyHistory, HighCaloricFood, Smoker, CalorieMonitoring, Transport, AgeGroup), as.factor),
across(c(VegConsumption, MealsPerDay, WaterIntake, PhysicalActivity, TechUseTime), as.factor))
#View(obesity_clean)
Before proceeding, we verify that the dataset contains no missing values and inspect the distribution of the target variable.
# 1. check for missing values
any_missing <- sum(is.na(obesity_clean))
cat("Total Missing Values in Dataset:", any_missing, "\n")
## Total Missing Values in Dataset: 0
# 2. verify target variable balance
# since SMOTE was used, we expect these counts to be roughly equal.
table(obesity_clean$ObesityLevel)
##
## Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II
## 272 287 290 290
## Obesity_Type_I Obesity_Type_II Obesity_Type_III
## 351 297 324
| Original Name | New Name | Description | Coding / Units / Levels |
|---|---|---|---|
| Gender | Gender |
Gender of the individual | Male, Female |
| Age | Age |
Age of the individual | Years (Numeric) |
| Height | Height |
Height of the individual | Meters (Numeric) |
| Weight | Weight |
Weight of the individual | Kilograms (Numeric) |
| family_history… | FamilyHistory |
Has a family member who is overweight | yes, no |
| FAVC | HighCaloricFood |
Frequent consumption of high caloric food | yes, no |
| FCVC | VegConsumption |
Frequency of vegetable consumption | 1 = Never 2 = Sometimes 3 = Always |
| NCP | MealsPerDay |
Number of main meals per day | 1, 2, 3, or 4 meals |
| CAEC | SnackBetweenMeals |
Consumption of food between meals | no, Sometimes, Frequently, Always |
| SMOKE | Smoker |
Does the person smoke? | yes, no |
| CH2O | WaterIntake |
Daily water consumption | 1 = Less than 1
L 2 = Between 1–2 L 3 = More than 2 L |
| SCC | CalorieMonitoring |
Monitors calorie consumption | yes, no |
| FAF | PhysicalActivity |
Physical activity frequency | 0 = None 1 = 1–2 days/week 2 = 2–4 days/week 3 = 4–5 days/week |
| TUE | TechUseTime |
Time using technology devices | 0 = 0–2 hours 1 = 3–5 hours 2 = More than 5 hours |
| CALC | Alcohol |
Alcohol consumption frequency | no, Sometimes, Frequently, Always |
| MTRANS | Transport |
Main mode of transportation | Public_Transportation, Walking, Automobile, Motorbike, Bike |
| NObeyesdad | ObesityLevel |
Target Variable (Obesity Classification) | Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II Obesity_Type_I Obesity_Type_II Obesity_Type_III |
This stacked bar chart visualizes the relationship between the mode of transportation and obesity categories.
library(ggplot2)
library(viridis)
library(viridisLite)
ggplot(obesity_clean, aes(x = Transport, fill = ObesityLevel)) +
geom_bar(position = "fill") +
labs(
title = "Obesity Levels by Transportation Mode",
x = "Transportation Mode",
y = "Proportion",
fill = "Obesity Level"
) +
scale_y_continuous(labels = scales::percent) +
scale_fill_viridis(discrete = TRUE, option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The data suggests a correlation between active transportation (Walking/Biking) and lower BMI, as those columns consist mostly of ‘Normal’ and ‘Overweight_level_I’ weight categories. In contrast, ‘Public Transportation’ shows the highest prevalence of ‘Obesity Type III’ (yellow), while ‘Automobile’ users show the most varied distribution across all weight classes.
This box plot compares the age distribution across different obesity categories. The horizontal line inside each box represents the median age for that group.
ggplot(obesity_clean, aes(x = ObesityLevel, y = Age, fill = ObesityLevel)) +
geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 1) +
labs(title = "Age Distribution across Obesity Levels",
subtitle = "Comparing median ages and variability",
x = "Obesity Category",
y = "Age (Years)",
fill = "Obesity Level") +
scale_fill_viridis(discrete = TRUE, option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") # dont need legend
The box plot reveals distinct age-related patterns across obesity categories:
Younger Demographics: Individuals classified as Insufficient Weight, Normal Weight, and Overweight Level I differ from the heavier groups, with median ages typically concentrated in the “Under 25” demographic.
Peak Age in Middle Categories: A shift is observed for Overweight Level II, Obesity Type I, and Obesity Type II, where the median age increases, suggesting these categories are more prevalent in individuals in their late 20s to 30s.
Low Variability in Obesity Type III: The Obesity Type III group exhibits a distinct distribution with very low interquartile variability (a short box), suggesting a highly clustered age group.
This jitter plot visualizes individual data points to show the density of physical activity habits across obesity categories. We add a small amount of random noise (“jitter”) to separate overlapping points.
ggplot(obesity_clean, aes(x = ObesityLevel,
y = PhysicalActivity,
color = ObesityLevel)) +
geom_jitter(alpha = 0.6, width = 0.2, height = 0.2, size = 1.5) +
labs(title = "Physical Activity Frequency by Obesity Level",
subtitle = "Visualizing the density of activity habits",
x = "Obesity Category", y = "Physical Activity Frequency",
color = "Obesity Level") +
scale_color_viridis(discrete = TRUE, option = "D") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none")
The mosaic plot demonstrates a clear inverse relationship between physical activity frequency (0 to 3) and obesity levels
The column for zero physical activity shows a strong positive association with higher obesity categories, confirming that a lack of exercise is a critical risk factor.
As higher physical activity level shows significant positive associations with “healthier weights”.
This 100% stacked bar chart compares the distribution of obesity levels between individuals with and without a family history of overweight.
ggplot(obesity_clean, aes(x = FamilyHistory, fill = ObesityLevel)) +
geom_bar(position = "fill") +
labs(title = "Obesity Levels by Family History",
subtitle = "Comparing those with vs. without a family history of overweight",
x = "Family History of Overweight",
y = "Proportion",
fill = "Obesity Level") +
scale_y_continuous(labels = scales::percent) +
scale_fill_viridis(discrete = TRUE, option = "D") +
theme_minimal() +
theme(legend.position = "right")
There is a striking correlation between family history and obesity levels. Having a family history of being overweight drastically increases the likelihood of a person being obese themselves.
To evaluate the relationship between categorical lifestyle variables (e.g., Transportation, Family History) and Obesity Levels, we employ Pearson’s Chi-Square Test of Independence. This test determines whether the observed distribution of frequencies differs significantly from what would be expected under the assumption of independence.
For each categorical predictor variable, we test the following hypotheses:
The test statistic (\(\chi^2\)) is calculated by summing the squared differences between observed (\(O_{ij}\)) and expected (\(E_{ij}\)) frequencies, normalized by the expected frequencies:
\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
Where:
\(r\) = Number of rows (categories of the lifestyle variable).
\(c\) = Number of columns (categories of Obesity Level).
\(O_{ij}\) = The observed count of individuals in row \(i\) and column \(j\).
\(E_{ij}\) = The expected count under the assumption of independence.
The expected frequency for each cell is calculated based on the marginal totals (row and column sums):
\[E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total (N)}}\]
The calculated \(\chi^2\) statistic follows a Chi-Square distribution with degrees of freedom \(df = (r-1)(c-1)\).
The following table summarizes the demographic profile and lifestyle behaviors of the study population, stratified by obesity category. P-values indicate significant differences across groups (Chi-square test for categorical variables, Kruskal-Wallis for continuous/ordinal).
library(gtsummary)
# create summary table
obesity_clean %>%
select(ObesityLevel, AgeGroup, Gender,Height, Weight, FamilyHistory, PhysicalActivity, HighCaloricFood, VegConsumption,MealsPerDay,SnackBetweenMeals,Smoker, WaterIntake,CalorieMonitoring,TechUseTime,Alcohol, Transport) %>%
tbl_summary(
by = ObesityLevel, # split the table by Obesity Level
statistic = list(all_categorical() ~ "{n} ({p}%)"),
digits = all_continuous() ~ 1,
label = list(
AgeGroup ~ "Age Group",
Gender ~ "Gender",
Height ~ "Height",
Weight ~ "Weight",
FamilyHistory ~ "Family History of Obesity",
PhysicalActivity ~ "Physical Activity (Days/Week)",
HighCaloricFood ~ "High Caloric Food Intake",
VegConsumption ~ "Vegetable Consumption",
MealsPerDay ~ "Meals Per Day",
Smoker ~ "Smoker",
SnackBetweenMeals ~ "Snack Between Meals",
CalorieMonitoring ~ "Calorie Monitoring",
WaterIntake ~ "Daily Water Intake",
TechUseTime ~ "Tech Use Time",
Alcohol ~ "Alcohol Consumption Frequency",
Transport ~ "Transportation Mode"
)
) %>%
add_p() %>% # add P-values automatically
add_overall() %>% # add a column for the total population
bold_labels()
| Characteristic | Overall N = 2,1111 |
Insufficient_Weight N = 2721 |
Normal_Weight N = 2871 |
Overweight_Level_I N = 2901 |
Overweight_Level_II N = 2901 |
Obesity_Type_I N = 3511 |
Obesity_Type_II N = 2971 |
Obesity_Type_III N = 3241 |
p-value2 |
|---|---|---|---|---|---|---|---|---|---|
| Age Group | <0.001 | ||||||||
| Under 20 | 585 (28%) | 180 (66%) | 132 (46%) | 94 (32%) | 59 (20%) | 76 (22%) | 1 (0.3%) | 43 (13%) | |
| 20-29 | 1,170 (55%) | 89 (33%) | 137 (48%) | 153 (53%) | 135 (47%) | 189 (54%) | 186 (63%) | 281 (87%) | |
| 30-39 | 299 (14%) | 3 (1.1%) | 16 (5.6%) | 40 (14%) | 80 (28%) | 62 (18%) | 98 (33%) | 0 (0%) | |
| 40+ | 57 (2.7%) | 0 (0%) | 2 (0.7%) | 3 (1.0%) | 16 (5.5%) | 24 (6.8%) | 12 (4.0%) | 0 (0%) | |
| Gender | <0.001 | ||||||||
| Female | 1,043 (49%) | 173 (64%) | 141 (49%) | 145 (50%) | 103 (36%) | 156 (44%) | 2 (0.7%) | 323 (100%) | |
| Male | 1,068 (51%) | 99 (36%) | 146 (51%) | 145 (50%) | 187 (64%) | 195 (56%) | 295 (99%) | 1 (0.3%) | |
| Height | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.7, 1.8) | 1.7 (1.6, 1.8) | 1.8 (1.8, 1.8) | 1.7 (1.6, 1.7) | <0.001 |
| Weight | 83.0 (65.4, 107.5) | 50.0 (44.7, 53.7) | 61.0 (55.0, 69.0) | 75.0 (68.1, 80.0) | 82.0 (78.0, 86.9) | 90.7 (82.1, 103.8) | 117.8 (112.0, 120.8) | 112.0 (109.1, 133.5) | <0.001 |
| Family History of Obesity | 1,726 (82%) | 126 (46%) | 155 (54%) | 209 (72%) | 272 (94%) | 344 (98%) | 296 (100%) | 324 (100%) | <0.001 |
| Physical Activity (Days/Week) | <0.001 | ||||||||
| 0 | 720 (34%) | 72 (26%) | 80 (28%) | 84 (29%) | 97 (33%) | 131 (37%) | 69 (23%) | 187 (58%) | |
| 1 | 776 (37%) | 72 (26%) | 97 (34%) | 126 (43%) | 125 (43%) | 123 (35%) | 165 (56%) | 68 (21%) | |
| 2 | 496 (23%) | 117 (43%) | 69 (24%) | 56 (19%) | 50 (17%) | 72 (21%) | 63 (21%) | 69 (21%) | |
| 3 | 119 (5.6%) | 11 (4.0%) | 41 (14%) | 24 (8.3%) | 18 (6.2%) | 25 (7.1%) | 0 (0%) | 0 (0%) | |
| High Caloric Food Intake | 1,866 (88%) | 221 (81%) | 208 (72%) | 268 (92%) | 216 (74%) | 340 (97%) | 290 (98%) | 323 (100%) | <0.001 |
| Vegetable Consumption | <0.001 | ||||||||
| 1 | 102 (4.8%) | 23 (8.5%) | 18 (6.3%) | 14 (4.8%) | 9 (3.1%) | 17 (4.8%) | 21 (7.1%) | 0 (0%) | |
| 2 | 1,013 (48%) | 86 (32%) | 155 (54%) | 186 (64%) | 192 (66%) | 256 (73%) | 138 (46%) | 0 (0%) | |
| 3 | 996 (47%) | 163 (60%) | 114 (40%) | 90 (31%) | 89 (31%) | 78 (22%) | 138 (46%) | 324 (100%) | |
| Meals Per Day | <0.001 | ||||||||
| 1 | 316 (15%) | 37 (14%) | 52 (18%) | 76 (26%) | 48 (17%) | 79 (23%) | 24 (8.1%) | 0 (0%) | |
| 2 | 176 (8.3%) | 18 (6.6%) | 0 (0%) | 23 (7.9%) | 52 (18%) | 47 (13%) | 36 (12%) | 0 (0%) | |
| 3 | 1,470 (70%) | 145 (53%) | 206 (72%) | 158 (54%) | 184 (63%) | 225 (64%) | 228 (77%) | 324 (100%) | |
| 4 | 149 (7.1%) | 72 (26%) | 29 (10%) | 33 (11%) | 6 (2.1%) | 0 (0%) | 9 (3.0%) | 0 (0%) | |
| Snack Between Meals | <0.001 | ||||||||
| no | 51 (2.4%) | 3 (1.1%) | 10 (3.5%) | 35 (12%) | 1 (0.3%) | 1 (0.3%) | 1 (0.3%) | 0 (0%) | |
| Sometimes | 1,765 (84%) | 146 (54%) | 159 (55%) | 236 (81%) | 270 (93%) | 338 (96%) | 293 (99%) | 323 (100%) | |
| Frequently | 242 (11%) | 121 (44%) | 83 (29%) | 14 (4.8%) | 16 (5.5%) | 6 (1.7%) | 1 (0.3%) | 1 (0.3%) | |
| Always | 53 (2.5%) | 2 (0.7%) | 35 (12%) | 5 (1.7%) | 3 (1.0%) | 6 (1.7%) | 2 (0.7%) | 0 (0%) | |
| Smoker | 44 (2.1%) | 1 (0.4%) | 13 (4.5%) | 3 (1.0%) | 5 (1.7%) | 6 (1.7%) | 15 (5.1%) | 1 (0.3%) | <0.001 |
| Daily Water Intake | <0.001 | ||||||||
| 1 | 485 (23%) | 84 (31%) | 83 (29%) | 60 (21%) | 47 (16%) | 68 (19%) | 82 (28%) | 61 (19%) | |
| 2 | 1,110 (53%) | 142 (52%) | 164 (57%) | 154 (53%) | 186 (64%) | 173 (49%) | 177 (60%) | 114 (35%) | |
| 3 | 516 (24%) | 46 (17%) | 40 (14%) | 76 (26%) | 57 (20%) | 110 (31%) | 38 (13%) | 149 (46%) | |
| Calorie Monitoring | 96 (4.5%) | 22 (8.1%) | 30 (10%) | 37 (13%) | 4 (1.4%) | 2 (0.6%) | 1 (0.3%) | 0 (0%) | <0.001 |
| Tech Use Time | <0.001 | ||||||||
| 0 | 952 (45%) | 94 (35%) | 129 (45%) | 164 (57%) | 114 (39%) | 169 (48%) | 173 (58%) | 109 (34%) | |
| 1 | 915 (43%) | 127 (47%) | 122 (43%) | 82 (28%) | 145 (50%) | 121 (34%) | 103 (35%) | 215 (66%) | |
| 2 | 244 (12%) | 51 (19%) | 36 (13%) | 44 (15%) | 31 (11%) | 61 (17%) | 21 (7.1%) | 0 (0%) | |
| Alcohol Consumption Frequency | |||||||||
| no | 639 (30%) | 117 (43%) | 107 (37%) | 50 (17%) | 128 (44%) | 165 (47%) | 71 (24%) | 1 (0.3%) | |
| Sometimes | 1,401 (66%) | 154 (57%) | 161 (56%) | 224 (77%) | 143 (49%) | 172 (49%) | 224 (75%) | 323 (100%) | |
| Frequently | 70 (3.3%) | 1 (0.4%) | 18 (6.3%) | 16 (5.5%) | 19 (6.6%) | 14 (4.0%) | 2 (0.7%) | 0 (0%) | |
| Always | 1 (<0.1%) | 0 (0%) | 1 (0.3%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | |
| Transportation Mode | |||||||||
| Automobile | 457 (22%) | 46 (17%) | 45 (16%) | 66 (23%) | 94 (32%) | 110 (31%) | 95 (32%) | 1 (0.3%) | |
| Bike | 7 (0.3%) | 0 (0%) | 4 (1.4%) | 2 (0.7%) | 0 (0%) | 0 (0%) | 1 (0.3%) | 0 (0%) | |
| Motorbike | 11 (0.5%) | 0 (0%) | 6 (2.1%) | 1 (0.3%) | 1 (0.3%) | 3 (0.9%) | 0 (0%) | 0 (0%) | |
| Public_Transportation | 1,580 (75%) | 220 (81%) | 200 (70%) | 212 (73%) | 189 (65%) | 236 (67%) | 200 (67%) | 323 (100%) | |
| Walking | 56 (2.7%) | 6 (2.2%) | 32 (11%) | 9 (3.1%) | 6 (2.1%) | 2 (0.6%) | 1 (0.3%) | 0 (0%) | |
| 1 n (%); Median (Q1, Q3) | |||||||||
| 2 Pearson’s Chi-squared test; Kruskal-Wallis rank sum test; NA | |||||||||
Why there aren’t p-values for Alcohol and
Transportation?
Let’s take a look at the counts
table(obesity_clean$Transport, obesity_clean$ObesityLevel)
##
## Insufficient_Weight Normal_Weight Overweight_Level_I
## Automobile 46 45 66
## Bike 0 4 2
## Motorbike 0 6 1
## Public_Transportation 220 200 212
## Walking 6 32 9
##
## Overweight_Level_II Obesity_Type_I Obesity_Type_II
## Automobile 94 110 95
## Bike 0 0 1
## Motorbike 1 3 0
## Public_Transportation 189 236 200
## Walking 6 2 1
##
## Obesity_Type_III
## Automobile 1
## Bike 0
## Motorbike 0
## Public_Transportation 323
## Walking 0
table(obesity_clean$Alcohol, obesity_clean$ObesityLevel)
##
## Insufficient_Weight Normal_Weight Overweight_Level_I
## no 117 107 50
## Sometimes 154 161 224
## Frequently 1 18 16
## Always 0 1 0
##
## Overweight_Level_II Obesity_Type_I Obesity_Type_II
## no 128 165 71
## Sometimes 143 172 224
## Frequently 19 14 2
## Always 0 0 0
##
## Obesity_Type_III
## no 1
## Sometimes 323
## Frequently 0
## Always 0
The zeros we see (e.g., 0 individuals with Obesity Type III who Walk, and 0 who drink alcohol Always) are mathematically “breaking” the standard Chi-Square formula.
The Chi-Square formula divides by an “expected value.” When we have cells with 0 observations, the expected value often drops extremely low, then dividing by it causes the statistic to explode towards infinity. R protects this misleading result by suppressing the P-value entirely.
The Solution: Monte Carlo Simulation:
We will use the Monte Carlo simulation method: instead of relying on the broken formula, we generate 2,000 random tables with the same row/column totals to see how “rare” the specific distribution is.
Let’s rerun the code:
# create summary table
obesity_clean %>%
select(ObesityLevel, AgeGroup, Gender,Height, Weight, FamilyHistory, PhysicalActivity, HighCaloricFood, VegConsumption,MealsPerDay,SnackBetweenMeals,Smoker, WaterIntake,CalorieMonitoring,TechUseTime,Alcohol, Transport) %>%
tbl_summary(
by = ObesityLevel, # split the table by Obesity Level
statistic = list(all_categorical() ~ "{n} ({p}%)"),
digits = all_continuous() ~ 1,
label = list(
AgeGroup ~ "Age Group",
Gender ~ "Gender",
Height ~ "Height",
Weight ~ "Weight",
FamilyHistory ~ "Family History of Obesity",
PhysicalActivity ~ "Physical Activity (Days/Week)",
HighCaloricFood ~ "High Caloric Food Intake",
VegConsumption ~ "Vegetable Consumption",
MealsPerDay ~ "Meals Per Day",
Smoker ~ "Smoker",
SnackBetweenMeals ~ "Snack Between Meals",
CalorieMonitoring ~ "Calorie Monitoring",
WaterIntake ~ "Daily Water Intake",
TechUseTime ~ "Tech Use Time",
Alcohol ~ "Alcohol Consumption Frequency",
Transport ~ "Transportation Mode")) %>%
# Monte-Carlo fix
add_p(test = all_categorical() ~ "chisq.test",
test.args = all_tests("chisq.test") ~ list(simulate.p.value = TRUE, B = 2000))%>%
add_p() %>% # add P-values automatically
add_overall() %>% # add a column for the total population
bold_labels()
| Characteristic | Overall N = 2,1111 |
Insufficient_Weight N = 2721 |
Normal_Weight N = 2871 |
Overweight_Level_I N = 2901 |
Overweight_Level_II N = 2901 |
Obesity_Type_I N = 3511 |
Obesity_Type_II N = 2971 |
Obesity_Type_III N = 3241 |
p-value2 |
|---|---|---|---|---|---|---|---|---|---|
| Age Group | <0.001 | ||||||||
| Under 20 | 585 (28%) | 180 (66%) | 132 (46%) | 94 (32%) | 59 (20%) | 76 (22%) | 1 (0.3%) | 43 (13%) | |
| 20-29 | 1,170 (55%) | 89 (33%) | 137 (48%) | 153 (53%) | 135 (47%) | 189 (54%) | 186 (63%) | 281 (87%) | |
| 30-39 | 299 (14%) | 3 (1.1%) | 16 (5.6%) | 40 (14%) | 80 (28%) | 62 (18%) | 98 (33%) | 0 (0%) | |
| 40+ | 57 (2.7%) | 0 (0%) | 2 (0.7%) | 3 (1.0%) | 16 (5.5%) | 24 (6.8%) | 12 (4.0%) | 0 (0%) | |
| Gender | <0.001 | ||||||||
| Female | 1,043 (49%) | 173 (64%) | 141 (49%) | 145 (50%) | 103 (36%) | 156 (44%) | 2 (0.7%) | 323 (100%) | |
| Male | 1,068 (51%) | 99 (36%) | 146 (51%) | 145 (50%) | 187 (64%) | 195 (56%) | 295 (99%) | 1 (0.3%) | |
| Height | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.6, 1.8) | 1.7 (1.7, 1.8) | 1.7 (1.6, 1.8) | 1.8 (1.8, 1.8) | 1.7 (1.6, 1.7) | <0.001 |
| Weight | 83.0 (65.4, 107.5) | 50.0 (44.7, 53.7) | 61.0 (55.0, 69.0) | 75.0 (68.1, 80.0) | 82.0 (78.0, 86.9) | 90.7 (82.1, 103.8) | 117.8 (112.0, 120.8) | 112.0 (109.1, 133.5) | <0.001 |
| Family History of Obesity | 1,726 (82%) | 126 (46%) | 155 (54%) | 209 (72%) | 272 (94%) | 344 (98%) | 296 (100%) | 324 (100%) | <0.001 |
| Physical Activity (Days/Week) | <0.001 | ||||||||
| 0 | 720 (34%) | 72 (26%) | 80 (28%) | 84 (29%) | 97 (33%) | 131 (37%) | 69 (23%) | 187 (58%) | |
| 1 | 776 (37%) | 72 (26%) | 97 (34%) | 126 (43%) | 125 (43%) | 123 (35%) | 165 (56%) | 68 (21%) | |
| 2 | 496 (23%) | 117 (43%) | 69 (24%) | 56 (19%) | 50 (17%) | 72 (21%) | 63 (21%) | 69 (21%) | |
| 3 | 119 (5.6%) | 11 (4.0%) | 41 (14%) | 24 (8.3%) | 18 (6.2%) | 25 (7.1%) | 0 (0%) | 0 (0%) | |
| High Caloric Food Intake | 1,866 (88%) | 221 (81%) | 208 (72%) | 268 (92%) | 216 (74%) | 340 (97%) | 290 (98%) | 323 (100%) | <0.001 |
| Vegetable Consumption | <0.001 | ||||||||
| 1 | 102 (4.8%) | 23 (8.5%) | 18 (6.3%) | 14 (4.8%) | 9 (3.1%) | 17 (4.8%) | 21 (7.1%) | 0 (0%) | |
| 2 | 1,013 (48%) | 86 (32%) | 155 (54%) | 186 (64%) | 192 (66%) | 256 (73%) | 138 (46%) | 0 (0%) | |
| 3 | 996 (47%) | 163 (60%) | 114 (40%) | 90 (31%) | 89 (31%) | 78 (22%) | 138 (46%) | 324 (100%) | |
| Meals Per Day | <0.001 | ||||||||
| 1 | 316 (15%) | 37 (14%) | 52 (18%) | 76 (26%) | 48 (17%) | 79 (23%) | 24 (8.1%) | 0 (0%) | |
| 2 | 176 (8.3%) | 18 (6.6%) | 0 (0%) | 23 (7.9%) | 52 (18%) | 47 (13%) | 36 (12%) | 0 (0%) | |
| 3 | 1,470 (70%) | 145 (53%) | 206 (72%) | 158 (54%) | 184 (63%) | 225 (64%) | 228 (77%) | 324 (100%) | |
| 4 | 149 (7.1%) | 72 (26%) | 29 (10%) | 33 (11%) | 6 (2.1%) | 0 (0%) | 9 (3.0%) | 0 (0%) | |
| Snack Between Meals | <0.001 | ||||||||
| no | 51 (2.4%) | 3 (1.1%) | 10 (3.5%) | 35 (12%) | 1 (0.3%) | 1 (0.3%) | 1 (0.3%) | 0 (0%) | |
| Sometimes | 1,765 (84%) | 146 (54%) | 159 (55%) | 236 (81%) | 270 (93%) | 338 (96%) | 293 (99%) | 323 (100%) | |
| Frequently | 242 (11%) | 121 (44%) | 83 (29%) | 14 (4.8%) | 16 (5.5%) | 6 (1.7%) | 1 (0.3%) | 1 (0.3%) | |
| Always | 53 (2.5%) | 2 (0.7%) | 35 (12%) | 5 (1.7%) | 3 (1.0%) | 6 (1.7%) | 2 (0.7%) | 0 (0%) | |
| Smoker | 44 (2.1%) | 1 (0.4%) | 13 (4.5%) | 3 (1.0%) | 5 (1.7%) | 6 (1.7%) | 15 (5.1%) | 1 (0.3%) | <0.001 |
| Daily Water Intake | <0.001 | ||||||||
| 1 | 485 (23%) | 84 (31%) | 83 (29%) | 60 (21%) | 47 (16%) | 68 (19%) | 82 (28%) | 61 (19%) | |
| 2 | 1,110 (53%) | 142 (52%) | 164 (57%) | 154 (53%) | 186 (64%) | 173 (49%) | 177 (60%) | 114 (35%) | |
| 3 | 516 (24%) | 46 (17%) | 40 (14%) | 76 (26%) | 57 (20%) | 110 (31%) | 38 (13%) | 149 (46%) | |
| Calorie Monitoring | 96 (4.5%) | 22 (8.1%) | 30 (10%) | 37 (13%) | 4 (1.4%) | 2 (0.6%) | 1 (0.3%) | 0 (0%) | <0.001 |
| Tech Use Time | <0.001 | ||||||||
| 0 | 952 (45%) | 94 (35%) | 129 (45%) | 164 (57%) | 114 (39%) | 169 (48%) | 173 (58%) | 109 (34%) | |
| 1 | 915 (43%) | 127 (47%) | 122 (43%) | 82 (28%) | 145 (50%) | 121 (34%) | 103 (35%) | 215 (66%) | |
| 2 | 244 (12%) | 51 (19%) | 36 (13%) | 44 (15%) | 31 (11%) | 61 (17%) | 21 (7.1%) | 0 (0%) | |
| Alcohol Consumption Frequency | <0.001 | ||||||||
| no | 639 (30%) | 117 (43%) | 107 (37%) | 50 (17%) | 128 (44%) | 165 (47%) | 71 (24%) | 1 (0.3%) | |
| Sometimes | 1,401 (66%) | 154 (57%) | 161 (56%) | 224 (77%) | 143 (49%) | 172 (49%) | 224 (75%) | 323 (100%) | |
| Frequently | 70 (3.3%) | 1 (0.4%) | 18 (6.3%) | 16 (5.5%) | 19 (6.6%) | 14 (4.0%) | 2 (0.7%) | 0 (0%) | |
| Always | 1 (<0.1%) | 0 (0%) | 1 (0.3%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | 0 (0%) | |
| Transportation Mode | <0.001 | ||||||||
| Automobile | 457 (22%) | 46 (17%) | 45 (16%) | 66 (23%) | 94 (32%) | 110 (31%) | 95 (32%) | 1 (0.3%) | |
| Bike | 7 (0.3%) | 0 (0%) | 4 (1.4%) | 2 (0.7%) | 0 (0%) | 0 (0%) | 1 (0.3%) | 0 (0%) | |
| Motorbike | 11 (0.5%) | 0 (0%) | 6 (2.1%) | 1 (0.3%) | 1 (0.3%) | 3 (0.9%) | 0 (0%) | 0 (0%) | |
| Public_Transportation | 1,580 (75%) | 220 (81%) | 200 (70%) | 212 (73%) | 189 (65%) | 236 (67%) | 200 (67%) | 323 (100%) | |
| Walking | 56 (2.7%) | 6 (2.2%) | 32 (11%) | 9 (3.1%) | 6 (2.1%) | 2 (0.6%) | 1 (0.3%) | 0 (0%) | |
| 1 n (%); Median (Q1, Q3) | |||||||||
| 2 Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates); Kruskal-Wallis rank sum test | |||||||||
All variables are statistically significant. [note: the table is at the very end of this report for PDF verion. R arranges it automatically. Please see HTML version for better visualization]
While the Chi-Square test of independence (performed in Table 4) tells us if a relationship exists (statistical significance), it does not quantify how strong that relationship is. In large datasets, even weak relationships can yield statistically significant p-values.
To address this, we employ two distinct effect size metrics for categorical data: Cramér’s V and Goodman-Kruskal Lambda.
Cramér’s V measures the strength of association between two nominal variables. It is a symmetric measure, meaning the relationship between \(X \rightarrow Y\) is treated the same as \(Y \rightarrow X\).
Mathematical Formulation: \[V = \sqrt{\frac{\chi^2}{N \cdot \min(r-1, c-1)}}\]
Where:
Interpretation: The coefficient ranges from 0 to 1:
Unlike Cramér’s V, Lambda is an asymmetric measure of association based on the concept of Proportional Reduction in Error (PRE). It answers the question: “By knowing the value of the independent variable (Lifestyle Factor), by what percentage do we reduce the error in guessing the dependent variable (Obesity Level)?”
Mathematical Formulation: \[\lambda = \frac{E_1 - E_2}{E_1}\]
Where:
Interpretation:
The following heatmap visualizes the pairwise strength of association between all predictor variables. This allows us to identify multicollinearity (variables that are highly correlated with each other).
library(vcd) #association stats
library(reshape2) #reshaping the matrix for plotting
library(ggplot2)
library(dplyr)
# 1. define function to calculate Cramer's V for a pair of vectors
get_cramer_v <- function(x, y) {
tbl <- table(x, y)
stats <- assocstats(tbl)
return(stats$cramer)
}
# 2. select variables (remove continuous Height,Weight, Age)
vars <- obesity_clean %>%
select(-Height, -Weight, -Age)
# 3. create an empty matrix
var_names <- names(vars)
n <- length(var_names)
cramer_mat <- matrix(0, nrow=n, ncol=n, dimnames=list(var_names, var_names))
# 4. fill matrix
for (i in 1:n) {
for (j in 1:n) {
cramer_mat[i, j] <- get_cramer_v(vars[[i]], vars[[j]])
}
}
# 5. reshape for ggplot
melted_cramer <- melt(cramer_mat)
# 6. plot heatmap
ggplot(melted_cramer, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "white", high = "#c0392b", mid = "#e67e22",
midpoint = 0.5, limit = c(0,1), name="Cramer's V") +
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
axis.title.x = element_blank(),
axis.title.y = element_blank()) +
coord_fixed() +
labs(title = "Cramér's V Association Matrix",
subtitle = "Measuring strength of relationships between categorical variables")
The heatmap tells a clear story: biological factors are the biggest
drivers here, with Gender (\(V=0.56\)) and Family History
(\(V=0.54\)) showing the strongest link
to obesity levels. Lifestyle habits like
Vegetable Consumption and Snacking also play a
significant role, though they are secondary to genetics. The large beige
area in the chart is actually a good sign - it shows that most of these
habits don’t overlap. The weak association between the predictor
variables themselves means low multicollinearity ~ minimal redundancy in
the dataset, meaning each variable contributes distinct information to
the analysis.
library(DescTools) # for Lambda function
library(knitr)
# initialize a dataframe to store results
ranking <- data.frame(
Variable = character(),
CramersV = numeric(),
Lambda = numeric(), # predictive power
stringsAsFactors = FALSE
)
# loop through predictors to calculate stats against ObesityLevel
predictors <- names(obesity_clean)[!names(obesity_clean) %in% c("ObesityLevel", "Height", "Weight", "Age")]
for (var in predictors) {
# reate table
tbl <- table(obesity_clean[[var]], obesity_clean$ObesityLevel)
# calculate stats
v_stat <- CramerV(tbl)
# direction: How well does x predict y?
lambda_stat <- Lambda(tbl, direction = "row")
# add to dataframe
ranking <- rbind(ranking, data.frame(
Variable = var,
CramersV = v_stat,
Lambda = lambda_stat
))
}
ranking %>%
arrange(desc(CramersV)) %>%
kable(digits = 3, caption = "Ranking Predictors by Association Strength with Obesity Level")
| Variable | CramersV | Lambda |
|---|---|---|
| Gender | 0.558 | 0.380 |
| FamilyHistory | 0.543 | 0.052 |
| VegConsumption | 0.366 | 0.365 |
| SnackBetweenMeals | 0.356 | 0.000 |
| HighCaloricFood | 0.332 | 0.000 |
| AgeGroup | 0.321 | 0.097 |
| MealsPerDay | 0.278 | 0.000 |
| CalorieMonitoring | 0.241 | 0.000 |
| Alcohol | 0.231 | 0.000 |
| PhysicalActivity | 0.208 | 0.129 |
| TechUseTime | 0.205 | 0.147 |
| WaterIntake | 0.197 | 0.035 |
| Transport | 0.186 | 0.000 |
| Smoker | 0.123 | 0.000 |
This ranking highlights a crucial difference between a variable being
related to obesity versus actually helping us predict it. While
Family History is strongly linked to weight status (high
Cramér’s V), it is Gender and
Vegetable Consumption that prove to be the most useful
predictors (highest Lambda scores). Surprisingly, many lifestyle habits
like Snacking, Alcohol, and Transportation show zero predictive power on
their own (\(\lambda=0\). This means
that while these behaviors are statistically connected to obesity, we
can’t rely on them individually to distinguish between specific weight
categories.
While pairwise statistics (Cramér’s V) reveal individual links, they cannot fully capture the multidimensional structure of the data. To visualize how all 15+ lifestyle variables interact simultaneously, we employ Multiple Correspondence Analysis (MCA).
MCA is a dimensionality reduction technique specifically designed for categorical data. It can be viewed as the categorical equivalent of Principal Component Analysis (PCA).
Unlike standard correlation which uses raw numbers, MCA operates on an Indicator Matrix (or a Disjunctive Table). If an individual belongs to category \(k\), the value is 1; otherwise, it is 0.
\[ X_{ik} = \begin{cases} 1 & \text{if individual } i \text{ belongs to category } k \\ 0 & \text{otherwise} \end{cases} \]
MCA calculates the distance between individuals or categories using the Chi-Square metric. This differs from standard Euclidean distance (straight lines) by weighting categories based on their rarity.
Rare categories (e.g., “Bike Commuters”) contribute more to the inertia (variance) than common categories.
Two categories are “close” in the generated map if they are frequently chosen by the same individuals.
The algorithm decomposes the total variation (Inertia) in the dataset into orthogonal Dimensions (axes):
Dimension 1: Represents the pattern accounting for the most variance in the data.
Dimension 2: Represents the second most dominant pattern, independent of the first.
The goal is to project the high-dimensional cloud of data points onto a 2D plane (Dim 1 vs. Dim 2) while retaining as much information as possible.
A critical methodological choice in this analysis is the treatment of
ObesityLevel as a Supplementary
Variable.
Active Variables: The map is constructed solely based on the lifestyle behaviors (Diet, Transport, Activity, etc.).
Supplementary Variable: The
ObesityLevel categories are not used to build the axes.
Instead, they are projected onto the established lifestyle map
afterward.
Why do this? This allows us to see where specific obesity levels “naturally land” within the landscape of lifestyle habits, without the obesity label itself forcing the structure of the map.
library(FactoMineR) # computing MCA
library(factoextra)
# 1. select active variables, remove Height/Weight/Age
mca_data <- obesity_clean %>%
select(Gender, AgeGroup, FamilyHistory,
HighCaloricFood, VegConsumption, MealsPerDay, SnackBetweenMeals,
Smoker, WaterIntake, CalorieMonitoring, PhysicalActivity,
TechUseTime, Alcohol, Transport,
ObesityLevel)
# 2. run MCA
res.mca <- MCA(mca_data,
quali.sup = 15, # Index of ObesityLevel
graph = FALSE)
# 3. Visualize the Variable Categories
# filter to show only the top contributing categories to avoid clutter
# visualizing MCA with red dots for Obesity Levels
fviz_mca_var(res.mca,
choice = "var.cat", # plot variable categories
repel = TRUE, # avoid text overlapping
# 1. active Variables
col.var = "black",
shape.var = 19,
col.quali.sup = "red",
shape.sup = 15,
title = "The 'Lifestyle Map': MCA of Habits with Obesity Overlay") +
theme_minimal() +
labs(subtitle = "Red Dots = Obesity Levels (Supplementary)\nBlack Dots = Lifestyle Habits (Active)")
The map shows us that weight isn’t random.
Healthy weight is clustered around young people who walk, doing motobike/bike, under 20, no family history, ect.
Overweight/Obesity I & II is clustered around older adults who drive cars, have little physical activities,…
Severe Obesity (Type III) forms a unique group strongly linked to public transportation users and women.
# identify top contributing variables
# plot contributions for Dimension 1 (the horizontal axis)
p1 <- fviz_contrib(res.mca, choice = "var", axes = 1, top = 10,
title = "Key Variables for Dimension 1 (Horizontal)")
# plot contributions for Dimension 2 (the vertical axis)
p2 <- fviz_contrib(res.mca, choice = "var", axes = 2, top = 10,
title = "Key Variables for Dimension 2 (Vertical)")
# put side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)
These contribution plots reveal the specific “ingredients” that define the axes of our MCA map, showing which variables contribute most to the data’s structure. The horizontal axis (Dimension 1) is dominated by Family History and Age, creating a clear contrast between younger individuals with no history of obesity and older adults in the 30–39 range. This suggests that the primary distinction in the dataset is between “low-risk” (youth/genetics) and “established-risk” profiles. The vertical axis (Dimension 2) is driven largely by Transportation modes, specifically separating Automobile drivers from Public Transportation users. Together, these results confirm that biological heritage and daily mobility habits are the strongest underlying forces shaping lifestyle clusters in this population.
To quantify how lifestyle factors increase or decrease the likelihood of moving into a higher obesity category, we employ Ordinal Logistic Regression (specifically the Proportional Odds Model).
Unlike standard classification which treats categories as separate buckets, this model respects the ranking of the target variable (\(Y\)): \[\text{Insufficient} < \text{Normal} < \text{Overweight} < \text{Obesity I} < \text{Obesity II} < \text{Obesity III}\]
The model predicts the cumulative probability that an individual’s obesity level is less than or equal to a specific category \(j\):
\[\ln \left( \frac{P(Y \le j)}{1 - P(Y \le j)} \right) = \alpha_j - \beta X\]
Where: * \(P(Y \le j)\): The probability of being in category \(j\) or lower. * \(\alpha_j\): The intercept (cut-point) for category \(j\). * \(\beta\): The coefficient for the predictor variables. * \(X\): The vector of lifestyle predictors.
We interpret the results using Odds Ratios (OR): * OR > 1: The factor is associated with higher odds of being in a heavier weight category (Risk Factor). * OR < 1: The factor is associated with lower odds of being in a heavier weight category (Protective Factor).
We fit the model using the polr function from the
MASS package. We exclude Height and
Weight to avoid data leakage, as they directly define
BMI.
library(ordinal)
library(gtsummary)
# 1. fit the model
clm_model <- clm(ObesityLevel ~ FamilyHistory + AgeGroup + Gender +
Transport + PhysicalActivity + VegConsumption +
HighCaloricFood + SnackBetweenMeals,
data = obesity_clean)
summary(clm_model)
## formula:
## ObesityLevel ~ FamilyHistory + AgeGroup + Gender + Transport + PhysicalActivity + VegConsumption + HighCaloricFood + SnackBetweenMeals
## data: obesity_clean
##
## link threshold nobs logLik AIC niter max.grad cond.H
## logit flexible 2111 -3346.14 6740.28 5(0) 6.11e-09 1.6e+03
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## FamilyHistoryyes 2.17737 0.12661 17.197 < 2e-16 ***
## AgeGroup20-29 1.56693 0.10538 14.869 < 2e-16 ***
## AgeGroup30-39 1.78080 0.16166 11.016 < 2e-16 ***
## AgeGroup40+ 2.33485 0.26869 8.690 < 2e-16 ***
## GenderMale -0.09681 0.08719 -1.110 0.26684
## TransportBike 0.42148 0.64837 0.650 0.51565
## TransportMotorbike 1.46547 0.63139 2.321 0.02029 *
## TransportPublic_Transportation 0.97554 0.12922 7.550 4.36e-14 ***
## TransportWalking 0.25181 0.28233 0.892 0.37244
## PhysicalActivity1 -0.27212 0.09820 -2.771 0.00559 **
## PhysicalActivity2 -0.68279 0.11287 -6.049 1.46e-09 ***
## PhysicalActivity3 -0.83960 0.18000 -4.664 3.09e-06 ***
## VegConsumption2 -0.03726 0.19177 -0.194 0.84594
## VegConsumption3 1.02361 0.19494 5.251 1.51e-07 ***
## HighCaloricFoodyes 0.62442 0.13345 4.679 2.88e-06 ***
## SnackBetweenMeals.L -1.14322 0.23996 -4.764 1.90e-06 ***
## SnackBetweenMeals.Q 0.55210 0.19340 2.855 0.00431 **
## SnackBetweenMeals.C 1.81722 0.13449 13.511 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Threshold coefficients:
## Estimate Std. Error z value
## Insufficient_Weight|Normal_Weight 2.0765 0.2752 7.546
## Normal_Weight|Overweight_Level_I 3.5550 0.2797 12.712
## Overweight_Level_I|Overweight_Level_II 4.6690 0.2879 16.217
## Overweight_Level_II|Obesity_Type_I 5.5531 0.2947 18.846
## Obesity_Type_I|Obesity_Type_II 6.6049 0.3029 21.804
## Obesity_Type_II|Obesity_Type_III 7.7558 0.3123 24.834
Family History is the Dominant Factor: The strongest predictor in
the model is FamilyHistory. With a coefficient of 2.17
(Odds Ratio is about 8.8), individuals with a family history of obesity
are nearly 9 times more likely to be in a higher obesity category than
those without, holding all other variables constant.
The Protective Power of Exercise: Physical activity shows a clear “dose-response” relationship. Compared to sedentary individuals:
Moderate activity (PhysicalActivity2) reduces the log-odds of higher obesity by -0.68.
High activity (PhysicalActivity3) reduces the log-odds by -0.84. This confirms that increased frequency of physical activity is a significant protective factor.
The Public Transport Anomaly: Consistent with our MCA findings, Public_Transportation is a significant risk factor (Estimate 0.97, p<0.001) compared to the reference group (Automobile). This reinforces the cluster found in the visualization where public transit usage correlated with Obesity Type III.
Age Progression: There is a linear increase in risk with age. Compared to the “Under 20” group, the risk coefficients rise steadily: 20-29 (1.56) to 30-39 (1.78) to 40+ (2.33).
A Note on Vegetable Consumption: Unexpectedly, frequent vegetable consumption (VegConsumption3) appears as a positive predictor for obesity (Estimate 1.02). In cross-sectional studies, this is often attributed to “reverse causality”—individuals with higher obesity levels may report eating more vegetables because they are currently attempting to lose weight.
This study utilized a categorical data analysis framework to deconstruct the relationship between lifestyle habits and obesity levels. By integrating association metrics (Cramér’s V), geometric mapping (MCA), and probabilistic modeling (Ordinal Regression), we arrived at three major conclusions.
Across all analytical methods, Family History emerged as the single most critical determinant of obesity status. The heatmap identified it as a dominant correlate (\(V = 0.54\)), and the ordinal regression model quantified the risk, showing that individuals with a family history are 8.8 times more likely (OR = 8.8) to be in a higher weight category. While “lifestyle” is often the focus of public health, this data suggests that genetic or shared environmental factors (the “home environment”) are the primary drivers here.
The MCA visualized a clear lifestyle trajectory. The “Active
Youth” cluster shows healthy weight strongly associated with
individuals under 20 who use Walking or
Biking. As individuals enter the 30–39 age
bracket, the map shows a “Sedentary Transition” toward
Automobile usage, correlating strongly with
Overweight Level I and II. Regression confirms that
shifting from walking to driving significantly increases the odds of
higher obesity classifications.
A unique finding was the distinct clustering of
Obesity Type III (Severe Obesity) with
Public Transportation usage and Female gender.
Unlike moderate obesity, which is linked to car ownership (often a proxy
for middle-class sedentary behavior), severe obesity appears linked to
public transit. In this Latin American context, this may serve as a
proxy for lower socioeconomic status, often correlated with limited
access to healthy food options.
Our model indicated that frequent Vegetable Consumption
was associated with higher obesity levels (OR > 1). This is
a likely example of reverse causality in cross-sectional data; it is
probable that individuals with higher BMI are eating more vegetables
because they are currently trying to lose weight, rather than vegetables
causing weight gain.
Based on these findings, effective interventions should move beyond generic “eat less, move more” advice:
Family-Based Interventions: Since Family History is the strongest predictor (OR 8.8), screening should target entire households rather than individuals.
Active Commuting: Urban planning policies promoting walkability and cycling could mitigate the “age-related” weight gain associated with automobile adoption in the 30s.
Targeted Support: Health campaigns should specifically target public transportation hubs, as this demographic shows the highest prevalence of severe obesity.
While machine learning models can predict obesity with high accuracy, this categorical analysis reveals the structure of the problem. Obesity in this population is not driven by a single bad habit, but by a combination of hereditary risk and a transition to sedentary, motorized lifestyles in adulthood.
Palechor, F. M., & De la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.Data in Brief, 25, 104344.
Quiroz, J. C., et al. (2022). Estimation of obesity levels based on dietary habits and physical condition using computational intelligence. Informatics in Medicine Unlocked, 29, 100954.
Görmez, A., et al. (2025). Prediction of obesity levels based on physical activity and eating habits with explainable AI. Frontiers in Physiology.