1. Introduction

Background and Rationale

Obesity is a growing global health concern strongly linked to behavioral factors like diet, physical activity, and transportation. While studies such as Palechor & De la Hoz Manotas (2019) have applied machine learning to predict obesity, they often emphasize accuracy over interpretability. This project proposes a categorical data analysis approach to uncover direct statistical associations and visualize how lifestyle behaviors correspond with obesity. By prioritizing explainability, we aim to provide a health-relevant perspective that complements “black box” predictive models.

Objectives

The primary objectives of this analysis are to:

investigate associations between categorical lifestyle factors and obesity categories;
identify which variables show the strongest relationships with obesity status;
visualize multidimensional associations using Correspondence Analysis (CA);
compare findings with results from previous machine-learning studies.

Data Description

The study utilizes the UCI “Estimation of Obesity Levels Based on Eating Habits and Physical Condition” dataset available via the UCI Machine Learning Repository. The sample consists of 2,111 individuals from Mexico, Peru, and Colombia. The target variable, NObeyesdad (Obesity Level), contains seven levels ranging from Insufficient Weight to Obesity Type III. The dataset includes 16 categorical or ordinal predictors covering demographics, eating habits, and physical activity. Notably, 23% of the data was collected directly from users via a web platform, while 77% was generated synthetically using the Weka tool and SMOTE filter.

All predictor variables are categorical or ordinal. The dataset is available via the UCI Machine Learning Repository.

2. Data Preparation

Data Loading and Cleaning

The dataset contains a mix of real and synthetic data generated using SMOTE (Synthetic Minority Over-sampling Technique). As a result, several integer-based variables (e.g., number of meals, vegetable consumption) appear as decimal values. To perform categorical analysis, we must first preprocess the data by:

Renaming cryptic column abbreviations to interpretable names.
Rounding synthetic decimal values to the nearest integer.
Binning continuous variables like Age into groups.
Converting character variables into ordered or unordered factors.

library(tidyverse)
library(gtsummary)

# 1. Load Data
raw_data <- read.csv("ObesityDataSet_raw_and_data_sinthetic.csv")

# 2. data cleaning
obesity_clean <- raw_data %>%
  rename(Gender = Gender, Age = Age,Height = Height,
         Weight = Weight,
         FamilyHistory = family_history_with_overweight,
         HighCaloricFood = FAVC,
         VegConsumption = FCVC,
  
         MealsPerDay = NCP,#number of main meals/day
         SnackBetweenMeals = CAEC,
         Smoker = SMOKE,
         WaterIntake = CH2O,
         CalorieMonitoring = SCC,
         PhysicalActivity = FAF,
         TechUseTime = TUE, 
         Alcohol = CALC,
         Transport = MTRANS,
         ObesityLevel = NObeyesdad) %>%
  
  #Fix floats to nearest integer
  mutate(VegConsumption = round(VegConsumption),
         MealsPerDay = round(MealsPerDay),
         WaterIntake = round(WaterIntake),
         PhysicalActivity = round(PhysicalActivity),
         TechUseTime = round(TechUseTime)) %>%
  
  # create Age groups
  mutate(AgeGroup = cut(Age, 
                   breaks = c(0, 20, 30, 40, 100),
                   labels = c("Under 20", "20-29", "30-39", "40+"))) %>%
  
  # convert to Factors
  mutate(ObesityLevel = factor(ObesityLevel, 
                          levels = c("Insufficient_Weight",
                                     "Normal_Weight", 
                                     "Overweight_Level_I",
                                     "Overweight_Level_II",
                                     "Obesity_Type_I",
                                     "Obesity_Type_II",
                                     "Obesity_Type_III"),
                          ordered = TRUE),
         SnackBetweenMeals = factor(SnackBetweenMeals,
                                    levels = c("no", "Sometimes", "Frequently", "Always"),
                                    ordered = TRUE),
         Alcohol = factor(Alcohol,
                          levels = c("no", "Sometimes", "Frequently", "Always"),
                          ordered = TRUE),
    
    # nominal factors (No order) & integer-to-factor
    across(c(Gender, FamilyHistory, HighCaloricFood, Smoker, CalorieMonitoring, Transport, AgeGroup), as.factor),
    across(c(VegConsumption, MealsPerDay, WaterIntake, PhysicalActivity, TechUseTime), as.factor))

#View(obesity_clean)

Data Quality Check

Before proceeding, we verify that the dataset contains no missing values and inspect the distribution of the target variable.

# 1. check for missing values
any_missing <- sum(is.na(obesity_clean))
cat("Total Missing Values in Dataset:", any_missing, "\n")

## Total Missing Values in Dataset: 0

# 2. verify target variable balance
# since SMOTE was used, we expect these counts to be roughly equal.
table(obesity_clean$ObesityLevel)

## 
## Insufficient_Weight       Normal_Weight  Overweight_Level_I Overweight_Level_II 
##                 272                 287                 290                 290 
##      Obesity_Type_I     Obesity_Type_II    Obesity_Type_III 
##                 351                 297                 324

Table 2: Variable Descriptions and Data Dictionary

Original Name	New Name	Description	Coding / Units / Levels
Gender	`Gender`	Gender of the individual	Male, Female
Age	`Age`	Age of the individual	Years (Numeric)
Height	`Height`	Height of the individual	Meters (Numeric)
Weight	`Weight`	Weight of the individual	Kilograms (Numeric)
family_history…	`FamilyHistory`	Has a family member who is overweight	yes, no
FAVC	`HighCaloricFood`	Frequent consumption of high caloric food	yes, no
FCVC	`VegConsumption`	Frequency of vegetable consumption	1 = Never 2 = Sometimes 3 = Always
NCP	`MealsPerDay`	Number of main meals per day	1, 2, 3, or 4 meals
CAEC	`SnackBetweenMeals`	Consumption of food between meals	no, Sometimes, Frequently, Always
SMOKE	`Smoker`	Does the person smoke?	yes, no
CH2O	`WaterIntake`	Daily water consumption	1 = Less than 1 L 2 = Between 1–2 L 3 = More than 2 L
SCC	`CalorieMonitoring`	Monitors calorie consumption	yes, no
FAF	`PhysicalActivity`	Physical activity frequency	0 = None 1 = 1–2 days/week 2 = 2–4 days/week 3 = 4–5 days/week
TUE	`TechUseTime`	Time using technology devices	0 = 0–2 hours 1 = 3–5 hours 2 = More than 5 hours
CALC	`Alcohol`	Alcohol consumption frequency	no, Sometimes, Frequently, Always
MTRANS	`Transport`	Main mode of transportation	Public_Transportation, Walking, Automobile, Motorbike, Bike
NObeyesdad	`ObesityLevel`	Target Variable (Obesity Classification)	Insufficient_Weight Normal_Weight Overweight_Level_I Overweight_Level_II Obesity_Type_I Obesity_Type_II Obesity_Type_III

3. Descriptive Statistics

Visualizing Associations

Figure 3a: Distribution of Obesity Levels by Transportation Mode

This stacked bar chart visualizes the relationship between the mode of transportation and obesity categories.

library(ggplot2)
library(viridis)
library(viridisLite)

ggplot(obesity_clean, aes(x = Transport, fill = ObesityLevel)) +
  geom_bar(position = "fill") +
  labs(
    title = "Obesity Levels by Transportation Mode",
    x = "Transportation Mode",
    y = "Proportion",
    fill = "Obesity Level"
  ) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_viridis(discrete = TRUE, option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

The data suggests a correlation between active transportation (Walking/Biking) and lower BMI, as those columns consist mostly of ‘Normal’ and ‘Overweight_level_I’ weight categories. In contrast, ‘Public Transportation’ shows the highest prevalence of ‘Obesity Type III’ (yellow), while ‘Automobile’ users show the most varied distribution across all weight classes.

Figure 3b: Age Distribution by Obesity Level

This box plot compares the age distribution across different obesity categories. The horizontal line inside each box represents the median age for that group.

ggplot(obesity_clean, aes(x = ObesityLevel, y = Age, fill = ObesityLevel)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 1) +
  labs(title = "Age Distribution across Obesity Levels",
       subtitle = "Comparing median ages and variability",
       x = "Obesity Category",
       y = "Age (Years)",
       fill = "Obesity Level") +
  scale_fill_viridis(discrete = TRUE, option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
    legend.position = "none") # dont need legend

The box plot reveals distinct age-related patterns across obesity categories:

Younger Demographics: Individuals classified as Insufficient Weight, Normal Weight, and Overweight Level I differ from the heavier groups, with median ages typically concentrated in the “Under 25” demographic.

Peak Age in Middle Categories: A shift is observed for Overweight Level II, Obesity Type I, and Obesity Type II, where the median age increases, suggesting these categories are more prevalent in individuals in their late 20s to 30s.

Low Variability in Obesity Type III: The Obesity Type III group exhibits a distinct distribution with very low interquartile variability (a short box), suggesting a highly clustered age group.

Figure 3c: Physical Activity Frequency vs. Obesity Level

This jitter plot visualizes individual data points to show the density of physical activity habits across obesity categories. We add a small amount of random noise (“jitter”) to separate overlapping points.

ggplot(obesity_clean, aes(x = ObesityLevel,
                          y = PhysicalActivity,
                          color = ObesityLevel)) +
  geom_jitter(alpha = 0.6, width = 0.2, height = 0.2, size = 1.5) +
  labs(title = "Physical Activity Frequency by Obesity Level",
       subtitle = "Visualizing the density of activity habits",
       x = "Obesity Category", y = "Physical Activity Frequency",
       color = "Obesity Level") +
  scale_color_viridis(discrete = TRUE, option = "D") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none")

The mosaic plot demonstrates a clear inverse relationship between physical activity frequency (0 to 3) and obesity levels

The column for zero physical activity shows a strong positive association with higher obesity categories, confirming that a lack of exercise is a critical risk factor.
As higher physical activity level shows significant positive associations with “healthier weights”.

Figure 3d: Impact of Family History on Obesity

This 100% stacked bar chart compares the distribution of obesity levels between individuals with and without a family history of overweight.

ggplot(obesity_clean, aes(x = FamilyHistory, fill = ObesityLevel)) +
  geom_bar(position = "fill") +
  labs(title = "Obesity Levels by Family History",
       subtitle = "Comparing those with vs. without a family history of overweight",
       x = "Family History of Overweight",
       y = "Proportion",
       fill = "Obesity Level") +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_viridis(discrete = TRUE, option = "D") +
  theme_minimal() +
  theme(legend.position = "right")

There is a striking correlation between family history and obesity levels. Having a family history of being overweight drastically increases the likelihood of a person being obese themselves.

4. Statistical Analysis

4.1 Pearson’s Chi-Square Test of Independence

To evaluate the relationship between categorical lifestyle variables (e.g., Transportation, Family History) and Obesity Levels, we employ Pearson’s Chi-Square Test of Independence. This test determines whether the observed distribution of frequencies differs significantly from what would be expected under the assumption of independence.

Hypotheses

For each categorical predictor variable, we test the following hypotheses:

Null Hypothesis (\(H_0\)): There is no association; the distribution of obesity is the same across all groups.
Alternative Hypothesis (\(H_1\)): There is a significant association; the distribution of obesity varies by group.

Mathematical Formulation

The test statistic (\(\chi^2\)) is calculated by summing the squared differences between observed (\(O_{ij}\)) and expected (\(E_{ij}\)) frequencies, normalized by the expected frequencies:

\[\chi^2 = \sum_{i=1}^{r} \sum_{j=1}^{c} \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

Where:

\(r\) = Number of rows (categories of the lifestyle variable).
\(c\) = Number of columns (categories of Obesity Level).
\(O_{ij}\) = The observed count of individuals in row \(i\) and column \(j\).
\(E_{ij}\) = The expected count under the assumption of independence.

Expected Frequencies

The expected frequency for each cell is calculated based on the marginal totals (row and column sums):

\[E_{ij} = \frac{(\text{Row Total}_i) \times (\text{Column Total}_j)}{\text{Grand Total (N)}}\]

Decision Rule

The calculated \(\chi^2\) statistic follows a Chi-Square distribution with degrees of freedom \(df = (r-1)(c-1)\).

We calculate a p-value, which represents the probability of observing a \(\chi^2\) statistic as extreme as the one calculated, assuming \(H_0\) is true.
Significance Level (\(\alpha\)): We set \(\alpha = 0.05\).
If p-value < 0.05, we reject \(H_0\) and conclude that there is a statistically significant association between the lifestyle factor and obesity.

Table 4.1: Demographic and Lifestyle Characteristics by Obesity Level

The following table summarizes the demographic profile and lifestyle behaviors of the study population, stratified by obesity category. P-values indicate significant differences across groups (Chi-square test for categorical variables, Kruskal-Wallis for continuous/ordinal).

library(gtsummary)

# create summary table
obesity_clean %>%
  select(ObesityLevel, AgeGroup, Gender,Height, Weight, FamilyHistory, PhysicalActivity, HighCaloricFood, VegConsumption,MealsPerDay,SnackBetweenMeals,Smoker, WaterIntake,CalorieMonitoring,TechUseTime,Alcohol, Transport) %>%
  tbl_summary(
    by = ObesityLevel, # split the table by Obesity Level
    statistic = list(all_categorical() ~ "{n} ({p}%)"),
    digits = all_continuous() ~ 1,
    label = list(
      AgeGroup ~ "Age Group",
      Gender ~ "Gender",
      Height ~ "Height",
      Weight ~ "Weight",
      FamilyHistory ~ "Family History of Obesity",
      PhysicalActivity ~ "Physical Activity (Days/Week)",
      HighCaloricFood ~ "High Caloric Food Intake",
      VegConsumption ~ "Vegetable Consumption",
      MealsPerDay ~ "Meals Per Day",
      Smoker ~ "Smoker",
      SnackBetweenMeals ~ "Snack Between Meals",
      CalorieMonitoring ~ "Calorie Monitoring",
      WaterIntake ~ "Daily Water Intake",
      TechUseTime ~ "Tech Use Time",
      Alcohol ~ "Alcohol Consumption Frequency",
      Transport ~ "Transportation Mode"
    )
  ) %>%
  add_p() %>%         # add P-values automatically
  add_overall() %>%   # add a column for the total population
  bold_labels()

Characteristic	Overall N = 2,111¹	Insufficient_Weight N = 272¹	Normal_Weight N = 287¹	Overweight_Level_I N = 290¹	Overweight_Level_II N = 290¹	Obesity_Type_I N = 351¹	Obesity_Type_II N = 297¹	Obesity_Type_III N = 324¹	p-value²
Age Group									<0.001
Under 20	585 (28%)	180 (66%)	132 (46%)	94 (32%)	59 (20%)	76 (22%)	1 (0.3%)	43 (13%)
20-29	1,170 (55%)	89 (33%)	137 (48%)	153 (53%)	135 (47%)	189 (54%)	186 (63%)	281 (87%)
30-39	299 (14%)	3 (1.1%)	16 (5.6%)	40 (14%)	80 (28%)	62 (18%)	98 (33%)	0 (0%)
40+	57 (2.7%)	0 (0%)	2 (0.7%)	3 (1.0%)	16 (5.5%)	24 (6.8%)	12 (4.0%)	0 (0%)
Gender									<0.001
Female	1,043 (49%)	173 (64%)	141 (49%)	145 (50%)	103 (36%)	156 (44%)	2 (0.7%)	323 (100%)
Male	1,068 (51%)	99 (36%)	146 (51%)	145 (50%)	187 (64%)	195 (56%)	295 (99%)	1 (0.3%)
Height	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.7, 1.8)	1.7 (1.6, 1.8)	1.8 (1.8, 1.8)	1.7 (1.6, 1.7)	<0.001
Weight	83.0 (65.4, 107.5)	50.0 (44.7, 53.7)	61.0 (55.0, 69.0)	75.0 (68.1, 80.0)	82.0 (78.0, 86.9)	90.7 (82.1, 103.8)	117.8 (112.0, 120.8)	112.0 (109.1, 133.5)	<0.001
Family History of Obesity	1,726 (82%)	126 (46%)	155 (54%)	209 (72%)	272 (94%)	344 (98%)	296 (100%)	324 (100%)	<0.001
Physical Activity (Days/Week)									<0.001
0	720 (34%)	72 (26%)	80 (28%)	84 (29%)	97 (33%)	131 (37%)	69 (23%)	187 (58%)
1	776 (37%)	72 (26%)	97 (34%)	126 (43%)	125 (43%)	123 (35%)	165 (56%)	68 (21%)
2	496 (23%)	117 (43%)	69 (24%)	56 (19%)	50 (17%)	72 (21%)	63 (21%)	69 (21%)
3	119 (5.6%)	11 (4.0%)	41 (14%)	24 (8.3%)	18 (6.2%)	25 (7.1%)	0 (0%)	0 (0%)
High Caloric Food Intake	1,866 (88%)	221 (81%)	208 (72%)	268 (92%)	216 (74%)	340 (97%)	290 (98%)	323 (100%)	<0.001
Vegetable Consumption									<0.001
1	102 (4.8%)	23 (8.5%)	18 (6.3%)	14 (4.8%)	9 (3.1%)	17 (4.8%)	21 (7.1%)	0 (0%)
2	1,013 (48%)	86 (32%)	155 (54%)	186 (64%)	192 (66%)	256 (73%)	138 (46%)	0 (0%)
3	996 (47%)	163 (60%)	114 (40%)	90 (31%)	89 (31%)	78 (22%)	138 (46%)	324 (100%)
Meals Per Day									<0.001
1	316 (15%)	37 (14%)	52 (18%)	76 (26%)	48 (17%)	79 (23%)	24 (8.1%)	0 (0%)
2	176 (8.3%)	18 (6.6%)	0 (0%)	23 (7.9%)	52 (18%)	47 (13%)	36 (12%)	0 (0%)
3	1,470 (70%)	145 (53%)	206 (72%)	158 (54%)	184 (63%)	225 (64%)	228 (77%)	324 (100%)
4	149 (7.1%)	72 (26%)	29 (10%)	33 (11%)	6 (2.1%)	0 (0%)	9 (3.0%)	0 (0%)
Snack Between Meals									<0.001
no	51 (2.4%)	3 (1.1%)	10 (3.5%)	35 (12%)	1 (0.3%)	1 (0.3%)	1 (0.3%)	0 (0%)
Sometimes	1,765 (84%)	146 (54%)	159 (55%)	236 (81%)	270 (93%)	338 (96%)	293 (99%)	323 (100%)
Frequently	242 (11%)	121 (44%)	83 (29%)	14 (4.8%)	16 (5.5%)	6 (1.7%)	1 (0.3%)	1 (0.3%)
Always	53 (2.5%)	2 (0.7%)	35 (12%)	5 (1.7%)	3 (1.0%)	6 (1.7%)	2 (0.7%)	0 (0%)
Smoker	44 (2.1%)	1 (0.4%)	13 (4.5%)	3 (1.0%)	5 (1.7%)	6 (1.7%)	15 (5.1%)	1 (0.3%)	<0.001
Daily Water Intake									<0.001
1	485 (23%)	84 (31%)	83 (29%)	60 (21%)	47 (16%)	68 (19%)	82 (28%)	61 (19%)
2	1,110 (53%)	142 (52%)	164 (57%)	154 (53%)	186 (64%)	173 (49%)	177 (60%)	114 (35%)
3	516 (24%)	46 (17%)	40 (14%)	76 (26%)	57 (20%)	110 (31%)	38 (13%)	149 (46%)
Calorie Monitoring	96 (4.5%)	22 (8.1%)	30 (10%)	37 (13%)	4 (1.4%)	2 (0.6%)	1 (0.3%)	0 (0%)	<0.001
Tech Use Time									<0.001
0	952 (45%)	94 (35%)	129 (45%)	164 (57%)	114 (39%)	169 (48%)	173 (58%)	109 (34%)
1	915 (43%)	127 (47%)	122 (43%)	82 (28%)	145 (50%)	121 (34%)	103 (35%)	215 (66%)
2	244 (12%)	51 (19%)	36 (13%)	44 (15%)	31 (11%)	61 (17%)	21 (7.1%)	0 (0%)
Alcohol Consumption Frequency
no	639 (30%)	117 (43%)	107 (37%)	50 (17%)	128 (44%)	165 (47%)	71 (24%)	1 (0.3%)
Sometimes	1,401 (66%)	154 (57%)	161 (56%)	224 (77%)	143 (49%)	172 (49%)	224 (75%)	323 (100%)
Frequently	70 (3.3%)	1 (0.4%)	18 (6.3%)	16 (5.5%)	19 (6.6%)	14 (4.0%)	2 (0.7%)	0 (0%)
Always	1 (<0.1%)	0 (0%)	1 (0.3%)	0 (0%)	0 (0%)	0 (0%)	0 (0%)	0 (0%)
Transportation Mode
Automobile	457 (22%)	46 (17%)	45 (16%)	66 (23%)	94 (32%)	110 (31%)	95 (32%)	1 (0.3%)
Bike	7 (0.3%)	0 (0%)	4 (1.4%)	2 (0.7%)	0 (0%)	0 (0%)	1 (0.3%)	0 (0%)
Motorbike	11 (0.5%)	0 (0%)	6 (2.1%)	1 (0.3%)	1 (0.3%)	3 (0.9%)	0 (0%)	0 (0%)
Public_Transportation	1,580 (75%)	220 (81%)	200 (70%)	212 (73%)	189 (65%)	236 (67%)	200 (67%)	323 (100%)
Walking	56 (2.7%)	6 (2.2%)	32 (11%)	9 (3.1%)	6 (2.1%)	2 (0.6%)	1 (0.3%)	0 (0%)
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test; Kruskal-Wallis rank sum test; NA

Why there aren’t p-values for Alcohol and Transportation?

Let’s take a look at the counts

table(obesity_clean$Transport, obesity_clean$ObesityLevel)

##                        
##                         Insufficient_Weight Normal_Weight Overweight_Level_I
##   Automobile                             46            45                 66
##   Bike                                    0             4                  2
##   Motorbike                               0             6                  1
##   Public_Transportation                 220           200                212
##   Walking                                 6            32                  9
##                        
##                         Overweight_Level_II Obesity_Type_I Obesity_Type_II
##   Automobile                             94            110              95
##   Bike                                    0              0               1
##   Motorbike                               1              3               0
##   Public_Transportation                 189            236             200
##   Walking                                 6              2               1
##                        
##                         Obesity_Type_III
##   Automobile                           1
##   Bike                                 0
##   Motorbike                            0
##   Public_Transportation              323
##   Walking                              0

table(obesity_clean$Alcohol, obesity_clean$ObesityLevel)

##             
##              Insufficient_Weight Normal_Weight Overweight_Level_I
##   no                         117           107                 50
##   Sometimes                  154           161                224
##   Frequently                   1            18                 16
##   Always                       0             1                  0
##             
##              Overweight_Level_II Obesity_Type_I Obesity_Type_II
##   no                         128            165              71
##   Sometimes                  143            172             224
##   Frequently                  19             14               2
##   Always                       0              0               0
##             
##              Obesity_Type_III
##   no                        1
##   Sometimes               323
##   Frequently                0
##   Always                    0

The zeros we see (e.g., 0 individuals with Obesity Type III who Walk, and 0 who drink alcohol Always) are mathematically “breaking” the standard Chi-Square formula.

The Chi-Square formula divides by an “expected value.” When we have cells with 0 observations, the expected value often drops extremely low, then dividing by it causes the statistic to explode towards infinity. R protects this misleading result by suppressing the P-value entirely.

The Solution: Monte Carlo Simulation:

We will use the Monte Carlo simulation method: instead of relying on the broken formula, we generate 2,000 random tables with the same row/column totals to see how “rare” the specific distribution is.

Let’s rerun the code:

# create  summary table
obesity_clean %>%
  select(ObesityLevel, AgeGroup, Gender,Height, Weight, FamilyHistory, PhysicalActivity, HighCaloricFood, VegConsumption,MealsPerDay,SnackBetweenMeals,Smoker, WaterIntake,CalorieMonitoring,TechUseTime,Alcohol, Transport) %>%
  tbl_summary(
    by = ObesityLevel, # split the table by Obesity Level
    statistic = list(all_categorical() ~ "{n} ({p}%)"),
    digits = all_continuous() ~ 1,
    label = list(
      AgeGroup ~ "Age Group",
      Gender ~ "Gender",
      Height ~ "Height",
      Weight ~ "Weight",
      FamilyHistory ~ "Family History of Obesity",
      PhysicalActivity ~ "Physical Activity (Days/Week)",
      HighCaloricFood ~ "High Caloric Food Intake",
      VegConsumption ~ "Vegetable Consumption",
      MealsPerDay ~ "Meals Per Day",
      Smoker ~ "Smoker",
      SnackBetweenMeals ~ "Snack Between Meals",
      CalorieMonitoring ~ "Calorie Monitoring",
      WaterIntake ~ "Daily Water Intake",
      TechUseTime ~ "Tech Use Time",
      Alcohol ~ "Alcohol Consumption Frequency",
      Transport ~ "Transportation Mode")) %>%
  # Monte-Carlo fix
  add_p(test = all_categorical() ~ "chisq.test",
        test.args = all_tests("chisq.test") ~ list(simulate.p.value = TRUE, B = 2000))%>% 
  add_p() %>%         # add P-values automatically
  add_overall() %>%   # add a column for the total population
  bold_labels()

Characteristic	Overall N = 2,111¹	Insufficient_Weight N = 272¹	Normal_Weight N = 287¹	Overweight_Level_I N = 290¹	Overweight_Level_II N = 290¹	Obesity_Type_I N = 351¹	Obesity_Type_II N = 297¹	Obesity_Type_III N = 324¹	p-value²
Age Group									<0.001
Under 20	585 (28%)	180 (66%)	132 (46%)	94 (32%)	59 (20%)	76 (22%)	1 (0.3%)	43 (13%)
20-29	1,170 (55%)	89 (33%)	137 (48%)	153 (53%)	135 (47%)	189 (54%)	186 (63%)	281 (87%)
30-39	299 (14%)	3 (1.1%)	16 (5.6%)	40 (14%)	80 (28%)	62 (18%)	98 (33%)	0 (0%)
40+	57 (2.7%)	0 (0%)	2 (0.7%)	3 (1.0%)	16 (5.5%)	24 (6.8%)	12 (4.0%)	0 (0%)
Gender									<0.001
Female	1,043 (49%)	173 (64%)	141 (49%)	145 (50%)	103 (36%)	156 (44%)	2 (0.7%)	323 (100%)
Male	1,068 (51%)	99 (36%)	146 (51%)	145 (50%)	187 (64%)	195 (56%)	295 (99%)	1 (0.3%)
Height	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.6, 1.8)	1.7 (1.7, 1.8)	1.7 (1.6, 1.8)	1.8 (1.8, 1.8)	1.7 (1.6, 1.7)	<0.001
Weight	83.0 (65.4, 107.5)	50.0 (44.7, 53.7)	61.0 (55.0, 69.0)	75.0 (68.1, 80.0)	82.0 (78.0, 86.9)	90.7 (82.1, 103.8)	117.8 (112.0, 120.8)	112.0 (109.1, 133.5)	<0.001
Family History of Obesity	1,726 (82%)	126 (46%)	155 (54%)	209 (72%)	272 (94%)	344 (98%)	296 (100%)	324 (100%)	<0.001
Physical Activity (Days/Week)									<0.001
0	720 (34%)	72 (26%)	80 (28%)	84 (29%)	97 (33%)	131 (37%)	69 (23%)	187 (58%)
1	776 (37%)	72 (26%)	97 (34%)	126 (43%)	125 (43%)	123 (35%)	165 (56%)	68 (21%)
2	496 (23%)	117 (43%)	69 (24%)	56 (19%)	50 (17%)	72 (21%)	63 (21%)	69 (21%)
3	119 (5.6%)	11 (4.0%)	41 (14%)	24 (8.3%)	18 (6.2%)	25 (7.1%)	0 (0%)	0 (0%)
High Caloric Food Intake	1,866 (88%)	221 (81%)	208 (72%)	268 (92%)	216 (74%)	340 (97%)	290 (98%)	323 (100%)	<0.001
Vegetable Consumption									<0.001
1	102 (4.8%)	23 (8.5%)	18 (6.3%)	14 (4.8%)	9 (3.1%)	17 (4.8%)	21 (7.1%)	0 (0%)
2	1,013 (48%)	86 (32%)	155 (54%)	186 (64%)	192 (66%)	256 (73%)	138 (46%)	0 (0%)
3	996 (47%)	163 (60%)	114 (40%)	90 (31%)	89 (31%)	78 (22%)	138 (46%)	324 (100%)
Meals Per Day									<0.001
1	316 (15%)	37 (14%)	52 (18%)	76 (26%)	48 (17%)	79 (23%)	24 (8.1%)	0 (0%)
2	176 (8.3%)	18 (6.6%)	0 (0%)	23 (7.9%)	52 (18%)	47 (13%)	36 (12%)	0 (0%)
3	1,470 (70%)	145 (53%)	206 (72%)	158 (54%)	184 (63%)	225 (64%)	228 (77%)	324 (100%)
4	149 (7.1%)	72 (26%)	29 (10%)	33 (11%)	6 (2.1%)	0 (0%)	9 (3.0%)	0 (0%)
Snack Between Meals									<0.001
no	51 (2.4%)	3 (1.1%)	10 (3.5%)	35 (12%)	1 (0.3%)	1 (0.3%)	1 (0.3%)	0 (0%)
Sometimes	1,765 (84%)	146 (54%)	159 (55%)	236 (81%)	270 (93%)	338 (96%)	293 (99%)	323 (100%)
Frequently	242 (11%)	121 (44%)	83 (29%)	14 (4.8%)	16 (5.5%)	6 (1.7%)	1 (0.3%)	1 (0.3%)
Always	53 (2.5%)	2 (0.7%)	35 (12%)	5 (1.7%)	3 (1.0%)	6 (1.7%)	2 (0.7%)	0 (0%)
Smoker	44 (2.1%)	1 (0.4%)	13 (4.5%)	3 (1.0%)	5 (1.7%)	6 (1.7%)	15 (5.1%)	1 (0.3%)	<0.001
Daily Water Intake									<0.001
1	485 (23%)	84 (31%)	83 (29%)	60 (21%)	47 (16%)	68 (19%)	82 (28%)	61 (19%)
2	1,110 (53%)	142 (52%)	164 (57%)	154 (53%)	186 (64%)	173 (49%)	177 (60%)	114 (35%)
3	516 (24%)	46 (17%)	40 (14%)	76 (26%)	57 (20%)	110 (31%)	38 (13%)	149 (46%)
Calorie Monitoring	96 (4.5%)	22 (8.1%)	30 (10%)	37 (13%)	4 (1.4%)	2 (0.6%)	1 (0.3%)	0 (0%)	<0.001
Tech Use Time									<0.001
0	952 (45%)	94 (35%)	129 (45%)	164 (57%)	114 (39%)	169 (48%)	173 (58%)	109 (34%)
1	915 (43%)	127 (47%)	122 (43%)	82 (28%)	145 (50%)	121 (34%)	103 (35%)	215 (66%)
2	244 (12%)	51 (19%)	36 (13%)	44 (15%)	31 (11%)	61 (17%)	21 (7.1%)	0 (0%)
Alcohol Consumption Frequency									<0.001
no	639 (30%)	117 (43%)	107 (37%)	50 (17%)	128 (44%)	165 (47%)	71 (24%)	1 (0.3%)
Sometimes	1,401 (66%)	154 (57%)	161 (56%)	224 (77%)	143 (49%)	172 (49%)	224 (75%)	323 (100%)
Frequently	70 (3.3%)	1 (0.4%)	18 (6.3%)	16 (5.5%)	19 (6.6%)	14 (4.0%)	2 (0.7%)	0 (0%)
Always	1 (<0.1%)	0 (0%)	1 (0.3%)	0 (0%)	0 (0%)	0 (0%)	0 (0%)	0 (0%)
Transportation Mode									<0.001
Automobile	457 (22%)	46 (17%)	45 (16%)	66 (23%)	94 (32%)	110 (31%)	95 (32%)	1 (0.3%)
Bike	7 (0.3%)	0 (0%)	4 (1.4%)	2 (0.7%)	0 (0%)	0 (0%)	1 (0.3%)	0 (0%)
Motorbike	11 (0.5%)	0 (0%)	6 (2.1%)	1 (0.3%)	1 (0.3%)	3 (0.9%)	0 (0%)	0 (0%)
Public_Transportation	1,580 (75%)	220 (81%)	200 (70%)	212 (73%)	189 (65%)	236 (67%)	200 (67%)	323 (100%)
Walking	56 (2.7%)	6 (2.2%)	32 (11%)	9 (3.1%)	6 (2.1%)	2 (0.6%)	1 (0.3%)	0 (0%)
¹ n (%); Median (Q1, Q3)
² Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates); Kruskal-Wallis rank sum test

All variables are statistically significant. [note: the table is at the very end of this report for PDF verion. R arranges it automatically. Please see HTML version for better visualization]

4.2 Cramer’s V and Goodman–Kruskal - Strength of Association and Effect Size

Theoretical Framework

While the Chi-Square test of independence (performed in Table 4) tells us if a relationship exists (statistical significance), it does not quantify how strong that relationship is. In large datasets, even weak relationships can yield statistically significant p-values.

To address this, we employ two distinct effect size metrics for categorical data: Cramér’s V and Goodman-Kruskal Lambda.

4.2.1 Cramér’s V (Symmetric Association)

Cramér’s V measures the strength of association between two nominal variables. It is a symmetric measure, meaning the relationship between \(X \rightarrow Y\) is treated the same as \(Y \rightarrow X\).

Mathematical Formulation: \[V = \sqrt{\frac{\chi^2}{N \cdot \min(r-1, c-1)}}\]

Where:

\(\chi^2\): The Pearson Chi-Square statistic.
\(N\): Total sample size.
\(r\): Number of rows (categories in variable 1).
\(c\): Number of columns (categories in variable 2).
\(\min(r-1, c-1)\): The minimum dimension, used to normalize the score.

Interpretation: The coefficient ranges from 0 to 1:

0: No association.
0.1: Weak association.
0.3: Moderate association.
> 0.5: Strong association.

4.2.2 Goodman-Kruskal Lambda (\(\lambda\)) (Predictive Power)

Unlike Cramér’s V, Lambda is an asymmetric measure of association based on the concept of Proportional Reduction in Error (PRE). It answers the question: “By knowing the value of the independent variable (Lifestyle Factor), by what percentage do we reduce the error in guessing the dependent variable (Obesity Level)?”

Mathematical Formulation: \[\lambda = \frac{E_1 - E_2}{E_1}\]

Where:

\(E_1\): The error made when predicting the target variable without any extra information (usually by guessing the modal category).
\(E_2\): The error made when predicting the target variable given the predictor variable.

Interpretation:

\(\lambda = 0\): The predictor provides no information (predictive power is nonexistent).
\(\lambda = 1\): The predictor allows for perfect prediction of the target.

Analysis Implementation

Association Heatmap (Cramér’s V)

The following heatmap visualizes the pairwise strength of association between all predictor variables. This allows us to identify multicollinearity (variables that are highly correlated with each other).

library(vcd)       #association stats
library(reshape2)  #reshaping the matrix for plotting
library(ggplot2)
library(dplyr)

# 1. define function to calculate Cramer's V for a pair of vectors
get_cramer_v <- function(x, y) {
  tbl <- table(x, y)
  stats <- assocstats(tbl)
  return(stats$cramer)
}

# 2. select variables (remove continuous Height,Weight, Age)
vars <- obesity_clean %>% 
  select(-Height, -Weight, -Age) 

# 3. create an empty matrix
var_names <- names(vars)
n <- length(var_names)
cramer_mat <- matrix(0, nrow=n, ncol=n, dimnames=list(var_names, var_names))

# 4. fill matrix
for (i in 1:n) {
  for (j in 1:n) {
    cramer_mat[i, j] <- get_cramer_v(vars[[i]], vars[[j]])
  }
}

# 5. reshape for ggplot
melted_cramer <- melt(cramer_mat)

# 6. plot heatmap
ggplot(melted_cramer, aes(Var1, Var2, fill = value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "white", high = "#c0392b", mid = "#e67e22", 
                       midpoint = 0.5, limit = c(0,1), name="Cramer's V") +
  geom_text(aes(label = round(value, 2)), color = "black", size = 3) +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
        axis.title.x = element_blank(),
        axis.title.y = element_blank()) +
  coord_fixed() +
  labs(title = "Cramér's V Association Matrix",
       subtitle = "Measuring strength of relationships between categorical variables")

The heatmap tells a clear story: biological factors are the biggest drivers here, with Gender (\(V=0.56\)) and Family History (\(V=0.54\)) showing the strongest link to obesity levels. Lifestyle habits like Vegetable Consumption and Snacking also play a significant role, though they are secondary to genetics. The large beige area in the chart is actually a good sign - it shows that most of these habits don’t overlap. The weak association between the predictor variables themselves means low multicollinearity ~ minimal redundancy in the dataset, meaning each variable contributes distinct information to the analysis.

library(DescTools) # for Lambda function
library(knitr)

# initialize a dataframe to store results
ranking <- data.frame(
  Variable = character(),
  CramersV = numeric(),
  Lambda = numeric(), # predictive power
  stringsAsFactors = FALSE
)

# loop through predictors to calculate stats against ObesityLevel
predictors <- names(obesity_clean)[!names(obesity_clean) %in% c("ObesityLevel", "Height", "Weight", "Age")]

for (var in predictors) {
  # reate table
  tbl <- table(obesity_clean[[var]], obesity_clean$ObesityLevel)
  
  # calculate stats
  v_stat <- CramerV(tbl)
  # direction: How well does x predict y?
  lambda_stat <- Lambda(tbl, direction = "row") 
  
  # add to dataframe
  ranking <- rbind(ranking, data.frame(
    Variable = var,
    CramersV = v_stat,
    Lambda = lambda_stat
  ))
}


ranking %>%
  arrange(desc(CramersV)) %>%
  kable(digits = 3, caption = "Ranking Predictors by Association Strength with Obesity Level")

Ranking Predictors by Association Strength with Obesity Level
Variable	CramersV	Lambda
Gender	0.558	0.380
FamilyHistory	0.543	0.052
VegConsumption	0.366	0.365
SnackBetweenMeals	0.356	0.000
HighCaloricFood	0.332	0.000
AgeGroup	0.321	0.097
MealsPerDay	0.278	0.000
CalorieMonitoring	0.241	0.000
Alcohol	0.231	0.000
PhysicalActivity	0.208	0.129
TechUseTime	0.205	0.147
WaterIntake	0.197	0.035
Transport	0.186	0.000
Smoker	0.123	0.000

This ranking highlights a crucial difference between a variable being related to obesity versus actually helping us predict it. While Family History is strongly linked to weight status (high Cramér’s V), it is Gender and Vegetable Consumption that prove to be the most useful predictors (highest Lambda scores). Surprisingly, many lifestyle habits like Snacking, Alcohol, and Transportation show zero predictive power on their own (\(\lambda=0\). This means that while these behaviors are statistically connected to obesity, we can’t rely on them individually to distinguish between specific weight categories.

4.3 Multivariate Visualization: Multiple Correspondence Analysis (MCA)

Theoretical Framework

While pairwise statistics (Cramér’s V) reveal individual links, they cannot fully capture the multidimensional structure of the data. To visualize how all 15+ lifestyle variables interact simultaneously, we employ Multiple Correspondence Analysis (MCA).

MCA is a dimensionality reduction technique specifically designed for categorical data. It can be viewed as the categorical equivalent of Principal Component Analysis (PCA).

4.3.1. The Indicator Matrix

Unlike standard correlation which uses raw numbers, MCA operates on an Indicator Matrix (or a Disjunctive Table). If an individual belongs to category \(k\), the value is 1; otherwise, it is 0.

\[ X_{ik} = \begin{cases} 1 & \text{if individual } i \text{ belongs to category } k \\ 0 & \text{otherwise} \end{cases} \]

4.3.2. Geometric Distance (\(\chi^2\)-Distance)

MCA calculates the distance between individuals or categories using the Chi-Square metric. This differs from standard Euclidean distance (straight lines) by weighting categories based on their rarity.

Rare categories (e.g., “Bike Commuters”) contribute more to the inertia (variance) than common categories.
Two categories are “close” in the generated map if they are frequently chosen by the same individuals.

4.3.3. Inertia and Dimensions

The algorithm decomposes the total variation (Inertia) in the dataset into orthogonal Dimensions (axes):

Dimension 1: Represents the pattern accounting for the most variance in the data.
Dimension 2: Represents the second most dominant pattern, independent of the first.

The goal is to project the high-dimensional cloud of data points onto a 2D plane (Dim 1 vs. Dim 2) while retaining as much information as possible.

4.3.4 Supplementary Variables

A critical methodological choice in this analysis is the treatment of ObesityLevel as a Supplementary Variable.

Active Variables: The map is constructed solely based on the lifestyle behaviors (Diet, Transport, Activity, etc.).
Supplementary Variable: The ObesityLevel categories are not used to build the axes. Instead, they are projected onto the established lifestyle map afterward.

Why do this? This allows us to see where specific obesity levels “naturally land” within the landscape of lifestyle habits, without the obesity label itself forcing the structure of the map.

library(FactoMineR) # computing MCA
library(factoextra)


# 1. select active variables, remove Height/Weight/Age
mca_data <- obesity_clean %>%
  select(Gender, AgeGroup, FamilyHistory, 
         HighCaloricFood, VegConsumption, MealsPerDay, SnackBetweenMeals, 
         Smoker, WaterIntake, CalorieMonitoring, PhysicalActivity, 
         TechUseTime, Alcohol, Transport, 
         ObesityLevel) 

# 2. run MCA
res.mca <- MCA(mca_data, 
               quali.sup = 15, # Index of ObesityLevel
               graph = FALSE)

# 3. Visualize the Variable Categories
# filter to show only the top contributing categories to avoid clutter
# visualizing MCA with red dots for Obesity Levels
fviz_mca_var(res.mca, 
             choice = "var.cat",     # plot variable categories
             repel = TRUE,           # avoid text overlapping
             
             # 1. active Variables
             col.var = "black", 
             shape.var = 19,      
             
            
             col.quali.sup = "red", 
             shape.sup = 15,        
             
             title = "The 'Lifestyle Map': MCA of Habits with Obesity Overlay") +
  
  theme_minimal() +
  labs(subtitle = "Red Dots = Obesity Levels (Supplementary)\nBlack Dots = Lifestyle Habits (Active)")

The map shows us that weight isn’t random.

Healthy weight is clustered around young people who walk, doing motobike/bike, under 20, no family history, ect.
Overweight/Obesity I & II is clustered around older adults who drive cars, have little physical activities,…
Severe Obesity (Type III) forms a unique group strongly linked to public transportation users and women.

Which variables define Dimension 1 and Dimension 2?

# identify top contributing variables


# plot contributions for Dimension 1 (the horizontal axis)
p1 <- fviz_contrib(res.mca, choice = "var", axes = 1, top = 10,
             title = "Key Variables for Dimension 1 (Horizontal)")

# plot contributions for Dimension 2 (the vertical axis)
p2 <- fviz_contrib(res.mca, choice = "var", axes = 2, top = 10,
             title = "Key Variables for Dimension 2 (Vertical)")

# put side by side
library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

These contribution plots reveal the specific “ingredients” that define the axes of our MCA map, showing which variables contribute most to the data’s structure. The horizontal axis (Dimension 1) is dominated by Family History and Age, creating a clear contrast between younger individuals with no history of obesity and older adults in the 30–39 range. This suggests that the primary distinction in the dataset is between “low-risk” (youth/genetics) and “established-risk” profiles. The vertical axis (Dimension 2) is driven largely by Transportation modes, specifically separating Automobile drivers from Public Transportation users. Together, these results confirm that biological heritage and daily mobility habits are the strongest underlying forces shaping lifestyle clusters in this population.

4.4 Predictive Modeling: Ordinal Logistic Regression

Theoretical Framework

To quantify how lifestyle factors increase or decrease the likelihood of moving into a higher obesity category, we employ Ordinal Logistic Regression (specifically the Proportional Odds Model).

Unlike standard classification which treats categories as separate buckets, this model respects the ranking of the target variable (\(Y\)): \[\text{Insufficient} < \text{Normal} < \text{Overweight} < \text{Obesity I} < \text{Obesity II} < \text{Obesity III}\]

The Mathematical Model

The model predicts the cumulative probability that an individual’s obesity level is less than or equal to a specific category \(j\):

\[\ln \left( \frac{P(Y \le j)}{1 - P(Y \le j)} \right) = \alpha_j - \beta X\]

Where: * \(P(Y \le j)\): The probability of being in category \(j\) or lower. * \(\alpha_j\): The intercept (cut-point) for category \(j\). * \(\beta\): The coefficient for the predictor variables. * \(X\): The vector of lifestyle predictors.

Interpretation (Odds Ratios)

We interpret the results using Odds Ratios (OR): * OR > 1: The factor is associated with higher odds of being in a heavier weight category (Risk Factor). * OR < 1: The factor is associated with lower odds of being in a heavier weight category (Protective Factor).

Model Implementation

We fit the model using the polr function from the MASS package. We exclude Height and Weight to avoid data leakage, as they directly define BMI.

library(ordinal)   
library(gtsummary)

# 1. fit the model
clm_model <- clm(ObesityLevel ~ FamilyHistory + AgeGroup + Gender + 
                 Transport + PhysicalActivity + VegConsumption + 
                 HighCaloricFood + SnackBetweenMeals,
                 data = obesity_clean)

summary(clm_model)

## formula: 
## ObesityLevel ~ FamilyHistory + AgeGroup + Gender + Transport + PhysicalActivity + VegConsumption + HighCaloricFood + SnackBetweenMeals
## data:    obesity_clean
## 
##  link  threshold nobs logLik   AIC     niter max.grad cond.H 
##  logit flexible  2111 -3346.14 6740.28 5(0)  6.11e-09 1.6e+03
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## FamilyHistoryyes                2.17737    0.12661  17.197  < 2e-16 ***
## AgeGroup20-29                   1.56693    0.10538  14.869  < 2e-16 ***
## AgeGroup30-39                   1.78080    0.16166  11.016  < 2e-16 ***
## AgeGroup40+                     2.33485    0.26869   8.690  < 2e-16 ***
## GenderMale                     -0.09681    0.08719  -1.110  0.26684    
## TransportBike                   0.42148    0.64837   0.650  0.51565    
## TransportMotorbike              1.46547    0.63139   2.321  0.02029 *  
## TransportPublic_Transportation  0.97554    0.12922   7.550 4.36e-14 ***
## TransportWalking                0.25181    0.28233   0.892  0.37244    
## PhysicalActivity1              -0.27212    0.09820  -2.771  0.00559 ** 
## PhysicalActivity2              -0.68279    0.11287  -6.049 1.46e-09 ***
## PhysicalActivity3              -0.83960    0.18000  -4.664 3.09e-06 ***
## VegConsumption2                -0.03726    0.19177  -0.194  0.84594    
## VegConsumption3                 1.02361    0.19494   5.251 1.51e-07 ***
## HighCaloricFoodyes              0.62442    0.13345   4.679 2.88e-06 ***
## SnackBetweenMeals.L            -1.14322    0.23996  -4.764 1.90e-06 ***
## SnackBetweenMeals.Q             0.55210    0.19340   2.855  0.00431 ** 
## SnackBetweenMeals.C             1.81722    0.13449  13.511  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Threshold coefficients:
##                                        Estimate Std. Error z value
## Insufficient_Weight|Normal_Weight        2.0765     0.2752   7.546
## Normal_Weight|Overweight_Level_I         3.5550     0.2797  12.712
## Overweight_Level_I|Overweight_Level_II   4.6690     0.2879  16.217
## Overweight_Level_II|Obesity_Type_I       5.5531     0.2947  18.846
## Obesity_Type_I|Obesity_Type_II           6.6049     0.3029  21.804
## Obesity_Type_II|Obesity_Type_III         7.7558     0.3123  24.834

Family History is the Dominant Factor: The strongest predictor in the model is FamilyHistory. With a coefficient of 2.17 (Odds Ratio is about 8.8), individuals with a family history of obesity are nearly 9 times more likely to be in a higher obesity category than those without, holding all other variables constant.
The Protective Power of Exercise: Physical activity shows a clear “dose-response” relationship. Compared to sedentary individuals:
- Moderate activity (PhysicalActivity2) reduces the log-odds of higher obesity by -0.68.
- High activity (PhysicalActivity3) reduces the log-odds by -0.84. This confirms that increased frequency of physical activity is a significant protective factor.
The Public Transport Anomaly: Consistent with our MCA findings, Public_Transportation is a significant risk factor (Estimate 0.97, p<0.001) compared to the reference group (Automobile). This reinforces the cluster found in the visualization where public transit usage correlated with Obesity Type III.
Age Progression: There is a linear increase in risk with age. Compared to the “Under 20” group, the risk coefficients rise steadily: 20-29 (1.56) to 30-39 (1.78) to 40+ (2.33).
A Note on Vegetable Consumption: Unexpectedly, frequent vegetable consumption (VegConsumption3) appears as a positive predictor for obesity (Estimate 1.02). In cross-sectional studies, this is often attributed to “reverse causality”—individuals with higher obesity levels may report eating more vegetables because they are currently attempting to lose weight.

5. Discussion and Conclusion

Synthesis of Findings

This study utilized a categorical data analysis framework to deconstruct the relationship between lifestyle habits and obesity levels. By integrating association metrics (Cramér’s V), geometric mapping (MCA), and probabilistic modeling (Ordinal Regression), we arrived at three major conclusions.

5.1 The Dominance of Biological Factors

Across all analytical methods, Family History emerged as the single most critical determinant of obesity status. The heatmap identified it as a dominant correlate (\(V = 0.54\)), and the ordinal regression model quantified the risk, showing that individuals with a family history are 8.8 times more likely (OR = 8.8) to be in a higher weight category. While “lifestyle” is often the focus of public health, this data suggests that genetic or shared environmental factors (the “home environment”) are the primary drivers here.

5.2 The “Lifecycle of Obesity” (Age & Transport)

The MCA visualized a clear lifestyle trajectory. The “Active Youth” cluster shows healthy weight strongly associated with individuals under 20 who use Walking or Biking. As individuals enter the 30–39 age bracket, the map shows a “Sedentary Transition” toward Automobile usage, correlating strongly with Overweight Level I and II. Regression confirms that shifting from walking to driving significantly increases the odds of higher obesity classifications.

5.3 The Public Transport Anomaly (Socioeconomic Indicators)

A unique finding was the distinct clustering of Obesity Type III (Severe Obesity) with Public Transportation usage and Female gender. Unlike moderate obesity, which is linked to car ownership (often a proxy for middle-class sedentary behavior), severe obesity appears linked to public transit. In this Latin American context, this may serve as a proxy for lower socioeconomic status, often correlated with limited access to healthy food options.

5.4 Behavioral Paradoxes (Reverse Causality)

Our model indicated that frequent Vegetable Consumption was associated with higher obesity levels (OR > 1). This is a likely example of reverse causality in cross-sectional data; it is probable that individuals with higher BMI are eating more vegetables because they are currently trying to lose weight, rather than vegetables causing weight gain.

Recommendations

Based on these findings, effective interventions should move beyond generic “eat less, move more” advice:

Family-Based Interventions: Since Family History is the strongest predictor (OR 8.8), screening should target entire households rather than individuals.
Active Commuting: Urban planning policies promoting walkability and cycling could mitigate the “age-related” weight gain associated with automobile adoption in the 30s.
Targeted Support: Health campaigns should specifically target public transportation hubs, as this demographic shows the highest prevalence of severe obesity.

Final Thoughts

While machine learning models can predict obesity with high accuracy, this categorical analysis reveals the structure of the problem. Obesity in this population is not driven by a single bad habit, but by a combination of hereditary risk and a transition to sedentary, motorized lifestyles in adulthood.

6. References

Palechor, F. M., & De la Hoz Manotas, A. (2019). Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico.Data in Brief, 25, 104344.
Quiroz, J. C., et al. (2022). Estimation of obesity levels based on dietary habits and physical condition using computational intelligence. Informatics in Medicine Unlocked, 29, 100954.
Görmez, A., et al. (2025). Prediction of obesity levels based on physical activity and eating habits with explainable AI. Frontiers in Physiology.

Exploring Lifestyle Factors and Obesity Categories through Categorical Data Analysis

Mimi Tran

12/7/2025

1. Introduction

Background and Rationale

Objectives

Data Description

2. Data Preparation

Data Loading and Cleaning

Data Quality Check

Table 2: Variable Descriptions and Data Dictionary

3. Descriptive Statistics

Visualizing Associations

Figure 3a: Distribution of Obesity Levels by Transportation Mode

Figure 3b: Age Distribution by Obesity Level

Figure 3c: Physical Activity Frequency vs. Obesity Level

Figure 3d: Impact of Family History on Obesity

4. Statistical Analysis

4.1 Pearson’s Chi-Square Test of Independence

Hypotheses

Mathematical Formulation

Expected Frequencies

Decision Rule

Table 4.1: Demographic and Lifestyle Characteristics by Obesity Level

4.2 Cramer’s V and Goodman–Kruskal - Strength of Association and Effect Size

Theoretical Framework

4.2.1 Cramér’s V (Symmetric Association)

4.2.2 Goodman-Kruskal Lambda (\(\lambda\)) (Predictive Power)

Analysis Implementation

Association Heatmap (Cramér’s V)

4.3 Multivariate Visualization: Multiple Correspondence Analysis (MCA)

Theoretical Framework

4.3.1. The Indicator Matrix

4.3.2. Geometric Distance (\(\chi^2\)-Distance)

4.3.3. Inertia and Dimensions

4.3.4 Supplementary Variables

Which variables define Dimension 1 and Dimension 2?

4.4 Predictive Modeling: Ordinal Logistic Regression

Theoretical Framework

The Mathematical Model

Interpretation (Odds Ratios)

Model Implementation

5. Discussion and Conclusion

Synthesis of Findings

5.1 The Dominance of Biological Factors

5.2 The “Lifecycle of Obesity” (Age & Transport)

5.3 The Public Transport Anomaly (Socioeconomic Indicators)

5.4 Behavioral Paradoxes (Reverse Causality)

Recommendations

Final Thoughts

6. References