Overall Goal

In this project, my goal is to determine which lifestyle habits affect the probability of obesity. As someone who is very active and considers myself a healthy individual, I’m interested in understanding what factors contribute to people becoming or remaining obese. Diving into this dataset allowed me to explore that question and narrow my own knowledge gap about the behaviors most strongly linked to obesity.

Loading Libraries

library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)

Introduction and Understanding Variables

This project investigates lifestyle and behavioral factors that may contribute to obesity. The dataset includes various features such as dietary habits, physical activity, and technology use, and aims to identify patterns linked to different obesity levels.

Although we are not creating predictive models for all the factors in this dataset, below is a description of each variable:

Variables

Importing the Data

data <- read_excel("/Users/nickcalip/Desktop/R-Programming/Project1-DAT301/Obesity_Dataset.xlsx")

head(data)

Data Cleaning and Reprocessing

To prepare the dataset for analysis, I’ve identified columns that represent categorical variables and converted them to factor type. This ensures that R treats them as discrete categories rather than numeric or character values. We also trimmed any extra whitespace to avoid inconsistencies in the data.

cols_to_factor <- c(
  "Sex", "Overweight_Obese_Family", "Consumption_of_Fast_Food",
  "Frequency_of_Consuming_Vegetables", "Food_Intake_Between_Meals",
  "Smoking", "Calculation_of_Calorie_Intake", "Physical_Excercise",
  "Schedule_Dedicated_to_Technology", "Type_of_Transportation_Used",
  "Class"
)
data[cols_to_factor] <- lapply(data[cols_to_factor], function(x) as.factor(trimws(x)))

Relabeling Categorical Variables

To improve interpretability, I relabelled the factor levels to be more descriptive instead of numeric codes.

levels(data$Sex) <- c("Male", "Female")
levels(data$Overweight_Obese_Family) <- c("No", "Yes")
levels(data$Consumption_of_Fast_Food) <- c("No", "Yes")
levels(data$Frequency_of_Consuming_Vegetables) <- c("Rarely", "Sometimes", "Always")
levels(data$Number_of_Main_Meals_Daily) <- c("1-2 meals", "3 meals", "More than 3")
levels(data$Food_Intake_Between_Meals) <- c("Rarely", "Sometimes", "Usually", "Always")
levels(data$Smoking) <- c("Yes", "No")
levels(data$Liquid_Intake_Daily) <- c("<1L", "1-2L", ">2L")
levels(data$Calculation_of_Calorie_Intake) <- c("Yes", "No")
levels(data$Physical_Excercise) <- c("None", "1-2 days", "3-4 days", "5-6 days", "6+ days")
levels(data$Schedule_Dedicated_to_Technology) <- c("0-2 hours", "3-5 hours", "5+ hours")
levels(data$Type_of_Transportation_Used) <- c("Automobile", "Motorbike", "Bike",
                                              "Public Transport", "Walking")
levels(data$Class) <- c("Underweight", "Normal", "Overweight", "Obese")

Exploratory Data Analysis (EDA)

In this section, I explored the distribution of key variables in the dataset and examine how they relate to obesity levels. Lets focus on visualizing age, physical activity, and other lifestyle factors using histograms and bar plots.

ggplot(data, aes(x = Class, fill = Class)) +
  geom_bar() +
  labs(title = "Obesity Level Distribution", x = "Obesity Class", y = "Count") +
  theme_minimal()

This plot shows how individuals are distributed across the four obesity categories.

ggplot(data, aes(x = Age, fill = Class)) +
  geom_histogram(bins = 20, position = "dodge", alpha = 0.7) +
  labs(title = "Age Distribution by Obesity Class", x = "Age", y = "Count") +
  theme_minimal()

This bar plot shows how levels of physical exercise vary across obesity categories.

ggplot(data, aes(x = Physical_Excercise, fill = Class)) +
  geom_bar(position = "fill") +
  labs(title = "Physical Activity by Obesity Level", x = "Exercise Level", y = "Count") +
  theme_minimal()

Building Predictive Models Based on Three Important Factors

Next I want to build three separate predictive models all plotting against obesity. My goal is to see if these three variables have a key role in predicting the probability you will become obese if you follow in the footsteps of the people in this sample. I would first like to start off with Physical Exercise vs Obesity.

data$Is_Obese <- ifelse(data$Class == "Obese", 1, 0)

model_exercise <- glm(Is_Obese ~ Physical_Excercise, data = data, family = "binomial")

levels_exercise <- levels(data$Physical_Excercise)
predict_data <- data.frame(Physical_Excercise = levels_exercise)
predict_data$predicted_prob <- predict(model_exercise, newdata = predict_data, type = "response")

ggplot(predict_data, aes(x = Physical_Excercise, y = predicted_prob)) +
  geom_col(fill = "darkred") +
  labs(
    title = "Predicted Probability of Obesity by Physical Activity",
    x = "Exercise Level",
    y = "Probability of Being Obese"
  ) +
  theme_minimal()

Questioning the Model

Following the predictive model, this bar plot seemed odd to me. It suggests that if you are exercising 6+ days per week, the probability that you are obese is higher. So I went back to look at the raw data to see how well that correlated with it.

table(data$Physical_Excercise, data$Class)
##           
##            Underweight Normal Overweight Obese
##   None              53    113         31     9
##   1-2 days          14    145         93    38
##   3-4 days           4    187        127    52
##   5-6 days           2    116        172    68
##   6+ days            0     97        169   120

Although this doesn’t necessarily negate what the model has to say it does show that in this sample the people who were considered obese didn’t go to the gym the most. This is most likely due to reverse causality. Which means, people who are already obese may be more motivated to exercise frequently. Highlighting the importance of interpreting the models output carefully to see what the real message is.

Fast Food Consumption vs Obesity

Let’s take this model in reverse order and look at the raw data first.

table(data$Consumption_of_Fast_Food, data$Class)
##      
##       Underweight Normal Overweight Obese
##   No            8     65        200   163
##   Yes          65    593        392   124
# Percentage of participants who were obese that consumed fast food
prop.table(table(data$Consumption_of_Fast_Food, data$Class), margin = 1)[, "Obese"]
##        No       Yes 
## 0.3738532 0.1056218

At first glance, this shows that there are more Obese people who don’t consume fast food compared to those who do, and that the vast majority of normal weight people actually do consume fast food. Again, this could be because of the reverse causality. The sample of the obese people in this study were trying to lose weight, thus they were eating less fast food.

data$Consumption_of_Fast_Food <- relevel(data$Consumption_of_Fast_Food, ref = "No")

model_fastfood <- glm(Is_Obese ~ Consumption_of_Fast_Food, data = data, family = "binomial")

predict_df <- data.frame(Consumption_of_Fast_Food = levels(data$Consumption_of_Fast_Food))
predict_df$predicted_prob <- predict(model_fastfood, newdata = predict_df, type = "response")

ggplot(predict_df, aes(x = Consumption_of_Fast_Food, y = predicted_prob)) +
  geom_col(fill = "orange") +
  labs(
    title = "Predicted Probability of Being Obese by Fast Food Consumption",
    x = "Fast Food Consumption",
    y = "Probability of Obesity"
  ) +
  theme_minimal()

## Interpreting the Fast Food Model

The logistic regression model predicts that individuals who report not consuming fast food have a higher probability of being obese than those who do. While this result may appear contradictory, it aligns with the raw data and likely reflects reverse causality: individuals who are already obese may be more health-conscious and intentionally avoid fast food to manage their weight.

Final Predictive Model

Before coming to our conclusion, lets look at one final variable to help us understand what affects the chances of becoming obese. We will now test the frequency of consuming vegetables against obesity to see what role it plays. We will start out looking at the raw data grouped first.

table(data$Frequency_of_Consuming_Vegetables, data$Class)
##            
##             Underweight Normal Overweight Obese
##   Rarely              6     47        163   184
##   Sometimes          26    252        327   103
##   Always             41    359        102     0
# Percentage of participants who were obese and their corresponding vegetable consumption
prop.table(table(data$Frequency_of_Consuming_Vegetables, data$Class), margin = 1)[, "Obese"]
##    Rarely Sometimes    Always 
## 0.4600000 0.1454802 0.0000000

Looking at the raw data, this shows an outcome that is expected. Based on this sample 0% of obese people always consume vegetables. Lets see what the predictive model has to say.

The model produces clearer results this time around. Individuals who rarely eat vegetables are predicted to have the highest probability of obesity, while those who always eat vegetables have the lowest. In fact, the dataset contained no obese individuals among those who reported always consuming vegetables. This strong and consistent pattern supports the importance of vegetable intake in managing weight and preventing obesity.

Conclusion

This project set out to explore which lifestyle habits are most associated with obesity using a dataset of behavioral and health-related variables. By focusing on three key factors — physical activity, fast food consumption, and vegetable intake — I built simple predictive models to examine how each relates to the probability of being obese.

The results revealed some unexpected patterns. For both physical activity and fast food consumption, the models initially suggested that more exercise or less fast food was associated with higher obesity risk. However, after examining the raw data, it became clear that these patterns likely reflect reverse causality — where individuals who are already obese may be more likely to exercise more or avoid fast food in an effort to improve their health.

In contrast, the relationship between vegetable intake and obesity was much more intuitive and consistent. Individuals who reported always eating vegetables had no observed cases of obesity in the dataset, while those who rarely consumed vegetables had the highest obesity rate. This supports the widely accepted notion that consistent vegetable consumption is a key factor in maintaining a healthy weight.

Overall, this project highlighted not only the importance of specific habits in relation to obesity, but also the value of combining statistical modeling with raw data checks to ensure meaningful interpretation. Something I learned during this project was not all datasets are equal. Another dataset with seperate factors and similar amount of participants could show the opposite of what my models showed simply because they had a different sample of people.

Citations

Citation Request: Koklu, N., & Sulak, S.A. (2024). Using artificial intelligence techniques for the analysis of obesity status according to the individuals’ social and physical activities. Sinop Üniversitesi Fen Bilimleri Dergisi, 9(1), 217-239. https://doi.org/10.33484/sinopfbd.1445215

https://www.kaggle.com/datasets/suleymansulak/obesity-dataset