In this project, my goal is to determine which lifestyle habits affect the probability of obesity. As someone who is very active and considers myself a healthy individual, I’m interested in understanding what factors contribute to people becoming or remaining obese. Diving into this dataset allowed me to explore that question and narrow my own knowledge gap about the behaviors most strongly linked to obesity.
library(ggplot2)
library(dplyr)
library(plotly)
library(readxl)
This project investigates lifestyle and behavioral factors that may contribute to obesity. The dataset includes various features such as dietary habits, physical activity, and technology use, and aims to identify patterns linked to different obesity levels.
Although we are not creating predictive models for all the factors in this dataset, below is a description of each variable:
Sex: Gender of the respondent
Age: Age in years (integer)
Height: Height in centimeters (integer)
Overweight/Obese Families: Whether the respondent has family members who are overweight or obese
Consumption of Fast Food: Whether the respondent regularly consumes fast food
Frequency of Consuming Vegetables:
Number of Main Meals Daily:
Food Intake Between Meals:
Smoking:
Liquid Intake Daily: Amount of water consumed daily
Calculation of Calorie Intake: Whether the respondent actively tracks calories
Physical Exercise: Days per week of physical activity
Schedule Dedicated to Technology: Screen time per day
Type of Transportation Used:
Class (Target Variable): Obesity classification based on BMI and habits
data <- read_excel("/Users/nickcalip/Desktop/R-Programming/Project1-DAT301/Obesity_Dataset.xlsx")
head(data)
To prepare the dataset for analysis, I’ve identified columns that
represent categorical variables and converted them to
factor type. This ensures that R treats them as discrete
categories rather than numeric or character values. We also trimmed any
extra whitespace to avoid inconsistencies in the data.
cols_to_factor <- c(
"Sex", "Overweight_Obese_Family", "Consumption_of_Fast_Food",
"Frequency_of_Consuming_Vegetables", "Food_Intake_Between_Meals",
"Smoking", "Calculation_of_Calorie_Intake", "Physical_Excercise",
"Schedule_Dedicated_to_Technology", "Type_of_Transportation_Used",
"Class"
)
data[cols_to_factor] <- lapply(data[cols_to_factor], function(x) as.factor(trimws(x)))
To improve interpretability, I relabelled the factor levels to be more descriptive instead of numeric codes.
levels(data$Sex) <- c("Male", "Female")
levels(data$Overweight_Obese_Family) <- c("No", "Yes")
levels(data$Consumption_of_Fast_Food) <- c("No", "Yes")
levels(data$Frequency_of_Consuming_Vegetables) <- c("Rarely", "Sometimes", "Always")
levels(data$Number_of_Main_Meals_Daily) <- c("1-2 meals", "3 meals", "More than 3")
levels(data$Food_Intake_Between_Meals) <- c("Rarely", "Sometimes", "Usually", "Always")
levels(data$Smoking) <- c("Yes", "No")
levels(data$Liquid_Intake_Daily) <- c("<1L", "1-2L", ">2L")
levels(data$Calculation_of_Calorie_Intake) <- c("Yes", "No")
levels(data$Physical_Excercise) <- c("None", "1-2 days", "3-4 days", "5-6 days", "6+ days")
levels(data$Schedule_Dedicated_to_Technology) <- c("0-2 hours", "3-5 hours", "5+ hours")
levels(data$Type_of_Transportation_Used) <- c("Automobile", "Motorbike", "Bike",
"Public Transport", "Walking")
levels(data$Class) <- c("Underweight", "Normal", "Overweight", "Obese")
In this section, I explored the distribution of key variables in the dataset and examine how they relate to obesity levels. Lets focus on visualizing age, physical activity, and other lifestyle factors using histograms and bar plots.
ggplot(data, aes(x = Class, fill = Class)) +
geom_bar() +
labs(title = "Obesity Level Distribution", x = "Obesity Class", y = "Count") +
theme_minimal()
This plot shows how individuals are distributed across the four obesity categories.
ggplot(data, aes(x = Age, fill = Class)) +
geom_histogram(bins = 20, position = "dodge", alpha = 0.7) +
labs(title = "Age Distribution by Obesity Class", x = "Age", y = "Count") +
theme_minimal()
This bar plot shows how levels of physical exercise vary across obesity categories.
ggplot(data, aes(x = Physical_Excercise, fill = Class)) +
geom_bar(position = "fill") +
labs(title = "Physical Activity by Obesity Level", x = "Exercise Level", y = "Count") +
theme_minimal()
Next I want to build three separate predictive models all plotting against obesity. My goal is to see if these three variables have a key role in predicting the probability you will become obese if you follow in the footsteps of the people in this sample. I would first like to start off with Physical Exercise vs Obesity.
data$Is_Obese <- ifelse(data$Class == "Obese", 1, 0)
model_exercise <- glm(Is_Obese ~ Physical_Excercise, data = data, family = "binomial")
levels_exercise <- levels(data$Physical_Excercise)
predict_data <- data.frame(Physical_Excercise = levels_exercise)
predict_data$predicted_prob <- predict(model_exercise, newdata = predict_data, type = "response")
ggplot(predict_data, aes(x = Physical_Excercise, y = predicted_prob)) +
geom_col(fill = "darkred") +
labs(
title = "Predicted Probability of Obesity by Physical Activity",
x = "Exercise Level",
y = "Probability of Being Obese"
) +
theme_minimal()
Following the predictive model, this bar plot seemed odd to me. It suggests that if you are exercising 6+ days per week, the probability that you are obese is higher. So I went back to look at the raw data to see how well that correlated with it.
table(data$Physical_Excercise, data$Class)
##
## Underweight Normal Overweight Obese
## None 53 113 31 9
## 1-2 days 14 145 93 38
## 3-4 days 4 187 127 52
## 5-6 days 2 116 172 68
## 6+ days 0 97 169 120
Although this doesn’t necessarily negate what the model has to say it does show that in this sample the people who were considered obese didn’t go to the gym the most. This is most likely due to reverse causality. Which means, people who are already obese may be more motivated to exercise frequently. Highlighting the importance of interpreting the models output carefully to see what the real message is.
Let’s take this model in reverse order and look at the raw data first.
table(data$Consumption_of_Fast_Food, data$Class)
##
## Underweight Normal Overweight Obese
## No 8 65 200 163
## Yes 65 593 392 124
# Percentage of participants who were obese that consumed fast food
prop.table(table(data$Consumption_of_Fast_Food, data$Class), margin = 1)[, "Obese"]
## No Yes
## 0.3738532 0.1056218
At first glance, this shows that there are more Obese people who don’t consume fast food compared to those who do, and that the vast majority of normal weight people actually do consume fast food. Again, this could be because of the reverse causality. The sample of the obese people in this study were trying to lose weight, thus they were eating less fast food.
data$Consumption_of_Fast_Food <- relevel(data$Consumption_of_Fast_Food, ref = "No")
model_fastfood <- glm(Is_Obese ~ Consumption_of_Fast_Food, data = data, family = "binomial")
predict_df <- data.frame(Consumption_of_Fast_Food = levels(data$Consumption_of_Fast_Food))
predict_df$predicted_prob <- predict(model_fastfood, newdata = predict_df, type = "response")
ggplot(predict_df, aes(x = Consumption_of_Fast_Food, y = predicted_prob)) +
geom_col(fill = "orange") +
labs(
title = "Predicted Probability of Being Obese by Fast Food Consumption",
x = "Fast Food Consumption",
y = "Probability of Obesity"
) +
theme_minimal()
## Interpreting the Fast Food Model
The logistic regression model predicts that individuals who report not consuming fast food have a higher probability of being obese than those who do. While this result may appear contradictory, it aligns with the raw data and likely reflects reverse causality: individuals who are already obese may be more health-conscious and intentionally avoid fast food to manage their weight.
Before coming to our conclusion, lets look at one final variable to help us understand what affects the chances of becoming obese. We will now test the frequency of consuming vegetables against obesity to see what role it plays. We will start out looking at the raw data grouped first.
table(data$Frequency_of_Consuming_Vegetables, data$Class)
##
## Underweight Normal Overweight Obese
## Rarely 6 47 163 184
## Sometimes 26 252 327 103
## Always 41 359 102 0
# Percentage of participants who were obese and their corresponding vegetable consumption
prop.table(table(data$Frequency_of_Consuming_Vegetables, data$Class), margin = 1)[, "Obese"]
## Rarely Sometimes Always
## 0.4600000 0.1454802 0.0000000
Looking at the raw data, this shows an outcome that is expected. Based on this sample 0% of obese people always consume vegetables. Lets see what the predictive model has to say.
The model produces clearer results this time around. Individuals who rarely eat vegetables are predicted to have the highest probability of obesity, while those who always eat vegetables have the lowest. In fact, the dataset contained no obese individuals among those who reported always consuming vegetables. This strong and consistent pattern supports the importance of vegetable intake in managing weight and preventing obesity.
This project set out to explore which lifestyle habits are most associated with obesity using a dataset of behavioral and health-related variables. By focusing on three key factors — physical activity, fast food consumption, and vegetable intake — I built simple predictive models to examine how each relates to the probability of being obese.
The results revealed some unexpected patterns. For both physical activity and fast food consumption, the models initially suggested that more exercise or less fast food was associated with higher obesity risk. However, after examining the raw data, it became clear that these patterns likely reflect reverse causality — where individuals who are already obese may be more likely to exercise more or avoid fast food in an effort to improve their health.
In contrast, the relationship between vegetable intake and obesity was much more intuitive and consistent. Individuals who reported always eating vegetables had no observed cases of obesity in the dataset, while those who rarely consumed vegetables had the highest obesity rate. This supports the widely accepted notion that consistent vegetable consumption is a key factor in maintaining a healthy weight.
Overall, this project highlighted not only the importance of specific habits in relation to obesity, but also the value of combining statistical modeling with raw data checks to ensure meaningful interpretation. Something I learned during this project was not all datasets are equal. Another dataset with seperate factors and similar amount of participants could show the opposite of what my models showed simply because they had a different sample of people.
Citation Request: Koklu, N., & Sulak, S.A. (2024). Using artificial intelligence techniques for the analysis of obesity status according to the individuals’ social and physical activities. Sinop Üniversitesi Fen Bilimleri Dergisi, 9(1), 217-239. https://doi.org/10.33484/sinopfbd.1445215
https://www.kaggle.com/datasets/suleymansulak/obesity-dataset