Project Overview: Predicting BMI Based on Lifestyle Factors
This analysis uses the Obesity Levels dataset from
the UCI Machine Learning Repository, which includes both real and
synthetic data describing individuals’ demographics, dietary habits, and
physical activity levels.
Our goal is to predict Body Mass Index (BMI) — a
continuous health indicator — using lifestyle and behavioral
variables.
Data Preparation & Initial Insights
Before modeling, we performed the following steps:
- Created a new target variable
BMI = Weight / Height²
- Removed original
Weight
, Height
, and
NObeyesdad
columns to avoid data leakage
- Converted categorical text fields into factors
- Split the data into an 80% training set and a 20% test set
Key variables include:
- Age, Gender
- Dietary patterns (e.g., frequency of high-calorie food
consumption)
- Physical activity (e.g., exercise, screen time)
- Behavioral and familial traits (e.g., smoking, water intake, family
history of overweight)
# Load libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)
# Load dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")
# Calculate BMI and remove related columns to avoid leakage
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)
obesity_data <- obesity_data %>% select(-Weight, -Height, -NObeyesdad)
# View cleaned data
head(obesity_data)
Data Preprocessing
We first verified data types and ensured categorical variables were
converted into factors, then removed any missing values to prevent
errors during modeling.
Finally, we split the dataset:
Training set (80%) to train the model Testing set (20%) to evaluate
its generalization performance
# Check structure
str(obesity_data)
'data.frame': 2111 obs. of 15 variables:
$ Gender : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
$ Age : num 21 21 23 27 22 29 23 22 24 22 ...
$ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
$ FAVC : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
$ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
$ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
$ CAEC : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
$ SMOKE : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
$ SCC : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
$ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
$ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
$ CALC : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
$ MTRANS : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
$ BMI : num 24.4 24.2 23.8 26.9 28.3 ...
# Convert character columns to factors
obesity_data <- obesity_data %>% mutate(across(where(is.character), as.factor))
# Remove missing values (if any)
obesity_data <- na.omit(obesity_data)
# Create training and testing sets
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]
Modeling: Shallow Decision Tree
We built a decision tree model using rpart() to predict BMI from all
lifestyle variables.
The tree was kept shallow to enhance interpretability:
Maximum depth = 3 Complexity parameter (cp) = 0.01
# Train a shallow decision tree
tree_model <- rpart(BMI ~ .,
data = train_data,
method = "anova",
control = rpart.control(maxdepth = 3, cp = 0.01))
# Visualize the tree
rpart.plot(tree_model,
type = 4,
extra = 101,
fallen.leaves = TRUE,
box.palette = "Blues",
shadow.col = "gray")

Model Evaluation: Root Mean Squared Error (RMSE)
We used RMSE to evaluate prediction accuracy on both training and
testing datasets.
# Predict BMI
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)
# Calculate RMSE
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))
# Print results
cat("Training RMSE:", round(train_rmse, 2), "\n")
Training RMSE: 5.72
cat("Testing RMSE:", round(test_rmse, 2), "\n")
Testing RMSE: 5.7
Training RMSE shows how well the model fits the data it was trained
on. Testing RMSE gives insight into generalization on new, unseen
data.
Variable Importance
We extracted feature importance scores to understand which factors
most influenced BMI predictions.
# Get variable importance
importance <- data.frame(
Variable = names(tree_model$variable.importance),
Importance = as.numeric(tree_model$variable.importance)
)
# Display as table
kable(importance, caption = "Variable Importance in Decision Tree Model")
Variable Importance in Decision Tree Model
family_history_with_overweight |
24987.33400 |
FCVC |
14733.48519 |
CAEC |
13175.45035 |
Gender |
2501.11946 |
Age |
1363.33016 |
TUE |
419.38583 |
MTRANS |
352.16697 |
CH2O |
183.48130 |
CALC |
84.41767 |
NCP |
80.08761 |
This table highlights the most influential predictors based on how
frequently and effectively they were used in the tree’s splits.
Conclusion
Model: A shallow regression tree predicting BMI
Approach: Simple, interpretable model with intentionally limited
depth
Results: RMSE on training and testing datasets gives insight into
accuracy and generalization Visualization of splits aids explainability
Variable importance identifies key health and lifestyle predictors
Limitations
While decision trees are easy to interpret, they may underfit complex
relationships. Future work could explore more flexible models (e.g.,
XGBoost or Random Forest) for improved prediction accuracy while
maintaining interpretability through tools like SHAP.
LS0tCnRpdGxlOiAiQk1JIFByZWRpY3Rpb24gQmFzZWQgb24gTGlmZXN0eWxlIEZhY3RvcnMiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgUHJvamVjdCBPdmVydmlldzogUHJlZGljdGluZyBCTUkgQmFzZWQgb24gTGlmZXN0eWxlIEZhY3RvcnMKClRoaXMgYW5hbHlzaXMgdXNlcyB0aGUgKipPYmVzaXR5IExldmVscyoqIGRhdGFzZXQgZnJvbSB0aGUgVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeSwgd2hpY2ggaW5jbHVkZXMgYm90aCByZWFsIGFuZCBzeW50aGV0aWMgZGF0YSBkZXNjcmliaW5nIGluZGl2aWR1YWxzJyBkZW1vZ3JhcGhpY3MsIGRpZXRhcnkgaGFiaXRzLCBhbmQgcGh5c2ljYWwgYWN0aXZpdHkgbGV2ZWxzLgoKT3VyIGdvYWwgaXMgdG8gKipwcmVkaWN0IEJvZHkgTWFzcyBJbmRleCAoQk1JKSoqIOKAlCBhIGNvbnRpbnVvdXMgaGVhbHRoIGluZGljYXRvciDigJQgdXNpbmcgbGlmZXN0eWxlIGFuZCBiZWhhdmlvcmFsIHZhcmlhYmxlcy4KCi0tLQoKIyBEYXRhIFByZXBhcmF0aW9uICYgSW5pdGlhbCBJbnNpZ2h0cwoKQmVmb3JlIG1vZGVsaW5nLCB3ZSBwZXJmb3JtZWQgdGhlIGZvbGxvd2luZyBzdGVwczoKCi0gQ3JlYXRlZCBhIG5ldyB0YXJnZXQgdmFyaWFibGUgYEJNSSA9IFdlaWdodCAvIEhlaWdodMKyYAotIFJlbW92ZWQgb3JpZ2luYWwgYFdlaWdodGAsIGBIZWlnaHRgLCBhbmQgYE5PYmV5ZXNkYWRgIGNvbHVtbnMgdG8gYXZvaWQgZGF0YSBsZWFrYWdlCi0gQ29udmVydGVkIGNhdGVnb3JpY2FsIHRleHQgZmllbGRzIGludG8gZmFjdG9ycwotIFNwbGl0IHRoZSBkYXRhIGludG8gYW4gODAlIHRyYWluaW5nIHNldCBhbmQgYSAyMCUgdGVzdCBzZXQKCiMjIyBLZXkgdmFyaWFibGVzIGluY2x1ZGU6Ci0gKipBZ2UqKiwgKipHZW5kZXIqKgotIERpZXRhcnkgcGF0dGVybnMgKGUuZy4sIGZyZXF1ZW5jeSBvZiBoaWdoLWNhbG9yaWUgZm9vZCBjb25zdW1wdGlvbikKLSBQaHlzaWNhbCBhY3Rpdml0eSAoZS5nLiwgZXhlcmNpc2UsIHNjcmVlbiB0aW1lKQotIEJlaGF2aW9yYWwgYW5kIGZhbWlsaWFsIHRyYWl0cyAoZS5nLiwgc21va2luZywgd2F0ZXIgaW50YWtlLCBmYW1pbHkgaGlzdG9yeSBvZiBvdmVyd2VpZ2h0KQoKYGBge3J9CiMgTG9hZCBsaWJyYXJpZXMKbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkocnBhcnQpCmxpYnJhcnkocnBhcnQucGxvdCkKbGlicmFyeShjYXJldCkKbGlicmFyeShrbml0cikKCiMgTG9hZCBkYXRhc2V0Cm9iZXNpdHlfZGF0YSA8LSByZWFkLmNzdigifi9EZXNrdG9wL09iZXNpdHlEYXRhU2V0X3Jhd19hbmRfZGF0YV9zaW50aGV0aWMuY3N2IikKCiMgQ2FsY3VsYXRlIEJNSSBhbmQgcmVtb3ZlIHJlbGF0ZWQgY29sdW1ucyB0byBhdm9pZCBsZWFrYWdlCm9iZXNpdHlfZGF0YSRCTUkgPC0gb2Jlc2l0eV9kYXRhJFdlaWdodCAvIChvYmVzaXR5X2RhdGEkSGVpZ2h0XjIpCm9iZXNpdHlfZGF0YSA8LSBvYmVzaXR5X2RhdGEgJT4lIHNlbGVjdCgtV2VpZ2h0LCAtSGVpZ2h0LCAtTk9iZXllc2RhZCkKCiMgVmlldyBjbGVhbmVkIGRhdGEKaGVhZChvYmVzaXR5X2RhdGEpCmBgYAoKIyBEYXRhIFByZXByb2Nlc3NpbmcKCldlIGZpcnN0IHZlcmlmaWVkIGRhdGEgdHlwZXMgYW5kIGVuc3VyZWQgY2F0ZWdvcmljYWwgdmFyaWFibGVzIHdlcmUgY29udmVydGVkIGludG8gZmFjdG9ycywgdGhlbiByZW1vdmVkIGFueSBtaXNzaW5nIHZhbHVlcyB0byBwcmV2ZW50IGVycm9ycyBkdXJpbmcgbW9kZWxpbmcuCgpGaW5hbGx5LCB3ZSBzcGxpdCB0aGUgZGF0YXNldDoKClRyYWluaW5nIHNldCAoODAlKSB0byB0cmFpbiB0aGUgbW9kZWwKVGVzdGluZyBzZXQgKDIwJSkgdG8gZXZhbHVhdGUgaXRzIGdlbmVyYWxpemF0aW9uIHBlcmZvcm1hbmNlCgpgYGB7cn0KIyBDaGVjayBzdHJ1Y3R1cmUKc3RyKG9iZXNpdHlfZGF0YSkKCiMgQ29udmVydCBjaGFyYWN0ZXIgY29sdW1ucyB0byBmYWN0b3JzCm9iZXNpdHlfZGF0YSA8LSBvYmVzaXR5X2RhdGEgJT4lIG11dGF0ZShhY3Jvc3Mod2hlcmUoaXMuY2hhcmFjdGVyKSwgYXMuZmFjdG9yKSkKCiMgUmVtb3ZlIG1pc3NpbmcgdmFsdWVzIChpZiBhbnkpCm9iZXNpdHlfZGF0YSA8LSBuYS5vbWl0KG9iZXNpdHlfZGF0YSkKCiMgQ3JlYXRlIHRyYWluaW5nIGFuZCB0ZXN0aW5nIHNldHMKc2V0LnNlZWQoMTIzKQp0cmFpbl9pbmRleCA8LSBjcmVhdGVEYXRhUGFydGl0aW9uKG9iZXNpdHlfZGF0YSRCTUksIHAgPSAwLjgsIGxpc3QgPSBGQUxTRSkKdHJhaW5fZGF0YSA8LSBvYmVzaXR5X2RhdGFbdHJhaW5faW5kZXgsIF0KdGVzdF9kYXRhIDwtIG9iZXNpdHlfZGF0YVstdHJhaW5faW5kZXgsIF0KYGBgCgojIE1vZGVsaW5nOiBTaGFsbG93IERlY2lzaW9uIFRyZWUKCldlIGJ1aWx0IGEgZGVjaXNpb24gdHJlZSBtb2RlbCB1c2luZyBycGFydCgpIHRvIHByZWRpY3QgQk1JIGZyb20gYWxsIGxpZmVzdHlsZSB2YXJpYWJsZXMuCgpUaGUgdHJlZSB3YXMga2VwdCBzaGFsbG93IHRvIGVuaGFuY2UgaW50ZXJwcmV0YWJpbGl0eToKCk1heGltdW0gZGVwdGggPSAzCkNvbXBsZXhpdHkgcGFyYW1ldGVyIChjcCkgPSAwLjAxCgpgYGB7cn0KIyBUcmFpbiBhIHNoYWxsb3cgZGVjaXNpb24gdHJlZQp0cmVlX21vZGVsIDwtIHJwYXJ0KEJNSSB+IC4sIAogICAgICAgICAgICAgICAgICAgIGRhdGEgPSB0cmFpbl9kYXRhLCAKICAgICAgICAgICAgICAgICAgICBtZXRob2QgPSAiYW5vdmEiLCAKICAgICAgICAgICAgICAgICAgICBjb250cm9sID0gcnBhcnQuY29udHJvbChtYXhkZXB0aCA9IDMsIGNwID0gMC4wMSkpCgojIFZpc3VhbGl6ZSB0aGUgdHJlZQpycGFydC5wbG90KHRyZWVfbW9kZWwsCiAgICAgICAgICAgdHlwZSA9IDQsCiAgICAgICAgICAgZXh0cmEgPSAxMDEsCiAgICAgICAgICAgZmFsbGVuLmxlYXZlcyA9IFRSVUUsCiAgICAgICAgICAgYm94LnBhbGV0dGUgPSAiQmx1ZXMiLAogICAgICAgICAgIHNoYWRvdy5jb2wgPSAiZ3JheSIpCmBgYAoKIyBNb2RlbCBFdmFsdWF0aW9uOiBSb290IE1lYW4gU3F1YXJlZCBFcnJvciAoUk1TRSkKCldlIHVzZWQgUk1TRSB0byBldmFsdWF0ZSBwcmVkaWN0aW9uIGFjY3VyYWN5IG9uIGJvdGggdHJhaW5pbmcgYW5kIHRlc3RpbmcgZGF0YXNldHMuCgpgYGB7cn0KIyBQcmVkaWN0IEJNSQp0cmFpbl9wcmVkIDwtIHByZWRpY3QodHJlZV9tb2RlbCwgbmV3ZGF0YSA9IHRyYWluX2RhdGEpCnRlc3RfcHJlZCA8LSBwcmVkaWN0KHRyZWVfbW9kZWwsIG5ld2RhdGEgPSB0ZXN0X2RhdGEpCgojIENhbGN1bGF0ZSBSTVNFCnRyYWluX3Jtc2UgPC0gc3FydChtZWFuKCh0cmFpbl9wcmVkIC0gdHJhaW5fZGF0YSRCTUkpXjIpKQp0ZXN0X3Jtc2UgPC0gc3FydChtZWFuKCh0ZXN0X3ByZWQgLSB0ZXN0X2RhdGEkQk1JKV4yKSkKCiMgUHJpbnQgcmVzdWx0cwpjYXQoIlRyYWluaW5nIFJNU0U6Iiwgcm91bmQodHJhaW5fcm1zZSwgMiksICJcbiIpCmNhdCgiVGVzdGluZyBSTVNFOiIsIHJvdW5kKHRlc3Rfcm1zZSwgMiksICJcbiIpCmBgYAoKVHJhaW5pbmcgUk1TRSBzaG93cyBob3cgd2VsbCB0aGUgbW9kZWwgZml0cyB0aGUgZGF0YSBpdCB3YXMgdHJhaW5lZCBvbi4KVGVzdGluZyBSTVNFIGdpdmVzIGluc2lnaHQgaW50byBnZW5lcmFsaXphdGlvbiBvbiBuZXcsIHVuc2VlbiBkYXRhLgoKIyBWYXJpYWJsZSBJbXBvcnRhbmNlCgpXZSBleHRyYWN0ZWQgZmVhdHVyZSBpbXBvcnRhbmNlIHNjb3JlcyB0byB1bmRlcnN0YW5kIHdoaWNoIGZhY3RvcnMgbW9zdCBpbmZsdWVuY2VkIEJNSSBwcmVkaWN0aW9ucy4KCmBgYHtyfQojIEdldCB2YXJpYWJsZSBpbXBvcnRhbmNlCmltcG9ydGFuY2UgPC0gZGF0YS5mcmFtZSgKICBWYXJpYWJsZSA9IG5hbWVzKHRyZWVfbW9kZWwkdmFyaWFibGUuaW1wb3J0YW5jZSksCiAgSW1wb3J0YW5jZSA9IGFzLm51bWVyaWModHJlZV9tb2RlbCR2YXJpYWJsZS5pbXBvcnRhbmNlKQopCgojIERpc3BsYXkgYXMgdGFibGUKa2FibGUoaW1wb3J0YW5jZSwgY2FwdGlvbiA9ICJWYXJpYWJsZSBJbXBvcnRhbmNlIGluIERlY2lzaW9uIFRyZWUgTW9kZWwiKQpgYGAKClRoaXMgdGFibGUgaGlnaGxpZ2h0cyB0aGUgbW9zdCBpbmZsdWVudGlhbCBwcmVkaWN0b3JzIGJhc2VkIG9uIGhvdyBmcmVxdWVudGx5IGFuZCBlZmZlY3RpdmVseSB0aGV5IHdlcmUgdXNlZCBpbiB0aGUgdHJlZeKAmXMgc3BsaXRzLgoKIyBDb25jbHVzaW9uCgpNb2RlbDogQSBzaGFsbG93IHJlZ3Jlc3Npb24gdHJlZSBwcmVkaWN0aW5nIEJNSQoKQXBwcm9hY2g6IFNpbXBsZSwgaW50ZXJwcmV0YWJsZSBtb2RlbCB3aXRoIGludGVudGlvbmFsbHkgbGltaXRlZCBkZXB0aAoKUmVzdWx0czoKUk1TRSBvbiB0cmFpbmluZyBhbmQgdGVzdGluZyBkYXRhc2V0cyBnaXZlcyBpbnNpZ2h0IGludG8gYWNjdXJhY3kgYW5kIGdlbmVyYWxpemF0aW9uClZpc3VhbGl6YXRpb24gb2Ygc3BsaXRzIGFpZHMgZXhwbGFpbmFiaWxpdHkKVmFyaWFibGUgaW1wb3J0YW5jZSBpZGVudGlmaWVzIGtleSBoZWFsdGggYW5kIGxpZmVzdHlsZSBwcmVkaWN0b3JzCgojIExpbWl0YXRpb25zCldoaWxlIGRlY2lzaW9uIHRyZWVzIGFyZSBlYXN5IHRvIGludGVycHJldCwgdGhleSBtYXkgdW5kZXJmaXQgY29tcGxleCByZWxhdGlvbnNoaXBzLiBGdXR1cmUgd29yayBjb3VsZCBleHBsb3JlIG1vcmUgZmxleGlibGUgbW9kZWxzIChlLmcuLCBYR0Jvb3N0IG9yIFJhbmRvbSBGb3Jlc3QpIGZvciBpbXByb3ZlZCBwcmVkaWN0aW9uIGFjY3VyYWN5IHdoaWxlIG1haW50YWluaW5nIGludGVycHJldGFiaWxpdHkgdGhyb3VnaCB0b29scyBsaWtlIFNIQVAuCg==