Project Overview: Predicting BMI Based on Lifestyle Factors

This analysis uses the Obesity Levels dataset from the UCI Machine Learning Repository, which includes both real and synthetic data describing individuals’ demographics, dietary habits, and physical activity levels.

Our goal is to predict Body Mass Index (BMI) — a continuous health indicator — using lifestyle and behavioral variables.


Data Preparation & Initial Insights

Before modeling, we performed the following steps:

Key variables include:

  • Age, Gender
  • Dietary patterns (e.g., frequency of high-calorie food consumption)
  • Physical activity (e.g., exercise, screen time)
  • Behavioral and familial traits (e.g., smoking, water intake, family history of overweight)
# Load libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)

# Load dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")

# Calculate BMI and remove related columns to avoid leakage
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)
obesity_data <- obesity_data %>% select(-Weight, -Height, -NObeyesdad)

# View cleaned data
head(obesity_data)

Data Preprocessing

We first verified data types and ensured categorical variables were converted into factors, then removed any missing values to prevent errors during modeling.

Finally, we split the dataset:

Training set (80%) to train the model Testing set (20%) to evaluate its generalization performance

# Check structure
str(obesity_data)
'data.frame':   2111 obs. of  15 variables:
 $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
 $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
 $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
 $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
 $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
 $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
 $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
 $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
 $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
 $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
 $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
 $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
 $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
 $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
 $ BMI                           : num  24.4 24.2 23.8 26.9 28.3 ...
# Convert character columns to factors
obesity_data <- obesity_data %>% mutate(across(where(is.character), as.factor))

# Remove missing values (if any)
obesity_data <- na.omit(obesity_data)

# Create training and testing sets
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]

Modeling: Shallow Decision Tree

We built a decision tree model using rpart() to predict BMI from all lifestyle variables.

The tree was kept shallow to enhance interpretability:

Maximum depth = 3 Complexity parameter (cp) = 0.01

# Train a shallow decision tree
tree_model <- rpart(BMI ~ ., 
                    data = train_data, 
                    method = "anova", 
                    control = rpart.control(maxdepth = 3, cp = 0.01))

# Visualize the tree
rpart.plot(tree_model,
           type = 4,
           extra = 101,
           fallen.leaves = TRUE,
           box.palette = "Blues",
           shadow.col = "gray")

Model Evaluation: Root Mean Squared Error (RMSE)

We used RMSE to evaluate prediction accuracy on both training and testing datasets.

# Predict BMI
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)

# Calculate RMSE
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))

# Print results
cat("Training RMSE:", round(train_rmse, 2), "\n")
Training RMSE: 5.72 
cat("Testing RMSE:", round(test_rmse, 2), "\n")
Testing RMSE: 5.7 

Training RMSE shows how well the model fits the data it was trained on. Testing RMSE gives insight into generalization on new, unseen data.

Variable Importance

We extracted feature importance scores to understand which factors most influenced BMI predictions.

# Get variable importance
importance <- data.frame(
  Variable = names(tree_model$variable.importance),
  Importance = as.numeric(tree_model$variable.importance)
)

# Display as table
kable(importance, caption = "Variable Importance in Decision Tree Model")
Variable Importance in Decision Tree Model
Variable Importance
family_history_with_overweight 24987.33400
FCVC 14733.48519
CAEC 13175.45035
Gender 2501.11946
Age 1363.33016
TUE 419.38583
MTRANS 352.16697
CH2O 183.48130
CALC 84.41767
NCP 80.08761

This table highlights the most influential predictors based on how frequently and effectively they were used in the tree’s splits.

Conclusion

Model: A shallow regression tree predicting BMI

Approach: Simple, interpretable model with intentionally limited depth

Results: RMSE on training and testing datasets gives insight into accuracy and generalization Visualization of splits aids explainability Variable importance identifies key health and lifestyle predictors

Limitations

While decision trees are easy to interpret, they may underfit complex relationships. Future work could explore more flexible models (e.g., XGBoost or Random Forest) for improved prediction accuracy while maintaining interpretability through tools like SHAP.

LS0tCnRpdGxlOiAiQk1JIFByZWRpY3Rpb24gQmFzZWQgb24gTGlmZXN0eWxlIEZhY3RvcnMiCm91dHB1dDogaHRtbF9ub3RlYm9vawotLS0KCiMgUHJvamVjdCBPdmVydmlldzogUHJlZGljdGluZyBCTUkgQmFzZWQgb24gTGlmZXN0eWxlIEZhY3RvcnMKClRoaXMgYW5hbHlzaXMgdXNlcyB0aGUgKipPYmVzaXR5IExldmVscyoqIGRhdGFzZXQgZnJvbSB0aGUgVUNJIE1hY2hpbmUgTGVhcm5pbmcgUmVwb3NpdG9yeSwgd2hpY2ggaW5jbHVkZXMgYm90aCByZWFsIGFuZCBzeW50aGV0aWMgZGF0YSBkZXNjcmliaW5nIGluZGl2aWR1YWxzJyBkZW1vZ3JhcGhpY3MsIGRpZXRhcnkgaGFiaXRzLCBhbmQgcGh5c2ljYWwgYWN0aXZpdHkgbGV2ZWxzLgoKT3VyIGdvYWwgaXMgdG8gKipwcmVkaWN0IEJvZHkgTWFzcyBJbmRleCAoQk1JKSoqIOKAlCBhIGNvbnRpbnVvdXMgaGVhbHRoIGluZGljYXRvciDigJQgdXNpbmcgbGlmZXN0eWxlIGFuZCBiZWhhdmlvcmFsIHZhcmlhYmxlcy4KCi0tLQoKIyBEYXRhIFByZXBhcmF0aW9uICYgSW5pdGlhbCBJbnNpZ2h0cwoKQmVmb3JlIG1vZGVsaW5nLCB3ZSBwZXJmb3JtZWQgdGhlIGZvbGxvd2luZyBzdGVwczoKCi0gQ3JlYXRlZCBhIG5ldyB0YXJnZXQgdmFyaWFibGUgYEJNSSA9IFdlaWdodCAvIEhlaWdodMKyYAotIFJlbW92ZWQgb3JpZ2luYWwgYFdlaWdodGAsIGBIZWlnaHRgLCBhbmQgYE5PYmV5ZXNkYWRgIGNvbHVtbnMgdG8gYXZvaWQgZGF0YSBsZWFrYWdlCi0gQ29udmVydGVkIGNhdGVnb3JpY2FsIHRleHQgZmllbGRzIGludG8gZmFjdG9ycwotIFNwbGl0IHRoZSBkYXRhIGludG8gYW4gODAlIHRyYWluaW5nIHNldCBhbmQgYSAyMCUgdGVzdCBzZXQKCiMjIyBLZXkgdmFyaWFibGVzIGluY2x1ZGU6Ci0gKipBZ2UqKiwgKipHZW5kZXIqKgotIERpZXRhcnkgcGF0dGVybnMgKGUuZy4sIGZyZXF1ZW5jeSBvZiBoaWdoLWNhbG9yaWUgZm9vZCBjb25zdW1wdGlvbikKLSBQaHlzaWNhbCBhY3Rpdml0eSAoZS5nLiwgZXhlcmNpc2UsIHNjcmVlbiB0aW1lKQotIEJlaGF2aW9yYWwgYW5kIGZhbWlsaWFsIHRyYWl0cyAoZS5nLiwgc21va2luZywgd2F0ZXIgaW50YWtlLCBmYW1pbHkgaGlzdG9yeSBvZiBvdmVyd2VpZ2h0KQoKYGBge3J9CiMgTG9hZCBsaWJyYXJpZXMKbGlicmFyeSh0aWR5dmVyc2UpCmxpYnJhcnkocnBhcnQpCmxpYnJhcnkocnBhcnQucGxvdCkKbGlicmFyeShjYXJldCkKbGlicmFyeShrbml0cikKCiMgTG9hZCBkYXRhc2V0Cm9iZXNpdHlfZGF0YSA8LSByZWFkLmNzdigifi9EZXNrdG9wL09iZXNpdHlEYXRhU2V0X3Jhd19hbmRfZGF0YV9zaW50aGV0aWMuY3N2IikKCiMgQ2FsY3VsYXRlIEJNSSBhbmQgcmVtb3ZlIHJlbGF0ZWQgY29sdW1ucyB0byBhdm9pZCBsZWFrYWdlCm9iZXNpdHlfZGF0YSRCTUkgPC0gb2Jlc2l0eV9kYXRhJFdlaWdodCAvIChvYmVzaXR5X2RhdGEkSGVpZ2h0XjIpCm9iZXNpdHlfZGF0YSA8LSBvYmVzaXR5X2RhdGEgJT4lIHNlbGVjdCgtV2VpZ2h0LCAtSGVpZ2h0LCAtTk9iZXllc2RhZCkKCiMgVmlldyBjbGVhbmVkIGRhdGEKaGVhZChvYmVzaXR5X2RhdGEpCmBgYAoKIyBEYXRhIFByZXByb2Nlc3NpbmcKCldlIGZpcnN0IHZlcmlmaWVkIGRhdGEgdHlwZXMgYW5kIGVuc3VyZWQgY2F0ZWdvcmljYWwgdmFyaWFibGVzIHdlcmUgY29udmVydGVkIGludG8gZmFjdG9ycywgdGhlbiByZW1vdmVkIGFueSBtaXNzaW5nIHZhbHVlcyB0byBwcmV2ZW50IGVycm9ycyBkdXJpbmcgbW9kZWxpbmcuCgpGaW5hbGx5LCB3ZSBzcGxpdCB0aGUgZGF0YXNldDoKClRyYWluaW5nIHNldCAoODAlKSB0byB0cmFpbiB0aGUgbW9kZWwKVGVzdGluZyBzZXQgKDIwJSkgdG8gZXZhbHVhdGUgaXRzIGdlbmVyYWxpemF0aW9uIHBlcmZvcm1hbmNlCgpgYGB7cn0KIyBDaGVjayBzdHJ1Y3R1cmUKc3RyKG9iZXNpdHlfZGF0YSkKCiMgQ29udmVydCBjaGFyYWN0ZXIgY29sdW1ucyB0byBmYWN0b3JzCm9iZXNpdHlfZGF0YSA8LSBvYmVzaXR5X2RhdGEgJT4lIG11dGF0ZShhY3Jvc3Mod2hlcmUoaXMuY2hhcmFjdGVyKSwgYXMuZmFjdG9yKSkKCiMgUmVtb3ZlIG1pc3NpbmcgdmFsdWVzIChpZiBhbnkpCm9iZXNpdHlfZGF0YSA8LSBuYS5vbWl0KG9iZXNpdHlfZGF0YSkKCiMgQ3JlYXRlIHRyYWluaW5nIGFuZCB0ZXN0aW5nIHNldHMKc2V0LnNlZWQoMTIzKQp0cmFpbl9pbmRleCA8LSBjcmVhdGVEYXRhUGFydGl0aW9uKG9iZXNpdHlfZGF0YSRCTUksIHAgPSAwLjgsIGxpc3QgPSBGQUxTRSkKdHJhaW5fZGF0YSA8LSBvYmVzaXR5X2RhdGFbdHJhaW5faW5kZXgsIF0KdGVzdF9kYXRhIDwtIG9iZXNpdHlfZGF0YVstdHJhaW5faW5kZXgsIF0KYGBgCgojIE1vZGVsaW5nOiBTaGFsbG93IERlY2lzaW9uIFRyZWUKCldlIGJ1aWx0IGEgZGVjaXNpb24gdHJlZSBtb2RlbCB1c2luZyBycGFydCgpIHRvIHByZWRpY3QgQk1JIGZyb20gYWxsIGxpZmVzdHlsZSB2YXJpYWJsZXMuCgpUaGUgdHJlZSB3YXMga2VwdCBzaGFsbG93IHRvIGVuaGFuY2UgaW50ZXJwcmV0YWJpbGl0eToKCk1heGltdW0gZGVwdGggPSAzCkNvbXBsZXhpdHkgcGFyYW1ldGVyIChjcCkgPSAwLjAxCgpgYGB7cn0KIyBUcmFpbiBhIHNoYWxsb3cgZGVjaXNpb24gdHJlZQp0cmVlX21vZGVsIDwtIHJwYXJ0KEJNSSB+IC4sIAogICAgICAgICAgICAgICAgICAgIGRhdGEgPSB0cmFpbl9kYXRhLCAKICAgICAgICAgICAgICAgICAgICBtZXRob2QgPSAiYW5vdmEiLCAKICAgICAgICAgICAgICAgICAgICBjb250cm9sID0gcnBhcnQuY29udHJvbChtYXhkZXB0aCA9IDMsIGNwID0gMC4wMSkpCgojIFZpc3VhbGl6ZSB0aGUgdHJlZQpycGFydC5wbG90KHRyZWVfbW9kZWwsCiAgICAgICAgICAgdHlwZSA9IDQsCiAgICAgICAgICAgZXh0cmEgPSAxMDEsCiAgICAgICAgICAgZmFsbGVuLmxlYXZlcyA9IFRSVUUsCiAgICAgICAgICAgYm94LnBhbGV0dGUgPSAiQmx1ZXMiLAogICAgICAgICAgIHNoYWRvdy5jb2wgPSAiZ3JheSIpCmBgYAoKIyBNb2RlbCBFdmFsdWF0aW9uOiBSb290IE1lYW4gU3F1YXJlZCBFcnJvciAoUk1TRSkKCldlIHVzZWQgUk1TRSB0byBldmFsdWF0ZSBwcmVkaWN0aW9uIGFjY3VyYWN5IG9uIGJvdGggdHJhaW5pbmcgYW5kIHRlc3RpbmcgZGF0YXNldHMuCgpgYGB7cn0KIyBQcmVkaWN0IEJNSQp0cmFpbl9wcmVkIDwtIHByZWRpY3QodHJlZV9tb2RlbCwgbmV3ZGF0YSA9IHRyYWluX2RhdGEpCnRlc3RfcHJlZCA8LSBwcmVkaWN0KHRyZWVfbW9kZWwsIG5ld2RhdGEgPSB0ZXN0X2RhdGEpCgojIENhbGN1bGF0ZSBSTVNFCnRyYWluX3Jtc2UgPC0gc3FydChtZWFuKCh0cmFpbl9wcmVkIC0gdHJhaW5fZGF0YSRCTUkpXjIpKQp0ZXN0X3Jtc2UgPC0gc3FydChtZWFuKCh0ZXN0X3ByZWQgLSB0ZXN0X2RhdGEkQk1JKV4yKSkKCiMgUHJpbnQgcmVzdWx0cwpjYXQoIlRyYWluaW5nIFJNU0U6Iiwgcm91bmQodHJhaW5fcm1zZSwgMiksICJcbiIpCmNhdCgiVGVzdGluZyBSTVNFOiIsIHJvdW5kKHRlc3Rfcm1zZSwgMiksICJcbiIpCmBgYAoKVHJhaW5pbmcgUk1TRSBzaG93cyBob3cgd2VsbCB0aGUgbW9kZWwgZml0cyB0aGUgZGF0YSBpdCB3YXMgdHJhaW5lZCBvbi4KVGVzdGluZyBSTVNFIGdpdmVzIGluc2lnaHQgaW50byBnZW5lcmFsaXphdGlvbiBvbiBuZXcsIHVuc2VlbiBkYXRhLgoKIyBWYXJpYWJsZSBJbXBvcnRhbmNlCgpXZSBleHRyYWN0ZWQgZmVhdHVyZSBpbXBvcnRhbmNlIHNjb3JlcyB0byB1bmRlcnN0YW5kIHdoaWNoIGZhY3RvcnMgbW9zdCBpbmZsdWVuY2VkIEJNSSBwcmVkaWN0aW9ucy4KCmBgYHtyfQojIEdldCB2YXJpYWJsZSBpbXBvcnRhbmNlCmltcG9ydGFuY2UgPC0gZGF0YS5mcmFtZSgKICBWYXJpYWJsZSA9IG5hbWVzKHRyZWVfbW9kZWwkdmFyaWFibGUuaW1wb3J0YW5jZSksCiAgSW1wb3J0YW5jZSA9IGFzLm51bWVyaWModHJlZV9tb2RlbCR2YXJpYWJsZS5pbXBvcnRhbmNlKQopCgojIERpc3BsYXkgYXMgdGFibGUKa2FibGUoaW1wb3J0YW5jZSwgY2FwdGlvbiA9ICJWYXJpYWJsZSBJbXBvcnRhbmNlIGluIERlY2lzaW9uIFRyZWUgTW9kZWwiKQpgYGAKClRoaXMgdGFibGUgaGlnaGxpZ2h0cyB0aGUgbW9zdCBpbmZsdWVudGlhbCBwcmVkaWN0b3JzIGJhc2VkIG9uIGhvdyBmcmVxdWVudGx5IGFuZCBlZmZlY3RpdmVseSB0aGV5IHdlcmUgdXNlZCBpbiB0aGUgdHJlZeKAmXMgc3BsaXRzLgoKIyBDb25jbHVzaW9uCgpNb2RlbDogQSBzaGFsbG93IHJlZ3Jlc3Npb24gdHJlZSBwcmVkaWN0aW5nIEJNSQoKQXBwcm9hY2g6IFNpbXBsZSwgaW50ZXJwcmV0YWJsZSBtb2RlbCB3aXRoIGludGVudGlvbmFsbHkgbGltaXRlZCBkZXB0aAoKUmVzdWx0czoKUk1TRSBvbiB0cmFpbmluZyBhbmQgdGVzdGluZyBkYXRhc2V0cyBnaXZlcyBpbnNpZ2h0IGludG8gYWNjdXJhY3kgYW5kIGdlbmVyYWxpemF0aW9uClZpc3VhbGl6YXRpb24gb2Ygc3BsaXRzIGFpZHMgZXhwbGFpbmFiaWxpdHkKVmFyaWFibGUgaW1wb3J0YW5jZSBpZGVudGlmaWVzIGtleSBoZWFsdGggYW5kIGxpZmVzdHlsZSBwcmVkaWN0b3JzCgojIExpbWl0YXRpb25zCldoaWxlIGRlY2lzaW9uIHRyZWVzIGFyZSBlYXN5IHRvIGludGVycHJldCwgdGhleSBtYXkgdW5kZXJmaXQgY29tcGxleCByZWxhdGlvbnNoaXBzLiBGdXR1cmUgd29yayBjb3VsZCBleHBsb3JlIG1vcmUgZmxleGlibGUgbW9kZWxzIChlLmcuLCBYR0Jvb3N0IG9yIFJhbmRvbSBGb3Jlc3QpIGZvciBpbXByb3ZlZCBwcmVkaWN0aW9uIGFjY3VyYWN5IHdoaWxlIG1haW50YWluaW5nIGludGVycHJldGFiaWxpdHkgdGhyb3VnaCB0b29scyBsaWtlIFNIQVAuCg==