This is an R Markdown
Notebook. When you execute code within the notebook, the results appear
beneath the code.
Try executing this chunk by clicking the Run button within
the chunk or by placing your cursor inside it and pressing
Cmd+Shift+Enter.
Load the Obesity Dataset
Project Overview: Predicting Weight Based on Lifestyle Factors
For our project, we chose the Obesity Levels dataset from the UCI
Machine Learning Repository. This dataset combines real and synthetic
data, capturing details about people’s eating habits, physical activity,
and biological traits (like age, gender, and height). Our main goal is
to predict weight—a continuous variable—using these different lifestyle
and health factors.
Data Preparation & Initial Insights
Before diving into modeling, we cleaned and prepared the data by:
Converting text-based (categorical) variables into numerical factors
for analysis. Ensuring no missing values were present that could skew
results. Splitting the data into an 80% training set (to build models)
and a 20% testing set (to evaluate their performance fairly).
Key observations from early exploration:
Variables include age, height, dietary patterns (e.g., frequent
high-calorie food consumption), and activity levels (exercise frequency,
screen time). Weight is our target, analyzed as a continuous value—ideal
for regression techniques. The dataset was already quite clean,
requiring minimal adjustments before modeling.
# Load necessary libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr) # for neat tables
# Read the dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")
# View first few rows
head(obesity_data)
# Calculate BMI
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)
# Remove NObeyesdad (target leakage)
obesity_data <- obesity_data %>% select(-NObeyesdad)
Data Preprocessing
# Check variable types
str(obesity_data)
'data.frame': 2111 obs. of 17 variables:
$ Gender : chr "Female" "Female" "Male" "Male" ...
$ Age : num 21 21 23 27 22 29 23 22 24 22 ...
$ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
$ family_history_with_overweight: chr "yes" "yes" "yes" "no" ...
$ FAVC : chr "no" "no" "no" "no" ...
$ FCVC : num 2 3 2 3 2 2 3 2 3 2 ...
$ NCP : num 3 3 3 3 1 3 3 3 3 3 ...
$ CAEC : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
$ SMOKE : chr "no" "yes" "no" "no" ...
$ CH2O : num 2 3 2 2 2 2 2 2 2 2 ...
$ SCC : chr "no" "yes" "no" "no" ...
$ FAF : num 0 3 2 2 0 0 1 3 1 1 ...
$ TUE : num 1 0 1 0 0 0 0 0 1 1 ...
$ CALC : chr "no" "Sometimes" "Frequently" "Frequently" ...
$ MTRANS : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
$ BMI : num 24.4 24.2 23.8 26.9 28.3 ...
# Convert character variables to factors
obesity_data <- obesity_data %>%
mutate(across(where(is.character), as.factor))
# Remove any missing values
obesity_data <- na.omit(obesity_data)
# Split the data into training and testing sets (80/20 split)
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]
First, we checked the structure of our dataset (obesity_data) to see
what types of variables we were working with—like numbers, categories,
or text.
Since some columns were stored as text (characters), we converted
them into factors—a format R understands better for statistical
modeling. This helps the models recognize categories (like “Male” or
“Female”) properly.
Next, we made sure our data was clean and complete by removing any
rows with missing values. This avoids errors or biased results later
on.
Finally, we split the data into two parts:
Training set (80%): Used to build and train our models. Testing set
(20%): Reserved to check how well the models perform on unseen data. We
set a random seed (set.seed(123)) to ensure this split is
reproducible—meaning anyone running the code gets the same
training/testing groups for fair comparisons.
Build a Shallow Decision Tree
# Train a shallow decision tree to predict BMI
tree_model <- rpart(BMI ~ .,
data = train_data,
method = "anova",
control = rpart.control(maxdepth = 3, cp = 0.01))
# Plot the decision tree
rpart.plot(tree_model,
type = 4,
extra = 101,
fallen.leaves = TRUE,
box.palette = "Blues",
shadow.col = "gray")

We started by training a basic Decision Tree model to predict Body
Mass Index (BMI) using all other available variables in the dataset
(excluding NObeyesdad, which was removed to prevent target leakage).
To keep the model intentionally simple and interpretable, we
restricted its complexity in two ways:
We limited the maximum depth to 3 levels to avoid overfitting and
unnecessary complexity. We set a relatively high complexity parameter
(cp = 0.01) to encourage pruning of less informative splits. Since BMI
is a continuous variable, we used the ‘anova’ method in the rpart()
function to indicate regression rather than classification.
For visualization, we created a clean and informative tree plot that
shows:
The hierarchical decision points (splits) based on the most
influential predictors The predicted BMI values at each terminal node
(leaf) A color gradient (using a blue palette) to visually indicate
different prediction ranges Subtle shadows to improve readability and
distinguish branches
This restrained approach gives us a model that is:
Easy to interpret and explain to non-technical stakeholders Quick to
train and computationally efficient Provides a solid baseline for
comparison with more complex models such as XGBoost The tree
visualization acts as both a diagnostic tool (to assess if splits make
logical sense) and a communication tool (to help explain how input
factors influence BMI outcomes).
Evaluate the Model
# Predict on training and test data
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)
# Calculate RMSE (Root Mean Squared Error)
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))
# Print RMSE neatly
cat("Training RMSE:", round(train_rmse, 2), "\n")
Training RMSE: 2.51
cat("Testing RMSE:", round(test_rmse, 2), "\n")
Testing RMSE: 2.65
We used our decision tree to predict BMI in two scenarios:
For individuals in the training set (data the model had already seen)
For individuals in the testing set (new, unseen data)
To evaluate how accurate the model’s predictions were, we calculated
the Root Mean Squared Error (RMSE). RMSE gives us one number that
summarizes the typical prediction error — smaller values indicate more
accurate predictions.
The steps were as follows:
First, we generated predictions for both the training and testing
sets.
Then, we calculated RMSE by:
- Taking the difference between the predicted and actual BMI
values
- Squaring those differences
- Averaging them
- Taking the square root of that average
Finally, we printed the Training RMSE and Testing RMSE:
Training RMSE shows how well the model fits the data it was trained
on. Testing RMSE shows how well the model performs on new data, which is
a better indicator of real-world performance. This evaluation helps us
understand whether the model is overfitting, underfitting, or performing
as expected, and gives us a sense of how accurate its BMI predictions
are in practice.
Analyze Important Features
# Check variable importance
importance <- data.frame(
Variable = names(tree_model$variable.importance),
Importance = as.numeric(tree_model$variable.importance)
)
# Display neatly
kable(importance, caption = "Variable Importance in Decision Tree Model")
Variable Importance in Decision Tree Model
Weight |
92294.2396 |
Height |
18200.5106 |
Age |
16343.7744 |
FCVC |
14226.0842 |
Gender |
8125.2846 |
CH2O |
7416.5660 |
CAEC |
3835.4491 |
family_history_with_overweight |
3096.4359 |
FAF |
2365.9833 |
NCP |
1446.6894 |
FAVC |
316.7405 |
Understanding What Really Affects Weight Predictions
After building our BMI prediction model, we wanted to identify which
variables had the most influence on the predictions. To do this, we
extracted the variable importance scores from the decision tree
model.
These scores reflect how frequently and effectively each variable was
used to split the data in the tree. Variables that played a larger role
in creating accurate predictions received higher scores.
We then organized this information into a clear table with two
columns:
Variable: The name of each predictor (e.g., Height, Age, FCVC)
Importance: A numeric score indicating how influential the variable was
in the model
Using the kable() function, we displayed the table in a professional
format and added a clear caption for context. This made it easy to
compare which features had the most impact.
This analysis helps answer key questions such as:
- Which lifestyle or demographic factors most influence BMI?
- Are there specific variables we should focus on in future data
collection?
- Do the most important predictors align with our expectations and
domain knowledge?
Conclusion:
Methods - Describe building a shallow Decision Tree with depth = 3,
reason for balance between interpretability and prediction. Results -
Report Training RMSE, Testing RMSE, include the Decision Tree plot
(Figure 1), and Variable Importance (Table 1). Discussion - Discuss
Decision Trees being simple to interpret but sometimes underfitting
complex patterns.
---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

--------

# Load the Obesity Dataset


# Project Overview: Predicting Weight Based on Lifestyle Factors

For our project, we chose the Obesity Levels dataset from the UCI Machine Learning Repository. This dataset combines real and synthetic data, capturing details about people’s eating habits, physical activity, and biological traits (like age, gender, and height). Our main goal is to predict weight—a continuous variable—using these different lifestyle and health factors.

# Data Preparation & Initial Insights
Before diving into modeling, we cleaned and prepared the data by:

Converting text-based (categorical) variables into numerical factors for analysis.
Ensuring no missing values were present that could skew results.
Splitting the data into an 80% training set (to build models) and a 20% testing set (to evaluate their performance fairly).

# Key observations from early exploration:

Variables include age, height, dietary patterns (e.g., frequent high-calorie food consumption), and activity levels (exercise frequency, screen time).
Weight is our target, analyzed as a continuous value—ideal for regression techniques.
The dataset was already quite clean, requiring minimal adjustments before modeling.

```{r}
# Load necessary libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)  # for neat tables

# Read the dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")

# View first few rows
head(obesity_data)

# Calculate BMI
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)

# Remove NObeyesdad (target leakage)
obesity_data <- obesity_data %>% select(-NObeyesdad)
```


# Data Preprocessing

```{r}
# Check variable types
str(obesity_data)

# Convert character variables to factors
obesity_data <- obesity_data %>% 
  mutate(across(where(is.character), as.factor))

# Remove any missing values
obesity_data <- na.omit(obesity_data)

# Split the data into training and testing sets (80/20 split)
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]
```

First, we checked the structure of our dataset (obesity_data) to see what types of variables we were working with—like numbers, categories, or text.

Since some columns were stored as text (characters), we converted them into factors—a format R understands better for statistical modeling. This helps the models recognize categories (like "Male" or "Female") properly.

Next, we made sure our data was clean and complete by removing any rows with missing values. This avoids errors or biased results later on.

Finally, we split the data into two parts:

Training set (80%): Used to build and train our models.
Testing set (20%): Reserved to check how well the models perform on unseen data.
We set a random seed (set.seed(123)) to ensure this split is reproducible—meaning anyone running the code gets the same training/testing groups for fair comparisons.


# Build a Shallow Decision Tree

```{r}
# Train a shallow decision tree to predict BMI
tree_model <- rpart(BMI ~ ., 
                    data = train_data, 
                    method = "anova", 
                    control = rpart.control(maxdepth = 3, cp = 0.01))

# Plot the decision tree
rpart.plot(tree_model,
           type = 4,
           extra = 101,
           fallen.leaves = TRUE,
           box.palette = "Blues",
           shadow.col = "gray")
```
We started by training a basic Decision Tree model to predict Body Mass Index (BMI) using all other available variables in the dataset (excluding NObeyesdad, which was removed to prevent target leakage).

To keep the model intentionally simple and interpretable, we restricted its complexity in two ways:

We limited the maximum depth to 3 levels to avoid overfitting and unnecessary complexity.
We set a relatively high complexity parameter (cp = 0.01) to encourage pruning of less informative splits.
Since BMI is a continuous variable, we used the 'anova' method in the rpart() function to indicate regression rather than classification.

For visualization, we created a clean and informative tree plot that shows:

The hierarchical decision points (splits) based on the most influential predictors
The predicted BMI values at each terminal node (leaf)
A color gradient (using a blue palette) to visually indicate different prediction ranges
Subtle shadows to improve readability and distinguish branches

This restrained approach gives us a model that is:

Easy to interpret and explain to non-technical stakeholders
Quick to train and computationally efficient
Provides a solid baseline for comparison with more complex models such as XGBoost
The tree visualization acts as both a diagnostic tool (to assess if splits make logical sense) and a communication tool (to help explain how input factors influence BMI outcomes).


# Evaluate the Model

```{r}
# Predict on training and test data
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)

# Calculate RMSE (Root Mean Squared Error)
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))

# Print RMSE neatly
cat("Training RMSE:", round(train_rmse, 2), "\n")
cat("Testing RMSE:", round(test_rmse, 2), "\n")
```

We used our decision tree to predict BMI in two scenarios:

For individuals in the training set (data the model had already seen)
For individuals in the testing set (new, unseen data)

To evaluate how accurate the model’s predictions were, we calculated the Root Mean Squared Error (RMSE). RMSE gives us one number that summarizes the typical prediction error — smaller values indicate more accurate predictions.

The steps were as follows:

First, we generated predictions for both the training and testing sets.

Then, we calculated RMSE by:

1. Taking the difference between the predicted and actual BMI values
2. Squaring those differences
3. Averaging them
4. Taking the square root of that average

Finally, we printed the Training RMSE and Testing RMSE:

Training RMSE shows how well the model fits the data it was trained on.
Testing RMSE shows how well the model performs on new data, which is a better indicator of real-world performance.
This evaluation helps us understand whether the model is overfitting, underfitting, or performing as expected, and gives us a sense of how accurate its BMI predictions are in practice.


# Analyze Important Features

```{r}
# Check variable importance
importance <- data.frame(
  Variable = names(tree_model$variable.importance),
  Importance = as.numeric(tree_model$variable.importance)
)

# Display neatly
kable(importance, caption = "Variable Importance in Decision Tree Model")
```

# Understanding What Really Affects Weight Predictions

After building our BMI prediction model, we wanted to identify which variables had the most influence on the predictions. To do this, we extracted the variable importance scores from the decision tree model.

These scores reflect how frequently and effectively each variable was used to split the data in the tree. Variables that played a larger role in creating accurate predictions received higher scores.

We then organized this information into a clear table with two columns:

Variable: The name of each predictor (e.g., Height, Age, FCVC)
Importance: A numeric score indicating how influential the variable was in the model

Using the kable() function, we displayed the table in a professional format and added a clear caption for context. This made it easy to compare which features had the most impact.

This analysis helps answer key questions such as:

1. Which lifestyle or demographic factors most influence BMI?
2. Are there specific variables we should focus on in future data collection?
3. Do the most important predictors align with our expectations and domain knowledge?


# Conclusion:

Methods	- Describe building a shallow Decision Tree with depth = 3, reason for balance between interpretability and prediction.
Results	- Report Training RMSE, Testing RMSE, include the Decision Tree plot (Figure 1), and Variable Importance (Table 1).
Discussion	- Discuss Decision Trees being simple to interpret but sometimes underfitting complex patterns.


