This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Cmd+Shift+Enter.


Load the Obesity Dataset

Project Overview: Predicting Weight Based on Lifestyle Factors

For our project, we chose the Obesity Levels dataset from the UCI Machine Learning Repository. This dataset combines real and synthetic data, capturing details about people’s eating habits, physical activity, and biological traits (like age, gender, and height). Our main goal is to predict weight—a continuous variable—using these different lifestyle and health factors.

Data Preparation & Initial Insights

Before diving into modeling, we cleaned and prepared the data by:

Converting text-based (categorical) variables into numerical factors for analysis. Ensuring no missing values were present that could skew results. Splitting the data into an 80% training set (to build models) and a 20% testing set (to evaluate their performance fairly).

Key observations from early exploration:

Variables include age, height, dietary patterns (e.g., frequent high-calorie food consumption), and activity levels (exercise frequency, screen time). Weight is our target, analyzed as a continuous value—ideal for regression techniques. The dataset was already quite clean, requiring minimal adjustments before modeling.

# Load necessary libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)  # for neat tables

# Read the dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")

# View first few rows
head(obesity_data)

# Calculate BMI
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)

# Remove NObeyesdad (target leakage)
obesity_data <- obesity_data %>% select(-NObeyesdad)

Data Preprocessing

# Check variable types
str(obesity_data)
'data.frame':   2111 obs. of  17 variables:
 $ Gender                        : chr  "Female" "Female" "Male" "Male" ...
 $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
 $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
 $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
 $ family_history_with_overweight: chr  "yes" "yes" "yes" "no" ...
 $ FAVC                          : chr  "no" "no" "no" "no" ...
 $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
 $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
 $ CAEC                          : chr  "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
 $ SMOKE                         : chr  "no" "yes" "no" "no" ...
 $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
 $ SCC                           : chr  "no" "yes" "no" "no" ...
 $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
 $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
 $ CALC                          : chr  "no" "Sometimes" "Frequently" "Frequently" ...
 $ MTRANS                        : chr  "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
 $ BMI                           : num  24.4 24.2 23.8 26.9 28.3 ...
# Convert character variables to factors
obesity_data <- obesity_data %>% 
  mutate(across(where(is.character), as.factor))

# Remove any missing values
obesity_data <- na.omit(obesity_data)

# Split the data into training and testing sets (80/20 split)
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]

First, we checked the structure of our dataset (obesity_data) to see what types of variables we were working with—like numbers, categories, or text.

Since some columns were stored as text (characters), we converted them into factors—a format R understands better for statistical modeling. This helps the models recognize categories (like “Male” or “Female”) properly.

Next, we made sure our data was clean and complete by removing any rows with missing values. This avoids errors or biased results later on.

Finally, we split the data into two parts:

Training set (80%): Used to build and train our models. Testing set (20%): Reserved to check how well the models perform on unseen data. We set a random seed (set.seed(123)) to ensure this split is reproducible—meaning anyone running the code gets the same training/testing groups for fair comparisons.

Build a Shallow Decision Tree

# Train a shallow decision tree to predict BMI
tree_model <- rpart(BMI ~ ., 
                    data = train_data, 
                    method = "anova", 
                    control = rpart.control(maxdepth = 3, cp = 0.01))

# Plot the decision tree
rpart.plot(tree_model,
           type = 4,
           extra = 101,
           fallen.leaves = TRUE,
           box.palette = "Blues",
           shadow.col = "gray")

We started by training a basic Decision Tree model to predict Body Mass Index (BMI) using all other available variables in the dataset (excluding NObeyesdad, which was removed to prevent target leakage).

To keep the model intentionally simple and interpretable, we restricted its complexity in two ways:

We limited the maximum depth to 3 levels to avoid overfitting and unnecessary complexity. We set a relatively high complexity parameter (cp = 0.01) to encourage pruning of less informative splits. Since BMI is a continuous variable, we used the ‘anova’ method in the rpart() function to indicate regression rather than classification.

For visualization, we created a clean and informative tree plot that shows:

The hierarchical decision points (splits) based on the most influential predictors The predicted BMI values at each terminal node (leaf) A color gradient (using a blue palette) to visually indicate different prediction ranges Subtle shadows to improve readability and distinguish branches

This restrained approach gives us a model that is:

Easy to interpret and explain to non-technical stakeholders Quick to train and computationally efficient Provides a solid baseline for comparison with more complex models such as XGBoost The tree visualization acts as both a diagnostic tool (to assess if splits make logical sense) and a communication tool (to help explain how input factors influence BMI outcomes).

Evaluate the Model

# Predict on training and test data
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)

# Calculate RMSE (Root Mean Squared Error)
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))

# Print RMSE neatly
cat("Training RMSE:", round(train_rmse, 2), "\n")
Training RMSE: 2.51 
cat("Testing RMSE:", round(test_rmse, 2), "\n")
Testing RMSE: 2.65 

We used our decision tree to predict BMI in two scenarios:

For individuals in the training set (data the model had already seen) For individuals in the testing set (new, unseen data)

To evaluate how accurate the model’s predictions were, we calculated the Root Mean Squared Error (RMSE). RMSE gives us one number that summarizes the typical prediction error — smaller values indicate more accurate predictions.

The steps were as follows:

First, we generated predictions for both the training and testing sets.

Then, we calculated RMSE by:

  1. Taking the difference between the predicted and actual BMI values
  2. Squaring those differences
  3. Averaging them
  4. Taking the square root of that average

Finally, we printed the Training RMSE and Testing RMSE:

Training RMSE shows how well the model fits the data it was trained on. Testing RMSE shows how well the model performs on new data, which is a better indicator of real-world performance. This evaluation helps us understand whether the model is overfitting, underfitting, or performing as expected, and gives us a sense of how accurate its BMI predictions are in practice.

Analyze Important Features

# Check variable importance
importance <- data.frame(
  Variable = names(tree_model$variable.importance),
  Importance = as.numeric(tree_model$variable.importance)
)

# Display neatly
kable(importance, caption = "Variable Importance in Decision Tree Model")
Variable Importance in Decision Tree Model
Variable Importance
Weight 92294.2396
Height 18200.5106
Age 16343.7744
FCVC 14226.0842
Gender 8125.2846
CH2O 7416.5660
CAEC 3835.4491
family_history_with_overweight 3096.4359
FAF 2365.9833
NCP 1446.6894
FAVC 316.7405

Understanding What Really Affects Weight Predictions

After building our BMI prediction model, we wanted to identify which variables had the most influence on the predictions. To do this, we extracted the variable importance scores from the decision tree model.

These scores reflect how frequently and effectively each variable was used to split the data in the tree. Variables that played a larger role in creating accurate predictions received higher scores.

We then organized this information into a clear table with two columns:

Variable: The name of each predictor (e.g., Height, Age, FCVC) Importance: A numeric score indicating how influential the variable was in the model

Using the kable() function, we displayed the table in a professional format and added a clear caption for context. This made it easy to compare which features had the most impact.

This analysis helps answer key questions such as:

  1. Which lifestyle or demographic factors most influence BMI?
  2. Are there specific variables we should focus on in future data collection?
  3. Do the most important predictors align with our expectations and domain knowledge?

Conclusion:

Methods - Describe building a shallow Decision Tree with depth = 3, reason for balance between interpretability and prediction. Results - Report Training RMSE, Testing RMSE, include the Decision Tree plot (Figure 1), and Variable Importance (Table 1). Discussion - Discuss Decision Trees being simple to interpret but sometimes underfitting complex patterns.

---
title: "R Notebook"
output: html_notebook
---

This is an [R Markdown](http://rmarkdown.rstudio.com) Notebook. When you execute code within the notebook, the results appear beneath the code. 

Try executing this chunk by clicking the *Run* button within the chunk or by placing your cursor inside it and pressing *Cmd+Shift+Enter*. 

--------

# Load the Obesity Dataset


# Project Overview: Predicting Weight Based on Lifestyle Factors

For our project, we chose the Obesity Levels dataset from the UCI Machine Learning Repository. This dataset combines real and synthetic data, capturing details about people’s eating habits, physical activity, and biological traits (like age, gender, and height). Our main goal is to predict weight—a continuous variable—using these different lifestyle and health factors.

# Data Preparation & Initial Insights
Before diving into modeling, we cleaned and prepared the data by:

Converting text-based (categorical) variables into numerical factors for analysis.
Ensuring no missing values were present that could skew results.
Splitting the data into an 80% training set (to build models) and a 20% testing set (to evaluate their performance fairly).

# Key observations from early exploration:

Variables include age, height, dietary patterns (e.g., frequent high-calorie food consumption), and activity levels (exercise frequency, screen time).
Weight is our target, analyzed as a continuous value—ideal for regression techniques.
The dataset was already quite clean, requiring minimal adjustments before modeling.

```{r}
# Load necessary libraries
library(tidyverse)
library(rpart)
library(rpart.plot)
library(caret)
library(knitr)  # for neat tables

# Read the dataset
obesity_data <- read.csv("~/Desktop/ObesityDataSet_raw_and_data_sinthetic.csv")

# View first few rows
head(obesity_data)

# Calculate BMI
obesity_data$BMI <- obesity_data$Weight / (obesity_data$Height^2)

# Remove NObeyesdad (target leakage)
obesity_data <- obesity_data %>% select(-NObeyesdad)
```


# Data Preprocessing

```{r}
# Check variable types
str(obesity_data)

# Convert character variables to factors
obesity_data <- obesity_data %>% 
  mutate(across(where(is.character), as.factor))

# Remove any missing values
obesity_data <- na.omit(obesity_data)

# Split the data into training and testing sets (80/20 split)
set.seed(123)
train_index <- createDataPartition(obesity_data$BMI, p = 0.8, list = FALSE)
train_data <- obesity_data[train_index, ]
test_data <- obesity_data[-train_index, ]
```

First, we checked the structure of our dataset (obesity_data) to see what types of variables we were working with—like numbers, categories, or text.

Since some columns were stored as text (characters), we converted them into factors—a format R understands better for statistical modeling. This helps the models recognize categories (like "Male" or "Female") properly.

Next, we made sure our data was clean and complete by removing any rows with missing values. This avoids errors or biased results later on.

Finally, we split the data into two parts:

Training set (80%): Used to build and train our models.
Testing set (20%): Reserved to check how well the models perform on unseen data.
We set a random seed (set.seed(123)) to ensure this split is reproducible—meaning anyone running the code gets the same training/testing groups for fair comparisons.


# Build a Shallow Decision Tree

```{r}
# Train a shallow decision tree to predict BMI
tree_model <- rpart(BMI ~ ., 
                    data = train_data, 
                    method = "anova", 
                    control = rpart.control(maxdepth = 3, cp = 0.01))

# Plot the decision tree
rpart.plot(tree_model,
           type = 4,
           extra = 101,
           fallen.leaves = TRUE,
           box.palette = "Blues",
           shadow.col = "gray")
```
We started by training a basic Decision Tree model to predict Body Mass Index (BMI) using all other available variables in the dataset (excluding NObeyesdad, which was removed to prevent target leakage).

To keep the model intentionally simple and interpretable, we restricted its complexity in two ways:

We limited the maximum depth to 3 levels to avoid overfitting and unnecessary complexity.
We set a relatively high complexity parameter (cp = 0.01) to encourage pruning of less informative splits.
Since BMI is a continuous variable, we used the 'anova' method in the rpart() function to indicate regression rather than classification.

For visualization, we created a clean and informative tree plot that shows:

The hierarchical decision points (splits) based on the most influential predictors
The predicted BMI values at each terminal node (leaf)
A color gradient (using a blue palette) to visually indicate different prediction ranges
Subtle shadows to improve readability and distinguish branches

This restrained approach gives us a model that is:

Easy to interpret and explain to non-technical stakeholders
Quick to train and computationally efficient
Provides a solid baseline for comparison with more complex models such as XGBoost
The tree visualization acts as both a diagnostic tool (to assess if splits make logical sense) and a communication tool (to help explain how input factors influence BMI outcomes).


# Evaluate the Model

```{r}
# Predict on training and test data
train_pred <- predict(tree_model, newdata = train_data)
test_pred <- predict(tree_model, newdata = test_data)

# Calculate RMSE (Root Mean Squared Error)
train_rmse <- sqrt(mean((train_pred - train_data$BMI)^2))
test_rmse <- sqrt(mean((test_pred - test_data$BMI)^2))

# Print RMSE neatly
cat("Training RMSE:", round(train_rmse, 2), "\n")
cat("Testing RMSE:", round(test_rmse, 2), "\n")
```

We used our decision tree to predict BMI in two scenarios:

For individuals in the training set (data the model had already seen)
For individuals in the testing set (new, unseen data)

To evaluate how accurate the model’s predictions were, we calculated the Root Mean Squared Error (RMSE). RMSE gives us one number that summarizes the typical prediction error — smaller values indicate more accurate predictions.

The steps were as follows:

First, we generated predictions for both the training and testing sets.

Then, we calculated RMSE by:

1. Taking the difference between the predicted and actual BMI values
2. Squaring those differences
3. Averaging them
4. Taking the square root of that average

Finally, we printed the Training RMSE and Testing RMSE:

Training RMSE shows how well the model fits the data it was trained on.
Testing RMSE shows how well the model performs on new data, which is a better indicator of real-world performance.
This evaluation helps us understand whether the model is overfitting, underfitting, or performing as expected, and gives us a sense of how accurate its BMI predictions are in practice.


# Analyze Important Features

```{r}
# Check variable importance
importance <- data.frame(
  Variable = names(tree_model$variable.importance),
  Importance = as.numeric(tree_model$variable.importance)
)

# Display neatly
kable(importance, caption = "Variable Importance in Decision Tree Model")
```

# Understanding What Really Affects Weight Predictions

After building our BMI prediction model, we wanted to identify which variables had the most influence on the predictions. To do this, we extracted the variable importance scores from the decision tree model.

These scores reflect how frequently and effectively each variable was used to split the data in the tree. Variables that played a larger role in creating accurate predictions received higher scores.

We then organized this information into a clear table with two columns:

Variable: The name of each predictor (e.g., Height, Age, FCVC)
Importance: A numeric score indicating how influential the variable was in the model

Using the kable() function, we displayed the table in a professional format and added a clear caption for context. This made it easy to compare which features had the most impact.

This analysis helps answer key questions such as:

1. Which lifestyle or demographic factors most influence BMI?
2. Are there specific variables we should focus on in future data collection?
3. Do the most important predictors align with our expectations and domain knowledge?


# Conclusion:

Methods	- Describe building a shallow Decision Tree with depth = 3, reason for balance between interpretability and prediction.
Results	- Report Training RMSE, Testing RMSE, include the Decision Tree plot (Figure 1), and Variable Importance (Table 1).
Discussion	- Discuss Decision Trees being simple to interpret but sometimes underfitting complex patterns.


