Introduction

Problem Statement

Obesity presents a significant global challenge affecting health, economies, and societal well-being. The rise in chronic diseases like diabetes imposes significant demand on health care systems across the globe highlighting the pressing need to understand its multifaceted causes, including environmental, social, and behavioral factors.

Approach & Methodology

Utilizing R Studio, our approach encompasses data cleaning, exploratory analysis, and modeling. We begin by importing and cleaning the data set, preparing categorical variables and handling missing data. Exploratory data analysis (EDA) will be employed to visualize key patterns associated with obesity.

Proposed Analytical Techniques

RStudio is utilized for data manipulation and visualization with tidyverse and ggplot2. Descriptive statistics and exploratory data analysis techniques like scatter plots, histograms, and correlation matrices are employed to understand variable relationships. Various algorithms, including logistic regression, ensemble methods, decision trees, random forests, support vector machines, and neural networks, are tested for model building. Optimization techniques like grid search or random search are used to fine-tune algorithm parameters, and feature selection methods help identify key factors influencing obesity levels.

Potential Benefits for the Consumer

Accurate obesity rate predictions empower healthcare and wellness businesses to provide tailored interventions, leading to improved health outcomes and consumer satisfaction. Predictive algorithms enable health insurers to assess disease risks and customize coverage. Food and beverage companies can leverage findings to develop healthier products and targeted marketing strategies, benefiting both consumers and businesses.

Packages

library(tidyverse) The Tidyverse is a collection of R packages designed for data science, providing a consistent and efficient way to manipulate, visualize, and model data. library(dplyr) Dplyr is part of the Tidyverse and provides a set of functions for data manipulation tasks such as filtering rows, selecting columns, creating new variables, and summarizing data. . library(caret) The caret package (Classification And Regression Training) is a comprehensive toolkit for building predictive models in R. It facilitates tasks such as data preprocessing, feature selection, model training, tuning, and evaluation. library(random Forest) The random Forest package is an ensemble learning method for classification and regression. It builds multiple decision trees during training and combines their predictions to improve accuracy and reduce over fitting, making it suitable for large data sets with high-dimensional feature spaces. library(xgboost) XGBoost (eXtreme Gradient Boosting) is a scalable and efficient implementation of gradient boosting algorithms. It is known for its speed and performance, often outperforming other machine learning methods. library(ggplot2) ggplot2 is a data visualization package based on the Grammar of Graphics principles. It allows users to create a wide range of visualizations, from simple scatter plots to complex multi-layered plots, by mapping variables to aesthetic attributes such as color, shape, size, and position.

Data Preparation

Data Source

The original dataset was obtained from https://www.kaggle.com/datasets/fatemehmehrparvar/obesity-levels/data

Background

This dataset comprises 2111 records from Mexico, Peru, and Colombia, focusing on obesity levels estimation based on eating habits and physical condition. With 17 attributes, including NObesity classification, it offers insights into the relationship between lifestyle factors and obesity across diverse populations.

Data Importing and Cleaning

Importing data into R usually involves using functions like read.csv() to load the dataset from its source. Cleaning steps address missing values, outliers, and incorrect data entries to ensure the data is analysis-ready. This can include standardizing or normalizing variables, managing categorical variables, and resolving any other data inconsistencies.

Final Cleaned Dataset

# Load required library
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.3
# Create a data frame with your variable descriptions
variable_descriptions <- data.frame(
  Variable = c("Gender", "Age", "Height", "Weight", "Family History", "FAVC", "FCVC", "NCP", "CAEC", "CH20", "CALC", "SCC", "FAF", "TUE", "MTRANS", "NObesity"),
  Description = c("Female/Male", "Numeric value representing age", "Numeric value representing height in meters", "Numeric value representing weight in kilograms", "Presence of family history with overweight", "Frequent consumption of high caloric food", "Frequency of consumption of vegetables", "Number of main meals", "Consumption of food between meals", "Consumption of water daily", "Consumption of alcohol", "Calories consumption monitoring", "Physical activity frequency", "Time using technology devices", "Transportation used", "Obesity level (target variable)"),
  Data_Type = c("Categorical", "Numeric", "Numeric", "Numeric", "Categorical", "Categorical", "Numeric", "Numeric", "Categorical", "Numeric", "Categorical", "Categorical", "Numeric", "Numeric", "Categorical", "Categorical")
)

# Print the table using kable
kable(variable_descriptions, caption = "Variable Descriptions")
Variable Descriptions
Variable Description Data_Type
Gender Female/Male Categorical
Age Numeric value representing age Numeric
Height Numeric value representing height in meters Numeric
Weight Numeric value representing weight in kilograms Numeric
Family History Presence of family history with overweight Categorical
FAVC Frequent consumption of high caloric food Categorical
FCVC Frequency of consumption of vegetables Numeric
NCP Number of main meals Numeric
CAEC Consumption of food between meals Categorical
CH20 Consumption of water daily Numeric
CALC Consumption of alcohol Categorical
SCC Calories consumption monitoring Categorical
FAF Physical activity frequency Numeric
TUE Time using technology devices Numeric
MTRANS Transportation used Categorical
NObesity Obesity level (target variable) Categorical

Key Variables

The data set contains 17 attributes including basic personal information such as gender, age, height, weight; Attributes connected with eating disorders: family_history_with_overweight - (Yes/No), FAVC - Frequent consumption of high caloric food (Yes/No), FCVC - Frequency of consumption of vegetables (Never/Sometimes/Always), NCP - Number of main meals, CAEC - Consumption of food between meals (No/Sometimes/Frequently/Always), SMOKE - Consumption of cigarettes (Yes/No), CH20 - Consumption of water daily (Less than a liter/Between 1-2L/More), CALC - Consumption of alcohol (I do not drink/Sometimes/Frequently/Always), Attributes related with the physical condition which are: SCC - Calories consumption monitoring (Yes/No), FAF - Physical activity frequency (I do not have/1 or 2 days/2 or 4 days/4 or 5 days), TUE - Time using technology devices (0–2 hours/3–5 hours/More than 5 hours), MTRANS - Transportation used (Automobile/Motorbike/Bike/Public Transportation/Walking). *NObeyesdad - Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III

Exploratory Data Analysis

Loading Libraries and Dataset

# Load required libraries
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.3
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'tidyr' was built under R version 4.3.3
## Warning: package 'readr' was built under R version 4.3.3
## Warning: package 'purrr' was built under R version 4.3.3
## Warning: package 'dplyr' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.3
## Warning: package 'forcats' was built under R version 4.3.3
## Warning: package 'lubridate' was built under R version 4.3.3
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.0     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(caret)
## Warning: package 'caret' was built under R version 4.3.3
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.3.3
## randomForest 4.7-1.1
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine
## 
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(xgboost)
## Warning: package 'xgboost' was built under R version 4.3.3
## 
## Attaching package: 'xgboost'
## 
## The following object is masked from 'package:dplyr':
## 
##     slice
# Read the data set
ObesityDataSet <- read.csv("C:/data/ObesityDataSet.csv")

Data Summary and Structure

# summary of the datasets
summary(ObesityDataSet)
##       Age           Gender              Height          Weight      
##  Min.   :14.00   Length:2111        Min.   :1.450   Min.   : 39.00  
##  1st Qu.:19.95   Class :character   1st Qu.:1.630   1st Qu.: 65.47  
##  Median :22.78   Mode  :character   Median :1.700   Median : 83.00  
##  Mean   :24.31                      Mean   :1.702   Mean   : 86.59  
##  3rd Qu.:26.00                      3rd Qu.:1.768   3rd Qu.:107.43  
##  Max.   :61.00                      Max.   :1.980   Max.   :173.00  
##      CALC               FAVC                FCVC            NCP       
##  Length:2111        Length:2111        Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:2.000   1st Qu.:2.659  
##  Mode  :character   Mode  :character   Median :2.386   Median :3.000  
##                                        Mean   :2.419   Mean   :2.686  
##                                        3rd Qu.:3.000   3rd Qu.:3.000  
##                                        Max.   :3.000   Max.   :4.000  
##      SCC               SMOKE                CH2O      
##  Length:2111        Length:2111        Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.585  
##  Mode  :character   Mode  :character   Median :2.000  
##                                        Mean   :2.008  
##                                        3rd Qu.:2.477  
##                                        Max.   :3.000  
##  family_history_with_overweight      FAF              TUE        
##  Length:2111                    Min.   :0.0000   Min.   :0.0000  
##  Class :character               1st Qu.:0.1245   1st Qu.:0.0000  
##  Mode  :character               Median :1.0000   Median :0.6253  
##                                 Mean   :1.0103   Mean   :0.6579  
##                                 3rd Qu.:1.6667   3rd Qu.:1.0000  
##                                 Max.   :3.0000   Max.   :2.0000  
##      CAEC              MTRANS           NObeyesdad       
##  Length:2111        Length:2111        Length:2111       
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
## 
# Checking missing values
sum(is.na(ObesityDataSet))
## [1] 0
# Convert character variables to factor
ObesityDataSet <- ObesityDataSet %>%
  mutate_if(is.character, as.factor)

# Check the structure of the dataframe
str(ObesityDataSet)
## 'data.frame':    2111 obs. of  17 variables:
##  $ Age                           : num  21 21 23 27 22 29 23 22 24 22 ...
##  $ Gender                        : Factor w/ 2 levels "Female","Male": 1 1 2 2 2 2 1 2 2 2 ...
##  $ Height                        : num  1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
##  $ Weight                        : num  64 56 77 87 89.8 53 55 53 64 68 ...
##  $ CALC                          : Factor w/ 4 levels "Always","Frequently",..: 3 4 2 2 4 4 4 4 2 3 ...
##  $ FAVC                          : Factor w/ 2 levels "no","yes": 1 1 1 1 1 2 2 1 2 2 ...
##  $ FCVC                          : num  2 3 2 3 2 2 3 2 3 2 ...
##  $ NCP                           : num  3 3 3 3 1 3 3 3 3 3 ...
##  $ SCC                           : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ SMOKE                         : Factor w/ 2 levels "no","yes": 1 2 1 1 1 1 1 1 1 1 ...
##  $ CH2O                          : num  2 3 2 2 2 2 2 2 2 2 ...
##  $ family_history_with_overweight: Factor w/ 2 levels "no","yes": 2 2 2 1 1 1 2 1 2 2 ...
##  $ FAF                           : num  0 3 2 2 0 0 1 3 1 1 ...
##  $ TUE                           : num  1 0 1 0 0 0 0 0 1 1 ...
##  $ CAEC                          : Factor w/ 4 levels "Always","Frequently",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ MTRANS                        : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 5 4 1 3 4 4 4 ...
##  $ NObeyesdad                    : Factor w/ 7 levels "Insufficient_Weight",..: 2 2 2 6 7 2 2 2 2 2 ...

Visualization

# Histogram of Age
ggplot(ObesityDataSet, aes(Age)) +
  geom_histogram() +
  labs(title = "Histogram of Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# Barplot of Gender
ggplot(ObesityDataSet, aes(Gender)) +
  geom_bar() +
  labs(title = "Barplot of Gender")

# Barplot of CALC
ggplot(ObesityDataSet, aes(CALC)) +
  geom_bar() +
  labs(title = "Barplot of CALC")

# Barplot of FAVC
ggplot(ObesityDataSet, aes(FAVC)) +
  geom_bar() +
  labs(title = "Barplot of FAVC")

# Barplot of SMOKE
ggplot(ObesityDataSet, aes(SMOKE)) +
  geom_bar() +
  labs(title = "Barplot of SMOKE")

# Scatter plot of weight and height
ggplot(ObesityDataSet, aes(Height, Weight)) +
  geom_point() +
  labs(title = "Scatter plot of weight and height")

# Stacked barplot
ggplot(ObesityDataSet, aes(x = family_history_with_overweight, fill = NObeyesdad)) +
  geom_bar(position = "stack") +
  labs(x = "Family History with Overweight", y = "Count", fill = "Obesity Category") +
  ggtitle("Stacked Barplot of Family History with Overweight and Obesity Category")

# Boxplot of Weight by Gender
ggplot(ObesityDataSet, aes(x = Gender, y = Weight, fill = Gender)) +
  geom_boxplot() +
  labs(x = "Gender", y = "Weight", fill = "Gender") +
  ggtitle("Boxplot of Weight by Gender")

Model Training and Evaluation

Splitting Data and Model Fitting

# Set seed for reproducibility
set.seed(123)

# Define response variable
response_var <- "FAVC"

# Split data into training and testing sets
train_index <- createDataPartition(ObesityDataSet[[response_var]], p = 0.7, list = FALSE)
train_data <- ObesityDataSet[train_index, ]
test_data <- ObesityDataSet[-train_index, ]

# Fit logistic regression model
logit_model <- glm(FAVC ~ ., data = train_data, family = binomial)
summary(logit_model)
## 
## Call:
## glm(formula = FAVC ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                         7.345602 535.422599   0.014 0.989054    
## Age                                -0.004767   0.020850  -0.229 0.819151    
## GenderMale                         -0.479272   0.295396  -1.622 0.104703    
## Height                              5.397748   2.573099   2.098 0.035926 *  
## Weight                              0.032456   0.027145   1.196 0.231837    
## CALCFrequently                    -13.857197 535.411594  -0.026 0.979352    
## CALCno                            -13.117439 535.411474  -0.024 0.980454    
## CALCSometimes                     -12.436985 535.411480  -0.023 0.981468    
## FCVC                               -0.228156   0.197647  -1.154 0.248354    
## NCP                                -0.268257   0.121488  -2.208 0.027238 *  
## SCCyes                             -1.397572   0.338340  -4.131 3.62e-05 ***
## SMOKEyes                           -0.118772   0.613916  -0.193 0.846594    
## CH2O                               -0.303360   0.171941  -1.764 0.077677 .  
## family_history_with_overweightyes   0.801893   0.251458   3.189 0.001428 ** 
## FAF                                -0.260007   0.115665  -2.248 0.024581 *  
## TUE                                 0.659906   0.176689   3.735 0.000188 ***
## CAECFrequently                     -1.386684   0.592984  -2.338 0.019362 *  
## CAECno                             -0.924202   0.770336  -1.200 0.230241    
## CAECSometimes                      -0.701944   0.568630  -1.234 0.217036    
## MTRANSBike                         -5.193088   1.309253  -3.966 7.30e-05 ***
## MTRANSMotorbike                    -1.555542   0.929984  -1.673 0.094395 .  
## MTRANSPublic_Transportation        -0.732667   0.345908  -2.118 0.034167 *  
## MTRANSWalking                      -2.821840   0.518952  -5.438 5.40e-08 ***
## NObeyesdadNormal_Weight            -0.807861   0.465625  -1.735 0.082740 .  
## NObeyesdadObesity_Type_I           -0.660527   1.210003  -0.546 0.585143    
## NObeyesdadObesity_Type_II          -0.861317   1.661751  -0.518 0.604236    
## NObeyesdadObesity_Type_III          0.417563   2.074984   0.201 0.840513    
## NObeyesdadOverweight_Level_I        0.195535   0.773277   0.253 0.800372    
## NObeyesdadOverweight_Level_II      -2.554522   0.904758  -2.823 0.004751 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1063.33  on 1478  degrees of freedom
## Residual deviance:  722.83  on 1450  degrees of freedom
## AIC: 780.83
## 
## Number of Fisher Scoring iterations: 12
## Model Evaluation using logistic regression
# Predictions on test set using logistic regression
logit_pred <- predict(logit_model, newdata = test_data, type = "response")
logit_pred_class <- ifelse(logit_pred > 0.5, "yes", "no")

# Calculate accuracy of logistic regression model
logit_accuracy <- mean(logit_pred_class == test_data$FAVC)
print(paste("Logistic Regression Accuracy:", logit_accuracy))
## [1] "Logistic Regression Accuracy: 0.897151898734177"
# Fit random forest model
rf_model <- randomForest(FAVC ~ ., data = train_data)

importance(rf_model)
##                                MeanDecreaseGini
## Age                                   36.110942
## Gender                                 4.506788
## Height                                38.946930
## Weight                                43.530170
## CALC                                  12.425862
## FCVC                                  13.404975
## NCP                                   13.496078
## SCC                                    8.491194
## SMOKE                                  1.671654
## CH2O                                  17.047683
## family_history_with_overweight         8.637269
## FAF                                   20.353754
## TUE                                   15.672244
## CAEC                                  11.001413
## MTRANS                                17.104498
## NObeyesdad                            36.389201
# Predictions on test set using random forest
rf_pred <- predict(rf_model, newdata = test_data)

# Calculate accuracy of random forest model
rf_accuracy <- mean(rf_pred == test_data$FAVC)
print(paste("Random Forest Accuracy:", rf_accuracy))
## [1] "Random Forest Accuracy: 0.920886075949367"
test_data.num<-test_data %>%
  mutate_if(is.factor, as.numeric)

# Fit gradient boosting model (xgboost)
xgb_model <- xgboost(data = as.matrix(test_data.num[, -6]), 
                     label = as.numeric(test_data.num[[response_var]]) - 1, 
                     nrounds = 10, 
                     objective = "binary:logistic")
## [1]  train-logloss:0.489278 
## [2]  train-logloss:0.374235 
## [3]  train-logloss:0.297478 
## [4]  train-logloss:0.245899 
## [5]  train-logloss:0.206920 
## [6]  train-logloss:0.179137 
## [7]  train-logloss:0.155424 
## [8]  train-logloss:0.136599 
## [9]  train-logloss:0.118675 
## [10] train-logloss:0.107959
summary(xgb_model)
##                Length Class              Mode       
## handle             1  xgb.Booster.handle externalptr
## raw            24106  -none-             raw        
## niter              1  -none-             numeric    
## evaluation_log     2  data.table         list       
## call              14  -none-             call       
## params             2  -none-             list       
## callbacks          2  -none-             list       
## feature_names     16  -none-             character  
## nfeatures          1  -none-             numeric
# Predictions on test set using gradient boosting
xgb_pred <- predict(xgb_model, as.matrix(test_data.num[, -6]))
xgb_pred_class <- ifelse(xgb_pred > 0.5, "yes", "no")

# Calculate accuracy of gradient boosting model
xgb_accuracy <- mean(xgb_pred_class == test_data$FAVC)
print(paste("Gradient Boosting Accuracy:", xgb_accuracy))
## [1] "Gradient Boosting Accuracy: 0.979430379746835"

#Key Insights: In this analysis, we uncovered key insights into the factors influencing obesity levels, particularly focusing on predicting frequent consumption of high-caloric food (FAVC). Through data mining techniques, we identified significant predictors such as age, gender, dietary habits, and physical activity frequency. Our models, including logistic regression, random forest, and XGBoost, demonstrated strong predictive accuracy, with XGBoost leading at 98.0%. These findings highlight the potential for targeted healthcare interventions, personalized insurance coverage, and informed product development strategies to combat obesity and promote healthier lifestyles.

#Conclusion: To conclude, based on the brief explanation of the obesity level that has been done through the R software program using the data set of obesity to provide accurate predictions of obesity prevalence relying on data mining techniques for a diverse set of indicators. The libraries used are Tidyverse, Dplyr, Caret, RandomForest, and Xgboost. The obesity data is taken from Kaggle with the dataset containing information related to obesity levels and various attributes related to individuals, such as gender, age, height, weight, family history with overweight, dietary habits, physical activity frequency, time using technology devices, transportation used, etc. Through exploratory data analysis and modeling, we gained valuable insights into the drivers of obesity and developed predictive models to aid in decision-making processes for healthcare and wellness businesses. Model Performance: The logistic regression model achieved a testing accuracy of 89.7%, the random forest model achieved 92.0% accuracy, and the XGBoost model outperformed both with an accuracy of 98.0%. This indicates the effectiveness of machine learning algorithms in predicting obesity levels.

#Implications Healthcare Intervention: Accurate predictions of obesity prevalence can enable targeted interventions and personalized health services delivery, leading to better health outcomes for individuals. Insurance Coverage: Health insurers can leverage predictive algorithms to assess the risk of obesity-related diseases and adjust insurance coverage accordingly, optimizing resource allocation. Product Development: Food and beverage companies can utilize insights from the analysis to formulate healthier products and tailor marketing strategies to promote healthy eating habits.

#Future Directions External Data Integration: Further analysis could involve integrating external data sources such as demographic information or regional health statistics to explore additional factors influencing obesity levels. Model Optimization: Continual refinement and optimization of predictive models using advanced techniques like hyperparameter tuning and feature selection can enhance accuracy and robustness.