Obesity has always been a problem which has plagued humans for many generations, which, since the 1975, almost doubled to turn into a global epidemic. The current human dependence on technology has contributed to the problem even more, with the effects visibly pronounced in late teenagers and early adults. Researchers till date, have tried numerous ways to determine the factors that cause obesity in early adults.
Mexico, Peru, and Colombia are facing the same health challenge. In these countries, overweight and obesity are a major risk for non-communicable diseases such as cardiovascular disease (mainly heart disease and stroke), and the leading cause of death (30% of death of all causes) in the region from diabetes, hypertension and chronic kidney disease. How overweight and obesity can be prevented with healthier food and regular physical activities has became a major concern for healthcare professionals, public health organizations, and individuals concerned about obesity and its consequences.
The objective of this project is to build and evaluate a few predictive models that can predict the obesity levels (Obesity or not) of individuals based on their eating habits and physical condition, as well as identify the importance factor of these attributes that contribute to obesity in Mexico, Peru, and Colombia.
To address this problem, the main dataset will be gathered from UC Irvine’s Machine Learning Repository that includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data was originally collected using a web platform with a survey where anonymous users answered each question, then the information was processed obtaining 17 attributes and 2111 records. The data contains numeric data and categorical data.
The attributes related with eating habits are: Frequent consumption of high caloric food (FAVC), Frequency of consumption of vegetables (FCVC), Number of main meals (NCP), Consumption of food between meals (CAEC), Consumption of water daily (CH20), and Consumption of alcohol (CALC). The attributes related with the physical condition are: Calories consumption monitoring (SCC), Physical activity frequency (FAF), Time using technology devices (TUE), Transportation used (MTRANS), other variables obtained were: Gender, Age, Height and Weight. The variable NObesity was created with the values of: Insufficient Weight, Normal Weight, Overweight Level I, Overweight Level II, Obesity Type I, Obesity Type II and Obesity Type III, based on equation (Mass body index = Weight/(Height*Height)) and information from WHO and Mexican Normativity.
In the phase of data preparation, Tableau Prep will be employed to preprocess the original dataset to handle missing values, round the decimal data for categorical variables, and ensure the overall data quality. For data visualization, Tableau will be used to create charts and dashboards to gain insights into the distribution of variables, correlation between variables, and observe patterns and trends. For predictive modelling and statistical analyses, RStudio will be used for building various models and prediction evaluation. Predictive modeling techniques such as logistical regression, random forest, boosting, and generalized additive model will be employed to create models.
For data preparation, some of the variable names will be renamed to make it easily understood for the targeted audience.The target variable NObesity will be transformed into two levels only (Yes/No) so that the models to be built will aim to predicting obesity or not.
By creating graphical displays of the data set (such as bar charts, histograms, etc.), we visualize any potential relationships between obesity and other variables.
Before building predictive models, the dataset will be randomly split into two parts, the training dataset and the testing dataset. The training dataset will used for building models and the testing dataset for testing predictive models and comparing their performance. Several predictive models will be created and evaluated, such as logistical regression, random forest, boosting, generalized additive model, by using the same training dataset.
After developing our models using various techniques, we examine their predictive capabilities by calculating a confusion matrix and the corresponding misclassification error rate that will be used to assess the performance of the models. In the end the best model will be concluded according to the comparison result. Based on the result of the boosting model, the importance of variables that influence obesity can also be identified.
The stakeholders for this analysis may include healthcare professionals, policymakers, public health organizations, and individuals concerned about obesity and its consequences. However, my primary audience would focus on healthcare professionals and individuals.
With the effective predictive model developed, healthcare professionals will be able to offer more effective guidance and personalized healthcare strategies to each individual, and individuals can also assess themselves to know the potential obesity risks or improve their obesity level by effectively adjusting their diet and behaviors, based on the prediction result with each individual conditions.
By identifying the importance factor of these variables that contribute to obesity, our analysis will provide valuable insights for designing targeted health promotion programs and evaluating the effectiveness of interventions.
Ultimately, the goal is to empower individuals to make healthier lifestyle choices and reduce the burden of obesity-related diseases in these countries.
This original dataset was downloaded from UC Irvine’s Machine Learning Repository, containing 17 attributes and 2111 records.
The data dictionary for the original dataset is as follow:
In the column of NObeyesdad, the data was labeled by the two steps as follow:
Calculation was made to obtain the mass body index for each individual, using equation: \(mass body index = Weight/(Height*Height)\)
The results were compared with the data provided by WHO and the Mexican Normativity: Underweight for Less than 18.5, Normal for 18.5 to 24.9, Overweight for 25.0 to 29.9, Obesity I for 30.0 to 34.9, Obesity II for 35.0 to 39.9, Obesity III for Higher than 40.
For more information, please read the article Dataset for estimation of obesity levels based on eating habits and physical condition in individuals from Colombia, Peru and Mexico
After the dataset was imported into Tableau Prep, the following steps have been processed to clean the data and make it ready for analysis. First, I renamed these columns with abbreviation names to make them easily understood. Then I checked if there is any missing data in all of the columns. Last, I transformed the target variable “Obesity” as well as some other columns with decimal data.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/Screenshot_2.3.png")
To easily understand the variable names when analyzing and presenting the data, these attributes with abbreviation names were renamed as follow: - family_w_Obesity (family history of obesity) was renamed as Family_w_Overweight. - FAVCF (frequent consumption of high caloric food) was renamed as High_Caloric_Food. - FCVC (frequency of consumption of vegetables was renamed as Vegetable. - NCP (number of main meals) was renamed as Main_Meals. - CAEC (consumption of food between meals) was renamed as Food_btw_Meals. - CH2O (consumption of water) was renamed as Water. - SCC (calorie consumption monitoring) was renamed as Calories_Monitor. - PAF (physical activity frequency per week) was renamed as Physical_Activity. - TUE (time using technology devices a day) was renamed as El_Device_Time. - CALC (consumption of alcohol) was renamed as Alcohol. - MTRANS (the ways of transportation) was renamed as Transportation. - NObeyesdad (the categories referring to the mass body index) was renamed as Obesity.
Fortunately, no missing(null, n/a) data were found in each column of this dataset. The Tableau Prep provides the functionality to quickly check if there is any missing value and how the values are distributed in each column.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/Screenshot_2.3.2.png")
The target variable Obesity were re-classified by grouping the “insufficient weight, normal weight, level I overweight, level II overweight” categories as 0 (no obesity) and “type I obesity, type 2 obesity, type 3 obesity” as 1 (obesity), considering the project objective as predicting obesity or not.
The data type of the Age column was changed as “integral” because decimal data were found in some of the rows that were produced in the process of the original dataset creation.
The other five categorical data columns were also transformed as integral, because decimal data were found and these categorical data should be integral according to the data dictionary. The data type of the Vegetable column was changed as as “integral”. The data type of the Main_Meals column was changed as “integral”. The data type of the Water column was changed as “integral”. The data type of the Physical_Activity column was changed as “integral”. The data type of the El_Device_Time column was changed as “integral”.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/Screenshot_2.3.3.png")
Overall, the whole cleaning process has been well documented in Tableau Prep.
Here is a glimpse of the final dataset in R after the cleaned data was exported from Tableau Prep and imported into RStudio.
# Load the final dataset into R
obesity <- read.csv("C:/YL Uindy/624_Capstone/Obesity_Dataset/Obs_Prep_Output.csv")
str(obesity)
## 'data.frame': 2111 obs. of 17 variables:
## $ Gender : chr "Female" "Female" "Male" "Male" ...
## $ Age : int 21 21 23 27 22 29 23 22 24 22 ...
## $ Height : num 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
## $ Weight : num 64 56 77 87 89.8 53 55 53 64 68 ...
## $ Family_w_Overweight: chr "yes" "yes" "yes" "no" ...
## $ High_Caloric_Food : chr "no" "no" "no" "no" ...
## $ Vegetable : int 2 3 2 3 2 2 3 2 3 2 ...
## $ Main_Meals : int 3 3 3 3 1 3 3 3 3 3 ...
## $ Food_btw_Meals : chr "Sometimes" "Sometimes" "Sometimes" "Sometimes" ...
## $ Smoke : chr "no" "yes" "no" "no" ...
## $ Water : int 2 3 2 2 2 2 2 2 2 2 ...
## $ Calories_Monitor : chr "no" "yes" "no" "no" ...
## $ Physical_Activity : int 0 3 2 2 0 0 1 3 1 1 ...
## $ El_Device_Time : int 1 0 1 0 0 0 0 0 1 1 ...
## $ Alcohol : chr "no" "Sometimes" "Frequently" "Frequently" ...
## $ Transportation : chr "Public_Transportation" "Public_Transportation" "Public_Transportation" "Walking" ...
## $ Obesity : int 0 0 0 0 0 0 0 0 0 0 ...
head(obesity)
The following R packages are used to develop predictive models to predict obesity.
In this session, Tableau is the main tool to conduct exploratory data analysis and create various types of visualizations to explore the relationships between the target variable “Obesity” and other key variables.
An easy and quick way to get a descriptive statistics for numerical variables in the dataset is using the function summary() in R.
summary(obesity)
## Gender Age Height Weight
## Length:2111 Min. :14.00 Min. :1.450 Min. : 39.00
## Class :character 1st Qu.:19.00 1st Qu.:1.630 1st Qu.: 65.47
## Mode :character Median :22.00 Median :1.700 Median : 83.00
## Mean :23.97 Mean :1.702 Mean : 86.59
## 3rd Qu.:26.00 3rd Qu.:1.768 3rd Qu.:107.43
## Max. :61.00 Max. :1.980 Max. :173.00
## Family_w_Overweight High_Caloric_Food Vegetable Main_Meals
## Length:2111 Length:2111 Min. :1.000 Min. :1.000
## Class :character Class :character 1st Qu.:2.000 1st Qu.:2.000
## Mode :character Mode :character Median :2.000 Median :3.000
## Mean :2.213 Mean :2.523
## 3rd Qu.:3.000 3rd Qu.:3.000
## Max. :3.000 Max. :4.000
## Food_btw_Meals Smoke Water Calories_Monitor
## Length:2111 Length:2111 Min. :1.000 Length:2111
## Class :character Class :character 1st Qu.:1.000 Class :character
## Mode :character Mode :character Median :2.000 Mode :character
## Mean :1.712
## 3rd Qu.:2.000
## Max. :3.000
## Physical_Activity El_Device_Time Alcohol Transportation
## Min. :0.0000 Min. :0.0000 Length:2111 Length:2111
## 1st Qu.:0.0000 1st Qu.:0.0000 Class :character Class :character
## Median :1.0000 Median :0.0000 Mode :character Mode :character
## Mean :0.7347 Mean :0.3813
## 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :3.0000 Max. :2.0000
## Obesity
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4604
## 3rd Qu.:1.0000
## Max. :1.0000
Based on the summary result above, we can quickly find these basic statistics like mean, median, minimum, maximum, and percentiles for numerical variables. For example, the ages of respondents range from 14 to 61 with the average 23.97; The heights of respondents span from 1.45m to 1.98m.
Moving forward, the dataset was imported to Tableau for more in-depth analyses.
First of all, let’s have a look how the targeted variable “obesity” is distributed in the dataset. From the bar chart below, we can see that 46% of respondents stay at the obesity level. This also means we have a good sample quantity for both types of people.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.1_Obesity_bar.png")
Which groups of people in terms of ages and genders were the data collected from? Is there any potential data shortage or abnormal distribution for any group? For this purpose, I created the histogram and boxplot charts to explore this issue.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.2_Gender.png")
From the bar chart above, we can see that the quantity of male and female people in the dataset are quite close, which shows that either gender has enough data for proper modelling.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.2_Age_hist.png")
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.2_Age_boxplot.png")
However, from the histogram chart we can see that most of data came from people who are between 14 and 42 years old and the data for people above 42 are very limited. Additionally the boxplot for Ages shows that the data of people above 36 years old fall beyond the whisker which means the models to be developed might have lower performance for these people above 36.
As we know, what we eat and drink have a big impact on our body weight, which is also one of my main concerns in this project about which eating habits are significantly related to obesity. In this session, I created one bar chart to look at the relationship between vegetable consumption and obesity, and the other to look at the relationship between alcohol consumption and obesity.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.3_Veg.png")
From the chart above, we can find that Vegetable consumption - 3 (always) seems to have a higher ratio (57.7%) of obesity than the other two groups (40% or so). So, this variable probably is a significant predictor in predicting obesity.
Next, let us take a look at the chart below that whether there is any correlation between alcohol consumption and obesity.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.3_Alcohol.png")
From this chart, we can notice that the groups of Alcohol - no and Alcohol - Frequently have obviously lower ratio of obesity than the group of Alcohol - Sometimes. So, how frequent people drink alcohol probably is also a significant predictor as to whether they become obese or not.
The other key aspect we’re concerned about is which physical conditions of individuals have an significant relationship on becoming obese or not. So, I further explored the variable Physical activities and Transportation to see whether they are obviously related.
I first created a bar chart to check whether there are some difference among the ratios of obesity for the four different categories of physical activities.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.4_Physical.png")
From the chart above, we can clearly see that the categories of Physical - 2 (2-4 days per week) and -3 (4-5days per week) have much lower obesity ratio than the categories of Physical - 0 (none per week) and - 1 (1-2 days per week). This means that exercising more than two times per week may significantly reduce the possibility of becoming obese.
Next, let us have a look at the relationship between the transportation choices and obesity.
knitr::include_graphics("C:/YL Uindy/624_Capstone/Obesity_Report/3.3.4_Transportation.png")
From this chart above, we can easily find the categories of Motorbike, Bike and and Walking have much lower ratio (less than 30%) of obesity than the categories of Public_Transportation and Motorbike. So, the way of transportation that people choose may indicate their possibility of being obese or not.
With the exploratory data analysis, I’ve and clearly seen some existed relationships between the target obesity and some categories of the predictor variables, and also gained some hypotheses about which variables may be significantly related in predicting obesity and which not.
In this session, three predictive models have been developed including logistical regression model, generalized additive model, and boosting model. I don’t use the column of Weight because weight is highly related to the determination of the obesity obesity level according to the formula that is explained in the session 2.2 data understanding, and more like the target variable as the consequence that result from individuals’ eating habits and physical activities. So I removed this column.
# remove the column of Weight
obesity1 <- obesity[,-4]
Next, I split the dataset into two: 70% of the the data for the training set and 30% for the testing set. The training dataset is used for developing predictive models and the testing dataset for verifying and comparing their prediction performance.
# split the dataset into training and testing sets
set.seed(123)
sample_index <- sample(nrow(obesity), nrow(obesity)*0.70)
obesity_train <- obesity1[sample_index,]
obesity_test <- obesity1[-sample_index,]
I started with the logistic regression because it is a classical statistical method suitable for binary classification tasks. The methodology is also simple to interpret, provides coefficients that indicate the strength and direction of the relationships between predictors and the target.
obe_glm <- glm(Obesity ~ ., family = binomial, data = obesity_train)
summary(obe_glm)
##
## Call:
## glm(formula = Obesity ~ ., family = binomial, data = obesity_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -18.13553 324.75170 -0.056 0.955466
## GenderMale 0.10865 0.20003 0.543 0.587022
## Age 0.09536 0.01560 6.113 9.80e-10 ***
## Height 0.33351 1.16292 0.287 0.774275
## Family_w_Overweightyes 3.38273 0.42989 7.869 3.58e-15 ***
## High_Caloric_Foodyes 2.37528 0.33618 7.065 1.60e-12 ***
## Vegetable 0.77535 0.13903 5.577 2.45e-08 ***
## Main_Meals 0.23222 0.08942 2.597 0.009403 **
## Food_btw_MealsFrequently -1.85366 0.74168 -2.499 0.012445 *
## Food_btw_Mealsno 0.26186 0.98244 0.267 0.789827
## Food_btw_MealsSometimes 1.56572 0.57382 2.729 0.006360 **
## Smokeyes 1.10811 0.61878 1.791 0.073324 .
## Water 0.15661 0.13221 1.185 0.236190
## Calories_Monitoryes -2.12962 0.77716 -2.740 0.006139 **
## Physical_Activity -0.34852 0.10117 -3.445 0.000571 ***
## El_Device_Time -0.38922 0.13325 -2.921 0.003490 **
## AlcoholFrequently 4.97885 324.74613 0.015 0.987768
## Alcoholno 5.23785 324.74585 0.016 0.987131
## AlcoholSometimes 5.29953 324.74590 0.016 0.986980
## TransportationBike 0.46761 2.24450 0.208 0.834968
## TransportationMotorbike 2.97494 1.36783 2.175 0.029634 *
## TransportationPublic_Transportation 1.44239 0.21406 6.738 1.60e-11 ***
## TransportationWalking -1.98227 1.07055 -1.852 0.064078 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2036.6 on 1476 degrees of freedom
## Residual deviance: 1271.3 on 1454 degrees of freedom
## AIC: 1317.3
##
## Number of Fisher Scoring iterations: 11
According to the summary of the regression model, especially the Pr values for each coefficient, we can see that Age, Family_w_Overweight(yes),High_Caloric_Food(yes),Vegetable,Main_Meals, Food_btw_Meals(Frequently), Food_btw_Meals(Sometimes),Calories_Monitor(yes) ,Physical_Activity, El_Device_Time, Transportation(Motorbike), and Transportation(Public_Transportation) are significantly important to predict the target obesity.
Next, I used this model to conduct prediction with the testing dataset. With the prediction results, I created a confusion matrix to see the prediction accuracy.
# predict
glm_pred <- predict(obe_glm, newdata = obesity_test, type = "response")
# create a confusion matrix and calculate the misclassification rate
conf_glm <- table(obesity_test$Obesity, (glm_pred > 0.5)*1, dnn = c("Truth", "Predicted"))
conf_glm
## Predicted
## Truth 0 1
## 0 230 107
## 1 49 248
err_glm <- (conf_glm[1,2]+conf_glm[2,1])/sum(conf_glm)
err_glm
## [1] 0.2460568
Based on the results, I found that about 100 of normal individuals were predicted as obesity and about 50 of obese people were predicted as normal people. Overall, The misclassification rate is found as 24.6%, or the accuracy rate is 76.4%. Obviously, the prediction result is acceptable but not good enough.
In this session, I utilized the Generalized Additive Model(GCM) methodology to see if I can develop a better model for this prediction. Basically, GAMs are more useful when the relationship between predictors and the target is nonlinear and can’t be easily captured by linear models like logistic regression. GCM methodology is flexible in modeling nonlinear relationships, can handle interactions between predictors effectively, and often provides good predictive performance.
I kept using the same training dataset to build the GCM model and the same testing dataset to conduct prediction in order to fairly evaluate the effectiveness of different models to be developed in this project.
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.8-42. For overview type 'help("mgcv-package")'.
obe_gam <- gam(Obesity ~ Gender + s(Age) + s(Height) + Family_w_Overweight + High_Caloric_Food + Vegetable + Main_Meals + Food_btw_Meals + Smoke + Water + Calories_Monitor + Physical_Activity + El_Device_Time + Alcohol + Transportation, data = obesity_train)
summary(obe_gam)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Obesity ~ Gender + s(Age) + s(Height) + Family_w_Overweight +
## High_Caloric_Food + Vegetable + Main_Meals + Food_btw_Meals +
## Smoke + Water + Calories_Monitor + Physical_Activity + El_Device_Time +
## Alcohol + Transportation
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.269497 0.407299 -0.662 0.50829
## GenderMale -0.004268 0.028894 -0.148 0.88258
## Family_w_Overweightyes 0.270403 0.030807 8.777 < 2e-16 ***
## High_Caloric_Foodyes 0.223272 0.033260 6.713 2.73e-11 ***
## Vegetable 0.111903 0.019033 5.879 5.11e-09 ***
## Main_Meals 0.031084 0.013423 2.316 0.02072 *
## Food_btw_MealsFrequently -0.086453 0.068226 -1.267 0.20531
## Food_btw_Mealsno -0.044400 0.092412 -0.480 0.63098
## Food_btw_MealsSometimes 0.192669 0.062899 3.063 0.00223 **
## Smokeyes 0.124068 0.076444 1.623 0.10481
## Water -0.022845 0.019314 -1.183 0.23709
## Calories_Monitoryes -0.073827 0.050236 -1.470 0.14189
## Physical_Activity -0.044572 0.014202 -3.139 0.00173 **
## El_Device_Time -0.064343 0.019091 -3.370 0.00077 ***
## AlcoholFrequently -0.308556 0.395807 -0.780 0.43578
## Alcoholno -0.219894 0.392558 -0.560 0.57546
## AlcoholSometimes -0.221489 0.393182 -0.563 0.57330
## TransportationBike -0.071484 0.161169 -0.444 0.65745
## TransportationMotorbike 0.221055 0.150765 1.466 0.14281
## TransportationPublic_Transportation 0.212955 0.034937 6.095 1.40e-09 ***
## TransportationWalking -0.002534 0.069518 -0.036 0.97093
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(Age) 8.103 8.758 13.31 <2e-16 ***
## s(Height) 7.577 8.506 7.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.407 Deviance explained = 42.2%
## GCV = 0.15088 Scale est. = 0.14713 n = 1477
According the summary of the GCM model, we can see that Age, Height Family_w_Overweight(yes),High_Caloric_Food(yes), Vegetable, Main_Meals, Food_btw_Meals(Sometimes), Calories_Monitor(yes) ,Physical_Activity, El_Device_Time, and Transportation(Public_Transportation) are significantly important for this model to predict the target obesity. I also noticed the difference of predictors when comparing them with what I found from the summary of logistical regression model. The variable Height is new here, and Food_btw_Meals(Frequently) and Transportation(Motorbike) are not considered significant.
Next, let us have a look at the prediction results using this GCM model.
# predict
gam_pred <- predict(obe_gam, newdata = obesity_test, type = "response")
length(gam_pred)
## [1] 634
# create a confusion matrix and caculate the misclassification rate
conf_gam <- table(obesity_test$Obesity, (gam_pred > 0.5)*1, dnn = c("Truth", "Predicted"))
conf_gam
## Predicted
## Truth 0 1
## 0 270 67
## 1 45 252
err_gam <- (conf_gam[1,2]+conf_gam[2,1])/sum(conf_gam)
err_gam
## [1] 0.1766562
Based on the prediction result, I found that about 67 of normal people were predicted as obesity and about 45 of obese people were predicted as normal people. Overall, The misclassification rate is found as 17.7%, or the accuracy rate is 83.3%. That means, the GCM model performs better than the logistical regression I first developed, probably because the nonlinear relationship between predictors and the target exist here and is considered additionally in the GCM model.
In this session, I continued to develop a boosting model to see if there is any more improvement in the prediction. Generally, boosting models are powerful techniques that excel in capturing complex nonlinear relationships and interactions in the data. They often have higher predictive accuracy, are robust to outliers, and can handle a large number of predictors effectively. Another advantage of the boosting model is that it provides the “importance” scores for predictor variables that refers to the relative importance or contribution of each predictor variable in making predictions. These scores provide insights into which predictors are most influential in determining the outcome of the model.
Before fitting the boosting model, turning categorical data to factors in R is a prerequisite.
# convert categorical variables to factors
obesity_train$Gender <- as.factor(obesity_train$Gender)
obesity_train$Family_w_Overweight <- as.factor(obesity_train$Family_w_Overweight)
obesity_train$High_Caloric_Food <- as.factor(obesity_train$High_Caloric_Food)
obesity_train$Food_btw_Meals <- as.factor(obesity_train$Food_btw_Meals)
obesity_train$Smoke <- as.factor(obesity_train$Smoke)
obesity_train$Calories_Monitor <- as.factor(obesity_train$Calories_Monitor)
obesity_train$Alcohol <- as.factor(obesity_train$Alcohol)
obesity_train$Transportation <- as.factor(obesity_train$Transportation)
obesity_test$Gender <- as.factor(obesity_test$Gender)
obesity_test$Family_w_Overweight <- as.factor(obesity_test$Family_w_Overweight)
obesity_test$High_Caloric_Food <- as.factor(obesity_test$High_Caloric_Food)
obesity_test$Food_btw_Meals <- as.factor(obesity_test$Food_btw_Meals)
obesity_test$Smoke <- as.factor(obesity_test$Smoke)
obesity_test$Calories_Monitor <- as.factor(obesity_test$Calories_Monitor)
obesity_test$Alcohol <- as.factor(obesity_test$Alcohol)
obesity_test$Transportation <- as.factor(obesity_test$Transportation)
str(obesity_train)
## 'data.frame': 1477 obs. of 16 variables:
## $ Gender : Factor w/ 2 levels "Female","Male": 1 1 1 1 2 1 2 1 2 1 ...
## $ Age : int 19 23 26 22 23 43 18 21 25 25 ...
## $ Height : num 1.52 1.6 1.63 1.7 1.79 ...
## $ Family_w_Overweight: Factor w/ 2 levels "no","yes": 1 2 2 2 2 2 2 2 2 2 ...
## $ High_Caloric_Food : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
## $ Vegetable : int 3 3 3 2 2 2 2 3 1 2 ...
## $ Main_Meals : int 1 3 3 2 3 3 1 3 3 3 ...
## $ Food_btw_Meals : Factor w/ 4 levels "Always","Frequently",..: 2 2 4 4 4 4 4 2 4 4 ...
## $ Smoke : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Water : int 1 3 2 1 2 2 1 1 2 2 ...
## $ Calories_Monitor : Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 1 1 1 ...
## $ Physical_Activity : int 0 3 0 0 0 0 1 0 1 0 ...
## $ El_Device_Time : int 0 0 0 1 0 0 0 0 0 1 ...
## $ Alcohol : Factor w/ 4 levels "Always","Frequently",..: 4 3 4 3 4 4 4 3 4 2 ...
## $ Transportation : Factor w/ 5 levels "Automobile","Bike",..: 4 4 4 4 4 1 4 4 4 4 ...
## $ Obesity : int 0 1 1 0 1 1 0 0 1 0 ...
After the transition of the data type, I developed the boosting model and then got the importance scores for predictors variables in this model .
#library(ggplot2)
#library(caret)
library(adabag)
## Warning: package 'adabag' was built under R version 4.3.2
## Loading required package: rpart
## Warning: package 'rpart' was built under R version 4.3.2
## Loading required package: caret
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.3.1
## Loading required package: lattice
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 4.3.2
## Loading required package: doParallel
## Warning: package 'doParallel' was built under R version 4.3.2
## Loading required package: iterators
## Warning: package 'iterators' was built under R version 4.3.2
## Loading required package: parallel
obesity_train$Obesity <- as.factor(obesity_train$Obesity)
obe_boost <- boosting(Obesity ~. , data = obesity_train, boos = T)
#find the importance in decreasing order
sort(obe_boost$importance, decreasing = TRUE)
## Height Age Main_Meals Alcohol
## 30.8262806 17.4649823 6.2259961 6.1246859
## Physical_Activity Vegetable Transportation Family_w_Overweight
## 5.5782318 5.1912231 5.1465710 4.7169066
## Water Food_btw_Meals El_Device_Time Gender
## 4.5214108 3.6547313 3.5594614 3.0757536
## High_Caloric_Food Smoke Calories_Monitor
## 2.2985005 1.0465921 0.5686728
According the the importance scores that we got for all the predictors, we can see that Height is the most important predictor, and then Age, Main_Meals, Physical_Activity, etc. Calories_Monitor and Smoke are the least important factors in making the prediction.
Next, I conducted the prediction with the boosting model, and also got the confusion matrix and error rate.
#predict
boost_pred = predict(obe_boost, newdata = obesity_test)
#confusion table and missclassification rate
boost_pred$confusion
## Observed Class
## Predicted Class 0 1
## 0 312 20
## 1 25 277
boost_pred$error
## [1] 0.07097792
Based on the prediction results, we can see that 20 of normal people were predicted as obesity and about 55 of obese people were predicted as normal people. Overall, The misclassification rate is found as low as 7.1%, or say the accuracy rate is 92.9% that is much higher than the logistical regression and the GAM model.
The problem we’ve addressed in this project is that the obesity situation in Latin American countries of Mexico, Peru, and Colombia has become a significant public health concern, with high prevalence rates and associated health risks. Healthcare professionals, public health organizations, and individuals have the urgent need to understand the underlying determinants of obesity and promote healthier lifestyles.
By analyzing the data collected from Mexico, Peru, and Colombia, we aim to develop an effective predictive model to predict if one will become obese based on their individual conditions, and identify the key factors significantly contributing to obesity in terms of eating habits and physical activities.
The dataset has been collected from UC Irvine’s Machine Learning Repository that includes data for the estimation of obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. The data was originally collected using a web platform with a survey where anonymous users answered each question, then the information was processed obtaining 17 attributes and 2111 records.
During the data preparation, Tableau Prep has been used to preprocess the original dataset to detect missing values, round the decimal data for categorical variables, and ensure the overall data quality.
In the stage of exploratory data analysis, Tableau has been used to create various types of charts to gain insights into the distribution of variables, correlation between variables, and observe patterns and trends. In the stage of predictive modelling and evaluation, R packages has been employed for developing different models, conducting prediction and evaluation. Three predictive modeling techniques including logistical regression, boosting, and generalized additive model have been used to build models.
For the predictive modelling and evaluation, I started with logistic regression as a baseline model due to its simplicity and interpretability. Then I utilized the Generalized Additive Model (GAM) to develop the second model, considering that the relationship between predictors and the target may be nonlinear. In the end, I employed the boosting model to build the third model since it is a powerful technique that excels in capturing complex nonlinear relationships and interactions in the data.
Throughout the project, Rmarkdown has been used as the reporting tool, incorporating the charts from Tableau and R codes into an integral part.
Through the exploratory data analysis(EDA) that was done in the Section 3, we’ve got some interesting findings for the relationships between obesity and eating habits as well as between obesity and physical conditions.
We first noticed in the section 3.2.3 that people always eating vegetables have a higher ratio (57.7%) of obesity than the other two groups of people, sometimes eating vegetables and seldom eating(40% or so). So, the group of people always eating vegetables have 50% higher probability to become obese. (However, this doesn’t simply mean that more vegetable consumption will definitely lead to obesity. It actually indicates that that group of people has more possibility to become obese, maybe due to some food they eat, or something else related to them.) Meanwhile, we also found that the groups of people who don’t drink alcohol and frequently drink alcohol have obviously lower ratio (37% or below) of obesity than the group that sometimes drink alcohol (51%). In other words, people who don’t or frequently drink alcohol have at least 20% less probability to become obese.
Furthermore, in the section we found that people who exercise 2-4 days per week and 4-5days per week have much lower obesity ratio (18%) than people who do not exercise at all and exercise 1-2 days per week (52%). This indicates that people exercising more than two times per week have about 50% less probability of becoming obese. So, encouraging people to exercise more than 2 days per week would be highly recommended as a preventive measure. In addition, we also noticed that people who choose Motorbike, Bike and and Walking have much lower ratio (less than 30%) of obesity than the categories of Public_Transportation and Automobile (45-48%).
By comparing the predication results of the logistical regression model, the generalized additive model and the boosting model that have been developed, we conclude that the best boosting model is the best one that achieve the highest accuracy score(92.9%). Meanwhile, the importance feature of the boosting model that refer to the relative importance or contribution of each predictor variable in making predictions indicates that Height is the most important predictor, and then Age, Main_Meals, Physical_Activity, etc. Calories_Monitor and Smoke are the least important factors in the model.
The insights that have been uncovered during the exploratory data analysis and predictive modelling can certainly benefit individuals and health professionals in developing better eating habits and more effective physical activities that help people stay in good shape.
On the aspect of eating and drinking, people should be cautious about not neglecting other food to keep a balanced diet if they are eating vegetable frequently, because the data show that the group always eating vegetables has 50% higher probability to become obese than the other groups of people. Meanwhile, people who enjoy drinking alcohol from time to time can reduce their frequency to lower the risk of becoming obese.
As for physical activities, considering the big difference of the obesity ratio between the group of people who exercise 2 days or more per week and the other people, individuals that work out less may properly increase their frequency, and they don’t need to increase too much more because there is no much difference on the obesity ratio between people who exercise 2-4 days and people who exercise more like 4-5 days per week. Meanwhile, people who get used to commuting by public transportation and automobile can try some other ways such as biking and walking.
Based on the importance scores of the predictors that we’ve got from the best boosting model, the variables of Height, Age, Main_meals, Alcohol and Physical_activities are far more important than the variables of Calories_Monitor, Smoke and High_Caloric_Food in predicting obesity. This means, to improve the situation of high obesity ratio in these countries, focusing on the top significant variables and bringing about positive changes would be way more productive than doing everything equally just because they sound healthy like no smoking.
When exploring the distribution of the ages of respondents, we found that it ranges from 14 to 61 with the average 23.97. However, from the histogram chart we can see that most of data came from people who are between 14 and 42 years old and the data for people above 42 are very limited. Additionally the boxplot for Ages shows that the data of people above 36 years old fall beyond the whisker. So, the boosting model that’s been developed might have certain bias for these people above 36 and also cannot cover the ages less than 14.
Meanwhile, the original data were collected from the individuals living in the three countries of Mexico, Peru, and Colombia. So, the insights that we’ve presented and the model that we’ve developed are mainly applicable in these countries, because the relationships between obesity and the variables that are related to eating habits and physical activities may be different from region to region. The application of the boosting model in other areas should be verified and tuned with new data from local people.
For future research, the predictive accuracy of our model could be improved when new additional variables are collected and added into the data frame, such as occupation, working environment. Based on the current findings, some of additional in-depth research can be further conducted to probe into some unusual relationship that we found. For example, more research could done and more data could be collected to investigate why the group of people who always eat vegetables has higher probability of becoming obese.