library(dplyr)
library(ggplot2)
library(tidyverse)
library(pROC)
library(shiny)
setwd("~/Desktop/datasets")
obesity <- read_csv("ObesityDataSet_raw_and_data_sinthetic.csv")Final Project 110 - Obesity Rates in Latin America
Background Research
According to the NIH, health-related issues surrounding diet is a prominent issue in Latin America. The crises is due to both mass malnutrition and nutritional in the region- often found in the same households. The root of the issue points to changes in the food system, extensive urbanization has led to skyrocketing fast/processed food consumption rates (Popkin and Reardon 1028). Another statistic supports this, citing that the average obesity rate in South America is 7.5%, 1.8% higher than the global average (Global Nutrition Report).
Introduction
Research Question: How does a person’s age, height, weight, and gender contribute to their family history of obesity?
About the dataset: The dataset is titled “Estimation of Obesity Levels Based On Eating Habits and Physical Condition” and comes from the UC Irvine Machine Learning Repository. The data itself was collected/created by NIH researchers Fabio Mendoza Palechor and Alexis de la Hoz Manotas. The data was generated synthetically using the Weka tool and the SMOTE filter, and 23% of the data was collected directly from users through a web platform. The dataset contains information about obesity levels in individuals from the countries of Mexico, Peru and Colombia, based on their eating habits and physical condition. I chose this dataset because I was interested in understanding the genetic factors of obesity, and how it is a hereditory disease. The dataset is 2111 with 16 variables (5 I will be using, defined below).
Gender: categorical, binary variable containing the gender of the person
Height: numerical, continuous variable containing the height of the person
Weight: numerical, continuous variable containing the weight of the person
Age: numerical, continuous variable containing the age of the person
family_history_with_overweight: categorical, binary variable containing information on whether the person has a family member who suffered/suffers from obesity
Load Dataset
Prep for Regression
In preperation, I will filter for 5/16 variables needed for my regression. Next, I will study the structure and convert any variables needed (ex, turn age into a discrete variable) to make the process easier.
Cleaning
#select the 5 variables needed
obesity <- obesity |>
select(c(Age, Weight, Height, family_history_with_overweight, Gender)) |>
#change Age to discrete by rounding it
mutate(Age = round(Age)) |>
#change gender into binary codes (male = 1, female = 0)
mutate(Gender = if_else(Gender == "Male", 1, 0)) |>
#change family_history_with_overweight into binary codes (yes = 1, no = 0)
mutate(family_history_with_overweight = if_else(family_history_with_overweight == "yes", 1, 0)) |>
#change family history variable to a factor class
mutate(family_history_with_overweight = as.factor(family_history_with_overweight)) |>
#change gender variable to a factor class
mutate(Gender = as.factor(Gender))
#double check work, confirm that age and family history variables are correct
str(obesity)tibble [2,111 × 5] (S3: tbl_df/tbl/data.frame)
$ Age : num [1:2111] 21 21 23 27 22 29 23 22 24 22 ...
$ Weight : num [1:2111] 64 56 77 87 89.8 53 55 53 64 68 ...
$ Height : num [1:2111] 1.62 1.52 1.8 1.8 1.78 1.62 1.5 1.64 1.78 1.72 ...
$ family_history_with_overweight: Factor w/ 2 levels "0","1": 2 2 2 1 1 1 2 1 2 2 ...
$ Gender : Factor w/ 2 levels "0","1": 1 1 2 2 2 2 1 2 2 2 ...
#the dataset source states that there are no missing values. Double check this information
sum(is.na(obesity))[1] 0
unique(obesity$family_history_with_overweight)[1] 1 0
Levels: 0 1
Statistical Analysis
For my statistical analysis, I have chosen logistic regression with a binary outcome variable. This analysis requires an equation including a binary outcome variable and predictor variables, followed by a series of tests on the model (confusion/performance matrix). I have chosen this because there is a clear outcome variable (family_history_with_overweight), which is binary, followed by predictors that can help explain the outcome variable (age, weight, gender, height)
#create final model
logistic <- glm(family_history_with_overweight ~ Age + Weight + Height + Gender, data=obesity, family="binomial")
#calculate model summary
summary(logistic)
Call:
glm(formula = family_history_with_overweight ~ Age + Weight +
Height + Gender, family = "binomial", data = obesity)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -10.65812 1.64845 -6.466 1.01e-10 ***
Age 0.01074 0.01302 0.825 0.409612
Weight 0.08966 0.00565 15.868 < 2e-16 ***
Height 3.46943 0.98718 3.514 0.000441 ***
Gender1 -0.88339 0.18939 -4.664 3.10e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2005.4 on 2110 degrees of freedom
Residual deviance: 1292.3 on 2106 degrees of freedom
AIC: 1302.3
Number of Fisher Scoring iterations: 6
Significant Pvalues
Weight, Height, and Gender (male only) are the most significant predictors of having a family history of obesity (pvalue > . 05). According to this alpha, age is not a significant predictor.
Diagostic Plot (ROC Curve and AOC Value)
# ROC curve & AUC on full data
roc_obj <- roc(response = obesity$family_history_with_overweight,
predictor = logistic$fitted.values,
levels = c("0", "1"),
direction = "<")
# Print AUC value
auc_val <- auc(roc_obj); auc_valArea under the curve: 0.8859
plot.roc(roc_obj, print.auc = TRUE, legacy.axes = TRUE,
xlab = "False Positive Rate (1 - Specificity)",
ylab = "True Positive Rate (Sensitivity)")The AUC = .886 means the model is very good at distinguishing between people who have a family history of obesity and people who don’t.
In plain words: if you randomly pick one person who has a family history of obesity and one person in the dataset who doesn’t, the model has about a 88.6% chance of ranking the person who does have a family history of obesity higher.
Visualizations:
#UI/page details
ui <- fluidPage(
titlePanel("Average BMI by Gender"),
sidebarLayout(
sidebarPanel(
#adding a checkbox for family history obesity status
radioButtons(inputId = "familyHistory",
label = "Family History of Overweight:",
choices = c("With Family History" = "yes",
"Without Family History" = "no"),
selected = "yes")
),
mainPanel(
plotOutput(outputId = "bmiBarGraph")
)
)
)
#server
server <- function(input, output) {
output$bmiBarGraph <- renderPlot({
#filter based on checkbox. If the "yes family history of obesity" box is checked, filter the plot data to people who answered "yes" to having a family history of obesity. If the "no family history of obesity" box is checked, filter to the opposite
plot_data <- obesity |>
filter(family_history_with_overweight == input$familyHistory |
#convert family history variables back to categorical for readability
family_history_with_overweight == ifelse(input$familyHistory == "yes", 1, 0)) |>
mutate(
#use height and weight to calculate BMI
BMI = Weight / (Height^2),
#internally do the same for gender- convert back to categorical variables
Gender = ifelse(Gender == 1 | Gender == "Men", "Men", "Women")
)
#calculate average BMI (grouped by Gender)
summary_data <- plot_data |>
group_by(Gender) |>
summarize(AvgBMI = mean(BMI, na.rm = TRUE),
Count = n())
#ggplot bar graph
ggplot(summary_data, aes(x = Gender, y = AvgBMI, fill = Gender)) +
geom_col(color = "white", width = 0.6) +
#custom ggplot colors
scale_fill_manual(values = c("Women" = "#FFB6C1", "Men" = "#A1C9F4"),
name = "Gender") +
#adjust boundaries for graphs so the data stays within the graph area
scale_y_continuous(limits = c(0, max(summary_data$AvgBMI) * 1.2)) +
labs(title = paste("Average BMI for Group:", input$familyHistory),
subtitle = paste("Total Subsample Size: n =", sum(summary_data$Count)),
x = "Gender Identity",
y = "Average Body Mass Index (BMI)",
caption = "Source: Mendoza Palechor & De la Hoz Manotas (2019) | UCI Machine Learning Repository") +
theme_minimal(base_size = 14) +
theme(plot.title = element_text(face = "bold", size = 16),
panel.grid.major.x = element_blank()) #remove the vertical lines in the middle of each bar
})
}
shinyApp(ui = ui, server = server)About the visualization:
This visualization is an interactive shiny app displaying a bar graph between women and men in the dataset. Between women and men, the bar graph displays the average BMI for each group (calculated using height and weight in the dataset, using the equation BMI = Weight / (Height^2)). In the shinyapp, the user is able to filter n for people who have a family history of obesity v.s those who don’t. The data itself in the visualization shows that for people who have a family history of obesity, women have a higher average BMI. For people who don’t have a family history of obesity, men have a higher average BMI.
Tableau Link:
https://public.tableau.com/views/ObesityRates_17789254327670/Sheet1?:language=en-US&publish=yes&:sid=&:redirect=auth&:display_count=n&:origin=viz_share_link
About the visualization:
The visualization shows the relationship between the age of individuals and their BMI, categorized by whether they have a family history of obesity or not. The scatterplot shows a dense cluster of younger individuals without a family history of obesity with lower BMIs. In contrast, individuals with higher BMIs are more scattered, with a heightened concentration of younger individuals with high BMIs.
Conclusion
Both visualizations prove that individuals with a higher average BMI tend to have a family history of obesity. While this did not surprise me, I was surprised to see from the scatterplot that most individuals who had high BMI and a family history of obesity were younger than 30. Additionally, I also did not expect that gender was an important role in varying median BMI (according to the bar chart). Moving forward, I would be interested to see how lifestyle and economic factors such as smoking or household income play a role in obesity, especially when using multivariable approaches with age.
Bibliography
Global Nutrition Report. “Latin America and the Caribbean Regional Nutrition Profile.” Global Nutrition Report, 2022, globalnutritionreport.org/resources/nutrition-profiles/latin-america-and-caribbean/.
Palechor, Fabio Mendoza, and Alexis de la Hoz Manotas. “Analysis of Eating Habits and Physical Condition Data for Estimation of Obesity Levels.” JMIR Medical Informatics, vol. 7, no. 4, 2019, e14323. PubMed Central, https://doi.org/10.2196/14323.
Popkin, B. M., and T. Reardon. “Obesity and the Food System Transformation in Latin America.” Obesity Reviews, vol. 19, no. 8, Aug. 2018, pp. 1028–1064. PubMed Central, https://doi.org/10.1111/obr.12694.