Summary

My data set is a machine learning data set for estimating obesity levels in individuals based on multiple predictive variables, including eating habits and physical activity levels. It can be found here: https://archive.ics.uci.edu/dataset/544/estimation+of+obesity+levels+based+on+eating+habits+and+physical+condition .

My main question for this project is how do these individual personal lifestyle habits predict obesity, so that ultimately some sort of prescriptive advice could be given.

Visualization 1

Relationship between eating habits (CAEC - do you eat food between meals?) and obesity level (NObeyesdad)

library (tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.3.5     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())

library(ggplot2)
ggplot(obesity, aes(x = NObeyesdad, fill = CAEC)) +
  geom_bar() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Consumption of Food Between Meals versus Obesity Level') +
  scale_fill_brewer(palette = 'Set1')

This visualization is comparing the obesity category (insufficient weight - overweight level II). I thought that this would warrant further investigation because, theoretically, consuming food in between meals would indicate an individual is consuming “more” calories, which might indicate that they would weigh more. However, looking at it in a bar chart doesn’t give a very good overall picture, so I should maybe try looking at actual numeric values or a graph that would better lend itself to showing distribution rather than count.

ggplot(obesity, aes(x = NObeyesdad, y = FAF, fill = NObeyesdad)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Frequency of Physical Activity versus Obesity Level') 

This visualization compares obesity level to frequency of physical activity. I thought this was interesting and would warrant further investigation because, following a similar thought process to my last visualization, individuals that have a higher frequency of physical activity would be in a lower obesity category than individuals with a lower frequency of physical activity. A higher frequency of physical activity might indicate a less sedentary lifestyle overall, or it could just mean that the individual is more conscious of their health and body.

My plan moving forward is to look into BMI as a way to further understand the categories of obesity assigned in this database, and to look more closely at how each column plays into the obesity category, and if the outcome has an explanation or not.

Initial Findings

My first hypothesis is that individuals who have a family history of obesity/being overweight are more likely to be in the overweight or obese categories than those who do not have a family history of obesity/being overweight.

ggplot(obesity, aes(x = NObeyesdad, fill = family_history_with_overweight)) +
  geom_bar() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Family History with Overweight by Obesity Level') +
  scale_fill_brewer(palette = "Set2")

My second hypothesis is that individuals who consume higher amounts of vegetables are more likely to be in the lower groups of obesity levels.

ggplot(obesity, aes(x = NObeyesdad, y = FCVC, fill = NObeyesdad)) +
  geom_boxplot() +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(title = 'Vegetable Consumption by Obesity Level') +
  scale_fill_brewer(palette = "Set3")