Load in obesity dataset
library (tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## Warning: package 'lubridate' was built under R version 4.1.3
## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr 1.1.2 v readr 2.1.4
## v forcats 1.0.0 v stringr 1.5.1
## v ggplot2 3.3.5 v tibble 3.2.1
## v lubridate 1.9.2 v tidyr 1.3.0
## v purrr 1.0.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
obesity <- read.csv(file.choose())
Column 1: FAVC
I don’t have an idea as to why they decided to encode this variable the way they did. This is a binary variable describing whether or not the individual eats high caloric food frequently. I would assume that the F stands for frequency, and maybe the C stands for calories.
If I didn’t read the documentation, I would have zero idea what this variable stood for, and likely would not have used it in my analysis.
Column 2: FAF
I also do not understand why this variable is encoded the way it is, as this variable represents a continuous variable of how often the individual does physical activity. One of the F’s might stand for “Feature”, as the metadata categorizes this variable as a feature, and the A might stand for activity as well.
As with the last variable, I likely would have had zero idea what this variable stood for and would not have used it in my analysis.
Column 3: CALC
This variable is a categorical variable that represents how often the individual drinks alcohol. I think the variable was encoded like this for the C to represent categorical, and the ALC to reference alcohol.
I think that this one would be easy to figure out if there was a standard naming schema across the board, but the C could stand for categorical or continuous, and CALC together at first made me think of a calculator or some kind of calculation, not consumption of alcohol.
obesity |>
ggplot()+
geom_bar(mapping = aes(x = NObeyesdad, fill = NObeyesdad)) +
theme_minimal() +
scale_fill_brewer(palette = 'RdPu') +
labs(title='Obesity Level', x = 'Obesity Level', y = 'Count')+
theme(axis.text.x = element_text(angle = 90, hjust = 1))
This visualization looks at the one column that is still not entirely clear to me, even after reading the documentation, NObeyesdad (Obesity level). I chose to represent this in a bar graph because it is categorical data. However, what was unclear to me is still unclear after visualizing it - what are the requirements for each obesity type, and how are they calculating them? Is there a standard, or is it their own categorization based on their sample? They could be using BMI to categorize people, or just weight, or another categorization technique entirely. As I mentioned previously, there is not a standard way to put people in these categories across the board, so documentation on how they are placed is necessary.