Week 5 Data Dive - Documentation

Load in obesity dataset

library (tidyverse)

## Warning: package 'tidyverse' was built under R version 4.1.3

## Warning: package 'tibble' was built under R version 4.1.3

## Warning: package 'tidyr' was built under R version 4.1.3

## Warning: package 'readr' was built under R version 4.1.3

## Warning: package 'purrr' was built under R version 4.1.3

## Warning: package 'dplyr' was built under R version 4.1.3

## Warning: package 'forcats' was built under R version 4.1.3

## Warning: package 'lubridate' was built under R version 4.1.3

## -- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
## v dplyr     1.1.2     v readr     2.1.4
## v forcats   1.0.0     v stringr   1.5.1
## v ggplot2   3.3.5     v tibble    3.2.1
## v lubridate 1.9.2     v tidyr     1.3.0
## v purrr     1.0.1     
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

obesity <- read.csv(file.choose())

1. A list of at least three columns that were unclear until after reading the documentation

Column 1: FAVC

I don’t have an idea as to why they decided to encode this variable the way they did. This is a binary variable describing whether or not the individual eats high caloric food frequently. I would assume that the F stands for frequency, and maybe the C stands for calories.
If I didn’t read the documentation, I would have zero idea what this variable stood for, and likely would not have used it in my analysis.

Column 2: FAF

I also do not understand why this variable is encoded the way it is, as this variable represents a continuous variable of how often the individual does physical activity. One of the F’s might stand for “Feature”, as the metadata categorizes this variable as a feature, and the A might stand for activity as well.
As with the last variable, I likely would have had zero idea what this variable stood for and would not have used it in my analysis.

Column 3: CALC

This variable is a categorical variable that represents how often the individual drinks alcohol. I think the variable was encoded like this for the C to represent categorical, and the ALC to reference alcohol.
I think that this one would be easy to figure out if there was a standard naming schema across the board, but the C could stand for categorical or continuous, and CALC together at first made me think of a calculator or some kind of calculation, not consumption of alcohol.

2. At least one element that is still unclear after reading the documentation

One element that is still unclear to me after reading the documentation is how individuals are placed in their categories for the NObeyesdad column, which represents the obesity level. I tried doing some additional research, but there are varying boundaries for each level of obesity, so I don’t know what these categories are based on.

3. A visualization of the column brought up in question 2

obesity |>
  ggplot()+
  geom_bar(mapping = aes(x = NObeyesdad, fill = NObeyesdad)) +
  theme_minimal() +
  scale_fill_brewer(palette = 'RdPu') +
  labs(title='Obesity Level', x = 'Obesity Level', y = 'Count')+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

This visualization looks at the one column that is still not entirely clear to me, even after reading the documentation, NObeyesdad (Obesity level). I chose to represent this in a bar graph because it is categorical data. However, what was unclear to me is still unclear after visualizing it - what are the requirements for each obesity type, and how are they calculating them? Is there a standard, or is it their own categorization based on their sample? They could be using BMI to categorize people, or just weight, or another categorization technique entirely. As I mentioned previously, there is not a standard way to put people in these categories across the board, so documentation on how they are placed is necessary.

Week 5 Data Dive - Documentation

Kylie Heagy

2024-09-26

1. A list of at least three columns that were unclear until after reading the documentation

2. At least one element that is still unclear after reading the documentation

3. A visualization of the column brought up in question 2