Data Dive 3 - Group By and Probabilities

###Importing the obesity dataset for analysis

{r}
library (tidyverse)
obesity <- read.csv(file.choose())

###Group by gender and summarize weight

{r}
library(RColorBrewer)

gender_weight <- obesity |>
  group_by(Gender) |>
  summarise(avg_weight = mean(Weight))

gw_min <- min(gender_weight$avg_weight)

gender_weight |>
  ggplot(mapping = aes(x = Gender, y = avg_weight, fill = Gender)) +
  geom_bar(stat = "identity", color = 'white') +
  theme_minimal() +
  scale_fill_brewer(palette = 'PuBuGn') +
  labs(title='Average Weight By Gender')

####The main conclusion I can draw from this first analysis is that females have lower weights than men in this dataset, and that on average, women will weight less than men. To phrase this in terms of probability, if you were to randomly select an individual from the dataset, they are more likely to weigh less if they are a female and more likely to weigh more if they are a male. Additionally, if you selected a random group, the group that contained more women would be more likely to weigh less than groups with more men. After analyzing this, though, I don’t think that it was very insightful to my analysis. My hypothesis for why this is is that females are typically smaller than males in general, so it makes sense that on average that women would weigh less than men. This obviously doesn’t account for outliers. It would be interesting to see if there was an even split between men and women that were in the same categories of obesity/normal weight or if there are significantly more men/women in any one category.

###Group by obesity level and summarize physical activity per week

{r}
obesity_activity <- obesity |>
  group_by(NObeyesdad) |>
  summarise(avg_activity = mean(FAF))

oa_min <- min(obesity_activity$avg_activity)

obesity_activity |>
  ggplot(mapping = aes(x = NObeyesdad, y = avg_activity, fill = NObeyesdad)) +
  geom_bar (stat = "identity", color = 'white') +
  theme_minimal() +
  scale_fill_brewer(palette = 'PRGn') +
  labs(title = 'Average Daily Activity By Obesity Level')

####From this analysis, I can conclude that individuals in the obesity type III category (formerly known as morbidly obese; BMI>=40 and is high risk) exercises the least per week than other weight categories. In terms of probability, there is a high probability that if someone is in obesity class III, they on average are not consistently exercising once/week. Additionally, if you look at someone who is exercising once or more per week, they are likely not in the obesity type III class. My testable hypothesis for this group is that some form of exercise, even just a few times a week, is enough to help manage a healthier level of weight.

###Group by family history of obesity and servings of vegetables

{r}

obesity_vegetables <- obesity |>
  group_by(family_history_with_overweight) |>
  summarize(avg_vegetables = mean(FCVC))
  
ov <- min(obesity_vegetables$avg_vegetables)

obesity_vegetables |>
  ggplot(mapping = aes(x = family_history_with_overweight, y = avg_vegetables, fill = family_history_with_overweight)) +
  geom_bar (stat = "identity", color = 'white') +
  theme_minimal() +
  scale_fill_brewer(palette = 'Pastel2') +
  labs(title = 'Average Daily Servings of Vegetables By Family History of Obesity')

####From this analysis, I can conclude that individuals with a family history of obesity eat slightly more vegetables that individuals without a family history of obesity, but both of the values are so close. To share this in terms of probability, there is no difference in the servings of vegetables eaten by an individual is not related to their family history with obesity. Adding onto this, you could say that in any randomly selected individual or group, their family history of obesity will not impact the average servings of vegetables eaten per day. A hypothesis I could test from this to confirm that a family history of obesity has no significant relationship to the servings of vegetables consumed per day. Confirming this would allow me to say with certainty that there is no relationship there.

###Data Frame of Categorical Variables Obesity Level and Do You Eat in between Meals?

{r}

obesity_df <- obesity |>
  select(NObeyesdad, CAEC)
  
unique_obesity_df <- expand.grid(NObeyesdad = unique(obesity_df$NObeyesdad), CAEC = unique(obesity_df$CAEC))

unique_obesity_df

####There are no missing unique combinations in this data frame.

{r}
count <- obesity_df |>
  count(NObeyesdad, CAEC)

####The combinations that are the most common are type 1 obesity and sometimes eating between means, obesity type 2 and sometimes, and overweight level 2 and sometimes. The lease common combinations are overweight level 2 and no, obesity type 2 and no, obesity type 3 and frequently, obesity type 1 and no, and obesity type 2 and frequently. These are all over the board, and because of this I don’t feel confident making any guess or conclusion from this. My only guess would be that people who are overweight or obese are consuming more food, so it would be rarer for them to not eat between meals, but that is a large overgeneralization.

{r}
combination_filter <- count|>
  filter(NObeyesdad == 'Obesity_Type_I')

combination_filter |>
  ggplot(mapping = aes(x = CAEC, y = n, fill = CAEC)) +
  geom_bar (stat = "identity") +
  theme_minimal() +
  scale_fill_brewer(palette = 'Set3') +
  labs(title = 'Obesity Type 1 and How Often Eating Between Meals')

####This graph shows that there are almost no other categories for how often the patient is eating between meals other than sometimes, and the sometimes category is significantly more prevalent than the other options. This somewhat helps support my hypothesis that individuals who are obese are eating more than individuals that are not obese, but I would want to confirm this by making more visualization and running other statistical tests.

Data Dive 3 - Group By and Probabilities

Kylie Heagy

2024-09-18