Importing the obesity data set for analysis

library (tidyverse)
obesity <- read.csv(file.choose())

Numeric summary for ‘family_history_with_overweight’ where value = ‘yes’

obesity |>
  count(family_history_with_overweight == 'yes')
##   family_history_with_overweight == "yes"    n
## 1                                   FALSE  385
## 2                                    TRUE 1726

Based on the summary of the data set in the environment, we know that there are 2,111 observations. By counting the number of times ‘yes’ appears in the family_history_with_overweight column, we can establish how many of the patients in the data set have a history of obesity in the family versus how many do not. This can provide insight into the genetic aspect of obesity, as well as could potentially be a risk factor identifier. A question that this poses, however, is what the correlation looks like between patients who are obese and whether or not they have a family history of obesity.

Numeric summary for ‘age’- minimum, maximum, and mean age

min_age <- min(obesity$Age) 
max_age <- max(obesity$Age) 
mean_age <- mean(obesity$Age)

min_age
## [1] 14
max_age
## [1] 61
mean_age
## [1] 24.3126

Based on the minimum, maximum, and mean age in this data set, we can see that the youngest patient is 14, the oldest is 61, but the average age is around 24. This implies that more patients are on the younger end of the scale, which means that we might potentially be missing out on a lot of information about the elder population. A question this raises for me would be why we’re missing that data - is it just because of the population of where this data was collected, or was the collection biased in some way?

3 Novel Questions to investigate

The first question I want to investigate if there is a significant relationship between having a family history of obesity and the patient being obese. As I mentioned earlier, understanding this relationship can help determine if familial history of obesity is a good diagnostic tool to help with preventative measures.

The second question I want to investigate is how additional lifestyle factors included in this data set, specifically smoking and forms of transportation, relate to the patient being obese or not. I have a hypothesis that individuals who smoke and who walk/bike as their form of transportation will be in the normal weight category, but then proving this hypothesis would warrant further research that this dataset doesn’t cover about other health markers.

The third question I want to investigate is what the distribution of weight looks like over the patient population. I’ve alread assessed the age of the patients, and so the variance of weight would help me understand the diversity of the data set better.

Address one of those 3 questions with an aggregate function

I will be addressing the variance of the weight across the data set.

weight_mean <- mean(obesity$Weight)
weight_mean 
## [1] 86.58606
weight_var <- var(obesity$Weight)
weight_var
## [1] 685.9775

The mean weight is 86.58606 kg, and the variance is 685.9775. From my basic understanding of variance, this means that the weight in this data set vary greatly from the mean. With just this information, it is difficult to draw any significant conclusions, but this does make me interested in the relationship between height, age, and weight. I might potentially be able to visualize these relationships, or calculate the BMI and map it by age.

Visualization of the distribution of weight

library(ggplot2)

obesity |>
  ggplot() +
  geom_boxplot(mapping = aes(x = Age, y= Weight)) +
  labs(title = 'Weight Distribution by Age',
       y = 'Weight') +
  theme_minimal()

This box plot shows the distribution of weight, as well as the range of ages in the data set. We can see that the ages range from below 20 to just below 60, and the average weight is around 85 kg. However, the weighs range from below 60 kg all the way up to above 150 kg. Additionally, this plot shows that outliers do exist in this dataset, which is something I should be mindful of going forwards in my analysis. The main insight I was able to gain from this graph is the span of data contained in this data set, which I believe will be valuable to any further analysis I do since I have an overall bigger picture of what I’m working with.

Visualization of transportation

obesity |>
  ggplot()+
  geom_bar(mapping = aes(x = MTRANS, fill = family_history_with_overweight)) +
  theme_minimal() +
  scale_fill_brewer(palette = 'RdPu') +
  labs(title='Family History of Obesity by Transportation Used')

This bar chart shows the distribution of what kinds of transportation patients use as their main form, split up by whether or not they have a family history of being overweight. A large insight gathered from this visualization is that most people either use public transportation or some kind of automobile, and most of the people that use those forms of transportation have a family history of overweight. This does raise the question, to me, of whether or not there is actually a correlation here, or if this result is just because public transportation and automobiles are the two most popular forms of transportation regardless of whether or not someone is obese or has a history of obesity in their family. I also would like to see how this graph would look if I replaced the family_history_with_overweight variable with the weight category for each patient.