Project 2

The dataset I used comes from our class google drive and is called “nhanes”. It is a CDC dataset where each line is some basic information about a person (age, gender, Household income, etc.) and then contains a multitude of different variables concerning many aspects of a person’s health, most of which would be answered by the patient or collected during a standard physical.

I chose to focus on the BMI variable, partly because it’s extremely accessible and everyone knows about it and also because it’s a very common experience to deal with weight issues for many people.

First we load in our library and dataset.

library(tidyverse)
library(plotly)

setwd("~/Data 110 Class Folder")
data <- read_csv("nhanes.csv")

I first started by filtering out NA’s in BMI and HHIncomeMid, then graphing. This made me realize HHIncomeMid is a categorical variable, not a numeric one, as it gives the midpoint in the range of HHIncome that you are in, making it useless for what I was trying to analyze. I then decided to switch to using the “Poverty” variable, which takes the individuals household income and puts it on a scale relative to the poverty level. This actually worked, but I ran into a couple issues. First, the correlation was positive, which felt wrong, and second, there was a big cluster of points where poverty = 5. This is when I realized that the poverty variable was capped at 5, which was likely to remove outliers, therefore I had to remove values of exactly 5.The correlation was still positive, but it was also almost zero which still felt wrong, and I then discovered tha BMI is calculated differently if you are under 20 years old, and there is even a seperate variable for it, so I added a third filter to remove those who were too young.The correlation then ended up being almost zero, so I wondered what would happen if I filtered for males vs females.

bmiM <- data %>% 
  filter(!is.na(BMI)) |>
  filter(Poverty != 5) |>
  filter(Age > 19) |>
  filter(Gender == "male")

ggplot(bmiM, aes(x = Poverty, y = BMI)) +
  geom_point() +
  labs(
    title = "Poverty Ratio vs BMI for Males 20 years or older",
    caption = "Source: CDC",
    x = "Poverty Ratio",
    y = "BMI"
  ) +
  theme_bw() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "#021BF9")

This is a graph of Poverty Ratio vs BMI for males 20 year or older, with ratios of exactly 5 filtered out an a regression line added.

cor(bmiM$Poverty, bmiM$BMI)

[1] 0.05511529

This gave me the highest correlation I’d seen so far, 0.055, which was the wrong way vs what I was expecting, so I decided to run the same test but filtered for females only.

bmiW <- data %>% 
  filter(!is.na(BMI)) |>
  filter(Poverty != 5) |>
  filter(Age > 19) |>
  filter(Gender == "female")

ggplot(bmiW, aes(x = Poverty, y = BMI)) +
  geom_point() +
  labs(
    title = "Poverty Ratio vs BMI for Females 20 years or older",
    caption = "Source: CDC",
    x = "Poverty Ratio",
    y = "BMI"
  ) +
  theme_bw() +
  geom_smooth(method = 'lm', formula = y ~ x, color = "#e10000")

This is the same as the previous graph but filtered for females instead of males.

cor(bmiW$Poverty, bmiW$BMI)

[1] -0.1185395

Now we finally have a negative correlation, which is what I was expecting, but more importantly we’ve discovered how different the data is for men and women. Men have a positive correlation of 0.055, while women have a negative correlation of -0.12, which is a pretty substantial swing.

Next we create a linear model of the dataset and analyze it.

linear <- lm(BMI ~ Poverty, data = bmiW)
summary(linear)


Call:
lm(formula = BMI ~ Poverty, data = bmiW)

Residuals:
    Min      1Q  Median      3Q     Max 
-15.016  -5.385  -1.380   3.957  52.036 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  30.7302     0.2880  106.71  < 2e-16 ***
Poverty      -0.6682     0.1106   -6.04 1.76e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 7.573 on 2560 degrees of freedom
Multiple R-squared:  0.01405,   Adjusted R-squared:  0.01367 
F-statistic: 36.48 on 1 and 2560 DF,  p-value: 1.762e-09

We then create a linear model of the data, using the lm function and feeding it BMI and Poverty from our dataset. This gives us a linear estimation of y given x based on: y’ = b’₀ + b’₁x, where b’₀ is our offset (y intercept) and b’₁ is our slope. This assumes the data can be explained by an underlying model y = b₀ + b₁x + e, where e is an error term that allows for points to vary via normal distribution. This also assumes b₀ + b₁ are unknown model parameters that are estimated by the sample in the linear regression. Our linear model says that our equation is y = -0.6682 + 30.7302, which does that for an increase in poverty level by 1, the BMI is expected to decrease by 0.6682.

We can also see the the p value for the data is extremely small (on the order of 10^-9).This means that there is extremely high statistical probability that the poverty ratio does affect bmi in some way. The null hypothesis proposes the slope is 0, and the data shows that the likelihood oh this being the case is extremely small. Separately we can see the R² value is only ~0.014, meaning that Poverty doesn’t explain much of the value of BMI, but it most definitely does explain some of it with an extremely high degree of confidence.

This led me to want to calculate the confidence interval for my slope, which i did below. First we have to calculate t*, which is how many standard deviations from the mean we want for a given interval for our t distribution. I want a 95% 2 tail value, so we give the function 0.975, since we get 0.025 on each side of the distribution adding up to 0.05. We also feed it 2560 for the number of degrees of freedom, which is the number of samles minus our 2 degrees of freedom (variables).

qt(0.975, 2560)

[1] 1.960891

-0.6682 + (1.960891 * 0.1106)

[1] -0.4513255

-0.6682 - (1.960891 * 0.1106)

[1] -0.8850745

We can see that our t* value is ~1.961. We can calculate the confidence interval for our slope by taking this number and multiplying it by the given standard error of our slope (~0.111), then taking the slope and doing plus and minus that value. We can see above that the 95% confidence interval for our slope is ~-0.451 to ~-0.885

Having not quite gotten the results I was expecting from the previous analysis, I decided to take another angle at the relationship of BMI to income, this time making boxplots of the BMI distribution for every income level to see if anything stood out

bmi <- data %>% 
  filter(!is.na(BMI)) |>
  filter(Age > 19) |>
  filter(Gender == "female") |>
  filter(!is.na(HHIncome))

ggplot(bmi, aes(x = HHIncome, y = BMI)) +
  geom_boxplot(fill = "#ff474c") +
  labs(
    title = "BMI Distribution by Household Income Level",
    caption = "Source: CDC",
    x = "Household Income Level",
    y = "BMI"
  ) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, family = "mono", face = "bold"))

Just like the last plot, we can’t see much here. A couple income levels are a little bit different but they’re all similar enough that you can’t draw any conclusions. The only 2 things I noted from this plot was the high number of high BMI outliers in the highest income bracket compared to the rest, and the lack of lower BMI outliers.

I then decided to explore if race showed any effect, so I made the following plot with a boxplot for each race in the dataset (Black, Hispanic, Mexican, Other and White). Here we can definitely see that the boxplots for each race are vastly different. Black and Mexican have higher medians, Other has the lowest while also having the lowest interqualtile range and Black and White have many more high outliers then the others.

ggplot(bmi, aes(x = Race1, y = BMI, fill = Race1)) +
  geom_boxplot() +
  labs(
    title = "BMI Distribution by Race",
    caption = "Source: CDC",
    x = "Income Level",
    y = "BMI"
  ) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, face = "bold")) +
  scale_fill_manual(values = c(
    "White" = "#69b3a2",
    "Black" = "#ff474c",
    "Mexican" = "#fdb462",
    "Hispanic" = "#8da0cb",
    "Other" = "#e78ac3"))

This led me to the decision for my final plot, which is a scatterplot with multiple regression lines for each race, color coded for each race.

p1 <-bmiW |>
ggplot(aes(x = Poverty, y = BMI, color = Race1)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(
    title = "BMI vs. Poverty Ratio by Race",
    caption = "Source: CDC",
    x = "Poverty Ratio",
    y = "BMI",
    color = "Race"
  ) +
  theme_bw() +
  scale_color_manual(values = c(
    "White" = "#1aa33f",
    "Black" = "#ff474c",
    "Mexican" = "#fdb462",
    "Hispanic" = "#021BF9",
    "Other" = "#F0E442"
  ))

ggplotly(p1)

`geom_smooth()` using formula = 'y ~ x'

We can see from the plot that the regression lines are all similar, but different in their own ways. The slops all look extremely similar, but they are at quite different places. The “Other” line is much lower then the rest, and the Black line is higher then the rest. The other 3 are all pretty much the same and in a group together. You can also see that the 3 middle ones (white, mexican and hispanic) start the graph apart and as the poverty ratio increases they converge together more.

Overall, we can conclude a few things. First, The correlation between poverty ratio and bmi is quite different between men and women, with it being a positive correlation for men and a negative one for women. We can also see that while there is correlation for both, it is quite small. Finally, we can see that race also affects bmi, but also not a ton.