Variable Selection and Research Question

I will be looking at how BMI is distributed between people who are below the poverty line and people who are above the poverty line. I hypothesize that where one is according to the poverty line (independent variable) will impact BMI (dependent variable.

Data Prep

First, I import the necessary packages and the NHIS data. Then I clean it so that it only contains the two variables I will be studying, and rename these variables so that they are easier to work with. Finally, I filter out any NA responses.

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
NHIS <- read.csv("/Users/Nazija/Desktop/SD3 NHIS Data.csv")
data <- NHIS%>%
  select(Demo_belowpovertyline_B, Health_BMI_N)%>%
  rename(poverty = Demo_belowpovertyline_B, BMI = Health_BMI_N)%>%
  filter(!is.na(BMI), !is.na(poverty))
head(data)
##   poverty   BMI
## 1       1 33.36
## 2       0 20.19
## 3       0 27.27
## 4       0 38.62
## 5       0 39.95
## 6       0 18.83

Data Analysis

Comparison of Means

I will compare the mean BMIs between respondents who were above the poverty line (indicated by a 0) and respondents who were below the poverty line (indicated by a 1).

Table

data%>%
  group_by(poverty)%>%
  summarize(avg = mean(BMI, na.rm = TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   poverty   avg
##     <int> <dbl>
## 1       0  27.9
## 2       1  28.3

Visualization

data%>%
  group_by(poverty)%>%
  summarize(avg_BMI = mean(BMI, na.rm = TRUE))%>%
  ggplot()+
  geom_col(aes(x = poverty, y = avg_BMI, fill = poverty))+
  geom_label(aes(x = poverty, y = avg_BMI, label = round(avg_BMI)))
## `summarise()` ungrouping output (override with `.groups` argument)

Interpretation

The average BMI between both groups is almost the same, and less than 1 point apart. This indicates that poverty may not have an impact on BMI, as respondents from both groups share basically the same average BMI.

Comparison of Response Distributions

I plot the count for each BMI score for the two groups.

data%>%
  ggplot()+
  geom_histogram((aes(x = BMI, fill = poverty)))+
  facet_wrap(~poverty)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Above Poverty Line

data%>%
  filter(poverty == 0)%>%
  ggplot()+
  geom_histogram((aes(x = BMI)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Below Poverty Line

data%>%
  filter(poverty == 1)%>%
  ggplot()+
  geom_histogram((aes(x = BMI)))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation

The two groups have very similar response distributions, though the range for respondents below the poverty threshold is wider than that for the respondents above the poverty threshold. Despite having a few more respondents that had higher BMIs, on average the respondents below the poverty line have the same BMI as respondents above the poverty line.

Sampling Distributions

I separate the data into two groups based on those below the poverty threshold and those above it, and create one thousand 40-person samples from the data for each group. Then I put the means of these one thousand samples into separate objects. I plot these sampling distributions, or means from the 1000 samples, for both groups to see how they compare.

Below_poverty<-data%>%
  filter(poverty == 1)
Above_poverty<-data%>%
  filter(poverty == 0)

Below_sampling<-replicate(10000,
          sample(Below_poverty$BMI, 40)%>%
            mean(na.rm = TRUE))%>%
  data.frame()%>%
  rename("mean" = 1)

Above_sampling<-replicate(10000,
          sample(Above_poverty$BMI, 40)%>%
            mean(na.rm = TRUE))%>%
  data.frame()%>%
  rename("mean" = 1)

Plot

ggplot()+
  geom_histogram(data = Below_sampling, aes(x = mean), fill = "blue", alpha = .75)+
  geom_histogram(data = Above_sampling, aes(x = mean), fill = "red", alpha = .75)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation

The two sampling distributions almost completely overlap, though while the distribution for the group above the poverty threshold is thinner and closer to the mean slightly below 28, the sampling distribution for the group below the poverty threshold is wider and lower, and the mean is closer to 28. This may show that despite having very close average BMIs, the group below the poverty threshold has a slightly higher average BMI.

T-test

t.test(BMI~poverty, data = data)
## 
##  Welch Two Sample t-test
## 
## data:  BMI by poverty
## t = -3.1462, df = 5652.8, p-value = 0.001663
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5882191 -0.1365892
## sample estimates:
## mean in group 0 mean in group 1 
##         27.9355         28.2979

Interpretation

This t-test shows that there is a statistically significant difference in BMI between the group below the poverty threshold and the groups above the poverty threshold, since the p-value is below .05. Therefore, poverty does impact BMI.