Variable Selection & Research Question:

I will be analyzing data from the provided NHIS data set. I hypothesize that there is a significant difference between the mean values of the bmi of married respondents and never married respondents. The two variables that I will be analyzing are marital status (independent variable) and health bmi(dependent variable).

Data Prep

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

NHISdata <- read.csv("C:/Users/12055/Documents/CUNY - Undergrad/Spring 2021 - CUNY/Data 333/Data/NHIS Data.csv")

nhisdata<-NHISdata%>%
  select(Health_BMI_N, Demo_marital_C)%>%
  filter(Demo_marital_C %in% c("Married", "Never Married"))

Comparison of Means

Table

nhisdata%>%
  summarize(Health_BMI_N=mean(Health_BMI_N, na.rm=TRUE))

##   Health_BMI_N
## 1     27.21887

nhisdata%>%
  group_by(Demo_marital_C)%>%
  summarize(Health_BMI_N = mean(Health_BMI_N, na.rm=TRUE))

## # A tibble: 2 x 2
##   Demo_marital_C Health_BMI_N
##   <chr>                 <dbl>
## 1 Married                27.4
## 2 Never Married          26.9

Visualization

nhisdata%>%
  group_by(Demo_marital_C)%>%
  summarize(Health_BMI_N = mean(Health_BMI_N, na.rm=TRUE))%>%
  ggplot()+
  geom_col(aes(x=Demo_marital_C, y=Health_BMI_N, fill=Demo_marital_C))+
     scale_fill_manual(values = c("Married" = "purple", "Never Married" ="pink"))

## Interpretation

The mean health BMI for the data set is 27.22. Comparing this mean to the mean health BMIs between respondents who are married and those who are never married, it seems that the BMIs do not differ greatly. This may indicate that there is no significant difference between respondents based on marital status.

Comparison of Distributions

Visualization

nhisdata%>%
ggplot()+
  geom_histogram(aes(x=Health_BMI_N, fill=Demo_marital_C))+
  facet_wrap(~Demo_marital_C)+
  scale_fill_manual(values = c("Married" = "purple", "Never Married" ="pink"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 16905 rows containing non-finite values (stat_bin).

Interpretation

Both categories of respondents have bell-shaped distributions. There is a slight skew on the right tail of both distributions.

Sampling Distribution & T-test

Sampling Distribution

Married_nhisdata<-nhisdata%>%
  filter(Demo_marital_C=="Married")

NeverMarried_nhisdata<-nhisdata%>%
  filter(Demo_marital_C=="Never Married")

Married_Samp_Distro<-replicate(10000, sample(Married_nhisdata$Health_BMI_N, 40)%>%
mean(na.rm=TRUE))%>%
data.frame()%>%
rename("mean"=1)

NeverMarried_Samp_Distro<-replicate(10000, sample(NeverMarried_nhisdata$Health_BMI_N, 40)%>%
mean(na.rm=TRUE))%>%
data.frame()%>%
rename("mean"=1)

ggplot()+
  geom_histogram(data=Married_Samp_Distro, aes(x=mean), fill="purple")+
  geom_histogram(data=NeverMarried_Samp_Distro, aes(x=mean), fill="pink")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

T-test

According to the results of the t-test, there is a significant difference in the mean values between the BMI of married respondents and never married respondents based on a confidence interval of alpha = 0.05.

t.test(Health_BMI_N~Demo_marital_C, data=nhisdata)

## 
##  Welch Two Sample t-test
## 
## data:  Health_BMI_N by Demo_marital_C
## t = 27.974, df = 277324, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.5135481 0.5909325
## sample estimates:
##       mean in group Married mean in group Never Married 
##                    27.41481                    26.86257

Comparison of means between marital status and bmi

Variable Selection & Research Question:

Data Prep

Comparison of Means

Table

Visualization

Comparison of Distributions

Visualization

Interpretation

Sampling Distribution & T-test

Sampling Distribution

T-test