Variable Selection & Research Question

Categorical (Independent) Variable
The independent variable I will be analyzing is Demo_sex_C which tells us whether the respondent is male or female.
Continuous (Dependent) Variable
The dependent variable I will be analyzing is Behav_BingeDrinkDaysYear_N which tells us how many times per year the respondents reported that they binge drink (5+ drinks in a single session).
Research Question/Hypothesis
I hypothesize that there is a relationship between a respondent’s gender and how frequently they binge drink in a year.

Data Prep

Loading packages, selecting necessary variables, and filtering data.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(readr)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
healthdata <- read_csv("/Users/rebeccagibble/Downloads/SD3 NHIS Data.csv")
## Parsed with column specification:
## cols(
##   year = col_double(),
##   Demo_Race = col_character(),
##   Demo_sex_C = col_character(),
##   Demo_sexorien_C = col_character(),
##   Demo_belowpovertyline_B = col_double(),
##   Demo_agerange_C = col_character(),
##   Demo_marital_C = col_character(),
##   Health_SelfRatedHealth_C = col_character(),
##   MentalHealth_MentalIllnessK6_N = col_double(),
##   Health_BMI_N = col_double(),
##   Behav_CigsPerDay_N = col_double(),
##   Behav_AlcDaysPerYear_N = col_double(),
##   Behav_AlcDaysPerWeek_N = col_double(),
##   Behav_BingeDrinkDaysYear_N = col_double()
## )
data<-healthdata%>%
  select(Demo_sex_C, Behav_BingeDrinkDaysYear_N)%>%
  filter(Demo_sex_C %in% c("male","female"))

Data Analysis

Comparison of Means

Table comparing mean of continuous variable between groups
data%>%
  select(Demo_sex_C, Behav_BingeDrinkDaysYear_N)%>%
  filter(Demo_sex_C %in% c("male","female"))%>%
  group_by(Demo_sex_C)%>%
  summarize(Behav_BingeDrinkDaysYear_N=mean(Behav_BingeDrinkDaysYear_N,na.rm=TRUE))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
##   Demo_sex_C Behav_BingeDrinkDaysYear_N
##   <chr>                           <dbl>
## 1 female                           6.99
## 2 male                            16.6
Interpretation
There seems to be a difference in means with females on average binge drinking 6.9 days a year and males on average binge drinking 16.6 days a year.

Visualization

data%>%
  ggplot()+
  geom_histogram(aes(x=Behav_BingeDrinkDaysYear_N))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11819 rows containing non-finite values (stat_bin).

Interpretation
By means of analyzing the table and visualization, we see that males on average on average, males binge drink more than females do (16.6 days for males vs. 6.9 days for females).

Comparison of Resposnse Distrubutions between Males and Females

data%>%
  ggplot()+
  geom_histogram(aes(x=Behav_BingeDrinkDaysYear_N))+
  facet_wrap(~Demo_sex_C)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 11819 rows containing non-finite values (stat_bin).

Interpretation
From these graphs we can further see that females (on average) binge drink less per year than males do, however, the majority of both males and females report 0 days of binge drinking per year.

Sampling Distribution

Data Set 1 (male)
data1=female_data<-data%>%
  filter(Demo_sex_C=="male")
Data Set 2 (female)
data2=male_data<-data%>%
  filter(Demo_sex_C=="female")
Drawing 10,000 random samples of 40 respondents for both data sets shown above then creating a sampling distribution to compare. The sampling distribution shows both datasets on one graph.
male_data<-replicate(10000,
sample(data1$Behav_BingeDrinkDaysYear_N, 40)%>%mean(na.rm=TRUE)
)%>%
  data.frame()%>%
  rename("mean"=1)

female_data<-replicate(10000,
sample(data2$Behav_BingeDrinkDaysYear_N, 40)%>%mean(na.rm=TRUE)
)%>%
  data.frame()%>%
  rename("mean"=1)

ggplot()+
  geom_histogram(data=male_data,aes(x=mean),fill="blue")+
  geom_histogram(data=female_data,aes(x=mean),fill="pink")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Interpretation of sample distributions
Because these distributions show 10,000 random samples of 40 respondents, they are far closer to a normal distribution than the population distribution. The key here being random sampling for 10,000 samples. We still see here that the male data skews more to the right showing that they on average binge drink more than females do in a year.

T-Test

Hypotheses
Null Hypothesis: There is no difference in the mean value between the two groups.
Alt Hypothesis: There is a difference in the mean value between the two groups.
t.test(Behav_BingeDrinkDaysYear_N~Demo_sex_C, data=data)
## 
##  Welch Two Sample t-test
## 
## data:  Behav_BingeDrinkDaysYear_N by Demo_sex_C
## t = -16.253, df = 16765, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10.750953  -8.436967
## sample estimates:
## mean in group female   mean in group male 
##             6.987474            16.581434
There is a statistically significant difference between males and females and the average days per year in which they consume five or more drinks in one sitting. We can conclude this because the p-value obtained from the t-test is smaller than 0.05.