1. Variable Selection & Research Question:

(a) Identify one categorical variable (IV)

The Categorical Variable for this analysis is Demo_Region. Which has 4 groups. But after filtering only 2 group South and West will be used in this analysis.

(b) Identify one continuous variable (DV)

The Continuos Variable for this analysis is Behav_AlcDaysPerYear_N.

(c) Hypothesis

I hypothesis that, there is a relationship between Demo_Region and Behav_AlcDaysPerYear_N. The relation is the number of days that a respondent drink alcohol is different between South region and West region.

2. Data Preparation:

(a) Load Packages & Import Data

Loading the necessary packages. Importing data into R and named it Health_Data.

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Health_Data = read_csv("/Users/sakif/Desktop/Data 333/NHIS Data.csv")

## 
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Demo_Race = col_logical(),
##   Demo_Hispanic = col_character(),
##   Demo_RaceEthnicity = col_character(),
##   Demo_Region = col_character(),
##   Demo_sex_C = col_character(),
##   Demo_sexorien_C = col_logical(),
##   Demo_agerange_C = col_character(),
##   Demo_marital_C = col_character(),
##   Demo_hourswrk_C = col_character(),
##   MentalHealth_MentalIllnessK6_C = col_character(),
##   MentalHealth_depressionmeds_B = col_logical(),
##   Health_SelfRatedHealth_C = col_character(),
##   Health_diagnosed_STD5yr_B = col_logical(),
##   Health_BirthControlNow_B = col_logical(),
##   Health_EverHavePrediabetes_B = col_logical(),
##   Health_HIVAidsRisk_C = col_character(),
##   Health_BMI_C = col_character(),
##   Health_UsualPlaceHealthcare_C = col_character(),
##   Health_AbnormalPapPast3yr_B = col_logical(),
##   Behav_CigsPerDay_C = col_character()
##   # ... with 1 more columns
## )
## ℹ Use `spec()` for the full column specifications.

## Warning: 683386 parsing failures.
##   row       col           expected                            actual                                          file
## 68557 Demo_Race 1/0/T/F/TRUE/FALSE Black or African American         '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68558 Demo_Race 1/0/T/F/TRUE/FALSE Asian                             '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68559 Demo_Race 1/0/T/F/TRUE/FALSE American Indian or Alaskan Native '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68560 Demo_Race 1/0/T/F/TRUE/FALSE White                             '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68561 Demo_Race 1/0/T/F/TRUE/FALSE White                             '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## ..... ......... .................. ................................. .............................................
## See problems(...) for more details.

head(Health_Data)

## # A tibble: 6 x 50
##     psu sampweight  year year_strata Demo_Race Demo_Hispanic Demo_RaceEthnic…
##   <dbl>      <dbl> <dbl>       <dbl> <lgl>     <chr>         <chr>           
## 1     2       4316  1997       1998. NA        Hispanic      Hispanic (Race …
## 2     2       2845  1997       1998. NA        Hispanic      Hispanic (Race …
## 3     2       3783  1997       1998. NA        Hispanic      Hispanic (Race …
## 4     2       2466  1997       1998. NA        Hispanic      Hispanic (Race …
## 5     2       3794  1997       1998. NA        Hispanic      Hispanic (Race …
## 6     1       1793  1997       1998. NA        Hispanic      Hispanic (Race …
## # … with 43 more variables: Demo_Region <chr>, Demo_sex_C <chr>,
## #   Demo_sexorien_C <lgl>, Demo_belowpovertyline_B <dbl>, Demo_age_N <dbl>,
## #   Demo_agerange_C <chr>, Demo_marital_C <chr>, Demo_hourswrk_C <chr>,
## #   MentalHealth_MentalIllnessK6_N <dbl>, MentalHealth_MentalIllnessK6_C <chr>,
## #   MentalHealth_SeriousMentalIllnessK6_B <dbl>,
## #   MentalHealth_depressionmeds_B <lgl>, Health_SelfRatedHealth_C <chr>,
## #   Health_diagnosed_STD5yr_B <lgl>, Health_BirthControlNow_B <lgl>,
## #   Health_EverHaveHeartAttack_B <dbl>, Health_EverHaveHeartCondition_B <dbl>,
## #   Health_EverHaveCancer_B <dbl>, Health_EverHaveDiabetes_B <dbl>,
## #   Health_EverHavePrediabetes_B <lgl>, Health_EverHaveAsthma_B <dbl>,
## #   Health_StillHaveAsthma_B <dbl>, Health_HIVAidsRisk_C <chr>,
## #   Health_HIVAidsHighRisk_B <dbl>, Health_EverTakeHIVTest_B <dbl>,
## #   Health_EverHaveHypertension_B <dbl>, Health_BMI_N <dbl>,
## #   Health_BMI_C <chr>, Health_BMIOverweight_B <dbl>, Health_BMIObese_B <dbl>,
## #   Health_Weight_N <dbl>, Health_Height_N <dbl>,
## #   Health_UsualPlaceHealthcare_C <chr>, Health_UsualPlaceHealthcare_B <dbl>,
## #   Health_AbnormalPapPast3yr_B <lgl>, Behav_EverSmokeCigs_B <dbl>,
## #   Behav_CigsPerDay_N <dbl>, Behav_CigsPerDay_C <chr>,
## #   Behav_AgeStartSmoking <dbl>, Behav_AlcDaysPerYear_N <dbl>,
## #   Behav_AlcDaysPerWeek_N <dbl>, Behav_BingeDrinkDaysYear_N <dbl>,
## #   Behav_BingeDrinkDaysYear_C <chr>

(b) Data Filtering & Storing

Filtering data to only keep from respondents who are in Demo_Region and Behav_AlcDaysPerYear_N. Store this filtered data in a new object called, Alc_Days_Per_Year.

Alc_Days_Per_Year = Health_Data %>%
  select(Demo_Region, Behav_AlcDaysPerYear_N) %>%
  filter(!is.na(Behav_AlcDaysPerYear_N),
         Demo_Region %in% c("West", "South"))

Alc_Days_Per_Year

## # A tibble: 271,669 x 2
##    Demo_Region Behav_AlcDaysPerYear_N
##    <chr>                        <dbl>
##  1 West                             1
##  2 West                             3
##  3 West                             2
##  4 West                             5
##  5 West                             0
##  6 West                             4
##  7 West                            24
##  8 West                             4
##  9 West                             0
## 10 West                           260
## # … with 271,659 more rows

3. Comparison of Means:

(a) Table

Comparing the mean of continuous variable between two groups.

Alc_Days_Per_Year %>%
  group_by(Demo_Region) %>%
  summarise(Avg_Alc_Days_Per_Year = mean(Behav_AlcDaysPerYear_N))

## `summarise()` ungrouping output (override with `.groups` argument)

## # A tibble: 2 x 2
##   Demo_Region Avg_Alc_Days_Per_Year
##   <chr>                       <dbl>
## 1 South                        58.5
## 2 West                         65.6

(b) Visualization

Visualize the mean of continuous variable between two groups.

Alc_Days_Per_Year %>%
  group_by(Demo_Region) %>%
  summarise(Avg_Alc_Days_Per_Year = mean(Behav_AlcDaysPerYear_N)) %>%
  ggplot()+
  geom_col(aes(x = Demo_Region, y = Avg_Alc_Days_Per_Year, fill = Demo_Region))

## `summarise()` ungrouping output (override with `.groups` argument)

(c) Interpretation

From the visualization, it is clearly showing that respondents in west regions drinks alcohol more number of days than south region.

4. Comparison of Distributions:

(a) Visualization

Visualize the distribution of responses to the continuous variables by showing a separate histogram for two groups.

Alc_Days_Per_Year %>%
  ggplot()+
  geom_histogram(aes(x = Behav_AlcDaysPerYear_N, fill = Demo_Region)) +
  facet_wrap(~Demo_Region)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(b) Interpretation

For south, more than 60,000 respondents response 0; which means they didn’t drink alcohol at all last year. Where in west, only 40,000 respondents response 0. But if we see from Day 1-100 then respondents in south drinks more alcohol then west. However, from day 101-365 if we compare then respondents in west drinks more alcohol then south. Overall, we can conclude that respondents in west drinks alcohol more number of days than south.

5. Sampling Distribution & T-test

(a) Sampling Distribution

Produce two new data objects - one which only contains first group, and one which only contains second group. For each group: Draw 10,000 samples of 40 respondents, and calculate the mean of the continuous variables for each of those 10,000 samples. Store these 10,000 means in new objects.

South = Alc_Days_Per_Year %>%
  filter(Demo_Region == "South")

West = Alc_Days_Per_Year %>%
  filter(Demo_Region == "West")

South_Sample_Dist = replicate(10000, sample(South$Behav_AlcDaysPerYear_N, 40) %>%
  mean(na.rm = TRUE)) %>%
  data.frame() %>%
  rename("mean" = 1)

West_Sample_Dist = replicate(10000, sample(West$Behav_AlcDaysPerYear_N, 40) %>%
  mean(na.rm = TRUE)) %>%
  data.frame() %>%
  rename("mean" = 1)

ggplot()+
  geom_histogram(data = South_Sample_Dist, aes(x = mean), fill = "red") +
  geom_histogram(data = West_Sample_Dist, aes(x = mean), fill = "blue")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

(b) T-test

Below are the results of the T-test. This tells us whether the differences in the mean for two groups with normally distributed sampling distributions.

t.test(Behav_AlcDaysPerYear_N ~ Demo_Region, data = Alc_Days_Per_Year)

## 
##  Welch Two Sample t-test
## 
## data:  Behav_AlcDaysPerYear_N by Demo_Region
## t = -18.534, df = 232840, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -7.859894 -6.356504
## sample estimates:
## mean in group South  mean in group West 
##            58.50267            65.61087

(c) Interpret

There is a statistically significant difference between south & west in their mean towards the number of days that a respondent drink alcohol.

Analysis of Continuous Data

Sakif Shadman