The Categorical Variable for this analysis is Demo_Region. Which has 4 groups. But after filtering only 2 group South and West will be used in this analysis.
The Continuos Variable for this analysis is Behav_AlcDaysPerYear_N.
I hypothesis that, there is a relationship between Demo_Region and Behav_AlcDaysPerYear_N. The relation is the number of days that a respondent drink alcohol is different between South region and West region.
Loading the necessary packages. Importing data into R and named it Health_Data.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Health_Data = read_csv("/Users/sakif/Desktop/Data 333/NHIS Data.csv")
##
## ── Column specification ───────────────────────────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Demo_Race = col_logical(),
## Demo_Hispanic = col_character(),
## Demo_RaceEthnicity = col_character(),
## Demo_Region = col_character(),
## Demo_sex_C = col_character(),
## Demo_sexorien_C = col_logical(),
## Demo_agerange_C = col_character(),
## Demo_marital_C = col_character(),
## Demo_hourswrk_C = col_character(),
## MentalHealth_MentalIllnessK6_C = col_character(),
## MentalHealth_depressionmeds_B = col_logical(),
## Health_SelfRatedHealth_C = col_character(),
## Health_diagnosed_STD5yr_B = col_logical(),
## Health_BirthControlNow_B = col_logical(),
## Health_EverHavePrediabetes_B = col_logical(),
## Health_HIVAidsRisk_C = col_character(),
## Health_BMI_C = col_character(),
## Health_UsualPlaceHealthcare_C = col_character(),
## Health_AbnormalPapPast3yr_B = col_logical(),
## Behav_CigsPerDay_C = col_character()
## # ... with 1 more columns
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 683386 parsing failures.
## row col expected actual file
## 68557 Demo_Race 1/0/T/F/TRUE/FALSE Black or African American '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68558 Demo_Race 1/0/T/F/TRUE/FALSE Asian '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68559 Demo_Race 1/0/T/F/TRUE/FALSE American Indian or Alaskan Native '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68560 Demo_Race 1/0/T/F/TRUE/FALSE White '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## 68561 Demo_Race 1/0/T/F/TRUE/FALSE White '/Users/sakif/Desktop/Data 333/NHIS Data.csv'
## ..... ......... .................. ................................. .............................................
## See problems(...) for more details.
head(Health_Data)
## # A tibble: 6 x 50
## psu sampweight year year_strata Demo_Race Demo_Hispanic Demo_RaceEthnic…
## <dbl> <dbl> <dbl> <dbl> <lgl> <chr> <chr>
## 1 2 4316 1997 1998. NA Hispanic Hispanic (Race …
## 2 2 2845 1997 1998. NA Hispanic Hispanic (Race …
## 3 2 3783 1997 1998. NA Hispanic Hispanic (Race …
## 4 2 2466 1997 1998. NA Hispanic Hispanic (Race …
## 5 2 3794 1997 1998. NA Hispanic Hispanic (Race …
## 6 1 1793 1997 1998. NA Hispanic Hispanic (Race …
## # … with 43 more variables: Demo_Region <chr>, Demo_sex_C <chr>,
## # Demo_sexorien_C <lgl>, Demo_belowpovertyline_B <dbl>, Demo_age_N <dbl>,
## # Demo_agerange_C <chr>, Demo_marital_C <chr>, Demo_hourswrk_C <chr>,
## # MentalHealth_MentalIllnessK6_N <dbl>, MentalHealth_MentalIllnessK6_C <chr>,
## # MentalHealth_SeriousMentalIllnessK6_B <dbl>,
## # MentalHealth_depressionmeds_B <lgl>, Health_SelfRatedHealth_C <chr>,
## # Health_diagnosed_STD5yr_B <lgl>, Health_BirthControlNow_B <lgl>,
## # Health_EverHaveHeartAttack_B <dbl>, Health_EverHaveHeartCondition_B <dbl>,
## # Health_EverHaveCancer_B <dbl>, Health_EverHaveDiabetes_B <dbl>,
## # Health_EverHavePrediabetes_B <lgl>, Health_EverHaveAsthma_B <dbl>,
## # Health_StillHaveAsthma_B <dbl>, Health_HIVAidsRisk_C <chr>,
## # Health_HIVAidsHighRisk_B <dbl>, Health_EverTakeHIVTest_B <dbl>,
## # Health_EverHaveHypertension_B <dbl>, Health_BMI_N <dbl>,
## # Health_BMI_C <chr>, Health_BMIOverweight_B <dbl>, Health_BMIObese_B <dbl>,
## # Health_Weight_N <dbl>, Health_Height_N <dbl>,
## # Health_UsualPlaceHealthcare_C <chr>, Health_UsualPlaceHealthcare_B <dbl>,
## # Health_AbnormalPapPast3yr_B <lgl>, Behav_EverSmokeCigs_B <dbl>,
## # Behav_CigsPerDay_N <dbl>, Behav_CigsPerDay_C <chr>,
## # Behav_AgeStartSmoking <dbl>, Behav_AlcDaysPerYear_N <dbl>,
## # Behav_AlcDaysPerWeek_N <dbl>, Behav_BingeDrinkDaysYear_N <dbl>,
## # Behav_BingeDrinkDaysYear_C <chr>
Filtering data to only keep from respondents who are in Demo_Region and Behav_AlcDaysPerYear_N. Store this filtered data in a new object called, Alc_Days_Per_Year.
Alc_Days_Per_Year = Health_Data %>%
select(Demo_Region, Behav_AlcDaysPerYear_N) %>%
filter(!is.na(Behav_AlcDaysPerYear_N),
Demo_Region %in% c("West", "South"))
Alc_Days_Per_Year
## # A tibble: 271,669 x 2
## Demo_Region Behav_AlcDaysPerYear_N
## <chr> <dbl>
## 1 West 1
## 2 West 3
## 3 West 2
## 4 West 5
## 5 West 0
## 6 West 4
## 7 West 24
## 8 West 4
## 9 West 0
## 10 West 260
## # … with 271,659 more rows
Comparing the mean of continuous variable between two groups.
Alc_Days_Per_Year %>%
group_by(Demo_Region) %>%
summarise(Avg_Alc_Days_Per_Year = mean(Behav_AlcDaysPerYear_N))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 2 x 2
## Demo_Region Avg_Alc_Days_Per_Year
## <chr> <dbl>
## 1 South 58.5
## 2 West 65.6
Visualize the mean of continuous variable between two groups.
Alc_Days_Per_Year %>%
group_by(Demo_Region) %>%
summarise(Avg_Alc_Days_Per_Year = mean(Behav_AlcDaysPerYear_N)) %>%
ggplot()+
geom_col(aes(x = Demo_Region, y = Avg_Alc_Days_Per_Year, fill = Demo_Region))
## `summarise()` ungrouping output (override with `.groups` argument)
From the visualization, it is clearly showing that respondents in west regions drinks alcohol more number of days than south region.
Visualize the distribution of responses to the continuous variables by showing a separate histogram for two groups.
Alc_Days_Per_Year %>%
ggplot()+
geom_histogram(aes(x = Behav_AlcDaysPerYear_N, fill = Demo_Region)) +
facet_wrap(~Demo_Region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
For south, more than 60,000 respondents response 0; which means they didn’t drink alcohol at all last year. Where in west, only 40,000 respondents response 0. But if we see from Day 1-100 then respondents in south drinks more alcohol then west. However, from day 101-365 if we compare then respondents in west drinks more alcohol then south. Overall, we can conclude that respondents in west drinks alcohol more number of days than south.
Produce two new data objects - one which only contains first group, and one which only contains second group. For each group: Draw 10,000 samples of 40 respondents, and calculate the mean of the continuous variables for each of those 10,000 samples. Store these 10,000 means in new objects.
South = Alc_Days_Per_Year %>%
filter(Demo_Region == "South")
West = Alc_Days_Per_Year %>%
filter(Demo_Region == "West")
South_Sample_Dist = replicate(10000, sample(South$Behav_AlcDaysPerYear_N, 40) %>%
mean(na.rm = TRUE)) %>%
data.frame() %>%
rename("mean" = 1)
West_Sample_Dist = replicate(10000, sample(West$Behav_AlcDaysPerYear_N, 40) %>%
mean(na.rm = TRUE)) %>%
data.frame() %>%
rename("mean" = 1)
ggplot()+
geom_histogram(data = South_Sample_Dist, aes(x = mean), fill = "red") +
geom_histogram(data = West_Sample_Dist, aes(x = mean), fill = "blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Below are the results of the T-test. This tells us whether the differences in the mean for two groups with normally distributed sampling distributions.
t.test(Behav_AlcDaysPerYear_N ~ Demo_Region, data = Alc_Days_Per_Year)
##
## Welch Two Sample t-test
##
## data: Behav_AlcDaysPerYear_N by Demo_Region
## t = -18.534, df = 232840, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -7.859894 -6.356504
## sample estimates:
## mean in group South mean in group West
## 58.50267 65.61087
There is a statistically significant difference between south & west in their mean towards the number of days that a respondent drink alcohol.