In this assignment by using the NHIS data, I hypothesize there is a relationship between Sex and BMI(Body Mass Index). The relation is how BMI condition is differ between male and female. I will do following steps to test this hypothesis.
Loading the necessary packages.
library(readr)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Importing data into R and named it Health_Data.
Health_Data = read_csv("/Users/sakif/Downloads/NHIS Data.csv")
##
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
## cols(
## .default = col_double(),
## Demo_Race = col_logical(),
## Demo_Hispanic = col_character(),
## Demo_RaceEthnicity = col_character(),
## Demo_Region = col_character(),
## Demo_sex_C = col_character(),
## Demo_sexorien_C = col_logical(),
## Demo_agerange_C = col_character(),
## Demo_marital_C = col_character(),
## Demo_hourswrk_C = col_character(),
## MentalHealth_MentalIllnessK6_C = col_character(),
## MentalHealth_depressionmeds_B = col_logical(),
## Health_SelfRatedHealth_C = col_character(),
## Health_diagnosed_STD5yr_B = col_logical(),
## Health_BirthControlNow_B = col_logical(),
## Health_EverHavePrediabetes_B = col_logical(),
## Health_HIVAidsRisk_C = col_character(),
## Health_BMI_C = col_character(),
## Health_UsualPlaceHealthcare_C = col_character(),
## Health_AbnormalPapPast3yr_B = col_logical(),
## Behav_CigsPerDay_C = col_character()
## # ... with 1 more columns
## )
## ℹ Use `spec()` for the full column specifications.
## Warning: 683386 parsing failures.
## row col expected actual file
## 68557 Demo_Race 1/0/T/F/TRUE/FALSE Black or African American '/Users/sakif/Downloads/NHIS Data.csv'
## 68558 Demo_Race 1/0/T/F/TRUE/FALSE Asian '/Users/sakif/Downloads/NHIS Data.csv'
## 68559 Demo_Race 1/0/T/F/TRUE/FALSE American Indian or Alaskan Native '/Users/sakif/Downloads/NHIS Data.csv'
## 68560 Demo_Race 1/0/T/F/TRUE/FALSE White '/Users/sakif/Downloads/NHIS Data.csv'
## 68561 Demo_Race 1/0/T/F/TRUE/FALSE White '/Users/sakif/Downloads/NHIS Data.csv'
## ..... ......... .................. ................................. ......................................
## See problems(...) for more details.
head(Health_Data)
## # A tibble: 6 x 50
## psu sampweight year year_strata Demo_Race Demo_Hispanic Demo_RaceEthnic…
## <dbl> <dbl> <dbl> <dbl> <lgl> <chr> <chr>
## 1 2 4316 1997 1998. NA Hispanic Hispanic (Race …
## 2 2 2845 1997 1998. NA Hispanic Hispanic (Race …
## 3 2 3783 1997 1998. NA Hispanic Hispanic (Race …
## 4 2 2466 1997 1998. NA Hispanic Hispanic (Race …
## 5 2 3794 1997 1998. NA Hispanic Hispanic (Race …
## 6 1 1793 1997 1998. NA Hispanic Hispanic (Race …
## # … with 43 more variables: Demo_Region <chr>, Demo_sex_C <chr>,
## # Demo_sexorien_C <lgl>, Demo_belowpovertyline_B <dbl>, Demo_age_N <dbl>,
## # Demo_agerange_C <chr>, Demo_marital_C <chr>, Demo_hourswrk_C <chr>,
## # MentalHealth_MentalIllnessK6_N <dbl>, MentalHealth_MentalIllnessK6_C <chr>,
## # MentalHealth_SeriousMentalIllnessK6_B <dbl>,
## # MentalHealth_depressionmeds_B <lgl>, Health_SelfRatedHealth_C <chr>,
## # Health_diagnosed_STD5yr_B <lgl>, Health_BirthControlNow_B <lgl>,
## # Health_EverHaveHeartAttack_B <dbl>, Health_EverHaveHeartCondition_B <dbl>,
## # Health_EverHaveCancer_B <dbl>, Health_EverHaveDiabetes_B <dbl>,
## # Health_EverHavePrediabetes_B <lgl>, Health_EverHaveAsthma_B <dbl>,
## # Health_StillHaveAsthma_B <dbl>, Health_HIVAidsRisk_C <chr>,
## # Health_HIVAidsHighRisk_B <dbl>, Health_EverTakeHIVTest_B <dbl>,
## # Health_EverHaveHypertension_B <dbl>, Health_BMI_N <dbl>,
## # Health_BMI_C <chr>, Health_BMIOverweight_B <dbl>, Health_BMIObese_B <dbl>,
## # Health_Weight_N <dbl>, Health_Height_N <dbl>,
## # Health_UsualPlaceHealthcare_C <chr>, Health_UsualPlaceHealthcare_B <dbl>,
## # Health_AbnormalPapPast3yr_B <lgl>, Behav_EverSmokeCigs_B <dbl>,
## # Behav_CigsPerDay_N <dbl>, Behav_CigsPerDay_C <chr>,
## # Behav_AgeStartSmoking <dbl>, Behav_AlcDaysPerYear_N <dbl>,
## # Behav_AlcDaysPerWeek_N <dbl>, Behav_BingeDrinkDaysYear_N <dbl>,
## # Behav_BingeDrinkDaysYear_C <chr>
Identifing two categorical variable named Demo_sex_C and Health_BMI_C. After renaming both to Sex and BMI; we can find the relationship between the both variables and named it as BMI_Per_Sex. It shows how BMI condiion differs to male and female.
BMI_Per_Sex = Health_Data %>%
select(Demo_sex_C, Health_BMI_C) %>%
rename(Sex = Demo_sex_C, BMI = Health_BMI_C) %>%
filter(BMI %in% c("Underweight", "Normal", "Overweight", "Obese", "Exremely Obese"))
BMI_Per_Sex
## # A tibble: 595,109 x 2
## Sex BMI
## <chr> <chr>
## 1 female Normal
## 2 female Overweight
## 3 male Obese
## 4 male Normal
## 5 male Normal
## 6 female Overweight
## 7 male Normal
## 8 female Normal
## 9 female Normal
## 10 male Normal
## # … with 595,099 more rows
Distributing data to Null hypothesis and Actual distribution. So we can compare between them.
Distributing both data separately to see their impects.
It shows the actual (%) of responses given for each catagory of Sex.
table(BMI_Per_Sex$Sex) %>%
prop.table() %>%
round(2)
##
## female male
## 0.55 0.45
It shows the actual (%) of responses given for each catagory of BMI.
table(BMI_Per_Sex$BMI) %>%
prop.table() %>%
round(2)
##
## Exremely Obese Normal Obese Overweight Underweight
## 0.04 0.36 0.22 0.35 0.03
It shows the actual (%) of responses given for both catagory together.
table(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex) %>%
prop.table() %>%
round(2)
##
## female male
## Exremely Obese 0.03 0.01
## Normal 0.22 0.14
## Obese 0.12 0.10
## Overweight 0.16 0.19
## Underweight 0.02 0.01
There is no difference between null hypothesis table and actual distribution table. Both got exact value.
Calculating column% to highlight the relationship of interest between the variables.
table(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex) %>%
prop.table(2)
##
## female male
## Exremely Obese 0.04695415 0.02604850
## Normal 0.40646484 0.30851167
## Obese 0.21795071 0.22604100
## Overweight 0.28605022 0.42572992
## Underweight 0.04258009 0.01366890
It’s clearly showing that among the responded, 40% female have the normal weight where 42.5% male got overweight.
Visualizing the results of the column% table.
BMI_Per_Sex %>%
group_by(Sex, BMI)%>%
summarize(n = n()) %>%
mutate(Percent = n/sum(n)) %>%
ggplot() +
geom_col(aes(x = Sex, y = Percent, fill = BMI))
## `summarise()` regrouping output by 'Sex' (override with `.groups` argument)
Calculating a chi-square test to determine if there is a statistically significant relationship between the variables.
chisq.test(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex)
##
## Pearson's Chi-squared test
##
## data: BMI_Per_Sex$BMI and BMI_Per_Sex$Sex
## X-squared = 18040, df = 4, p-value < 2.2e-16
There is a statistically significant relationship between Sex and BMI. It’s showing most of the female have normal weight where most of the male have overweight.