Introduction

In this assignment by using the NHIS data, I hypothesize there is a relationship between Sex and BMI(Body Mass Index). The relation is how BMI condition is differ between male and female. I will do following steps to test this hypothesis.

1. Load Packages

Loading the necessary packages.

library(readr)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

2. Import Data

Importing data into R and named it Health_Data.

Health_Data = read_csv("/Users/sakif/Downloads/NHIS Data.csv")

## 
## ── Column specification ────────────────────────────────────────────────────────────────────────────────────
## cols(
##   .default = col_double(),
##   Demo_Race = col_logical(),
##   Demo_Hispanic = col_character(),
##   Demo_RaceEthnicity = col_character(),
##   Demo_Region = col_character(),
##   Demo_sex_C = col_character(),
##   Demo_sexorien_C = col_logical(),
##   Demo_agerange_C = col_character(),
##   Demo_marital_C = col_character(),
##   Demo_hourswrk_C = col_character(),
##   MentalHealth_MentalIllnessK6_C = col_character(),
##   MentalHealth_depressionmeds_B = col_logical(),
##   Health_SelfRatedHealth_C = col_character(),
##   Health_diagnosed_STD5yr_B = col_logical(),
##   Health_BirthControlNow_B = col_logical(),
##   Health_EverHavePrediabetes_B = col_logical(),
##   Health_HIVAidsRisk_C = col_character(),
##   Health_BMI_C = col_character(),
##   Health_UsualPlaceHealthcare_C = col_character(),
##   Health_AbnormalPapPast3yr_B = col_logical(),
##   Behav_CigsPerDay_C = col_character()
##   # ... with 1 more columns
## )
## ℹ Use `spec()` for the full column specifications.

## Warning: 683386 parsing failures.
##   row       col           expected                            actual                                   file
## 68557 Demo_Race 1/0/T/F/TRUE/FALSE Black or African American         '/Users/sakif/Downloads/NHIS Data.csv'
## 68558 Demo_Race 1/0/T/F/TRUE/FALSE Asian                             '/Users/sakif/Downloads/NHIS Data.csv'
## 68559 Demo_Race 1/0/T/F/TRUE/FALSE American Indian or Alaskan Native '/Users/sakif/Downloads/NHIS Data.csv'
## 68560 Demo_Race 1/0/T/F/TRUE/FALSE White                             '/Users/sakif/Downloads/NHIS Data.csv'
## 68561 Demo_Race 1/0/T/F/TRUE/FALSE White                             '/Users/sakif/Downloads/NHIS Data.csv'
## ..... ......... .................. ................................. ......................................
## See problems(...) for more details.

head(Health_Data)

## # A tibble: 6 x 50
##     psu sampweight  year year_strata Demo_Race Demo_Hispanic Demo_RaceEthnic…
##   <dbl>      <dbl> <dbl>       <dbl> <lgl>     <chr>         <chr>           
## 1     2       4316  1997       1998. NA        Hispanic      Hispanic (Race …
## 2     2       2845  1997       1998. NA        Hispanic      Hispanic (Race …
## 3     2       3783  1997       1998. NA        Hispanic      Hispanic (Race …
## 4     2       2466  1997       1998. NA        Hispanic      Hispanic (Race …
## 5     2       3794  1997       1998. NA        Hispanic      Hispanic (Race …
## 6     1       1793  1997       1998. NA        Hispanic      Hispanic (Race …
## # … with 43 more variables: Demo_Region <chr>, Demo_sex_C <chr>,
## #   Demo_sexorien_C <lgl>, Demo_belowpovertyline_B <dbl>, Demo_age_N <dbl>,
## #   Demo_agerange_C <chr>, Demo_marital_C <chr>, Demo_hourswrk_C <chr>,
## #   MentalHealth_MentalIllnessK6_N <dbl>, MentalHealth_MentalIllnessK6_C <chr>,
## #   MentalHealth_SeriousMentalIllnessK6_B <dbl>,
## #   MentalHealth_depressionmeds_B <lgl>, Health_SelfRatedHealth_C <chr>,
## #   Health_diagnosed_STD5yr_B <lgl>, Health_BirthControlNow_B <lgl>,
## #   Health_EverHaveHeartAttack_B <dbl>, Health_EverHaveHeartCondition_B <dbl>,
## #   Health_EverHaveCancer_B <dbl>, Health_EverHaveDiabetes_B <dbl>,
## #   Health_EverHavePrediabetes_B <lgl>, Health_EverHaveAsthma_B <dbl>,
## #   Health_StillHaveAsthma_B <dbl>, Health_HIVAidsRisk_C <chr>,
## #   Health_HIVAidsHighRisk_B <dbl>, Health_EverTakeHIVTest_B <dbl>,
## #   Health_EverHaveHypertension_B <dbl>, Health_BMI_N <dbl>,
## #   Health_BMI_C <chr>, Health_BMIOverweight_B <dbl>, Health_BMIObese_B <dbl>,
## #   Health_Weight_N <dbl>, Health_Height_N <dbl>,
## #   Health_UsualPlaceHealthcare_C <chr>, Health_UsualPlaceHealthcare_B <dbl>,
## #   Health_AbnormalPapPast3yr_B <lgl>, Behav_EverSmokeCigs_B <dbl>,
## #   Behav_CigsPerDay_N <dbl>, Behav_CigsPerDay_C <chr>,
## #   Behav_AgeStartSmoking <dbl>, Behav_AlcDaysPerYear_N <dbl>,
## #   Behav_AlcDaysPerWeek_N <dbl>, Behav_BingeDrinkDaysYear_N <dbl>,
## #   Behav_BingeDrinkDaysYear_C <chr>

3. Prepare Data

Identifing two categorical variable named Demo_sex_C and Health_BMI_C. After renaming both to Sex and BMI; we can find the relationship between the both variables and named it as BMI_Per_Sex. It shows how BMI condiion differs to male and female.

BMI_Per_Sex = Health_Data %>%
  select(Demo_sex_C, Health_BMI_C) %>%
  rename(Sex = Demo_sex_C, BMI = Health_BMI_C) %>%
  filter(BMI %in% c("Underweight", "Normal", "Overweight", "Obese", "Exremely Obese"))

BMI_Per_Sex

## # A tibble: 595,109 x 2
##    Sex    BMI       
##    <chr>  <chr>     
##  1 female Normal    
##  2 female Overweight
##  3 male   Obese     
##  4 male   Normal    
##  5 male   Normal    
##  6 female Overweight
##  7 male   Normal    
##  8 female Normal    
##  9 female Normal    
## 10 male   Normal    
## # … with 595,099 more rows

4. Table Distribution

Distributing data to Null hypothesis and Actual distribution. So we can compare between them.

(a) Null Hypothesis

Distributing both data separately to see their impects.

I. Distribution of Sex

It shows the actual (%) of responses given for each catagory of Sex.

table(BMI_Per_Sex$Sex) %>%
  prop.table() %>%
  round(2)

## 
## female   male 
##   0.55   0.45

II. Distribution of BMI

It shows the actual (%) of responses given for each catagory of BMI.

table(BMI_Per_Sex$BMI) %>%
  prop.table() %>%
  round(2)

## 
## Exremely Obese         Normal          Obese     Overweight    Underweight 
##           0.04           0.36           0.22           0.35           0.03

(b) Actual Distribution

It shows the actual (%) of responses given for both catagory together.

table(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex) %>%
  prop.table() %>%
  round(2)

##                 
##                  female male
##   Exremely Obese   0.03 0.01
##   Normal           0.22 0.14
##   Obese            0.12 0.10
##   Overweight       0.16 0.19
##   Underweight      0.02 0.01

5. Comparing the null hypothesis table to the actual table distribution

There is no difference between null hypothesis table and actual distribution table. Both got exact value.

6. Data Analysis

Calculating column% to highlight the relationship of interest between the variables.

table(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex) %>%
  prop.table(2)

##                 
##                      female       male
##   Exremely Obese 0.04695415 0.02604850
##   Normal         0.40646484 0.30851167
##   Obese          0.21795071 0.22604100
##   Overweight     0.28605022 0.42572992
##   Underweight    0.04258009 0.01366890

7. Data Interpretation

It’s clearly showing that among the responded, 40% female have the normal weight where 42.5% male got overweight.

8. Data Visualization

Visualizing the results of the column% table.

BMI_Per_Sex  %>%
  group_by(Sex, BMI)%>% 
  summarize(n = n()) %>%
  mutate(Percent = n/sum(n)) %>%
  ggplot() +
  geom_col(aes(x = Sex, y = Percent, fill = BMI))

## `summarise()` regrouping output by 'Sex' (override with `.groups` argument)

9. Chi-Square Statistical Test

Calculating a chi-square test to determine if there is a statistically significant relationship between the variables.

chisq.test(BMI_Per_Sex$BMI, BMI_Per_Sex$Sex)

## 
##  Pearson's Chi-squared test
## 
## data:  BMI_Per_Sex$BMI and BMI_Per_Sex$Sex
## X-squared = 18040, df = 4, p-value < 2.2e-16

10. Interpret the Chi-Square Test

There is a statistically significant relationship between Sex and BMI. It’s showing most of the female have normal weight where most of the male have overweight.

Categorical Data Analysis

Sakif Shadman