library(readr)
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

NHIdata <- read.csv("C:/Users/12055/Documents/333- Chi squared/NHIS Data.csv" )

I will be analyzing National Health Interview data from the years 1997-2016 to test if there exists a relationship between survey respondents status above or below the poverty line and self-rated health.

Data Preparation

NHIPov <- NHIdata%>%
  select(Demo_belowpovertyline_B , Health_SelfRatedHealth_C)%>%
  filter(Demo_belowpovertyline_B  %in%c("0", "1"), Health_SelfRatedHealth_C  %in%c("Poor","Fair", "Good", "Very Good", "Excellent"))

Preview of data

head(NHIPov)
##   Demo_belowpovertyline_B Health_SelfRatedHealth_C
## 1                       1                Excellent
## 2                       0                Very Good
## 3                       0                Excellent
## 4                       0                Very Good
## 5                       1                     Good
## 6                       0                     Poor

Relationship of Interest

The first variable is coded as “1” for respondents below the poverty line and “0” for respondents above the poverty line. I hypothesize that respondents who are above the poverty line will report a higher self-rated health than respondents below the poverty line.

table(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C)%>%
prop.table(1)%>%
round(2)
##    
##     Excellent Fair Good Poor Very Good
##   0      0.29 0.09 0.26 0.02      0.33
##   1      0.19 0.20 0.29 0.08      0.23

The crosstab supports my hypothesis. For example, about 8% of respondents below the poverty line reported poor health compared to 2% of respondents above the poverty line. About 19% of respondents below the poverty line reported excellent health compared to 29% of respondents above the poverty line.

Expected Values

The null hypothesis is that there is no relationship between status above or below the poverty line and self-rated health.This would mean that the two variables are independent of each other.

chisq.test(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C )[7]
## $expected
##                               
## NHIPov$Demo_belowpovertyline_B Excellent      Fair      Good      Poor
##                              0 121612.42 46170.234 115507.96 15078.401
##                              1  23797.58  9034.766  22603.04  2950.599
##                               
## NHIPov$Demo_belowpovertyline_B Very Good
##                              0 139050.98
##                              1  27210.02

Observed Values

The observed values are compared to the expected values to determine if there is a relationship between the two variables.

chisq.test(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C )[6]
## $observed
##                               
## NHIPov$Demo_belowpovertyline_B Excellent   Fair   Good   Poor Very Good
##                              0    128739  38451 112949  10860    146421
##                              1     16671  16754  25162   7169     19840

Chi-squared test for independence

The results indicate an existing relationship between status above or below the poverty line and self-rated health.The p-value is less than 0.5, indicating a significant relationship between the two variables. The null hypothesis can be rejected by a p-value of 2.2e-16.

chisq.test(NHIPov$Demo_belowpovertyline_B, NHIPov$Health_SelfRatedHealth_C)
## 
##  Pearson's Chi-squared test
## 
## data:  NHIPov$Demo_belowpovertyline_B and NHIPov$Health_SelfRatedHealth_C
## X-squared = 20382, df = 4, p-value < 2.2e-16

Visualization

In the visual below, the first bar reflects the breakdown of self-rated health for respondents above the poverty line. The bar on the right reflects the breakdown of self-rated health for respondents below the poverty line.

NHIPov%>%
  group_by(Demo_belowpovertyline_B,Health_SelfRatedHealth_C)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))%>%
  ggplot()+
  geom_col(aes(x=Demo_belowpovertyline_B, y=percent, fill=Health_SelfRatedHealth_C))
## `summarise()` has grouped output by 'Demo_belowpovertyline_B'. You can override using the `.groups` argument.