library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
NHIdata <- read.csv("C:/Users/12055/Documents/333- Chi squared/NHIS Data.csv" )
I will be analyzing National Health Interview data from the years 1997-2016 to test if there exists a relationship between survey respondents status above or below the poverty line and self-rated health.
NHIPov <- NHIdata%>%
select(Demo_belowpovertyline_B , Health_SelfRatedHealth_C)%>%
filter(Demo_belowpovertyline_B %in%c("0", "1"), Health_SelfRatedHealth_C %in%c("Poor","Fair", "Good", "Very Good", "Excellent"))
head(NHIPov)
## Demo_belowpovertyline_B Health_SelfRatedHealth_C
## 1 1 Excellent
## 2 0 Very Good
## 3 0 Excellent
## 4 0 Very Good
## 5 1 Good
## 6 0 Poor
The first variable is coded as “1” for respondents below the poverty line and “0” for respondents above the poverty line. I hypothesize that respondents who are above the poverty line will report a higher self-rated health than respondents below the poverty line.
table(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C)%>%
prop.table(1)%>%
round(2)
##
## Excellent Fair Good Poor Very Good
## 0 0.29 0.09 0.26 0.02 0.33
## 1 0.19 0.20 0.29 0.08 0.23
The crosstab supports my hypothesis. For example, about 8% of respondents below the poverty line reported poor health compared to 2% of respondents above the poverty line. About 19% of respondents below the poverty line reported excellent health compared to 29% of respondents above the poverty line.
The null hypothesis is that there is no relationship between status above or below the poverty line and self-rated health.This would mean that the two variables are independent of each other.
chisq.test(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C )[7]
## $expected
##
## NHIPov$Demo_belowpovertyline_B Excellent Fair Good Poor
## 0 121612.42 46170.234 115507.96 15078.401
## 1 23797.58 9034.766 22603.04 2950.599
##
## NHIPov$Demo_belowpovertyline_B Very Good
## 0 139050.98
## 1 27210.02
The observed values are compared to the expected values to determine if there is a relationship between the two variables.
chisq.test(NHIPov$Demo_belowpovertyline_B , NHIPov$Health_SelfRatedHealth_C )[6]
## $observed
##
## NHIPov$Demo_belowpovertyline_B Excellent Fair Good Poor Very Good
## 0 128739 38451 112949 10860 146421
## 1 16671 16754 25162 7169 19840
The results indicate an existing relationship between status above or below the poverty line and self-rated health.The p-value is less than 0.5, indicating a significant relationship between the two variables. The null hypothesis can be rejected by a p-value of 2.2e-16.
chisq.test(NHIPov$Demo_belowpovertyline_B, NHIPov$Health_SelfRatedHealth_C)
##
## Pearson's Chi-squared test
##
## data: NHIPov$Demo_belowpovertyline_B and NHIPov$Health_SelfRatedHealth_C
## X-squared = 20382, df = 4, p-value < 2.2e-16
In the visual below, the first bar reflects the breakdown of self-rated health for respondents above the poverty line. The bar on the right reflects the breakdown of self-rated health for respondents below the poverty line.
NHIPov%>%
group_by(Demo_belowpovertyline_B,Health_SelfRatedHealth_C)%>%
summarize(n=n())%>%
mutate(percent=n/sum(n))%>%
ggplot()+
geom_col(aes(x=Demo_belowpovertyline_B, y=percent, fill=Health_SelfRatedHealth_C))
## `summarise()` has grouped output by 'Demo_belowpovertyline_B'. You can override using the `.groups` argument.