library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
NHIdata <- read.csv("C:/Users/12055/Documents/333- Chi squared/NHIS Data.csv" )
I will be analyzing National Health Interview data from the years 1997-2016 to test if there exists a relationship between survey respondents status below or above the poverty line and self-rated health. I hypothesize that respondents falling below the poverty line will report a lower self-rated health than respondents above the poverty line.
NHIPov <- NHIdata%>%
select(Demo_belowpovertyline_B , Health_SelfRatedHealth_C)%>%
mutate(PovertyStatus = ifelse(Demo_belowpovertyline_B==1, "Below Poverty", "Above Poverty"),
Health_SelfRatedHealth_C = factor(Health_SelfRatedHealth_C, levels = c("Excellent", "Very Good", "Good", "Fair", "Poor")))%>%
filter(!is.na(PovertyStatus), !is.na(Health_SelfRatedHealth_C))
Below is a preview of the data:
head(NHIPov)
## Demo_belowpovertyline_B Health_SelfRatedHealth_C PovertyStatus
## 1 1 Excellent Below Poverty
## 2 0 Very Good Above Poverty
## 3 0 Excellent Above Poverty
## 4 0 Very Good Above Poverty
## 5 1 Good Below Poverty
## 6 0 Poor Above Poverty
The null hypothesis is that there is no relationship between status above or below the poverty line and self-rated health.This would indicate that the two variables are independent of each other.
chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[7]
## $expected
##
## NHIPov$PovertyStatus Excellent Very Good Good Fair Poor
## Above Poverty 121612.42 139050.98 115507.96 46170.234 15078.401
## Below Poverty 23797.58 27210.02 22603.04 9034.766 2950.599
The observed values are compared to the expected values to review if there may exist a relationship between the two variables. In a comparison between the expected values and the observed values, the following can be remarked:
chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[6]
## $observed
##
## NHIPov$PovertyStatus Excellent Very Good Good Fair Poor
## Above Poverty 128739 146421 112949 38451 10860
## Below Poverty 16671 19840 25162 16754 7169
The crosstab supports my hypothesis. For example, about 8% of respondents below the poverty line reported poor health compared to only 2% of respondents above the poverty line. Only 19% of respondents below the poverty line reported excellent health compared to 29% of respondents above the poverty line.
table(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C)%>%
prop.table(1)%>%
round(2)
##
## Excellent Very Good Good Fair Poor
## Above Poverty 0.29 0.33 0.26 0.09 0.02
## Below Poverty 0.19 0.23 0.29 0.20 0.08
In the bar chart below, the first bar reflects the breakdown of self-rated health for respondents above the poverty line. The bar on the right reflects the breakdown of self-rated health for respondents below the poverty line.
NHIPov%>%
group_by(PovertyStatus,Health_SelfRatedHealth_C)%>%
summarize(n=n())%>%
mutate(percent=n/sum(n))%>%
ggplot()+
geom_col(aes(x=PovertyStatus, y=percent, fill=Health_SelfRatedHealth_C))
## `summarise()` has grouped output by 'PovertyStatus'. You can override using the `.groups` argument.
The results indicate an existing relationship between status above or below the poverty line (independent variable) and self-rated health (dependent variable).The p-value is 2.2e-16, less than alpha = 0.5, signifying a significant relationship between the two variables. Based on the chi-squared test results, I am able to reject the null hypothesis.
chisq.test(NHIPov$PovertyStatus, NHIPov$Health_SelfRatedHealth_C)
##
## Pearson's Chi-squared test
##
## data: NHIPov$PovertyStatus and NHIPov$Health_SelfRatedHealth_C
## X-squared = 20382, df = 4, p-value < 2.2e-16