library(readr)
library(dplyr) 
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)

NHIdata <- read.csv("C:/Users/12055/Documents/333- Chi squared/NHIS Data.csv" )

Data and Research Question

I will be analyzing National Health Interview data from the years 1997-2016 to test if there exists a relationship between survey respondents status below or above the poverty line and self-rated health. I hypothesize that respondents falling below the poverty line will report a lower self-rated health than respondents above the poverty line.

NHIPov <- NHIdata%>%
  select(Demo_belowpovertyline_B , Health_SelfRatedHealth_C)%>%
  mutate(PovertyStatus = ifelse(Demo_belowpovertyline_B==1, "Below Poverty", "Above Poverty"),
        Health_SelfRatedHealth_C = factor(Health_SelfRatedHealth_C, levels = c("Excellent", "Very Good", "Good", "Fair", "Poor")))%>%
  filter(!is.na(PovertyStatus), !is.na(Health_SelfRatedHealth_C))

Below is a preview of the data:

head(NHIPov)
##   Demo_belowpovertyline_B Health_SelfRatedHealth_C PovertyStatus
## 1                       1                Excellent Below Poverty
## 2                       0                Very Good Above Poverty
## 3                       0                Excellent Above Poverty
## 4                       0                Very Good Above Poverty
## 5                       1                     Good Below Poverty
## 6                       0                     Poor Above Poverty

Expected Values

The null hypothesis is that there is no relationship between status above or below the poverty line and self-rated health.This would indicate that the two variables are independent of each other.

chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[7]
## $expected
##                     
## NHIPov$PovertyStatus Excellent Very Good      Good      Fair      Poor
##        Above Poverty 121612.42 139050.98 115507.96 46170.234 15078.401
##        Below Poverty  23797.58  27210.02  22603.04  9034.766  2950.599

Observed Values

The observed values are compared to the expected values to review if there may exist a relationship between the two variables. In a comparison between the expected values and the observed values, the following can be remarked:

chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[6]
## $observed
##                     
## NHIPov$PovertyStatus Excellent Very Good   Good   Fair   Poor
##        Above Poverty    128739    146421 112949  38451  10860
##        Below Poverty     16671     19840  25162  16754   7169

Relationship of Interest

The crosstab supports my hypothesis. For example, about 8% of respondents below the poverty line reported poor health compared to only 2% of respondents above the poverty line. Only 19% of respondents below the poverty line reported excellent health compared to 29% of respondents above the poverty line.

table(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C)%>%
prop.table(1)%>%
round(2)
##                
##                 Excellent Very Good Good Fair Poor
##   Above Poverty      0.29      0.33 0.26 0.09 0.02
##   Below Poverty      0.19      0.23 0.29 0.20 0.08

Visualization

In the bar chart below, the first bar reflects the breakdown of self-rated health for respondents above the poverty line. The bar on the right reflects the breakdown of self-rated health for respondents below the poverty line.

NHIPov%>%
  group_by(PovertyStatus,Health_SelfRatedHealth_C)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))%>%
  ggplot()+
  geom_col(aes(x=PovertyStatus, y=percent, fill=Health_SelfRatedHealth_C))
## `summarise()` has grouped output by 'PovertyStatus'. You can override using the `.groups` argument.

Chi-squared test for independence

The results indicate an existing relationship between status above or below the poverty line (independent variable) and self-rated health (dependent variable).The p-value is 2.2e-16, less than alpha = 0.5, signifying a significant relationship between the two variables. Based on the chi-squared test results, I am able to reject the null hypothesis.

chisq.test(NHIPov$PovertyStatus, NHIPov$Health_SelfRatedHealth_C)
## 
##  Pearson's Chi-squared test
## 
## data:  NHIPov$PovertyStatus and NHIPov$Health_SelfRatedHealth_C
## X-squared = 20382, df = 4, p-value < 2.2e-16