National Health Interview - Chi-Squared Test for Independence

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)

NHIdata <- read.csv("C:/Users/12055/Documents/333- Chi squared/NHIS Data.csv" )

Data and Research Question

I will be analyzing National Health Interview data from the years 1997-2016 to test if there exists a relationship between survey respondents status below or above the poverty line and self-rated health. I hypothesize that respondents falling below the poverty line will report a lower self-rated health than respondents above the poverty line.

NHIPov <- NHIdata%>%
  select(Demo_belowpovertyline_B , Health_SelfRatedHealth_C)%>%
  mutate(PovertyStatus = ifelse(Demo_belowpovertyline_B==1, "Below Poverty", "Above Poverty"),
        Health_SelfRatedHealth_C = factor(Health_SelfRatedHealth_C, levels = c("Excellent", "Very Good", "Good", "Fair", "Poor")))%>%
  filter(!is.na(PovertyStatus), !is.na(Health_SelfRatedHealth_C))

Below is a preview of the data:

head(NHIPov)

##   Demo_belowpovertyline_B Health_SelfRatedHealth_C PovertyStatus
## 1                       1                Excellent Below Poverty
## 2                       0                Very Good Above Poverty
## 3                       0                Excellent Above Poverty
## 4                       0                Very Good Above Poverty
## 5                       1                     Good Below Poverty
## 6                       0                     Poor Above Poverty

Expected Values

The null hypothesis is that there is no relationship between status above or below the poverty line and self-rated health.This would indicate that the two variables are independent of each other.

chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[7]

## $expected
##                     
## NHIPov$PovertyStatus Excellent Very Good      Good      Fair      Poor
##        Above Poverty 121612.42 139050.98 115507.96 46170.234 15078.401
##        Below Poverty  23797.58  27210.02  22603.04  9034.766  2950.599

Observed Values

The observed values are compared to the expected values to review if there may exist a relationship between the two variables. In a comparison between the expected values and the observed values, the following can be remarked:

The expected value that represents the number of respondents above the poverty line reporting excellent health is 121,612.42. The higher observed value is 128, 739.
Th expected value that represents the number of respondents above the poverty line reporting poor health is 15,078.401. The lower observed value is 10, 860.

chisq.test(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C )[6]

## $observed
##                     
## NHIPov$PovertyStatus Excellent Very Good   Good   Fair   Poor
##        Above Poverty    128739    146421 112949  38451  10860
##        Below Poverty     16671     19840  25162  16754   7169

Relationship of Interest

The crosstab supports my hypothesis. For example, about 8% of respondents below the poverty line reported poor health compared to only 2% of respondents above the poverty line. Only 19% of respondents below the poverty line reported excellent health compared to 29% of respondents above the poverty line.

table(NHIPov$PovertyStatus , NHIPov$Health_SelfRatedHealth_C)%>%
prop.table(1)%>%
round(2)

##                
##                 Excellent Very Good Good Fair Poor
##   Above Poverty      0.29      0.33 0.26 0.09 0.02
##   Below Poverty      0.19      0.23 0.29 0.20 0.08

Visualization

In the bar chart below, the first bar reflects the breakdown of self-rated health for respondents above the poverty line. The bar on the right reflects the breakdown of self-rated health for respondents below the poverty line.

NHIPov%>%
  group_by(PovertyStatus,Health_SelfRatedHealth_C)%>%
  summarize(n=n())%>%
  mutate(percent=n/sum(n))%>%
  ggplot()+
  geom_col(aes(x=PovertyStatus, y=percent, fill=Health_SelfRatedHealth_C))

## `summarise()` has grouped output by 'PovertyStatus'. You can override using the `.groups` argument.

Chi-squared test for independence

The results indicate an existing relationship between status above or below the poverty line (independent variable) and self-rated health (dependent variable).The p-value is 2.2e-16, less than alpha = 0.5, signifying a significant relationship between the two variables. Based on the chi-squared test results, I am able to reject the null hypothesis.

chisq.test(NHIPov$PovertyStatus, NHIPov$Health_SelfRatedHealth_C)

## 
##  Pearson's Chi-squared test
## 
## data:  NHIPov$PovertyStatus and NHIPov$Health_SelfRatedHealth_C
## X-squared = 20382, df = 4, p-value < 2.2e-16