\(H_0\): There is no association between a person’s health status and if they have health insurance.
\(H_a\): There is an association between a person’s health status and if they have health insurance coverage.
I will be doing a Chi Square Test for Association.
I will be using the Health Coverage and Health Status dataset from the openintro.org website at https://www.openintro.org/data/index.php?data=health_coverage
The dataset has 20,000 observations on 2 variables. The 2 variables are both categorical; whether a person has health insurance coverage or not and what is the person’s overall health status (the range was excellent, very good, good, fair or poor). The data came from 20,0000 survey responses to the Behavioral Risk Factor Surveillance System.
I am going to look at the dimensions and head of the data as well as check for any missing information.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.0 ✔ readr 2.1.6
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(corrplot)
## corrplot 0.95 loaded
library(lubridate)
library(dplyr)
setwd("~/Downloads/Data 101 Course materials/Data Sets")
health <- read_csv("health_coverage.csv")
## Rows: 20000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): coverage, health_status
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
health
## # A tibble: 20,000 × 2
## coverage health_status
## <chr> <chr>
## 1 No Excellent
## 2 No Excellent
## 3 No Excellent
## 4 No Excellent
## 5 No Excellent
## 6 No Excellent
## 7 No Excellent
## 8 No Excellent
## 9 No Excellent
## 10 No Excellent
## # ℹ 19,990 more rows
str(health)
## spc_tbl_ [20,000 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ coverage : chr [1:20000] "No" "No" "No" "No" ...
## $ health_status: chr [1:20000] "Excellent" "Excellent" "Excellent" "Excellent" ...
## - attr(*, "spec")=
## .. cols(
## .. coverage = col_character(),
## .. health_status = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
head(health)
## # A tibble: 6 × 2
## coverage health_status
## <chr> <chr>
## 1 No Excellent
## 2 No Excellent
## 3 No Excellent
## 4 No Excellent
## 5 No Excellent
## 6 No Excellent
colSums(is.na(health))
## coverage health_status
## 0 0
There are no missing data.
health |>
mutate(across(c(coverage, health_status), as.factor))
## # A tibble: 20,000 × 2
## coverage health_status
## <fct> <fct>
## 1 No Excellent
## 2 No Excellent
## 3 No Excellent
## 4 No Excellent
## 5 No Excellent
## 6 No Excellent
## 7 No Excellent
## 8 No Excellent
## 9 No Excellent
## 10 No Excellent
## # ℹ 19,990 more rows
To save memory, I made the categorical variables factors.
df_health <- health |>
group_by(coverage, health_status) |>
summarise(outcomes= n()
)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by coverage and health_status.
## ℹ Output is grouped by coverage.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(coverage, health_status))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
df_health
## # A tibble: 10 × 3
## # Groups: coverage [2]
## coverage health_status outcomes
## <chr> <chr> <int>
## 1 No Excellent 459
## 2 No Fair 385
## 3 No Good 854
## 4 No Poor 99
## 5 No Very good 727
## 6 Yes Excellent 4198
## 7 Yes Fair 1634
## 8 Yes Good 4821
## 9 Yes Poor 578
## 10 Yes Very good 6245
observed <- matrix(c(4198, 6245, 4821, 1634, 578, 459, 727, 854, 385, 99), nrow = 2, byrow = TRUE,
dimnames = list(c("Had Insurance", "No Insurance"), c("Excellent", "Very Good", "Good", "Fair", "Poor")))
observed
## Excellent Very Good Good Fair Poor
## Had Insurance 4198 6245 4821 1634 578
## No Insurance 459 727 854 385 99
chi_health<- chisq.test(observed)
chi_health
##
## Pearson's Chi-squared test
##
## data: observed
## X-squared = 171.61, df = 4, p-value < 2.2e-16
chi_health$expected
## Excellent Very Good Good Fair Poor
## Had Insurance 4069.2866 6092.1336 4958.815 1764.2022 591.5626
## No Insurance 587.7134 879.8664 716.185 254.7978 85.4374
chi_health
##
## Pearson's Chi-squared test
##
## data: observed
## X-squared = 171.61, df = 4, p-value < 2.2e-16
observed
## Excellent Very Good Good Fair Poor
## Had Insurance 4198 6245 4821 1634 578
## No Insurance 459 727 854 385 99
Here we can compare the difference between the expected and the observed (actual) results and see how they differ.
barplot(observed,
beside = FALSE,
col = c("skyblue", "orange"),
main = "Health Status with & without Health Insurance",
xlab = "Health Status",
ylab = "Total People",)
legend("topright",
legend = c("Has Insurance", "No Insurance"),
fill = c("skyblue", "orange"))
Here is a visualization where you can clearly see the health status and whether there was health insurance coverage or not.
The p-value is less than 2.2e-16 which is less than alpha at 0.05. (However, please note, there is quite an imbalance within the dataset between those who have insurance and those that don’t.There are very many people who have health insurance and very few people without it.)
The results of the test for association are statistically significant. We reject the null hypothesis that there is no association between having health insurance and a person’s health status. We have evidence for the alternative hypothesis, that there is an association between a person’s health status and whether the person has health insurance or not.
This is helpful information for all people to know, that you may have better health if you have health insurance coverage. You may be more likely to use it before you really have a serious health issue, preventing more serious health situations later on.
It may be interesting to look at whether this same association holds true for men and women or people of different ages and also if we get similar results with a more balanced datset.
https://www.openintro.org/data/index.php?data=health_coverage
Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.
https://www.cdc.gov/brfss/questionnaires/pdf-ques/2010brfss.pdf