Final Project

Is there an association between a person’s health status and a person having health insurance coverage?

\(H_0\): There is no association between a person’s health status and if they have health insurance.

\(H_a\): There is an association between a person’s health status and if they have health insurance coverage.

I will be doing a Chi Square Test for Association.

Introduction

I will be using the Health Coverage and Health Status dataset from the openintro.org website at https://www.openintro.org/data/index.php?data=health_coverage

The dataset has 20,000 observations on 2 variables. The 2 variables are both categorical; whether a person has health insurance coverage or not and what is the person’s overall health status (the range was excellent, very good, good, fair or poor). The data came from 20,0000 survey responses to the Behavioral Risk Factor Surveillance System.

Data Analysis

I am going to look at the dimensions and head of the data as well as check for any missing information.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.0     ✔ readr     2.1.6
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

library(lubridate)
library(dplyr)


setwd("~/Downloads/Data 101 Course materials/Data Sets")
health <- read_csv("health_coverage.csv")

## Rows: 20000 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): coverage, health_status
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

health

## # A tibble: 20,000 × 2
##    coverage health_status
##    <chr>    <chr>        
##  1 No       Excellent    
##  2 No       Excellent    
##  3 No       Excellent    
##  4 No       Excellent    
##  5 No       Excellent    
##  6 No       Excellent    
##  7 No       Excellent    
##  8 No       Excellent    
##  9 No       Excellent    
## 10 No       Excellent    
## # ℹ 19,990 more rows

str(health)

## spc_tbl_ [20,000 × 2] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ coverage     : chr [1:20000] "No" "No" "No" "No" ...
##  $ health_status: chr [1:20000] "Excellent" "Excellent" "Excellent" "Excellent" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   coverage = col_character(),
##   ..   health_status = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

head(health)

## # A tibble: 6 × 2
##   coverage health_status
##   <chr>    <chr>        
## 1 No       Excellent    
## 2 No       Excellent    
## 3 No       Excellent    
## 4 No       Excellent    
## 5 No       Excellent    
## 6 No       Excellent

colSums(is.na(health))

##      coverage health_status 
##             0             0

There are no missing data.

health |>
  mutate(across(c(coverage, health_status), as.factor))

## # A tibble: 20,000 × 2
##    coverage health_status
##    <fct>    <fct>        
##  1 No       Excellent    
##  2 No       Excellent    
##  3 No       Excellent    
##  4 No       Excellent    
##  5 No       Excellent    
##  6 No       Excellent    
##  7 No       Excellent    
##  8 No       Excellent    
##  9 No       Excellent    
## 10 No       Excellent    
## # ℹ 19,990 more rows

To save memory, I made the categorical variables factors.

df_health <- health |>
  group_by(coverage, health_status) |>
    summarise(outcomes= n()
)

## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by coverage and health_status.
## ℹ Output is grouped by coverage.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(coverage, health_status))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.

df_health

## # A tibble: 10 × 3
## # Groups:   coverage [2]
##    coverage health_status outcomes
##    <chr>    <chr>            <int>
##  1 No       Excellent          459
##  2 No       Fair               385
##  3 No       Good               854
##  4 No       Poor                99
##  5 No       Very good          727
##  6 Yes      Excellent         4198
##  7 Yes      Fair              1634
##  8 Yes      Good              4821
##  9 Yes      Poor               578
## 10 Yes      Very good         6245

observed <- matrix(c(4198, 6245, 4821, 1634, 578, 459, 727, 854, 385, 99), nrow = 2, byrow = TRUE,
                   dimnames = list(c("Had Insurance", "No Insurance"), c("Excellent", "Very Good", "Good", "Fair", "Poor")))

observed

##               Excellent Very Good Good Fair Poor
## Had Insurance      4198      6245 4821 1634  578
## No Insurance        459       727  854  385   99

chi_health<- chisq.test(observed)
chi_health

## 
##  Pearson's Chi-squared test
## 
## data:  observed
## X-squared = 171.61, df = 4, p-value < 2.2e-16

chi_health$expected

##               Excellent Very Good     Good      Fair     Poor
## Had Insurance 4069.2866 6092.1336 4958.815 1764.2022 591.5626
## No Insurance   587.7134  879.8664  716.185  254.7978  85.4374

chi_health

## 
##  Pearson's Chi-squared test
## 
## data:  observed
## X-squared = 171.61, df = 4, p-value < 2.2e-16

observed

##               Excellent Very Good Good Fair Poor
## Had Insurance      4198      6245 4821 1634  578
## No Insurance        459       727  854  385   99

Here we can compare the difference between the expected and the observed (actual) results and see how they differ.

barplot(observed,
        beside = FALSE,   
        col = c("skyblue", "orange"),
        main = "Health Status with & without Health Insurance",
        xlab = "Health Status",
        ylab = "Total People",)
     legend("topright",
       legend = c("Has Insurance", "No Insurance"),
       fill = c("skyblue", "orange"))

Here is a visualization where you can clearly see the health status and whether there was health insurance coverage or not.

Conclusion

The p-value is less than 2.2e-16 which is less than alpha at 0.05. (However, please note, there is quite an imbalance within the dataset between those who have insurance and those that don’t.There are very many people who have health insurance and very few people without it.)

The results of the test for association are statistically significant. We reject the null hypothesis that there is no association between having health insurance and a person’s health status. We have evidence for the alternative hypothesis, that there is an association between a person’s health status and whether the person has health insurance or not.

This is helpful information for all people to know, that you may have better health if you have health insurance coverage. You may be more likely to use it before you really have a serious health issue, preventing more serious health situations later on.

Future Steps

It may be interesting to look at whether this same association holds true for men and women or people of different ages and also if we get similar results with a more balanced datset.

Sources

https://www.openintro.org/data/index.php?data=health_coverage

Office of Surveillance, Epidemiology, and Laboratory Services Behavioral Risk Factor Surveillance System, BRFSS 2010 Survey Data.

https://www.cdc.gov/brfss/questionnaires/pdf-ques/2010brfss.pdf

https://www.cdc.gov/brfss/annual_data/annual_2010.htm