Abstract:

Through the use of the 2021- 2023 NHANES Poverty Index, we explored how income affects an individual’s health insurance coverage: “no insurance,” “public insurance,” “ private insurance,” or “combined insurance.” Our goal with this project was to explore the research question “Do different poverty levels have an impact on level of healthcare coverage when controlling for variables like age, sex, and race?”. To conduct this test we used a Kruskal-Wallis test with a post-hoc Games-Howell test, followed by an ordinal regression. Our findings suggest that poverty levels when controlling for age, race, and sex did have an impact on healthcare coverage.

Introduction:

Within the United States a mix of healthcare coverage has raised concerns about accessibility and equity. As published in the 2024 United States Census “most people (92.0% or 310 million) had health insurance for some or all of the year” with the majority (66.1%) being private and 35.5% public (Lisa N. Bunch & Haleujha Ketema, 2025). While health coverage is the overarching goal, it is important to explore that not all forms of insurance have uniform coverage and are structured based on socioeconomic factors. Under public insurance plans, like Medicaid and Medicare, low-income individuals or those over 65 are able to receive healthcare at a reduced price. In Massachusetts, following public insurance reform, a 2.9% decrease in poverty was demonstrated in those 65 and under (Korenman & Remler, 2016). Although public insurances caused a decrease in poverty associated with low-income individuals, the same was not seen in higher- income families. Families with higher incomes saw public insurances as less attractive due to “enrollment costs and transactional fees” (De La Mata , 2012) whereas private insurances possessed appealing “preventative care benefits” and “supplementary services”. Public health insurances are also associated with bureaucratic time inefficiencies that can lead to longer waiting times. This highlights the need for how an individual’s demographics— income, race, age, and sex— connect to a certain attitude towards their insurance outcome.

Methods:
The data used within this project was sourced from the 2021-2023 National Health and Nutrition Examination Survey. The NHANES database is compiled based on a study conducted by the National Center for Health Statistics and the Centers for Disease Control and Prevention with the goal of assessing the health status of adults and children in the United States. This data is collected via personal interviews, physical examinations, and biological sample collection. In our study we used one primary outcome variable and four predictor variables (National Health and Nutrition Examination Survey 2024). Our sole outcome variable was the different levels of insurance status. This ordinal categorical variable was put together through the various levels of insurance found in NHANES. The different levels of insurance are as follows (1) No insurance, (2) Public insurance (e.g., Medicare, Medicaid, CHIP), (3) Private insurance (e.g., employer sponsored or direct purchase insurance), and (4) a combination of public and private insurance. This scaling of insurance levels indicates a hierarchy associated with financial access and quality of coverage. Our four predictor variables were the Family Poverty Level Index (FMPL), age, sex, and race. FMPL was our primary predictor variable that is calculated by the ratio of reported family income to the federal poverty line. The age variable measure the age of individuals in years, sex is a nominal categorical variable of either male or female, and race is a nominal categorical variable that denotes 5 different races: Mexican American, Non-Hispanic White, Non-Hispanic Black, Non-Hispanic Asian, and Multi-racial individuals. Prior to beginning our statistical analysis, we determined our statistical hypothesis. Our null hypothesis (H0) claimed that there is no statistically significant difference in the distribution of FMPL across the 4 categorical insurance level groups. While our alternative hypothesis (H1) claimed that there is a statistically significant difference in the distribution of FMPL across the 4 categorical variables of insurance level, with an alpha value was 0.05. After running tests to determine if our data passed the necessary assumptions for an ANOVA test (independence, normality, and homogeneity of variances), we found that our data did not pass the assumptions for normality and homogeneity of variance. After seeing that our data did not meet the assumptions to run an ANOVA, we ran a Kruskal-Wallis Test. The Kruskal-Wallis test was selected to conduct our statistical analysis, as it is an nonparametric alternative to a one-way ANOVA. The Kruskal-Wallis test is used when 3 or more independent groups are present and want to be compared. It can determine if there are statistically significant differences found between the medians of a continuous variable without having to meet the assumptions of normality or homogeneity of variance. It tests the overarching null hypothesis (H0); in our case, we tested to see if our continuous variable’s (FMPL) median statistically significantly differed from our independent group (insurance levels). After rejecting the null hypothesis, we performed a Games-Howell Post-Hoc Test. The Games-Howell Post-Hoc Test is done following an ANOVA that has found statistical significance but has violated the assumption of equal variances. And as seen by the Kruskal-Willis Test, our data meets the criteria to perform a Games-Howell Test. The Games-Howell Test is able to identify which specific insurance levels had statistically significant FMPL distributions. Finally, we ran an Ordinal Regression model. Ordinal Regression models are beneficial as they can predict the value of a continuous and dependent variable based on the value of an independent variable. As seen in our project, our outcome variable has a specific order—in which a number (1-4) delineates a level of insurance (none, public, private, or a combination)— and various predictor variables ( FMPL, age, sex, race). The assessment of the various predictor variables against the outcome variable will provide a log odds coefficient value that can be used to determine if the difference is significant. We exponentiated these values to get values that could be interpreted easier in the context of the variables.

Results:

To begin our statistical assessment, we first loaded the following packages into R-Studio: nhanesA and dplyr, using the codes install.packages (“nhanesA”) and install.packages (“dplyr”). After loading the NHANES data, we used the code chunk:

demographics = nhanes(“DEMO_L”) health_ins = nhanes(“HIQ_L”) income = nhanes(“INQ_L”)

NHANES was separated into 3 data frames (RIAGENDER = sex, RIDAGEYR = age, RIDETH3 = race/ethnicity). Our income table was used to load our primary predictor, INDFMPPI to represent the Family Poverty Level Index. And lastly, the health insurance table was loaded to show our categorical insurance variables (HIQ011, HIQ032A, HIQ032B, HIQ032C, HIQ032D, HIQ032D, HIQ032E, HIG032F, HIQ032H, and HIQ032I). SEQN was then used to merge all of the data to be used in the following steps. To find out which specific variables delineated the different levels of insurance we used the code chunk:

healthdata2 = health_data %>% filter (HIQ011 == “Yes” | HIQ032A == “No”) %>% mutate (HIQ032A == ifelse “Covered by insurance”, “Yes”, “No”),

To see in the data frame which individuals definitively had any form of insurance, filter (HIQ011 == “Yes” | HIQ032A == “No”), was used as it told us if an individual did or did not have insurance. The following, mutate (HIQ032A == ifelse “Covered by insurance”, “Yes”, “No”), was used to standardize the results— from those “Covered by insurance” to a “Yes” or “No” result, so then it could be organized into the varying levels of insurance ( 1= No insurance, 2 = public insurance, 3 = private insurance, and 4= combined) through the code chunk:

mutate (insurance = case_when (HIQ011 == “No” ~ 1)

The use of the code case_when allows us to evaluate the recorded Yes/No output and correspond it to a numerical rank: ( 1= No insurance, 2 = public insurance, 3 = private insurance, and 4= combined). The following code was used to combine our separate data frames ( income_ data, demo_data, and healthdata2) into a single dataframe called “dataset”. This was done through the use of the merge function and SEQN, which links demographic, income, and insurance data together correctly.

dataset <- merge(income_data, merge(demo_data, healthdata2, by=“SEQN”))

Following the merging of data frames, we renamed some of our variables to make them shorter and easier to recognize.

dataset <- dataset %>% rename(“Respondent Sequence Number” = “SEQN”, “FMPL” = “INDFMMPI”,“Gender” = “RIAGENDR”, “Age” = “RIDAGEYR”, “Race” = “RIDRETH3”) dataset = as.data.frame(dataset)

With the aforementioned code, “Respondent Sequence Number” was changed to “SEQN”, “INDFMMPI” was changed to “FMPL”, “RIAGENDR” was changed to “Gender”, “RIDAGEYR” was changed to “Age”, and “RIDRETH3” was changed to “Race”. The ending code “ dataset = as.data.frame(dataset)” was used to ensure that our merged data frames were recognized by R as a standard data frame, so that it will be ready for future statistical tests that we will run. To remove any NA values from our dataset, we used the code:

dataset_trim = na.omit(dataset)

The deletion of any NA values guarantees that any incomplete or missing values are removed and not included in any of our following statistical tests. The following code chunk is used to correctly label our insurance variables, so that the numerical values that correspond to insurance type do not register as an integer but rather as unordered categories— which is needed when there is a goal of running an ANOVA or ordinal regression.

dataset_trim\(insurance = as. factor (dataset_trim\)insurance)

Prior to beginning our ANOVA, it is necessary to check that three assumptions are met. To begin we will first load the car package into our library so that we can begin testing assumptions of normality, homogeneity of variances, and independence.

install.packages (“car”)

To check if our data meets the assumptions of normality, we used the following code:

qqPlot(dataset_trim\(FMPL[dataset_trim\)insurance == “1”]) qqPlot(dataset_trim\(FMPL[dataset_trim\)insurance == “2”]) qqPlot(dataset_trim\(FMPL[dataset_trim\)insurance == “3”]) qqPlot(dataset_trim\(FMPL[dataset_trim\)insurance == “4”])

The code “qqPlot” is used to test the assumption of normality for the FMPL variable within each insurance level. As seen in the figures below (Fig. 1-Fig. 4), our data did not meet the assumption as none were normally distributed.

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Levene’s test for Homogeneity of Variances tests to see if the variances of two or more groups are equal. If the result of Levene’s test is a p-value greater than your alpha value then, you must fail to reject the null hypothesis as the variances of your groups are equal. If your p-value is less than your alpha value, then you will reject the null hypothesis as you have unequal variances.

LeveneTest(y = dataset_trim\(FMPL, group = dataset_trim\)insurance, data = dataset_trim)

In our project we found that our p-value was less than our alpha value and that we had unequal variances (p< α ; 2.2e-16 < .05). Due to the failure to meet assumptions for an ANOVA test, we must run a non-parametric alternative test, such as the Kruskal-Wallis test. We can assume that our data did pass the assumption for independence, as we had assigned each variable different respondent numbers.

Due to the failure to meet two out of three assumptions, we ran the Kruskal-Wallis test. As briefly touched upon earlier, this test is run when assumptions are not met and we want to determine if the medians of two or more samples are statistically significant.

test = kruskal.test(FMPL ~ insurance, data = dataset_trim)

After running the Kruskal-Wallis test, we found that our chi squared value, X2(3) = 2319.3, p < 2.2e-16. The chi-square value of 2319.3 indicates that there is a very significant deviation from the null hypothesis. The p-value of (<2.2e-16), was so small that it supplementally supports the rejection of the null hypothesis, stipulating that our data medians are statistically significant. After the statistically significant result of our Kruskal-Wallis test, we ran a Games-Howell test. A Games-Howell test is a “post hoc” test and is used to perform a comparison of all groups to determine which pairs are statistically significantly different from each other. We did this by first installing “rstatix” into our library and then running the code for a Games-Howell test:

install.packages (“rstatix”) ghtest <- games_howell_test (dataset_trim, FMPL ~ insurance, conf. level = 0.95)

We found that our results all showed statistically significant differences in all possible pairs, as seen in the p-values (Fig. 5).

Fig. 5

After running our Kruskal-Wallis and Games-Howell tests, we determined that the alternative hypothesis is being supported as the data appears to be statistically significant. To ensure that these results are accurate, we checked for outliers. To begin we load ggpubr into our library’s and the code to check for outliers in each level of insurance, as visualized via boxplot.

install.packages (“ggpubr”) ggplot(dataset_trim, aes (x = insurance, y = FMPL)) + geom_boxplot( ) + labs ( x = “insurance”, y = FMPL”)

As seen in Fig. 6, we found that we did have some outliers. The No insurance level had 31 outliers, public insurance had 245 outliers, and private and combined insurances had no outliers.

Fig. 6

Because the data is not normally distributed, we could not use any of the outlier tests available to us. We chose to allow the outliers to remain in our dataset, as our data came from NHANES— a credible federally sourced database, and most likely represents a true variation. To test and make sure that all of our variables had a significant impact on the type of insurance, we did a backwards variable elimination. The code:

reg = clm( insurance ~ FMPL + Age + Gender + Race, data = dataset_trim)

allowed us to view which variables were statistically significant against the insurance type. Through this code we were able to find the highest p-value and remove it from our dataset. This was repeated on all of the variables. In the end we found that all variables had statistical significance, except for “Hispanic” within the Race variable. But again we decided to keep this nonsignificant value in our dataset as the rest of the Race variable were found to be significant. Finally we conducted our ordinal regression. Through our ordinal regression we are able to see the continued predicted behavior of the relationship between our ordinal variable and 4 predictor variables.

ord_reg = clm(insurance ~ FMPL + Age + Gender + Rac, data = dataset_trim)

For our ordinal regression model, we used a Mexican American male as our default baseline. We then exponentiated our coefficients so that we could better interpret them using the following code:

exp( ord_reg$coefficients)

Our linear regression model output demonstrated that there is a significant association in the socioeconomic status— as seen in FMPL— and their level of insurance. Race is also a significant variable, as non-Hispanic white individuals have the highest odds of having better coverage, followed by non-Hispanic Asian individuals factoring their financial status (FMPL). This is seen in both figures 7 and 8 (Fig. 7 & Fig.8)

Fig. 7

Fig. 8

Specifically we found the following results: in terms of income, while holding all else equal, a one-unit increase in FMPL increases the odds of a higher insurance level by 58.9% (1.59 the odds, p < .001). In terms of age, holding all else equal, a one year increase in age increases the odds of a higher insurance status by 0.58%. In terms of sex, holding all else equal, an individual who is female has 1.13 times the odds ( OR = 1.13, p< .001) of having a higher insurance status than that of an individual who is male. In terms of race, while holding all else equal, other Hispanics have a value of 1.0801659, and therefore are not statistically different when compared to Mexican Americans. In terms of race, while holding all else equal, those who are non-Hispanic whites have 2.60 higher odds (OR = 2.60, p < .001), of having a higher insurance status when compared to Mexican Americans. In terms of race, while holding all else equal, those who are non-Hispanic black have 1.22 higher odds (OR = 1.22, p < .001) of having a higher level of insurance when compared to Mexican Americans. In terms of race, while holding all else equal, those who are non-Hispanic Asian have 1.91 higher odds (OR = 1.91, p < .001)of having a higher level of insurance when compared to Mexican Americans. In terms of race, while holding all else equal, those who are multi-racial have 1.77 higher odds (OR = 1.77, p < .001) of having a higher level of insurance.

To supplement our results from our earlier tests (the Kruskal-Wallis and ordinal regression) we also ran an ANOVA model. Even though our earlier ANOVA assumptions were not met— assumptions of normality and equal variances— we needed to determine the F-statistic value and effect size.

The following code was used in addition to our Kruskal-Wallis test as both measure the statistical significance of our variables.

ANOVA_test <- aov (FMPL ~ insurance, data = dataset_trim) summary (ANO_test)

The ANOVA test determined that the F-statistic value was (F(3,8958) =992.2, p <.001). This F-value indicates that there is a significant difference in FMPL across the various insurance levels. While these results are significant, we still followed up with a one-way ANOVA, due to the earlier failure to meet all of the necessary assumptions.

OW_test <- oneway.test (FMPL ~insurance, data = dataset_trim) OW_test

Welch’s ANOVA is used when the assumption for equal variances is not met. Our results found that there is a very significant statistical difference across our insurance levels. The very small p-value (p<2.2e-16) signifies that each insurance level has a significantly different FMPL mean. To supplement this finding, the F-value (F = 989.32) determined that the variations between the various groups was much larger than those within each group.

To ensure complete accuracy in our results and conclusions drawn from the Kruskal-Wallis test and Ordinal Regression, we ran a power analysis using the following code.

install.packages (“pwr”) cohen.ES(test = “anov”, size = “large”) table(dataset_trim$insurance) pw.anova.test (k = 4, n=c (598, 3290, 3450, 1124), f= 0.4, sig.level 0.05)

Our power analysis calculated that the statistical power (1.0) demonstrated that all of the data gathered from NHANES correctly detected a true effect between groups 100% of the time. A statistical power of 1.0 is usually unlikely, and in our case, was likely caused by our sample size that was very large.

Conclusion:

Our initial analysis set out to determine the relationship between socioeconomic status as delineated by the Family Poverty Level Index (FMPL) and insurance status (No insurance, Public insurance, Private insurance, Combination), while controlling for variables such as age, sex, and race. Based on the statistical tests and analysis conducted we found a statistical conclusion. After the rejection of assumptions and the findings of the Kruskal-Wallis test we determined that there was a significant relationship between FMPL and the insurance groups (X2 = 2319.3 and p < .001). The Ordinal regression model we then ran quantified this finding for us. Financial status, as seen in FMPL, was the biggest predictor of our model. This finding continued to strongly support our alternative hypothesis (H1), that there was as a statistically significance in socioeconomic status and level of insurance. Our demographic variables also demonstrated significance, with race being the most powerful, as seen in non-Hispanic white individuals who had 2.60 more odds (OR = 2.66, p < .001) of having a higher insurance status (Private insurance). Our statistical analysis concluded that in the United States access to health insurance is impacted by socioeconomic status and race, age, and sex.

References:

Andrew Le, M. (2024, October 24). The Pros and Cons of Public Health Insurance. Buoy Health: Check Symptoms & Find the Right Care. https://www.buoyhealth.com/insurance/pros-cons-public-health-insurance?utm_source=chatgpt.com

Centers for Disease Control and Prevention. (2024). National Health and Nutrition Examination Survey. Centers for Disease Control and Prevention. https://www.cdc.gov/nchs/nhanes/index.html De La

Mata , D. (2012). The effect of Medicaid eligibility on coverage, utilization, and children’s health. Health economics. https://pubmed.ncbi.nlm.nih.gov/22807287/

Korenman, S., & Remler, D. K. (2016, February 15). Including Health Insurance in Poverty Measurement: The Impact of Massachusetts Health Reform on Poverty. NBER. https://www.nber.org/papers/w21990

Lisa N. Bunch and Halelujha Ketema. (2025, September 9). Health insurance coverage in the United States: 2024. Census.gov. https://www.census.gov/library/publications/2025/demo/p60-288.html

Norris, A. L. (2025, November 11). What is the Federal Poverty Level (FPL)?. healthinsurance.org. https://www.healthinsurance.org/glossary/federal-poverty-level/

Wray, C. M., Khare , M., & Keyhani , S. (2021). Experiences with care among US adults with private and public health insurance. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2780540