#Introduction This project is about determining if there is a connection between health concerns and poverty. I work in public health, so I have a curiosity about how social determinants of health impacts our health. The data set that I used (described below) is a widely used data set for this type of exploration. Here is a list of things that I was curious to explore using this data set: 1. Self-reported health ratings by poverty scores 2. The number of health concerns of those below the poverty threshold (less than 1 on a 0-5 scale) 3. Connection between the number of health concerns and poverty
The NHANES data set from 2011-2012 was used to perform the analysis. The survey data is collected by US National Center for Health Statistics (NCHS) where they asked a series of questions related to demographics, health, lifestyle. A health examination is also conducted. More information about the survey and the data set can be found here: https://www.cdc.gov/nchs/nhanes/about_nhanes.htm
Demographic Variables: SurveyYr: survey year that the participant participated in. Age: in years at screening of study participant. Gender: Gender (sex) of study participant coded as male or female. Race3: Reported race of study participant. Poverty: A ratio of family income to poverty guidelines. Smaller numbers indicate more poverty.
Health related variables: BMI: Body mass index (weight/height2 in kg/m2). BPSysAve: Combined systolic blood pressure reading. BPDiaAve: Combined diastolic blood pressure reading. TotChol: Total HDL cholesterol in mmol/L. Diabetes: Study participant told by a doctor or health professional that they have diabetes. Depressed: Self-reported number of days where participant felt down, depressed or hopeless. SleepTrouble: Participant has told a doctor or other health professional that they had trouble sleep. HealthGen: Self-reported rating of participant’s health in general.
Disclaimers: For NHANES datasets, the use of sampling weights and sample design variables is recommended for all analyses because the sample design is a clustered design and incorporates differential prob- abilities of selection. If you fail to account for the sampling parameters, you may obtain biased estimates and overstate significance levels.
Please note that the data sets provided in this package are derived from the NHANES database and have been adapted for educational purposes. As such, they are NOT suitable for use as a research database. For research purposes you should download original data files from the NHANES website and follow the analysis instructions given there.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
##
## The following object is masked from 'package:ggplot2':
##
## last_plot
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following object is masked from 'package:graphics':
##
## layout
library(RColorBrewer)
library(viridis)
## Loading required package: viridisLite
library(viridisLite)
library(treemap)
setwd("/Users/smhenderson/Desktop/DATA110/R/Datasets")
nhanes <- read.csv("nhanes.csv")
#Create a subset of the data set to be used during the analysis
nhanes2 <- nhanes %>%
filter(SurveyYr == "2011_12") %>%
filter(Age >=18) %>%
select("HealthGen", "Gender", "Race3", "Poverty", "BMI", "BPSysAve", "BPDiaAve", "Diabetes", "TotChol", "Depressed", "SleepTrouble")
#summary(nhanes2)
#colnames(nhanes2)
Firstly, I wanted to look at the self-reported health ratings by poverty scores. According to an article written by American Academy of Family Physicians (AAFP), poverty has a significant impact on health as it restricts access to essential resources, such as nutritious food, suitable housing, safe environments to reside and work in, and other aspects that contribute to an individual’s overall well-being. People living in low-income or high-poverty areas often face health challenges due to the cumulative effect of these factors. The distribution of the boxplot follows a similiar trend.
#Handle NAs and recode the HealthGen variable
nhanes_healthgen <- nhanes2 %>%
filter(!is.na(HealthGen) & (!is.na(HealthGen) & (!is.na(Poverty)))) %>%
mutate(HealthGen =recode(HealthGen, "Vgood" = "Very Good"))
#Prepare the data so that it can be used to create a boxplot
nhanes_healthgen$HealthGen <- factor(nhanes_healthgen$HealthGen,
levels = c("Poor", "Fair", "Good", "Very Good", "Excellent"))
num_colors <- length(levels(nhanes_healthgen$HealthGen))
colors <- viridis_pal(option = "D")(num_colors)
#Create boxplot
ggplot(nhanes_healthgen, aes(x = HealthGen, y = Poverty, fill = HealthGen)) +
geom_boxplot() +
scale_fill_manual(values = colors) +
labs(x = "Health Ratings", y = "Poverty Scores", caption = "Poverty Scores: A value less than 1 indicates the family is below the poverty threshold.") +
theme(plot.caption = element_text(hjust = 0, size = 7),
plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "white", color = "gray"),
panel.grid.minor = element_line(color = "gray"),
legend.position = "none") +
ggtitle("Self-Reported General Health Ratings by Poverty Scores")
p <- ggplot(nhanes_healthgen, aes(x = HealthGen, y = Poverty, fill = HealthGen)) +
geom_boxplot() +
labs(x = "Health Ratings", y = "Poverty Scores",
caption = "Poverty Scores: A value less than 1 indicates the family is below the poverty threshold.") +
theme(plot.caption = element_text(hjust = 0, size = 7),
plot.title = element_text(hjust = 0.5),
panel.background = element_rect(fill = "white", color = "gray"),
panel.grid.minor = element_line(color = "gray"),
legend.position = "none") +
ggtitle("Self-Reported General Health Ratings by Poverty Scores")
# Convert the plot to an interactive plotly object
p_interactive <- ggplotly(p, tooltip = "text")
# Show the interactive plot
p_interactive
nhanes_health <- nhanes2 %>%
mutate(bmi2 = ifelse(BMI <= 18.5 | BMI >= 25, 1, 0)) %>%
mutate(diabetes2 = ifelse(Diabetes == "Yes", 1, 0)) %>%
mutate(BP = ifelse(BPSysAve <=120 & BPDiaAve <=80, 0,1)) %>%
mutate(sleeptrouble2 = ifelse(SleepTrouble == "Yes", 1, 0)) %>%
mutate(totchol2 = ifelse(TotChol <5.2, 0, 1)) %>%
mutate(depressed2 = ifelse(Depressed == "Most", 1,0))
#colnames(nhanes_health)
Next, I wanted to look at the number of health concerns by survey participants below the poverty threshold. I thought it would be interesting to see the most commonly indicated health concerns for this group. The treemap shows health concerns of survey respondents below the poverty threshold, with elevated BMI being the most frequent health concern. This isn’t surprising given that poor dietary and fitness habits can be associated with poverty, as described above by the AAFP article.
#Filter data set show those below the poverty threshold and then create a data set with only needed variables
nhanes_health2 <- nhanes_health %>%
filter(Poverty <1) %>%
arrange(desc(Poverty)) %>%
select("bmi2", "diabetes2", "BP", "sleeptrouble2", "totchol2", "depressed2")
#Tally up the number of each health concern
nhanes_health3 <- gather(nhanes_health2, key = "condition", value = "value", bmi2:depressed2, na.rm=TRUE)
nhanes_health4 <- nhanes_health3 %>%
group_by(condition) %>%
summarise(total = sum(value)) %>%
arrange(desc(total))
#Rename conditions in the dataframe to be shown in the treemap
nhanes_health4$condition_renamed <- c("Elevated Body Mass Index", "Elevated Blood Pressure", "Elevated Cholesterol", "Reported Sleep Troubles", "Reported Depression", " Reported Diabetes")
#Create treemap
treemap(nhanes_health4, index = "condition_renamed", vSize = "total",
vColor = "total", type = "manual",
palette = viridis_pal(option = "D")(length(nhanes_health4$condition)),
title = "Health Concerns of those Below the Poverty Threshold")
#Create new variable that assigns Gender & Race to each survey respondent
nhanes_demo <- nhanes_health %>%
mutate(race_gender = ifelse(Race3 == "Asian" & Gender == "female", "Asian Women",
ifelse(Race3 == "Asian" & Gender == "male", "Asian Men",
ifelse(Race3 == "Black" & Gender == "female", "Black Women",
ifelse(Race3 == "Black" & Gender == "male", "Black Men",
ifelse(Race3 == "Hispanic" & Gender == "female", "Latinx Women",
ifelse(Race3 == "Mexican" & Gender == "female", "Latinx Women",
ifelse(Race3 == "Hispanic" & Gender == "male", "Latinx Men",
ifelse(Race3 == "Mexican" & Gender == "male", "Latinx Men",
ifelse(Race3 == "White" & Gender == "female", "White Women",
ifelse(Race3 == "White" & Gender == "male", "White Men", NA))))))))))) %>%
filter(!is.na(race_gender))
#Recode Latinx group
nhanes_demo2 <- nhanes_demo %>%
mutate(Race3 =recode(Race3, "Mexican" = "Latinx", "Hispanic" = "Latinx"))
nhanes_demo3 <- nhanes_demo2 %>%
rowwise() %>%
mutate(healthrisks_count = sum(diabetes2, bmi2, BP, sleeptrouble2, totchol2, depressed2)) %>%
select("Poverty", "race_gender", "healthrisks_count") %>%
filter(!is.na(healthrisks_count) & !is.na(Poverty) & !is.na(race_gender))
The last thing that I wanted to look at was the number of health concerns by Poverty & by Race, Gender. The visualization shows that a great deal of participants has at least 3 health concerns, regardless of race/gender and poverty score. So, I took it one step further (see the next chunk)….
plot1 <- nhanes_demo3 %>%
ggplot(aes(Poverty, healthrisks_count))+
geom_point(aes(color = race_gender))+
facet_wrap(~race_gender) +
ggtitle("Number of Health Concerns by Poverty & by Race, Gender") +
labs(x = "Poverty Score", y = "Number of Health Concerns", caption = "Poverty Score: A value less than 1 indicates the family is below the poverty threshold.") +
theme(plot.title = element_text(hjust = 0.5),
legend.position = "none",
plot.caption = element_text(hjust = 0, size = 7))
plot1
When only looking at 3+ health concerns, we can see somewhat of a clearer picture. Asian men and women reported the least number of health scores. For most of the other groups, it appears that the number of health concerns slightly decreases as the the the score moves closer to 5.
#Filter to show 3+ health concerns
nhanes_demo4 <- nhanes_demo3 %>%
filter(healthrisks_count >=3)
#Create facet-wrap
plot2 <- nhanes_demo4 %>%
ggplot(aes(Poverty, healthrisks_count)) +
geom_point(aes(color = race_gender), size = 0.8) +
facet_wrap(~race_gender) +
ggtitle("Number of Health Concerns and Poverty by Race, Gender") +
scale_y_continuous(limits = c(3, 6), breaks = seq(0, 6, 1)) +
theme(
panel.background = element_rect(fill = "white", color = "gray"),
panel.grid.minor = element_line(color = "gray"),
legend.position = "none",
plot.title = element_text(hjust = 0.5),
strip.background = element_rect(fill = "navyblue", color = "navyblue"),
strip.text = element_text(color = "white"),
plot.caption = element_text(hjust = 0, size = 7)) +
labs(x = "Poverty Score", y = "Number of Health Concerns", caption = "Poverty Score: A value less than 1 indicates the family is below the poverty threshold.") +
scale_color_brewer(palette = "Set1") +
geom_vline(xintercept = 1, linetype = "solid", color = "black")
plot2
Overall, there does not appear to be a huge difference in the number of health concerns as it compares to poverty scores. Based on the AAFR article, one would expect to see significantly lower reporting in health concerns as the poverty level moves closer to 5. A couple things to note, if this data set was weighed, it is possible that the findings may have been different. If I had more time, I would have dedicated efforts to weighing the dataset. Also, this data set was created for educational purposes only. It is unclear how much data manipulation occurred.
References: https://www.aafp.org/about/policies/all/poverty-health.html#:~:text=Poverty%20affects%20health%20by%20limiting,an%20individual’s%20standard%20of%20living.