In this project, I will explore the 2020 Heart Disease dataset, which contains health-related information on various demographic and lifestyle factors associated with heart disease.
• Topic: Heart Disease and its associations with different risk factors.
• Data: “heart_2020_cleaned.csv,” originally derived from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2020.
• Variables of Interest:
HeartDisease (categorical: “Yes”/“No”)
BMI (quantitative)
Smoking (categorical: “Yes”/“No”)
AlcoholDrinking (categorical: “Yes”/“No”)
PhysicalHealth (quantitative: number of days in poor physical health)
GenHealth (quantitative: number of days in poor physical health)
PhysicalActivity (categorical: “Yes”/“No”)
I chose this dataset because cardiovascular health is a worldwide issue, and understanding patterns in heart-disease risk factors is personally meaningful through some of my family having it in the past. The data was collected via a telephone survey as described on the CDC BRFSS website. I cleaned it by removing incomplete or invalid entries (the provided dataset was already curated very well).
Load in Libraries and data
# Load necessary packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
library(patchwork)library(webshot2)# Set working directorysetwd("C:/Users/gitar/Documents/heart_2020_cleaned.csv")# read_csv (not read.csv())heart <-read_csv("heart_2020_cleaned.csv")
Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick check of data structurehead(heart)
# A tibble: 6 × 18
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 No 16.6 Yes No No 3 30
2 No 20.3 No No Yes 0 0
3 No 26.6 Yes No No 20 30
4 No 24.2 No No No 0 0
5 No 23.7 No No No 28 0
6 Yes 28.9 Yes No No 6 0
# ℹ 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
# Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
# SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>
Filter the data
heart_clean <- heart |>mutate( # the ifelse command creates a numeric value from the datasets original categorical variable. This helps with the regression and will be undone later.HeartDisease =ifelse(HeartDisease =="Yes", 1, 0)) |>mutate( # Puts GenHelath on a numerical scale from 1-5GenHealth =match(GenHealth, c("Poor", "Fair", "Good", "Very good", "Excellent"))) |>filter(MentalHealth >=1) |># remove missing valuesfilter(PhysicalHealth >=1) |># remove missing valuesselect(BMI, SleepTime, HeartDisease, Smoking, AlcoholDrinking,PhysicalActivity, GenHealth, MentalHealth, PhysicalHealth)
Using mutate I transformed the column HeartDisease from a categorical to a numerical variable, then I did the same for GenHealth but this time using match() since I wanted to use both in the regression model. Using filter() I eliminated any missing data from the MentalHealth and PhysicalHelath columns and used select() to only keep important data points relevant to the visualizations I plan to construct.
Logistic Regression Model
model <-glm(HeartDisease ~ BMI + MentalHealth + PhysicalHealth + SleepTime + GenHealth,data = heart_clean,family ="binomial")# I had to use glm instead of lm since HeartDisease is a binary factor. https://stackoverflow.com/questions/42051315/glm-for-logistic-regression-vs-self-coded-logistic-regression-model-in-rsummary(model)
Call:
glm(formula = HeartDisease ~ BMI + MentalHealth + PhysicalHealth +
SleepTime + GenHealth, family = "binomial", data = heart_clean)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.345854 0.094390 -3.664 0.000248 ***
BMI 0.001585 0.001713 0.925 0.354985
MentalHealth -0.006373 0.001396 -4.564 5.02e-06 ***
PhysicalHealth 0.015245 0.001495 10.198 < 2e-16 ***
SleepTime 0.009953 0.006741 1.477 0.139786
GenHealth -0.736572 0.017283 -42.619 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 38433 on 50732 degrees of freedom
Residual deviance: 34466 on 50727 degrees of freedom
AIC: 34478
Number of Fisher Scoring iterations: 5
Variable
Coefficient
p-value
Interpretation
BMI
+0.0016
0.355
Not significant. BMI doesn’t have a strong relationship with heart disease here.
MentalHealth
-0.0064
5.02e-06
Very counterintuative but each additional day of poor mental health slightly lowers the odds of heart disease.
PhysicalHealth
+0.0152
<2e-16
Each additional day of poor physical health increases the odds of heart disease.
SleepTime
+0.0100
0.140
Not significant. Sleep time doesn’t clearly predict heart disease here.
GenHealth
-0.737
<2e-16
Better self-rated general health strongly reduces the odds of heart disease.
Table of Variables and Significance in Relation to HeartDisease
Now knowing what values are significant I can proceed to randomly selecting around 3,000 observations since the dataset originally has 300,000, and the filtered version still has over 50,000.
Plots comparing Heart Disease with Mental Health, General Health, and Physical Health
heart_sample <- heart_sample |>mutate(HeartDisease =ifelse(HeartDisease ==1, "Yes", "No"))#Same thing I did when I was filtering, except now we need the opposite
Plot 1: Health and BMI in Relation to Heart Disease
#Prepare the dataheart_facet <- heart_sample |>select(HeartDisease, PhysicalHealth, GenHealth, BMI) |>pivot_longer( #covert to long formatcols =c(PhysicalHealth, GenHealth, BMI),names_to ="HealthMetric",values_to ="Value" )# Plot facet_wrapp1 <-ggplot(heart_facet, aes(x = HeartDisease, y = Value, fill = HeartDisease)) +geom_boxplot(alpha =0.7) +facet_wrap(~HealthMetric, scales ="free_y") +scale_fill_manual(values =c("darkgreen", "firebrick")) +labs(title ="Health Metrics by Heart Disease Status",x ="Heart Disease",y ="Value" ) +theme_minimal()ggplotly(p1)
Plot 2: Behavior Patterns in Relation to Heart Disease
#Same thing as first plotheart_long <- heart_sample |>pivot_longer(cols =c(Smoking, AlcoholDrinking, PhysicalActivity),names_to ="Behavior",values_to ="Response" )#summarise and group together the behaviors based on groups associated with Heart diseaseheart_summary <- heart_long |>group_by(Behavior, Response, HeartDisease) |>summarise(count =n(), .groups ="drop") |>group_by(Behavior, Response) |>mutate(prop = count /sum(count))# Plot facet_wrapp2 <-ggplot(heart_summary, aes(x = Response, y = prop, fill = HeartDisease)) +geom_col(position ="fill") +facet_wrap(~Behavior, scales ="free_x") +scale_fill_manual(values =c("skyblue", "firebrick")) +labs(title ="Heart Disease Proportion by Behaviors",x ="Behavior Response",y ="Proportion",fill ="Heart Disease" ) +theme_minimal()ggplotly(p2)
Final Reflections
According to the CDC, about 1 in every 5 deaths in the U.S. is caused by heart disease, and most cases are preventable through changes in daily behaviors. Smoking, poor diet, and lack of exercise are contributing factors to cardiovascular risk (“Heart Disease Facts”, CDC.gov, 2023).
Studies confirm that self-rated health (like the GenHealth variable) is a reliable predictor of future heart events, sometimes even more predictive than lab tests (Idler & Benyamini, 1997). Physical inactivity, especially when combined with poor nutrition and high BMI, significantly increases heart disease risk over time (Lee et al., Lancet, 2012). These findings support the relationships shown in my visualizations.
The visualizations in this project show that smoking and lack of physical activity are pretty strongly associated with higher rates of heart disease, while better general health and more physical activity relate to lower rates. One surprise was that MentalHealth showed a weak or maybe negative relationship with heart disease in some models wjich can show the complexity of how mental health symptoms are reported.
I also found that some variables like SleepTime and BMI had weaker associations than expected. In the future, I want to explore interactions in more depth, possibly looking at demographics or regions as well.
One thing I hoped to include but couldn’t get working due to data limitations was an interactive heatmap of behavior combinations by heart disease status in different regions of the country.
Idler, E. L., & Benyamini, Y. (1997). Self-rated health and mortality: a review of twenty-seven community studies. Journal of Health and Social Behavior, 21–37.
Lee, I. M., Shiroma, E. J., Lobelo, F., et al. (2012). Effect of physical inactivity on major non-communicable diseases worldwide: an analysis of burden of disease and life expectancy. The Lancet, 380(9838), 219–229.