Final Project-Mat Shaposhnikov

Author

Matvei Shaposhnikov

Source: https://www.cdc.gov/heart-disease/data-research/facts-stats/index.html

Source: https://www.cdc.gov/heart-disease/data-research/facts-stats/index.html

DATA 110 Final Project

2020 Heart Disease Data Exploration

Matvei Shaposhnikov

In this project, I will explore the 2020 Heart Disease dataset, which contains health-related information on various demographic and lifestyle factors associated with heart disease.

Topic: Heart Disease and its associations with different risk factors.
Data: “heart_2020_cleaned.csv,” originally derived from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2020.
Variables of Interest:

  • HeartDisease (categorical: “Yes”/“No”)

  • BMI (quantitative)

  • Smoking (categorical: “Yes”/“No”)

  • AlcoholDrinking (categorical: “Yes”/“No”)

  • PhysicalHealth (quantitative: number of days in poor physical health)

  • GenHealth (quantitative: number of days in poor physical health)

  • PhysicalActivity (categorical: “Yes”/“No”)

I chose this dataset because cardiovascular health is a worldwide issue, and understanding patterns in heart-disease risk factors is personally meaningful through some of my family having it in the past. The data was collected via a telephone survey as described on the CDC BRFSS website. I cleaned it by removing incomplete or invalid entries (the provided dataset was already curated very well).

Load in Libraries and data

# Load necessary packages
library(tidyverse) 
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)  

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout
library(patchwork)
library(webshot2)

# Set working directory
setwd("C:/Users/gitar/Documents/heart_2020_cleaned.csv")

# read_csv (not read.csv())
heart <- read_csv("heart_2020_cleaned.csv")
Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick check of data structure
head(heart)
# A tibble: 6 × 18
  HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
  <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
1 No            16.6 Yes     No              No                  3           30
2 No            20.3 No      No              Yes                 0            0
3 No            26.6 Yes     No              No                 20           30
4 No            24.2 No      No              No                  0            0
5 No            23.7 No      No              No                 28            0
6 Yes           28.9 Yes     No              No                  6            0
# ℹ 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
#   Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
#   SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>

Filter the data

heart_clean <- heart |> 
  mutate( # the ifelse command creates a numeric value from the datasets original categorical variable. This helps with the regression and will be undone later.
    HeartDisease = ifelse(HeartDisease == "Yes", 1, 0)) |>
    mutate( # Puts GenHelath on a numerical scale from 1-5
      GenHealth = match(GenHealth, c("Poor", "Fair", "Good", "Very good", "Excellent"))) |>
  filter(MentalHealth >= 1) |> # remove missing values
  filter(PhysicalHealth >= 1) |> # remove missing values
  select(BMI, SleepTime, HeartDisease, Smoking, AlcoholDrinking,PhysicalActivity, GenHealth, MentalHealth, PhysicalHealth)
Using mutate I transformed the column HeartDisease from a categorical to a numerical variable, then I did the same for GenHealth but this time using match() since I wanted to use both in the regression model. Using filter() I eliminated any missing data from the MentalHealth and PhysicalHelath columns and used select() to only keep important data points relevant to the visualizations I plan to construct.

Logistic Regression Model

model <- glm(HeartDisease ~ BMI + MentalHealth + PhysicalHealth + SleepTime + GenHealth,
            data = heart_clean,
            family = "binomial")
# I had to use glm instead of lm since HeartDisease is a binary factor. https://stackoverflow.com/questions/42051315/glm-for-logistic-regression-vs-self-coded-logistic-regression-model-in-r

summary(model)

Call:
glm(formula = HeartDisease ~ BMI + MentalHealth + PhysicalHealth + 
    SleepTime + GenHealth, family = "binomial", data = heart_clean)

Coefficients:
                Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.345854   0.094390  -3.664 0.000248 ***
BMI             0.001585   0.001713   0.925 0.354985    
MentalHealth   -0.006373   0.001396  -4.564 5.02e-06 ***
PhysicalHealth  0.015245   0.001495  10.198  < 2e-16 ***
SleepTime       0.009953   0.006741   1.477 0.139786    
GenHealth      -0.736572   0.017283 -42.619  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 38433  on 50732  degrees of freedom
Residual deviance: 34466  on 50727  degrees of freedom
AIC: 34478

Number of Fisher Scoring iterations: 5
Variable Coefficient p-value Interpretation
BMI +0.0016 0.355 Not significant. BMI doesn’t have a strong relationship with heart disease here.
MentalHealth -0.0064 5.02e-06 Very counterintuative but each additional day of poor mental health slightly lowers the odds of heart disease.
PhysicalHealth +0.0152 <2e-16 Each additional day of poor physical health increases the odds of heart disease.
SleepTime +0.0100 0.140 Not significant. Sleep time doesn’t clearly predict heart disease here.
GenHealth -0.737 <2e-16 Better self-rated general health strongly reduces the odds of heart disease.

Table of Variables and Significance in Relation to HeartDisease

Now knowing what values are significant I can proceed to randomly selecting around 3,000 observations since the dataset originally has 300,000, and the filtered version still has over 50,000.

Randomly sample 3,000 rows

set.seed(123)  # Random seed ensures reproducible sampling
heart_sample <- heart_clean |>
  slice_sample(n = 3000)

Plots comparing Heart Disease with Mental Health, General Health, and Physical Health

heart_sample <- heart_sample |>
  mutate(HeartDisease = ifelse(HeartDisease == 1, "Yes", "No"))
#Same thing I did when I was filtering, except now we need the opposite

Plot 1: Health and BMI in Relation to Heart Disease

#Prepare the data
heart_facet <- heart_sample |>
  select(HeartDisease, PhysicalHealth, GenHealth, BMI) |>
  pivot_longer( #covert to long format
    cols = c(PhysicalHealth, GenHealth, BMI),
    names_to = "HealthMetric",
    values_to = "Value"
  )

# Plot facet_wrap
p1 <- ggplot(heart_facet, aes(x = HeartDisease, y = Value, fill = HeartDisease)) +
  geom_boxplot(alpha = 0.7) +
  facet_wrap(~HealthMetric, scales = "free_y") +
  scale_fill_manual(values = c("darkgreen", "firebrick")) +
  labs(
    title = "Health Metrics by Heart Disease Status",
    x = "Heart Disease",
    y = "Value"
  ) +
  theme_minimal()

ggplotly(p1)

Plot 2: Behavior Patterns in Relation to Heart Disease

#Same thing as first plot
heart_long <- heart_sample |>
  pivot_longer(
    cols = c(Smoking, AlcoholDrinking, PhysicalActivity),
    names_to = "Behavior",
    values_to = "Response"
  )

#summarise and group together the behaviors based on groups associated with Heart disease
heart_summary <- heart_long |>
  group_by(Behavior, Response, HeartDisease) |>
  summarise(count = n(), .groups = "drop") |>
  group_by(Behavior, Response) |>
  mutate(prop = count / sum(count))

# Plot facet_wrap
p2 <- ggplot(heart_summary, aes(x = Response, y = prop, fill = HeartDisease)) +
  geom_col(position = "fill") +
  facet_wrap(~Behavior, scales = "free_x") +
  scale_fill_manual(values = c("skyblue", "firebrick")) +
  labs(
    title = "Heart Disease Proportion by Behaviors",
    x = "Behavior Response",
    y = "Proportion",
    fill = "Heart Disease"
  ) +
  theme_minimal()

ggplotly(p2)

Final Reflections

According to the CDC, about 1 in every 5 deaths in the U.S. is caused by heart disease, and most cases are preventable through changes in daily behaviors. Smoking, poor diet, and lack of exercise are contributing factors to cardiovascular risk (“Heart Disease Facts”, CDC.gov, 2023).

Studies confirm that self-rated health (like the GenHealth variable) is a reliable predictor of future heart events, sometimes even more predictive than lab tests (Idler & Benyamini, 1997). Physical inactivity, especially when combined with poor nutrition and high BMI, significantly increases heart disease risk over time (Lee et al., Lancet, 2012). These findings support the relationships shown in my visualizations.

The visualizations in this project show that smoking and lack of physical activity are pretty strongly associated with higher rates of heart disease, while better general health and more physical activity relate to lower rates. One surprise was that MentalHealth showed a weak or maybe negative relationship with heart disease in some models wjich can show the complexity of how mental health symptoms are reported.

I also found that some variables like SleepTime and BMI had weaker associations than expected. In the future, I want to explore interactions in more depth, possibly looking at demographics or regions as well.

One thing I hoped to include but couldn’t get working due to data limitations was an interactive heatmap of behavior combinations by heart disease status in different regions of the country.

References:

Centers for Disease Control and Prevention. (2023). Heart Disease Facts. https://www.cdc.gov/heartdisease/facts.htm

Idler, E. L., & Benyamini, Y. (1997). Self-rated health and mortality: a review of twenty-seven community studies. Journal of Health and Social Behavior, 21–37.

Lee, I. M., Shiroma, E. J., Lobelo, F., et al. (2012). Effect of physical inactivity on major non-communicable diseases worldwide: an analysis of burden of disease and life expectancy. The Lancet, 380(9838), 219–229.