Cardiovascular health statistics

Author

Ayomide Joe-Adigwe

CDI — CARDIOVASCULAR DISEASE INFOGRAPHIC

Introduction

For this analysis, I selected the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes data from over 300,000 adults in the United States. This dataset contains both categorical and quantitative variables, offering valuable insights into various aspects of cardiovascular health. I chose this dataset specifically to examine the relationship between unhealthy habits—such as smoking and diabetes—and the prevalence of heart disease. By analyzing these factors, I aim to better understand how lifestyle choices contribute to the development of heart disease and to identify potential risk factors that can help inform preventive health strategies.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor) # for cleaning column names


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

# Load the dataset
setwd("/Users/ayomidealagbada/AYOMIDE'S DATAVISUALITIOM")

Cleaning the column names

# Read and clean the data
heart2020 <- read_csv("heart_2020_cleaned.csv") %>%
  clean_names() %>%
  # Convert character variables to factors
  mutate(
    heart_disease = as.factor(heart_disease),
    smoking = as.factor(smoking),
    alcohol_drinking = as.factor(alcohol_drinking),
    stroke = as.factor(stroke),
    race = as.factor(race),
    sex = as.factor(sex),
    age_category = as.factor(age_category),
    diabetic = as.factor(diabetic)
  )

Rows: 319795 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime
lgl  (1): Incomplete_Row

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Creating an intractive plot for body mass and heart disease

# Load necessary library
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Create the interactive plot for BMI and heart disease
bmi_plot <- plot_ly(
  data = heart2020, 
  x = ~bmi, 
  color = ~heart_disease, 
  type = "violin",
  box = list(visible = TRUE), 
  meanline = list(visible = TRUE)
) %>%
  layout(
    title = list(text = "BMI Distribution by Heart Disease Status"),
    xaxis = list(title = "BMI"),
    yaxis = list(title = "Density")
  )

# Display the plot
bmi_plot

Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Examing the relationship between smoking and heart disease

ggplot(heart2020, aes(x = smoking, fill = heart_disease)) +
  geom_density(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Heart Disease by Smoking Status",
       x = "Smoking Status",
       y = "Count")

Visualize the distribution of heart disease across different age categories

ggplot(heart2020, aes(x = age_category, fill = heart_disease)) +
  geom_bar(position = "dodge") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Heart Disease by Age Category",
       x = "Age Category",
       y = "Count")

# Create a numeric version of heart disease (0/1)
heart2020$heart_disease_num <- as.numeric(heart2020$heart_disease) - 1

linear regrinon between the heart disease number and unhealthy habbits

# Fit linear regression model with main risk factors
model <- lm(heart_disease_num ~  smoking + alcohol_drinking +  diabetic,
            data = heart2020)

# Display model summary
summary(model)


Call:
lm(formula = heart_disease_num ~ smoking + alcohol_drinking + 
    diabetic, data = heart2020)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2500 -0.1021 -0.0447 -0.0447  1.0142 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      0.0447017  0.0006651  67.207  < 2e-16 ***
smokingYes                       0.0573516  0.0009906  57.895  < 2e-16 ***
alcohol_drinkingYes             -0.0365537  0.0019359 -18.882  < 2e-16 ***
diabeticNo, borderline diabetes  0.0486049  0.0033625  14.455  < 2e-16 ***
diabeticYes                      0.1479393  0.0014582 101.453  < 2e-16 ***
diabeticYes (during pregnancy)  -0.0223900  0.0054311  -4.123 3.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2734 on 319789 degrees of freedom
Multiple R-squared:  0.04473,   Adjusted R-squared:  0.04472 
F-statistic:  2995 on 5 and 319789 DF,  p-value: < 2.2e-16

Analysis:

It indicates that engaging in unhealthy habits increases the likelihood of being part of the demographic affected by heart disease.

Plotting the data and the regression lines

# Display model summary
summary(model)


Call:
lm(formula = heart_disease_num ~ smoking + alcohol_drinking + 
    diabetic, data = heart2020)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.2500 -0.1021 -0.0447 -0.0447  1.0142 

Coefficients:
                                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)                      0.0447017  0.0006651  67.207  < 2e-16 ***
smokingYes                       0.0573516  0.0009906  57.895  < 2e-16 ***
alcohol_drinkingYes             -0.0365537  0.0019359 -18.882  < 2e-16 ***
diabeticNo, borderline diabetes  0.0486049  0.0033625  14.455  < 2e-16 ***
diabeticYes                      0.1479393  0.0014582 101.453  < 2e-16 ***
diabeticYes (during pregnancy)  -0.0223900  0.0054311  -4.123 3.75e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2734 on 319789 degrees of freedom
Multiple R-squared:  0.04473,   Adjusted R-squared:  0.04472 
F-statistic:  2995 on 5 and 319789 DF,  p-value: < 2.2e-16

# Plotting the data and regression lines
library(ggplot2)

# Assuming heart_disease_num is continuous (for visualizing regression lines)
ggplot(heart2020, aes(x = diabetic, y = heart_disease_num, color = smoking)) +
  geom_point(alpha = 0.6, size = 2) +
  geom_smooth(aes(group = smoking), method = "lm", se = FALSE, linetype = "dashed") +
  labs(
    title = "Linear Regression of Heart Disease on Risk Factors",
    x = "Diabetic (1 = Yes, 0 = No)",
    y = "Heart Disease Number",
    color = "Smoking Status"
  ) +
  theme_minimal()

`geom_smooth()` using formula = 'y ~ x'

Analysis

The plot displays the regression lines for different smoking status categories (no smoking vs. smoking) as well as for the presence or absence of borderline diabetes. Points on the plot represent individual data points, with the color coding indicating smoking status. The key insights I can gather from this plot are:

Smoking appears to be associated with higher heart disease severity, as indicated by the higher regression line for the “Yes” (smoking) group compared to the “No” (non-smoking) group. The presence of borderline diabetes also seems to be linked to increased heart disease, with the “Yes” (borderline diabetes) group having a higher regression line compared to the “No” group.

Brief essay

This study underscores the importance of public health initiatives aimed at reducing smoking and managing chronic conditions to mitigate the risks associated with heart disease. Early intervention and lifestyle changes are essential strategies in preventing heart disease, particularly for those with these risk factors. Understanding these relationships, as seen in the data, can help guide more effective preventive measures and health policies.

Reference List:

World Health Organization. “Cardiovascular diseases (CVDs) Fact Sheet.” Retrieved from https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
World Health Organization. “Cardiovascular diseases.” Retrieved from https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1.
https://www.cdc.gov/heart-disease/risk-factors/?CDC_AAref_Val=https://www.cdc.gov/heartdisease/risk_factors.htm

Challenges and What Could Not Be Shown

Incorporating geographic data (e.g., regional prevalence of CVD risk factors) could provide additional layers of insight. Unfortunately, the dataset does not include location information.
Longitudinal Trends:

A limitation of the dataset is its cross-sectional nature, which prevents examining trends over time. Adding a time dimension could show how BMI or sleep patterns evolve and impact physical health.

Conclusion:

From my analysis of the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes responses from over 300,000 adults in the United States, it is clear that individuals who engage in unhealthy behaviors, such as smoking, and have a history of conditions like diabetes, are at a significantly higher risk of developing heart disease. The dataset, which incorporates both categorical and quantitative variables, offers valuable insights into the relationship between lifestyle habits and cardiovascular health. By examining the links between factors like smoking, diabetes, and heart disease, my analysis suggests that unhealthy habits play a critical role in the onset and progression of heart disease.