For this analysis, I selected the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes data from over 300,000 adults in the United States. This dataset contains both categorical and quantitative variables, offering valuable insights into various aspects of cardiovascular health. I chose this dataset specifically to examine the relationship between unhealthy habits—such as smoking and diabetes—and the prevalence of heart disease. By analyzing these factors, I aim to better understand how lifestyle choices contribute to the development of heart disease and to identify potential risk factors that can help inform preventive health strategies.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for cleaning column names
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
# Load the dataset
setwd("/Users/ayomidealagbada/AYOMIDE'S DATAVISUALITIOM")
# Read and clean the data
heart2020 <- read_csv("heart_2020_cleaned.csv") %>%
clean_names() %>%
# Convert character variables to factors
mutate(
heart_disease = as.factor(heart_disease),
smoking = as.factor(smoking),
alcohol_drinking = as.factor(alcohol_drinking),
stroke = as.factor(stroke),
race = as.factor(race),
sex = as.factor(sex),
age_category = as.factor(age_category),
diabetic = as.factor(diabetic)
)
## Rows: 319795 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
## dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Load necessary library
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Create the interactive plot for BMI and heart disease
bmi_plot <- plot_ly(
data = heart2020,
x = ~bmi,
color = ~heart_disease,
type = "violin",
box = list(visible = TRUE),
meanline = list(visible = TRUE)
) %>%
layout(
title = list(text = "BMI Distribution by Heart Disease Status"),
xaxis = list(title = "BMI"),
yaxis = list(title = "Density")
)
# Display the plot
bmi_plot
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
ggplot(heart2020, aes(x = smoking, fill = heart_disease)) +
geom_density(alpha = 0.7) +
theme_minimal() +
labs(title = "Heart Disease by Smoking Status",
x = "Smoking Status",
y = "Count")
ggplot(heart2020, aes(x = age_category, fill = heart_disease)) +
geom_bar(position = "dodge") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Heart Disease by Age Category",
x = "Age Category",
y = "Count")
# Create a numeric version of heart disease (0/1)
heart2020$heart_disease_num <- as.numeric(heart2020$heart_disease) - 1
# Fit linear regression model with main risk factors
model <- lm(heart_disease_num ~ smoking + alcohol_drinking + diabetic,
data = heart2020)
# Display model summary
summary(model)
##
## Call:
## lm(formula = heart_disease_num ~ smoking + alcohol_drinking +
## diabetic, data = heart2020)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2500 -0.1021 -0.0447 -0.0447 1.0142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0447017 0.0006651 67.207 < 2e-16 ***
## smokingYes 0.0573516 0.0009906 57.895 < 2e-16 ***
## alcohol_drinkingYes -0.0365537 0.0019359 -18.882 < 2e-16 ***
## diabeticNo, borderline diabetes 0.0486049 0.0033625 14.455 < 2e-16 ***
## diabeticYes 0.1479393 0.0014582 101.453 < 2e-16 ***
## diabeticYes (during pregnancy) -0.0223900 0.0054311 -4.123 3.75e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2734 on 319789 degrees of freedom
## Multiple R-squared: 0.04473, Adjusted R-squared: 0.04472
## F-statistic: 2995 on 5 and 319789 DF, p-value: < 2.2e-16
it shows that If you particpated in unhealthy habbit you are more likely to be in the demographic of those who have a heart disease
Brief essay
This study underscores the importance of public health initiatives aimed at reducing smoking and managing chronic conditions to mitigate the risks associated with heart disease. Early intervention and lifestyle changes are essential strategies in preventing heart disease, particularly for those with these risk factors. Understanding these relationships, as seen in the data, can help guide more effective preventive measures and health policies.
Reference List:
World Health Organization. “Cardiovascular diseases (CVDs) Fact Sheet.” Retrieved from https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds).
World Health Organization. “Cardiovascular diseases.” Retrieved from https://www.who.int/health-topics/cardiovascular-diseases#tab=tab_1.
Incorporating geographic data (e.g., regional prevalence of CVD risk factors) could provide additional layers of insight. Unfortunately, the dataset does not include location information.
A limitation of the dataset is its cross-sectional nature, which prevents examining trends over time. Adding a time dimension could show how BMI or sleep patterns evolve and impact physical health.
From my analysis of the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes responses from over 300,000 adults in the United States, it is clear that individuals who engage in unhealthy behaviors, such as smoking, and have a history of conditions like diabetes, are at a significantly higher risk of developing heart disease. The dataset, which incorporates both categorical and quantitative variables, offers valuable insights into the relationship between lifestyle habits and cardiovascular health. By examining the links between factors like smoking, diabetes, and heart disease, my analysis suggests that unhealthy habits play a critical role in the onset and progression of heart disease.