For this analysis, I selected the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes data from over 300,000 adults in the United States. This dataset contains both categorical and quantitative variables, offering valuable insights into various aspects of cardiovascular health. I chose this dataset specifically to examine the relationship between unhealthy habits—such as smoking and diabetes—and the prevalence of heart disease. By analyzing these factors, I aim to better understand how lifestyle choices contribute to the development of heart disease and to identify potential risk factors that can help inform preventive health strategies.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # for cleaning column names
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
# Load the datasetsetwd("/Users/ayomidealagbada/AYOMIDE'S DATAVISUALITIOM")
Cleaning the column names
# Read and clean the dataheart2020 <-read_csv("heart_2020_cleaned.csv") %>%clean_names() %>%# Convert character variables to factorsmutate(heart_disease =as.factor(heart_disease),smoking =as.factor(smoking),alcohol_drinking =as.factor(alcohol_drinking),stroke =as.factor(stroke),race =as.factor(race),sex =as.factor(sex),age_category =as.factor(age_category),diabetic =as.factor(diabetic) )
Rows: 319795 Columns: 19
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
lgl (1): Incomplete_Row
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Creating an intractive plot for body mass and heart disease
# Load necessary librarylibrary(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Create the interactive plot for BMI and heart diseasebmi_plot <-plot_ly(data = heart2020, x =~bmi, color =~heart_disease, type ="violin",box =list(visible =TRUE), meanline =list(visible =TRUE)) %>%layout(title =list(text ="BMI Distribution by Heart Disease Status"),xaxis =list(title ="BMI"),yaxis =list(title ="Density") )# Display the plotbmi_plot
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Examing the relationship between smoking and heart disease
ggplot(heart2020, aes(x = smoking, fill = heart_disease)) +geom_density(alpha =0.7) +theme_minimal() +labs(title ="Heart Disease by Smoking Status",x ="Smoking Status",y ="Count")
Visualize the distribution of heart disease across different age categories
ggplot(heart2020, aes(x = age_category, fill = heart_disease)) +geom_bar(position ="dodge") +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1)) +labs(title ="Heart Disease by Age Category",x ="Age Category",y ="Count")
# Create a numeric version of heart disease (0/1)heart2020$heart_disease_num <-as.numeric(heart2020$heart_disease) -1
linear regrinon between the heart disease number and unhealthy habbits
# Fit linear regression model with main risk factorsmodel <-lm(heart_disease_num ~ smoking + alcohol_drinking + diabetic,data = heart2020)# Display model summarysummary(model)
# Plotting the data and regression lineslibrary(ggplot2)# Assuming heart_disease_num is continuous (for visualizing regression lines)ggplot(heart2020, aes(x = diabetic, y = heart_disease_num, color = smoking)) +geom_point(alpha =0.6, size =2) +geom_smooth(aes(group = smoking), method ="lm", se =FALSE, linetype ="dashed") +labs(title ="Linear Regression of Heart Disease on Risk Factors",x ="Diabetic (1 = Yes, 0 = No)",y ="Heart Disease Number",color ="Smoking Status" ) +theme_minimal()
`geom_smooth()` using formula = 'y ~ x'
Analysis
The plot displays the regression lines for different smoking status categories (no smoking vs. smoking) as well as for the presence or absence of borderline diabetes. Points on the plot represent individual data points, with the color coding indicating smoking status. The key insights I can gather from this plot are:
Smoking appears to be associated with higher heart disease severity, as indicated by the higher regression line for the “Yes” (smoking) group compared to the “No” (non-smoking) group. The presence of borderline diabetes also seems to be linked to increased heart disease, with the “Yes” (borderline diabetes) group having a higher regression line compared to the “No” group.
Brief essay
This study underscores the importance of public health initiatives aimed at reducing smoking and managing chronic conditions to mitigate the risks associated with heart disease. Early intervention and lifestyle changes are essential strategies in preventing heart disease, particularly for those with these risk factors. Understanding these relationships, as seen in the data, can help guide more effective preventive measures and health policies.
Incorporating geographic data (e.g., regional prevalence of CVD risk factors) could provide additional layers of insight. Unfortunately, the dataset does not include location information.
Longitudinal Trends:
A limitation of the dataset is its cross-sectional nature, which prevents examining trends over time. Adding a time dimension could show how BMI or sleep patterns evolve and impact physical health.
Conclusion:
From my analysis of the Heart Disease Health Indicators Dataset from the CDC’s 2020 annual survey, which includes responses from over 300,000 adults in the United States, it is clear that individuals who engage in unhealthy behaviors, such as smoking, and have a history of conditions like diabetes, are at a significantly higher risk of developing heart disease. The dataset, which incorporates both categorical and quantitative variables, offers valuable insights into the relationship between lifestyle habits and cardiovascular health. By examining the links between factors like smoking, diabetes, and heart disease, my analysis suggests that unhealthy habits play a critical role in the onset and progression of heart disease.