In this project, I will explore the 2020 Heart Disease dataset, which contains health-related information on various demographic and lifestyle factors associated with heart disease.
• Topic: Heart Disease and its associations with different risk factors.
• Data: “heart_2020_cleaned.csv,” originally derived from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2020.
• Variables of Interest:
HeartDisease (categorical: “Yes”/“No”)
BMI (quantitative)
Smoking (categorical: “Yes”/“No”)
PhysicalHealth (quantitative: number of days in poor physical health)
I chose this dataset because cardiovascular health is a critical global issue, and understanding patterns in heart-disease risk factors is personally meaningful through my family. The data was collected via a large-scale telephone survey process as described on the CDC BRFSS website. I have cleaned it by removing incomplete or invalid entries (the provided dataset was already curated very well).
Load in Libraries and data
# Load necessary packageslibrary(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
# Set working directorysetwd("C:/Users/gitar/Documents/heart_2020_cleaned.csv")# read_csv (not read.csv())heart_data <-read_csv("heart_2020_cleaned.csv")
Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl (4): BMI, PhysicalHealth, MentalHealth, SleepTime
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick check of data structurehead(heart_data)
# A tibble: 6 × 18
HeartDisease BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
<chr> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 No 16.6 Yes No No 3 30
2 No 20.3 No No Yes 0 0
3 No 26.6 Yes No No 20 30
4 No 24.2 No No No 0 0
5 No 23.7 No No No 28 0
6 Yes 28.9 Yes No No 6 0
# ℹ 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
# Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
# SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>
Filter the data
# Select only certain variables we need for our analysisheart_small <- heart_data |>select(HeartDisease, BMI, Smoking, PhysicalHealth, AgeCategory, Sex)# Filter to exclude extremely high BMIs for demonstration (e.g., > 60), to see how that changes the datasetheart_bmi_filtered <- heart_small |>filter(BMI <=60)# Randomly sample 1,500 rows (https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/setting-the-seed.html)set.seed(123) # Ensures reproducible samplingheart_filtered <- heart_bmi_filtered |>slice_sample(n =3000)head(heart_filtered)
# A tibble: 6 × 6
HeartDisease BMI Smoking PhysicalHealth AgeCategory Sex
<chr> <dbl> <chr> <dbl> <chr> <chr>
1 Yes 24.4 No 0 75-79 Male
2 No 22.7 Yes 0 35-39 Female
3 No 38.1 No 0 35-39 Female
4 No 36.3 Yes 0 50-54 Male
5 No 32.9 Yes 20 60-64 Female
6 No 28.0 No 1 60-64 Female