Final Project-Mat Shaposhnikov

Author

Matvei Shaposhnikov

DATA 110 Final Project

2020 Heart Disease Data Exploration

Matvei Shaposhnikov

In this project, I will explore the 2020 Heart Disease dataset, which contains health-related information on various demographic and lifestyle factors associated with heart disease.

• Topic: Heart Disease and its associations with different risk factors.
• Data: “heart_2020_cleaned.csv,” originally derived from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) 2020.
• Variables of Interest:

HeartDisease (categorical: “Yes”/“No”)
BMI (quantitative)
Smoking (categorical: “Yes”/“No”)
PhysicalHealth (quantitative: number of days in poor physical health)
AgeCategory (categorical: e.g., “55-59”, “60-64”, etc.)
Sex (categorical: “Male”/“Female”)

I chose this dataset because cardiovascular health is a critical global issue, and understanding patterns in heart-disease risk factors is personally meaningful through my family. The data was collected via a large-scale telephone survey process as described on the CDC BRFSS website. I have cleaned it by removing incomplete or invalid entries (the provided dataset was already curated very well).

Load in Libraries and data

# Load necessary packages
library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

# Set working directory
setwd("C:/Users/gitar/Documents/heart_2020_cleaned.csv")

# read_csv (not read.csv())
heart_data <- read_csv("heart_2020_cleaned.csv")

Rows: 319795 Columns: 18
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (14): HeartDisease, Smoking, AlcoholDrinking, Stroke, DiffWalking, Sex, ...
dbl  (4): BMI, PhysicalHealth, MentalHealth, SleepTime

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Quick check of data structure
head(heart_data)

# A tibble: 6 × 18
  HeartDisease   BMI Smoking AlcoholDrinking Stroke PhysicalHealth MentalHealth
  <chr>        <dbl> <chr>   <chr>           <chr>           <dbl>        <dbl>
1 No            16.6 Yes     No              No                  3           30
2 No            20.3 No      No              Yes                 0            0
3 No            26.6 Yes     No              No                 20           30
4 No            24.2 No      No              No                  0            0
5 No            23.7 No      No              No                 28            0
6 Yes           28.9 Yes     No              No                  6            0
# ℹ 11 more variables: DiffWalking <chr>, Sex <chr>, AgeCategory <chr>,
#   Race <chr>, Diabetic <chr>, PhysicalActivity <chr>, GenHealth <chr>,
#   SleepTime <dbl>, Asthma <chr>, KidneyDisease <chr>, SkinCancer <chr>

Filter the data

# Select only certain variables we need for our analysis
heart_small <- heart_data |>
  select(HeartDisease, BMI, Smoking, PhysicalHealth, AgeCategory, Sex)

# Filter to exclude extremely high BMIs for demonstration (e.g., > 60), to see how that changes the dataset
heart_bmi_filtered <- heart_small |>
  filter(BMI <= 60)

# Randomly sample 1,500 rows (https://pubs.wsb.wisc.edu/academics/analytics-using-r-2019/setting-the-seed.html)
set.seed(123)  # Ensures reproducible sampling
heart_filtered <- heart_bmi_filtered |>
  slice_sample(n = 3000)

head(heart_filtered)

# A tibble: 6 × 6
  HeartDisease   BMI Smoking PhysicalHealth AgeCategory Sex   
  <chr>        <dbl> <chr>            <dbl> <chr>       <chr> 
1 Yes           24.4 No                   0 75-79       Male  
2 No            22.7 Yes                  0 35-39       Female
3 No            38.1 No                   0 35-39       Female
4 No            36.3 Yes                  0 50-54       Male  
5 No            32.9 Yes                 20 60-64       Female
6 No            28.0 No                   1 60-64       Female

DATA 110 Final Project

2020 Heart Disease Data Exploration

Matvei Shaposhnikov

Load in Libraries and data

Filter the data

Bar Chart of Smoking Status by Heart Disease