Project meeting:1 Notebook

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Introduction:

The dataset that I have chosen for my Project is about Heart-Attack Prediction. Heart Attack is one of the leading cause of death in the world. This dangerous heart disease is crucial to be understood and analysed. In this dataset, there are various factors that are considered to be attributing factors of Heart Attack such as Age, Diet, Exercise, Sleep, Stress, Location etc. to name a few. This dataset provide an effort to uplift the general understanding of Heart health and create a way to make a prediction of Heart Attack.

Below is provided the links to the original dataset and its documentation,

Original Dataset link:

https://www.kaggle.com/datasets/iamsouravbanerjee/heart-attack-prediction-dataset

Documentation link:

https://www.kaggle.com/code/sukanthen/heart-attack-risk-prediction

Goal of Project:

Every one has a different lifestyle which makes it really difficult to generalize heart attack failure while predicting the risk of it on a general population. Out of many attributing factors given in the dataset, there has to be some that are relatively more significant than the others. The understanding of significant factors vs the non-significant factors of Heart Attack creates somewhat a clear path for predicting the Heart-Attack risk in general population. Hence, the purpose of this project is to identify and rank the significant causing factors of Heart Attack so as to make the prediction easy, reliable and generalized.

Visualizations:

There are many visualizations that can be generated and will be generated as the project continues. However, up until now there are two interesting visualization that has be encountered as provided below.

A. Gender vs Stress:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")

ggplot(HA, aes(x=Sex,
               y=Stress.Level,
               fill=Country))+
  geom_boxplot()+
  labs(x="Gender",
       y="Stress Level",
       title="Gender vs Stress Level",
        scale_color_brewer(palette='Dark2'))

The box plot presented is intriguing as it reveals that stress levels among females vary significantly across different countries, while males exhibit relatively consistent stress levels across these nations. This visualization requests further investigation, as the stress levels in females can provide valuable insights into their hormonal differences related to stress, coping mechanisms, and the dynamics of their social and work lives.

In many societies, men tend to hold dominant roles, which can influence the experiences of women regarding their social standing, family responsibilities, and work-life balance. However, this dominance does not hold true in every culture, leading to varying experiences of stress for women worldwide.

Delving deeper into this visualization may allow us to assess whether stress is a significant factor contributing to heart attacks within the general population, particularly among women. Understanding these dynamics could enhance our ability to identify risk factors and tailor interventions to improve heart health among women across different cultural contexts.

B. Age vs Cholesterol:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv", nrows = 250)

HA <- HA |> 
  mutate(BMI_Category = case_when(
    BMI < 18.5 ~ "Underweight",
    BMI >= 18.5 & BMI < 24.9 ~ "Normal weight",
    BMI >= 25 & BMI < 29.9 ~ "Overweight",
    TRUE ~ "Obesity"
  ))

ggplot(HA, aes(x = Age, y = Cholesterol)) +
  geom_point(aes(color = BMI_Category)) +
  geom_smooth(method = "lm", color = "blue")+
  facet_wrap(~ BMI_Category) +
  labs(title = "Age vs Cholesterol for different BMI Categories",
       x = "Age",
       y = "Cholesterol Level") +
  theme_minimal()

## `geom_smooth()` using formula = 'y ~ x'

This visualization presents an intriguing trend in cholesterol levels across different BMI categories as individuals age. Notably, in the Overweight category, there is a decline in cholesterol levels with increasing age, which is unexpected and warrants further exploration. This trend may be influenced by various factors, including lifestyle changes, dietary habits, or even medical interventions that become more prevalent with age.

To fully understand the implications of this finding, additional data and analysis are necessary. Investigating how cholesterol levels in overweight individuals compare to those in underweight and normal-weight categories could provide valuable insights into the relationship between cholesterol and heart attack risk as the population ages. This exploration could help elucidate the complexities of how cholesterol levels interact with BMI and age, contributing to our understanding of heart health across different demographic groups.

Future Plan:

It is essential to recognize that heart attacks are complex diseases that typically arise from multiple contributing factors rather than a single cause. For instance, if a patient is dealing with ten risk factors simultaneously, predicting their likelihood of a heart attack becomes more straightforward.

Consider a scenario where a patient maintains a poor diet; this alone can elevate their risk of a heart attack. However, if the same patient also engages in regular exercise, the negative impact of their poor dietary choices may be mitigated, thereby reducing their overall risk. Therefore, conducting a comprehensive analysis of multiple factors is critical for accurately predicting heart attack risk which is the future plan of this project.

Hypothesis:

A. Individuals with a family history of heart disease will have a higher risk of experiencing a heart attack compared to those without such a history, regardless of lifestyle factors.

B. Poor sleep quality and insufficient sleep duration are associated with an increased risk of heart attacks, independent of other lifestyle factors such as diet and exercise.

Visualization:

A. Genetics vs Heart Attack Risk:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv", nrows = 250)

unique(HA$Family.History)

## [1] 0 1

summary_data <- HA %>%
  group_by(Family.History) %>%
  summarise(Heart.Attack.Risk = mean(Heart.Attack.Risk), .groups = 'drop')

summary_data$Family.History <- factor(summary_data$Family.History,
                                       levels = c(0, 1),
                                       labels = c("No", "Yes"))

ggplot(summary_data, aes(x = Family.History, y = Heart.Attack.Risk, fill = Family.History)) +
  geom_bar(stat = "identity") +
  labs(title = "Heart Attack Risk by Family History of Heart Disease",
       x = "Family History of Heart Disease",
       y = "Proportion of Heart Attacks") +
  scale_fill_manual(values = c("Yes" = "lightcoral", "No" = "lightblue")) +
  theme_minimal() +
  theme(legend.title = element_blank())

This visualization shows that patients with family history of heart disease have lower chance of having heart attacks than that of with no family history of heart disease. This seems counter intuitive and we might conclude as if this is entirely incorrect. However, heart disease is not solely affected by family history, rather by many other simultaneous factors such as diet, sleep, exercise, stress and more. Hence, a further investigation into this visualization may disclose some interesting result about heart attack risk.

B. Sleep vs Heart Attack Risk:

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")

HA <- HA |> 
  mutate(Sleep_Category = case_when(
    Sleep.Hours.Per.Day >= 7 & Sleep.Hours.Per.Day <= 9 ~ "Sufficient",
    Sleep.Hours.Per.Day < 7 ~ "Insufficient",
    TRUE ~ "Other"  # This can catch anyone with more than 9 hours if needed
  ))


summary_data <- HA %>%
  group_by(Sleep_Category) %>%
  summarise(Heart.Attack.Risk = mean(Heart.Attack.Risk), .groups = 'drop')


ggplot(summary_data, aes(x = Sleep_Category, y = Heart.Attack.Risk, fill = Sleep_Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Heart Attack Risk by Sleep Category",
       x = "Sleep Category",
       y = "Proportion of Heart Attacks") +
  scale_fill_manual(values = c("Sufficient" = "lightblue", "Insufficient" = "lightcoral")) +
  theme_minimal()

This visualization shows that patients with lower sleep numbers have higher risk of heart attacks than patients with higher sleep numbers. This seems correct and intuitive. However, the differences in heart attack risk for insufficient vs sufficient sleep categories is not too significant. This opens doors for further investigations.