library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggthemes)
library(ggrepel)
library(effsize)
library(pwr)

HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")

Hypothesis 1:

Null Hypothesis (H0):

There is no significant association between age and heart attack risk.

Alternative Hypothesis (H1):

There is a significant association between age and heart attack risk.

Let’s use the following alpha level, power level and effect size as provided below.

Alpha:

A 0.05 alpha level has been considered for this test. Since the dataset is about making prediction of heart attack risk using the common habits of individuals, we may consider the test not too critical. Patients are considered to have risk of heart failure based upon their smoking, sleeping, exercise, diet and so on habits. This test is not about testing a medicine that cures heart disease (critical case), so a 5% chance of falsely rejecting the null hypothesis seems reasonable for this test.

Beta:

A 0.8 beta (power level) has been considered for this test which is also a standard level for beta. In this case, as the prediction of heart failure risk is concluded based upon the general habits of individuals, this signifies that the test is not too critical to be precisely judged. Hence, it seems reasonable to have 80% probability of correctly rejecting the null hypothesis.

Effect Size:

The effect size in this case is considered to be 0.3. Since the power level is 0.8, the effect size of 0.3 is considered medium. This medium effect size shows the moderate association between age and heart attack risk which is reasonable as age in not the prime sole factor determining the risk of heart attack.

Sample Size Calculation:

Let’s calculate the sample size required to obtain the 80% chance of detecting the moderate association between age and heart attack risk using the code as below.

alpha <- 0.05  
power <- 0.8   
effect_size <- 0.3  

sample_size <- pwr.chisq.test(w = effect_size, 
                               N = NULL, 
                               df = 1, 
                               sig.level = alpha, 
                               power = power)

sample_size

## 
##      Chi squared power calculation 
## 
##               w = 0.3
##               N = 87.20954
##              df = 1
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: N is the number of observations

After observing the above result, we can conclude that in order to have a 80% chance of detecting a moderate association between age group and heart attack risk at a significance level of 0.05, we need approximately 88 sample observation per group.

Since we will be categorizing our Age groups into Young, Middle-aged and Elderly groups, we will need 264 sample observations.

Chi-squared Test:

Let’s perform a Chi-squared test to decide whether to reject the null hypothesis or fail to reject the null hypothesis as provided below.

HA <- HA |> 
  mutate(Age_Group = cut(Age, breaks = c(-Inf, 40, 60, Inf), labels = c("Young", "Middle-aged", "Elderly")))


contingency_table <- table(HA$Age_Group, HA$Heart.Attack.Risk)


chi_test <- chisq.test(contingency_table)
print(chi_test)

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 1.3484, df = 2, p-value = 0.5096

if(chi_test$p.value < 0.05) {
  print("Reject the null hypothesis: There is an association between age group and heart attack risk.")
} else {
  print("Fail to reject the null hypothesis: There is no significant association between age group and heart attack risk.")
}

## [1] "Fail to reject the null hypothesis: There is no significant association between age group and heart attack risk."

Interpretation:

From the above Chi-squared test, we can conclude that we failed to reject the null hypothesis. This means, we can’t really tell if the patients will have heart attack risk just by looking at their age. This statement is evidenced by X-squared value and p-value.

A small value of X-squared of 1.3484 tell us that there is not really too much difference in observed counts (actual counts from the dataset) and expected counts (if there is no association between age and heart attack risk).

A higher p-value of 0.5096 also tell us that there is higher probability of actual data from the dataset to fall under the null hypothesis.

Thinking intuitively, we may relate that heart attack does not solely depends on age of an individual, rather it is highly impacted by multiple combined factors of lifestyle such as diet, exercise, sleep, stress etc.

Hence, we can conclude that we failed to reject the null hypothesis and, therefore there is no significant association between age group and heart attack risk.

Visualization:

Let’s further facilitate the above conclusion by visualizing the null hypothesis using bar graph below.

HA <- HA |> 
  mutate(Age_Group = cut(Age, breaks = c(-Inf, 40, 60, Inf), labels = c("Young", "Middle-aged", "Elderly")))


summary_data <- HA |> 
  group_by(Age_Group) |> 
  summarise(Heart_Attack_Proportion = mean(Heart.Attack.Risk))

ggplot(summary_data, aes(x = Age_Group, y = Heart_Attack_Proportion, fill = Age_Group)) +
  geom_bar(stat = "identity") +
  labs(title = "Proportion of Heart Attack Risk by Age Group",
       x = "Age Group",
       y = "Proportion of Heart Attacks") +
  theme_minimal() +
  scale_fill_manual(values = c("Young" = "lightblue", "Middle-aged" = "lightgreen", "Elderly" = "lightcoral")) +
  theme(legend.position = "none")

From the above visualization, we can clearly observe that among three aged group patients, there is no significant different in heart attack risk. There is little difference among them however it is not enough to make firm conclusion regarding a significant association between age and heart attack risk.

This visualization warrants further investigation to dive deeper into the causing factors of heart attack risk.

Hypothesis 2:

Null Hypothesis (H0):

Low sleep hours are not associated with an increased risk of heart attacks.

Alternative Hypothesis (H1):

Low sleep hours are associated with an increased risk of heart attacks.

Fisher’s Exact Test:

Let’s perform a fisher’s exact test to determine whether to reject the null hypothesis or fail to reject the null hypothesis as provided below.

HA <- HA |> 
  mutate(Sleep_Category = ifelse(Sleep.Hours.Per.Day < 7, "Insufficient", "Sufficient"))


contingency_table <- table(HA$Sleep_Category, HA$Heart.Attack.Risk)


fisher_result <- fisher.test(contingency_table)


print(fisher_result)

## 
##  Fisher's Exact Test for Count Data
## 
## data:  contingency_table
## p-value = 0.2236
## alternative hypothesis: true odds ratio is not equal to 1
## 95 percent confidence interval:
##  0.8650894 1.0341318
## sample estimates:
## odds ratio 
##  0.9458197

Interpretation:

From the above results of Fisher’s Exact test, we obtained the higher p-value of 0.2236. This indicates the probability of data falling under the null hypothesis. Since the p-value is higher than the significance level (α) of 0.05, we can conclude that there is no significant association between sleep hours and heart attack risk. In other words, there are not enough evidence to prove a significant association between sleep hours and heart attack risk.

However, this conclusion may seem counter intuitive. Typically, we would expect some level of association between sleep patterns and heart attack risk. It’s important to consider that a patient’s overall lifestyle plays a crucial role in this relationship. For instance, an individual who exercises regularly, maintains a healthy diet, experiences minimal stress, and abstains from alcohol may have a lower heart attack risk, even if they only sleep 4 to 5 hours per night. In such cases, the beneficial effects of these other healthy behaviors could mitigate the potential negative impact of inadequate sleep, leading to similar heart attack risk levels when compared to those who get sufficient sleep.

Visualization:

Let’s further facilitate the above conclusion by plotting a bar graph visualization as below.

HA <- HA |> 
  mutate(Sleep_Category = case_when(
    Sleep.Hours.Per.Day >= 7 & Sleep.Hours.Per.Day <= 9 ~ "Sufficient",
    Sleep.Hours.Per.Day < 7 ~ "Insufficient",
    TRUE ~ "Other"  # This can catch anyone with more than 9 hours if needed
  ))


summary_data <- HA |> 
  group_by(Sleep_Category) |> 
  summarise(Heart.Attack.Risk = mean(Heart.Attack.Risk), .groups = 'drop')


ggplot(summary_data, aes(x = Sleep_Category, y = Heart.Attack.Risk, fill = Sleep_Category)) +
  geom_bar(stat = "identity") +
  labs(title = "Heart Attack Risk by Sleep Category",
       x = "Sleep Category",
       y = "Proportion of Heart Attacks") +
  scale_fill_manual(values = c("Sufficient" = "lightblue", "Insufficient" = "lightcoral")) +
  theme_minimal()

From the visualization, it is evident that there is no significant difference in heart attack risk between patients with sufficient sleep and those with insufficient sleep. This indicates a weak association between sleep hours and heart attack risk, as the data shows similar outcomes across different sleep duration.

Week 7: Data Dive Notebook

Hypothesis 1:

Null Hypothesis (H0):

Alternative Hypothesis (H1):

Alpha:

Beta:

Effect Size:

Sample Size Calculation:

Chi-squared Test:

Interpretation:

Visualization:

Hypothesis 2:

Null Hypothesis (H0):

Alternative Hypothesis (H1):

Fisher’s Exact Test:

Interpretation:

Visualization: