library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
library(ggthemes)
library(ggrepel)
library(lindia)
# remove scientific notation
options(scipen = 6)
# default theme, unless otherwise noted
theme_set(theme_minimal())
HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv")
HAA<- read.csv("/Users/rupeshswarnakar/Downloads/STAT/Cardiovascular_Disease_Dataset.csv")
Heart Attack and Cardiovascular disease is one of the leading cause of death world-wise. According to the 2021 survey, approx. 20.5 millions of death were caused by heart related disease. Due to this alerting number of deaths, a better model to early detect and prevent it seems of significance.
Below is the analysis gathered to support the modeling of data to aid in preventing heart attack risk. The model flows from analyzing lifestyle factors to health related causing factors. Lifestyle factors are diet, stress, sleep, age etc. and health related factors are cholesterol, blood sugar, blood pressure etc.
ggplot(HA, aes(x=Sex,
y=Stress.Level,
fill=Country))+
geom_boxplot()+
labs(x="Gender",
y="Stress Level",
title="Gender vs Stress Level",
scale_color_brewer(palette='Dark2'))
The box plot presented is intriguing as it reveals that stress levels among females vary significantly across different countries, while males exhibit relatively consistent stress levels across these nations. This visualization requests further investigation, as the stress levels in females can provide valuable insights into their hormonal differences related to stress, coping mechanisms, and the dynamics of their social and work lives.
In many societies, men tend to hold dominant roles, which can influence the experiences of women regarding their social standing, family responsibilities, and work-life balance. However, this dominance does not hold true in every culture, leading to varying experiences of stress for women worldwide.
Delving deeper into this visualization may allow us to assess whether stress is a significant factor contributing to heart attacks within the general population, particularly among women. Understanding these dynamics could enhance our ability to identify risk factors and tailor interventions to improve heart health among women across different cultural contexts.
HA<- read.csv("/Users/rupeshswarnakar/Desktop/heart_attack_prediction_dataset.csv", nrows = 250)
unique(HA$Family.History)
## [1] 0 1
summary_data <- HA |>
group_by(Family.History) |>
summarise(Heart.Attack.Risk = mean(Heart.Attack.Risk), .groups = 'drop')
summary_data$Family.History <- factor(summary_data$Family.History,
levels = c(0, 1),
labels = c("No", "Yes"))
ggplot(summary_data, aes(x = Family.History, y = Heart.Attack.Risk, fill = Family.History)) +
geom_bar(stat = "identity") +
labs(title = "Heart Attack Risk by Family History of Heart Disease",
x = "Family History of Heart Disease",
y = "Proportion of Heart Attacks") +
scale_fill_manual(values = c("Yes" = "lightcoral", "No" = "lightblue")) +
theme_minimal() +
theme(legend.title = element_blank())
This visualization shows that patients with family history of heart disease have lower chance of having heart attacks than that of with no family history of heart disease. This seems counter intuitive and we might conclude as if this is entirely incorrect. However, heart disease is not solely affected by family history, rather by many other simultaneous factors such as diet, sleep, exercise, stress and more. Hence, a further investigation into this visualization may disclose some interesting result about heart attack risk.
To further evidence the fact that lifestyle factors needs a holistic approach rather than a individualistic approach, below is provided with a hypothesis test to evaluate the significance of age towards the cause of heart attack risk.
There is no significant association between age and heart attack risk.
There is a significant association between age and heart attack risk.
Let’s perform a Chi-squared test to decide whether to reject the null hypothesis or fail to reject the null hypothesis as provided below.
# Step 2: Create age groups
HAA <- HAA |>
mutate(Age_Group = cut(age,
breaks = c(-Inf, 40, 60, Inf),
labels = c("Young", "Middle-aged", "Elderly")))
# Step 3: Create a contingency table
# Assuming `Heart.Attack.Risk` is the column indicating heart attack risk (categorical)
contingency_table <- table(HAA$Age_Group, HAA$target)
# Step 4: Perform the chi-square test
chi_test <- chisq.test(contingency_table)
# Step 5: Print the chi-square test results
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 2.9458, df = 2, p-value = 0.2293
# Step 6: Interpret the results
if (chi_test$p.value < 0.05) {
print("Reject the null hypothesis: There is an association between age group and heart attack risk.")
} else {
print("Fail to reject the null hypothesis: There is no significant association between age group and heart attack risk.")
}
## [1] "Fail to reject the null hypothesis: There is no significant association between age group and heart attack risk."
From the above Chi-squared test, we can conclude that we failed to reject the null hypothesis. This means, we can’t really tell if the patients will have heart attack risk just by looking at their age. This statement is evidenced by X-squared value and p-value.
A small value of X-squared of 2.9458 tell us that there is not really too much difference in observed counts (actual counts from the dataset) and expected counts (if there is no association between age and heart attack risk).
A higher p-value of 0.2293 also tell us that there is higher probability of actual data from the dataset to fall under the null hypothesis.
Thinking intuitively, we may relate that heart attack does not solely depends on age of an individual, rather it is highly impacted by multiple combined factors of lifestyle such as diet, exercise, sleep, stress etc.
Hence, we can conclude that we failed to reject the null hypothesis and, therefore there is no significant association between age group and heart attack risk.
Now, linear regression model is utilized to evaluate the significance of health related causing factors.
Let’s perform a linear model among resting BP as response variable, and serum cholesterol, fasting blood sugar and chest pain as explanatory variables. Here, fasting blood sugar is a binary variable indicating 0 as blood sugar less than 120 mg/dL and 1 as greater than 120 mg/dL. Also, chest pain has been categorized into 4 parts such as 0 for typical angina, 1 for atypical angina, 2 for non-angina and 3 for asymptomatic pain.
Looking into the dataset, we can see that there are some outliers in the serum cholesterol column. The value 0 for serum cholesterol is impractical as no human being can survive with 0 cholesterol. Hence the data is filtered to remove 0 serum cholesterol so as to create a robust linear regression model.
HAA_filtered <- subset(HAA, serumcholestrol != 0)
model <- lm(restingBP ~ serumcholestrol + fastingbloodsugar + chestpain, data =HAA_filtered)
summary(model)
##
## Call:
## lm(formula = restingBP ~ serumcholestrol + fastingbloodsugar +
## chestpain, data = HAA_filtered)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.467 -19.620 -3.245 24.120 62.417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 130.146064 2.871610 45.322 < 2e-16 ***
## serumcholestrol 0.039682 0.008602 4.613 0.000004510 ***
## fastingbloodsugar 8.967851 2.101268 4.268 0.000021732 ***
## chestpain 5.216343 1.017265 5.128 0.000000356 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 28.54 on 943 degrees of freedom
## Multiple R-squared: 0.1018, Adjusted R-squared: 0.09897
## F-statistic: 35.64 on 3 and 943 DF, p-value: < 2.2e-16
The residuals have minimum of -72.467 to maximum of 62.417 with a median of -3.245. This indicates a somewhat a well distributed data in the model.
The value 130.14 indicates that even if all of the predictors’ value is zero, there will be a resting BP of 130.14 units. Practically, this data is not useful as it is impractical for a human body to have a zero cholesterol, zero blood sugar etc. Hence, for this situation, this value can be ignored.
From the above result, 5.216 means that for every 1 unit increase in fasting chest pain, the resting BP increases by 5.216 unit. This shows a positive relationship between chest pain and resting BP statistically evidenced by a very small p-value of 0.000000356.
The F-statistic (35.64) with a p-value < 2.2e-16 suggests that the model as a whole is statistically significant, meaning at least one of the predictors contributes to explaining the variance in resting BP.
Let’s check if the variables exhibit multi-collinearity with each other using the dataframe as below.
model <- lm(restingBP ~ serumcholestrol + fastingbloodsugar + chestpain, data = HAA_filtered)
vif_values <- vif(model)
print(vif_values)
## serumcholestrol fastingbloodsugar chestpain
## 1.098520 1.102840 1.104225
Let’s diagnose the model using the 5 diagnostic plots as given below.
gg_resfitted(model) +
geom_smooth(se=FALSE)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
In general, the above plot shows a widespread data across the horizontal red-axis. There is no divergence or convergence in the data indicating that the residuals are independent of each other. However a small bump in the middle along with the declining line at the end show that there might a different relationship between some predicting variables in the model with the output variable. A better model might be explored to better justice the relationship between the all the explanatory variables and response variables.
plots <- gg_resX(model, plot.all = FALSE)
plots$serumcholestrol
plots$fastingbloodsugar
plots$chestpain
From the above plot, we can see that the residual vs serum cholesterol and residual vs fasting blood sugar possess normal spread of data on the either side of horizontal line. However if we look into residual vs chest pain, we see that for each categories of chest pain, they possess different distribution of data. For example, category 0 possess normal distribution of residuals, however 1 and 2 possess not so well distribution (presence of skewness). Also, category 3 possess no any residuals in the horizontal red-line. This indicates that chest pain might have quadratic or cubical relationship with resting BP, which can be further explored.
gg_reshist(model)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above histogram plot, we can see that the histogram somewhat mimics the normal distribution graph. There are some point in the middle that are violates the expectation of normal distribution indicating that there might be some better ways to increase the robustness of model. Also, the left and right tail does not have symmetry that indicates some level of skewness in the residuals, which can be further explored.
gg_qqplot(model)
From the above QQ plot, it indicates that the residuals are somewhat normal enough, however it also indicates that the robustness of the model is not achieved yet. The center of the red line deviated a bit away from most of the residuals which indicate the normal Gaussian curve has not been achieved signaling some sort of skewness in the residuals.
Furthermore, the left tail and right tail has residuals that are far away from the red line that further evidence the presence of outliers or skewness in the graph. This demands a further detailed analysis of dataset to form the QQ-plot as close as possible to the normal distribution.
gg_cooksd(model, threshold = 'matlab')
From the above plot, we can see that there are numerous data that exhibits a higher cook’s distance. These higher cook’s D could be due to the extreme values present in cholesterol or presence of outliers in other predictors. Specifically, there are patients with higher serum cholesterol as close to 600 mg/dL. This higher value might have some impact on the residuals and hence the cook’s distance.
This plot suggest the further investigation on those higher cook’s distance values.
ggplot(HAA_filtered, aes(x = serumcholestrol, y = restingBP, color = as.factor(fastingbloodsugar))) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
facet_wrap(~ chestpain) +
labs(title = "Resting Blood Pressure vs. Serum Cholesterol by Chest Pain Levels",
x = "Serum Cholesterol",
y = "Resting Blood Pressure",
color = "Fasting Blood Sugar") +
scale_color_manual(values = c("0" = "skyblue", "1" = "blue"))
## `geom_smooth()` using formula = 'y ~ x'
From the above plots we can gain a lot of information. If we look at the 0 category of chest pain which is directly related to the heart attack risk, we see a positive correlation between resting BP and serum cholesterol. This plot exhibits a well spread data in the model which further evidence the positive correlation.
The second plot which is of 1 chest pain (less predictable) shows a stronger positive correlation than that of first plot. However, second plot does not have a well spread data indicating a less stronger model.
The third plot which is of 2 chest pain (not related with heart issue) also shows a positive correlation between resting BP and serum cholesterol.
The fourth plot is of 3 chest pain (asymptomatic) which shows the negative correlation between resting BP and cholesterol. Since the chest pain for fourth plot does not show any symptoms indicating a random relation of chest pain and heart failure risk, the fourth plot does not serve much to provide with critical information about the correlation of serum cholesterol with resting BP.
At the end, the one thing that all graph excluding the last one indicate is that there is a positive correlation between resting BP and serum cholesterol. This also opens door to dig deeper into analyzing the numeric values of chest pain to establish a better relation with resting BP. And, since fasting blood sugar can be observed as a widespread data in the model, this demands a better linear modeling of fasting blood sugar with resting BP.
Let’s build a logistic regression model using target as a response variable, and resting BP and fasting blood sugar as explanatory variables.
model <- glm(target ~ restingBP,
data = HAA,
family = binomial(link = 'logit'))
summary(model)
##
## Call:
## glm(formula = target ~ restingBP, family = binomial(link = "logit"),
## data = HAA)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.57334 0.42621 -13.08 <2e-16 ***
## restingBP 0.03958 0.00287 13.79 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1360.6 on 999 degrees of freedom
## Residual deviance: 1105.7 on 998 degrees of freedom
## AIC: 1109.7
##
## Number of Fisher Scoring iterations: 4
# Confidence interval for restingBP
confint(model, 'restingBP', level = 0.95)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## 0.03408302 0.04534275
exp(model$coefficients)
## (Intercept) restingBP
## 0.003797771 1.040372007
The value of -5.722286 is the log-odds value indicating the probability of event happening (heart disease occurring) when resting BP is zero. This is impractical because blood pressure is never zero on a living human body. However, the value of intercept provides an idea of baseline risk when all other contributing variables are at minimum effect.
The value of 0.038216 means the each unit increase in resting BP, the log-odds of heart disease occurring increases by 0.038. In other words, e^{0.038216} approx. equals to 1.038 is the odds of outcome being positive when resting BP increases by 1 unit. This statement is statistically evidenced by strong p-value of < 2e-16.
The value of null deviance 336.51 on 249 degrees of freedom represents the fit of the model without any explanatory variables.
The value of residual deviance 279.40 on 248 degrees of freedom represents the fit of the model after taking a predicting variables into consideration. We can clearly observe a decrement in the value of residual deviance indicating a better model fit after considering resting BP as key predicting variable in the model.
A range of 0.03258984 to 0.04411006 is obtained from the 95% confidence interval of resting BP variable in the model.
Since there is not any negative value in the range, it indicates the positive relationship between resting BP and heart disease outcome (target).
Also, the corresponding odds ratio, which is e^{0.0326} approx. 1.033 to e^{0.0441} approx. 1.045, suggests that each unit increase in resting BP increases the odds of the outcome (heart disease occurring) by approximately 3.3% to 4.5%.
Also, it means that the true coefficients for resting BP is likely to fall in the interval of 0.03258984 to 0.04411006 with 95% confidence.
Let’s visualize the heart disease outcome using resting blood pressure as explanatory variable.
ggplot(HAA, aes(x = restingBP, y = target)) +
geom_jitter(width = 0, height = 0.05, shape = 21, color = 'blue', fill = 'lightblue', size = 3) +
geom_smooth(method = "glm", method.args = list(family = "binomial"), color = 'red', se = FALSE) +
labs(title = "Scatter Plot of Target vs. RestingBP",
x = "Resting Blood Pressure",
y = "Target (Binary Response)") +
scale_y_continuous(breaks = c(0, 0.5, 1)) +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
This above visualization help us understand the logistic relationship between heart disease outcome (target) and resting blood pressure. The plot show a positive relationship between them. We can observe that the patients with zero heart disease outcome, exhibit resting blood pressure usually lower around (90 to 140) mm of Hg. However, for those patients with risk of heart disease, exhibit the bulk of the resting blood pressure higher than (120 to 200) mm of Hg.
This logistic plot show a clear visualization of how strong a blood pressure is in terms of indicating the risk of heart disease as blood pressure fluctuates in human body.
The analysis has successfully identified the key predictors of heart attack risk, emphasizing the critical roles of both lifestyle and health-related factors, as well as the interplay between them. The findings suggest that addressing a combination of factors, rather than focusing on a single aspect, is essential for maintaining optimal heart health. While this study provides valuable insights, further advancements in data modeling could enhance the precision and actionability of the results. Ultimately, adopting a holistic approach to improving daily lifestyle choices can significantly reduce the risk of heart disease and promote overall well-being.