library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
HAA<- read.csv("/Users/rupeshswarnakar/Downloads/Cardiovascular_Disease_Dataset.csv")
There are many valuable columns that are present in the dataset of Cardiovascular disease. However, one of the most significant columns is Serum Cholesterol. This column is important because it is the major attributing factor that directly impacts the heart health of patients. There are two types of cholesterol such as HDL (good cholesterol) and LDL (bad cholesterol). The datatset is mainly focused on LDL(bad cholesterol) to see how it impacts the heart attack risk of patients.
Let’s assume our response variable from the dataset is Serum Cholesterol. And in order to observe how differently the cholesterol affects the chest pain of human body, we consider chest pain as a explanatory variable.
The mean Serum Cholesterol is same for all types of chest pain.
The mean Serum Cholesterol is different among different types of chest pain.
Let’s test the above hypothesis using ANOVA test as follows.
m <- aov(serumcholestrol ~ chestpain, data = HAA)
summary(m)
## Df Sum Sq Mean Sq F value Pr(>F)
## chestpain 1 535018 535018 31.43 2.68e-08 ***
## Residuals 998 16988801 17023
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
From the above results, we can see that the p-value of 0.0000000268 is significantly less than 0.05. This means we can reject the null hypothesis. In other words, there is enough evidence to support the alternative hypothesis that the mean cholesterol level is different for different types of chest pain
The result from the ANOVA test seems intuitive because increment in cholesterol does not induce all types of chest pain. Typical angina is a type of chest pain that is induced by accumulation of cholesterol in arteries preventing the normal distribution of blood in the body. However, other types of chest pain caused due to acid reflex, chest-muscle pain, anxiety etc could be caused by factors other than higher cholesterol in heart. Hence, the mean of cholesterol level is different in different types of chest pain as evidenced by the ANOVA test.
A further detailed analysis on the characteristics of chest pain such as sharp pain, pressure in chest, acute vs long term pain etc can be done to get accurate relationship between chest pain and cholesterol.
Let’s consider max heart rate to be the continuous column. We can obtain a linear regression model for serum cholesterol vs max heart rate as follows.
Let’s create a scatter plot to show the relationship between Serum Cholesterol and max heart rate as follows.
HAA |>
ggplot(mapping = aes(x =maxheartrate , y = serumcholestrol)) +
geom_point(size = 2,) +
geom_smooth(method = "lm", se = FALSE, color = 'red')+
labs(title = "Effect of Max Heart Rate on Serum Cholesterol",
x = "Max Heart Rate",
y = "Serum Cholestrol") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
From the above plot we can see that there is a weak positive correlation between heart rate and cholesterol. In other words, heart rate increases as cholesterol increases. A weak correlation could also be impacted by the data that are present on the x-axis which might be affecting the fit line. Or, there could be other attributing factors affecting the strong correlation between the heart rate and cholesterol level.
Intuitively, this above result seems true because if a patients is experiencing higher cholesterol level, this can blockage the vessels in heart and make the blood flow slower leading to higher heart rate causing stroke. This results spreads the significance of maintaining lower cholesterol level to maintain healthy heart.
model <- lm( serumcholestrol~ maxheartrate, data = HAA)
summary(model)
##
## Call:
## lm(formula = serumcholestrol ~ maxheartrate, data = HAA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -319.83 -75.27 7.19 93.55 291.38
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 289.4876 18.3101 15.810 <2e-16 ***
## maxheartrate 0.1509 0.1225 1.232 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 132.4 on 998 degrees of freedom
## Multiple R-squared: 0.001518, Adjusted R-squared: 0.000518
## F-statistic: 1.518 on 1 and 998 DF, p-value: 0.2183
From the above results we can obtain various data that can be useful for the conclusion as presented below.
The value 289.5 indicates that even if the maximum heart rate of a human body is zero, there will be a cholesterol level of 289.5 units. Practically this data is not useful since biological human body will always have some level of heart rate. A zero heart rate implies a dead human body which is not practical to be experimented.
The value 0.1509 indicates that for each unit increase in heart rate, cholesterol increases by 0.15 mg/dL.
The p-value of 0.218 suggests a weak correlation between cholesterol and heart rate. In other words, this statistical evidence suggests that heart rate is not the prime predictor of the cholesterol level of human body. This seems intuitive because heart rate can be affected by various factors such as low blood supply in heart, stress, poor diet etc.
A high residual standard error of 132.4 indicates that the observed value of cholesterol deviated largely away from the expected value causing weaker correlation between cholesterol and heart rate.
This value of 0.001518 suggest only 0.15% variation caused in the cholesterols level due to heart rate. This is a very small value suggesting heart rate is not the best predictor of cholesterol level.
The value of 1.518 suggests that heart rate has a weaker significance in predicting the cholesterol level of human body.
This linear regression model suggests a weaker positive relationship between the Serum cholesterol and maximum heart rate on a human body. Hence, the model is not the best one to predict the cholesterol using the data from heart rate. However, it can be combined with other affecting factors such as BMI, resting Blood pressure, blood sugar etc to see the combined effect on cholesterol level.
Furthermore, the above linear regression model opens up doors to further investigate on cholesterol at a detailed level.
On other hand, we can make recommendation to general populations that eating appropriate amount of healthy fats such as avocado, nuts, healthy oil, seeds etc can help increase HDL levels thereby, reducing the LDL in blood preventing from the risk of heart attacks.