In this assignment, the main objective is to examine three clinical features and determine if there is a potential association between these features and the age variable in patients with heart failure. I will explore whether these features are likely to be related to the age of heart failure patients. Since, early detection helps in prevention of CVDs, that’s when Machine learning model can be of great help. Here, I am using hypothesis testing to predict the relationship of age with the other two variables - Ejection fraction and Serum sodium of heart failure patients. The results of the hypothesis testing help in assessing the potential impact of age on Ejection fraction and Serum sodium levels and contribute to a better understanding of the relationship between age and these variables in the context of heart failure.
The overall problem/question driving the investigation is to determine the relationship between age and two medical variables, ejection fraction and serum sodium level, in patients with heart failure. Specifically, the investigation aims to understand if age is a significant predictor of ejection fraction and serum sodium levels in heart failure patients.
Statistics will be used to analyze the dataset and provide insights into the relationship between age and the two medical variables of interest (ejection fraction and serum sodium level) in heart failure patients. The following statistical techniques and analyses will be employed:
The results obtained will contribute to the understanding of these relationships and their potential implications for patient management and treatment strategies.
setwd("/Users/swetakarmacharya/Documents/Projects/Heart-Failure-Prediction")
heart <- read.csv("heart_failure.csv")
The dataset has been extracted from Kaggle. The URL of the data is: https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data
Variables Description
The dataset provided consists of the information about various factors and measurements related to heart failure. The dataset has 13 columns and 299 observations. The variables in the dataset are described as:
Here, age, ejection fraction and serum sodium serves as the important variables.
To summarize the important variables, I’ve calculated the mean, median, minimum, maximum, and quartiles for each variable. There were no missing data or outliers detected in the dataset.
To visualize the relationships and patterns in the data, scatter plots with regression lines can be created. A scatter plot with a regression line shows the relationship between two variables and helps us understand the nature and strength of that relationship. It allows us to visually examine the pattern of the data points and determine if there is a linear association between the variables.
summary(heart$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40.00 51.00 60.00 60.83 70.00 95.00
summary(heart$ejection_fraction)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 14.00 30.00 38.00 38.08 45.00 80.00
summary(heart$serum_sodium)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 113.0 134.0 137.0 136.6 140.0 148.0
#scatter polt for age and ejection_fraction
ggplot(heart, aes(x = heart$age, y = heart$ejection_fraction)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(x = "Age", y = "Ejection Fraction", title = "Scatter Plot with Regression Line")
The slope of the regression line indicates the direction of the
relationship between the variables. Since the line slopes slightly
upward from left to right, it suggests a positive relationship, meaning
that as the values on the x-axis increase, the values on the y-axis also
tend to increase so as the age increases, ejection_fraction also
increases. However, the slope is only slightly upward which suggests a
relatively weak relationship between the two variables. Also, there is
some dispersion of data points around the regression line indicating
some variability in ejection fraction for individuals of the same age.
Hence, age has no significant impact on ejection fraction but, has a
linear relationship.
#scatter polt for age and serum_sodium
ggplot(heart, aes(x = heart$age, y = heart$serum_sodium)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color = "blue") +
labs(x = "Age", y = "Serum Sodium", title = "Scatter Plot with Regression Line")
Here, the scatter plot and regression line of age and serum_sodium show
a slightly downward slope from left to right which implies a negative
relationship between age and serum_sodium. Specifically, it suggests
that as age increases, there is a tendency for serum_sodium levels to
decrease. In other words, older individuals tend to have lower
serum_sodium compared to younger individuals. However, it’s important to
note that the slope is only slightly downward, which suggests a
relatively weak relationship between the two variables. The scatter plot
may show some dispersion of data points around the regression line,
indicating some variability in serum_sodium for individuals of the same
age. Hence, it shows a linear relationship.
It’s crucial to keep in mind that correlation does not imply causation. While the scatter plot and regression line indicate a negative relationship between both the variables and age, further analysis and investigation would be needed to determine the underlying factors or mechanisms driving this relationship. So, I’ve proceeded with fitting the linear regression model using R because the data shows few hints of a linear connection.
Hypothesis Testing Model 1: Age and Ejection Fraction
model1 <- lm(heart$ejection_fraction ~ heart$age, data = heart)
model1 %>% summary()
##
## Call:
## lm(formula = heart$ejection_fraction ~ heart$age, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.931 -7.944 -1.230 6.368 42.863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.44603 3.57196 9.643 <2e-16 ***
## heart$age 0.05980 0.05763 1.038 0.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.83 on 297 degrees of freedom
## Multiple R-squared: 0.003612, Adjusted R-squared: 0.000257
## F-statistic: 1.077 on 1 and 297 DF, p-value: 0.3003
The p-value obtained from the statistical analysis is lower than the significant value, so we can reject the null hypothesis. Therefore, the data provides statistically significant proof that a linear regression model fits the relationship between the variables being analyzed.
Hypothesis Testing Model 2: Age and Serum Sodium
model2 <- lm(heart$serum_sodium ~ heart$age, data = heart)
model2 %>% summary()
##
## Call:
## lm(formula = heart$serum_sodium ~ heart$age, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23.640 -2.384 0.241 3.045 11.616
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 137.66272 1.33276 103.291 <2e-16 ***
## heart$age -0.01705 0.02150 -0.793 0.428
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.415 on 297 degrees of freedom
## Multiple R-squared: 0.002113, Adjusted R-squared: -0.001247
## F-statistic: 0.6288 on 1 and 297 DF, p-value: 0.4284
The p-value obtained from the statistical analysis is lower than the significant value, so we can reject the null hypothesis. Therefore, the data provides statistically significant proof that a linear regression model fits the relationship between the variables being analyzed.
plot(model1, which = 1)
The residual vs fitted plot shows that the values align closely along
the abline and are symmetrically distributed. There is a tendency for
the points to cluster around the center of the plot. Based on these
observations, we can conclude that the assumption of homoscedasticity,
which assumes equal variance of the residuals across all levels of the
predictor, is reasonably valid. The assumption is not violated in this
analysis.
plot(model1, which = 2)
The Normal Q-Q plot shows that the points in the center of the graph falls along the line, indicating that they follow a normal distribution. However, towards the edges of the plot, the points curve off, indicating the presence of more extreme values than expected under a true normal distribution. The assumption is not violated.
plot(model1, which = 3)
The scale-location plot shows that the residuals are scattered randomly
around the abline without displaying any distinct patterns or systematic
trends. This suggests that the variance of the residuals is constant
across different levels of the predictor variable, indicating that the
assumption is not violated.
plot(model1, which = 4)
The Residuals vs Leverage plot shows that there are no influential cases
that fall significantly outside the Cook’s distance lines. All cases are
observed to be well below the threshold of Cook’s distance lines,
indicating that there are no individual data points with a substantial
impact on the overall regression analysis. The absence of influential
cases that deviate far from the Cook’s distance lines suggests that
there are no outliers or influential observations that have a
disproportionate influence on the regression model. Therefore, we can
conclude that the assumption is not violated in this analysis.
plot(model2, which = 1)
The residual vs fitted plot shows that the residuals and fitted values
align closely along the abline and are symmetrically distributed.
Additionally, there is a tendency for the residuals to cluster around
the center of the plot. These observations indicate that the assumption
of homoscedasticity, which assumes equal variance of the residuals
across all levels of the predictor, is reasonably valid. Based on the
pattern observed in the plot, it can be concluded that there is no
significant violation of the assumption of homoscedasticity.
plot(model2, which = 2)
The Normal Q-Q plot reveals that the points in the center of the graph
approximately adhere to the straight line, indicating that they follow a
normal distribution. However, towards the edges of the plot, the points
curve away from the straight line. This curvature suggests that the data
contains more extreme values than would be expected under a true normal
distribution. Based on the observed pattern in the Normal Q-Q plot, it
can be concluded that the assumption of normality is not violated for
the analyzed data.
plot(model2, which = 3)
The scale-location plot displays the residuals scattered randomly around the abline, without any discernible patterns or systematic trends. This random and scattered distribution of residuals suggests that the variance of the residuals remains constant across different levels of the predictor variable. Therefore, the assumption of homoscedasticity, which assumes equal variance of the residuals, is not violated.
plot(model2, which = 4)
The Residuals vs Leverage plot shows that all cases in the analysis are firmly below the Cook’s distance lines, indicating that there are no influential cases that have a significant impact on the regression model. The absence of any influential cases that are barely visible or fall outside the Cook’s distance lines suggests that there are no outliers or influential observations that disproportionately influence the results of the regression analysis. Based on this observation, we can conclude that the assumption of no influential cases is not violated in this analysis.
The investigation focused on analyzing the relationship between age and two variables, ejection fraction and serum sodium, in the context of heart failure. The major findings of the investigation are as follows:
Age and Ejection Fraction: The scatter plot and regression line between age and ejection fraction indicated a slight upward slope. However, statistical analysis revealed that age had no significant effect on ejection fraction. The linear regression model for age and ejection fraction was found to be a good fit, and the assumptions of the model were not violated.
Age and Serum Sodium: The scatter plot and regression line between age and serum sodium showed a slightly downward slope. Similar to the previous finding, age was found to have no significant effect on serum sodium levels. The linear regression model for age and serum sodium was statistically significant, and the assumptions of the model were not violated.
Strengths
Limitation
Future Directions
However, it is important to acknowledge the limitations of the investigation. The analysis only considers three variables: age, ejection fraction, and serum sodium. This limited scope may prevent a comprehensive understanding of the underlying relationships. To gain a more accurate understanding, future investigations should gather and analyze additional data and variables to provide more insights and considerations.
World Health Organization. (n.d.). Cardiovascular diseases (CVDs). Retrieved from https://www.who.int/news-room/fact-sheets/detail/cardiovascular-diseases-(cvds)
Kaggle. (n.d.). Heart Failure Clinical Data. Retrieved from https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data
Our World in Data. (n.d.). Cardiovascular disease death rates. Retrieved from https://ourworldindata.org/grapher/cardiovascular-disease-death-rates