MATH1324 Introduction to Statistics Assignment 2

Heart Failure Prediction

Sweta Karmacharya (3938458)

28-MAY-2023

Introduction

In this assignment, the main objective is to examine three clinical features and determine if there is a potential association between these features and the age variable in patients with heart failure. I will explore whether these features are likely to be related to the age of heart failure patients. Since, early detection helps in prevention of CVDs, that’s when Machine learning model can be of great help. Here, I am using hypothesis testing to predict the relationship of age with the other two variables - Ejection fraction and Serum sodium of heart failure patients. The results of the hypothesis testing help in assessing the potential impact of age on Ejection fraction and Serum sodium levels and contribute to a better understanding of the relationship between age and these variables in the context of heart failure.

Problem Statement

The overall problem/question driving the investigation is to determine the relationship between age and two medical variables, ejection fraction and serum sodium level, in patients with heart failure. Specifically, the investigation aims to understand if age is a significant predictor of ejection fraction and serum sodium levels in heart failure patients.

Statistics will be used to analyze the dataset and provide insights into the relationship between age and the two medical variables of interest (ejection fraction and serum sodium level) in heart failure patients. The following statistical techniques and analyses will be employed:

  1. Descriptive Statistics: Summary statistics will be computed for the variables age, ejection fraction, and serum sodium. These measures will provide an overview of the central tendency, variability, and distribution of the variables.
  2. Scatter Plot: Scatter plots will be created to visualize the relationship between age and each of the two variables (ejection fraction and serum sodium). The scatter plots will help identify any potential patterns or trends between the variables.
  3. Linear Regression: Linear regression models will be fitted to examine the relationship between age and the two variables of interest. Two separate regression models will be created, one for ejection fraction and another for serum sodium. The regression analysis will provide information on the strength, direction, and significance of the relationships.
  4. Hypothesis Testing: Hypothesis testing will be performed to assess the statistical significance of the relationship between age and the two variables. The regression models will be tested to determine if age significantly predicts ejection fraction and serum sodium levels in heart failure patients.
  5. Assumption Testing: Assumption tests for linear regression will be conducted to ensure the validity and reliability of the regression models. Diagnostic plots will be examined to assess assumptions such as linearity, normality, homoscedasticity, and absence of influential outliers.

The results obtained will contribute to the understanding of these relationships and their potential implications for patient management and treatment strategies.

Data

setwd("/Users/swetakarmacharya/Documents/RMIT/Applied Analytics/Assignment2")
heart <- read.csv("heart_failure.csv")

The dataset has been extracted from Kaggle. The URL of the data is: https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

Variables Description

The dataset provided consists of the information about various factors and measurements related to heart failure. The dataset has 13 columns and 299 observations. The variables in the dataset are described as:

  1. age: This variable represents the age of the patients. Age is a significant risk factor for heart failure, as the incidence and prevalence of heart failure increase with advancing age.
  2. anaemia: This variable indicates whether a patient has anemia, a condition characterized by a lower-than-normal number of red blood cells or hemoglobin in the blood. Anemia can contribute to heart failure by reducing oxygen-carrying capacity and impairing cardiac function.
  3. creatinine_phosphokinase: This variable represents the levels of creatinine phosphokinase (CPK) in the blood. CPK is an enzyme found primarily in the heart, brain, and skeletal muscles. Elevated CPK levels may indicate damage or injury to the heart muscle, such as during a heart attack.
  4. diabetes: This variable indicates whether a patient has diabetes. Diabetes can lead to various cardiovascular complications, including damage to the blood vessels and impaired heart function.
  5. ejection_fraction: This variable represents the ejection fraction, which is the percentage of blood pumped out of the heart’s left ventricle with each contraction. A reduced ejection fraction is a hallmark of heart failure and indicates that the heart is not pumping efficiently.
  6. high_blood_pressure: This variable indicates whether a patient has high blood pressure, also known as hypertension. Prolonged high blood pressure can cause damage to the heart and blood vessels, leading to heart failure.
  7. platelets: This variable represents the levels of platelets in the blood. Platelets are blood cells involved in clotting. Abnormal platelet levels can indicate underlying health conditions or potential complications related to heart failure.
  8. serum_creatinine: This variable indicates the levels of serum creatinine, which is a waste product from muscle metabolism. Elevated serum creatinine levels may indicate impaired kidney function, a common complication of heart failure.
  9. serum_sodium: This variable represents the levels of sodium in the blood. Abnormal serum sodium levels can be indicative of electrolyte imbalances, which can affect fluid balance and cardiovascular function.
  10. sex: This variable represents the gender of the patients.
  11. smoking: This variable indicates whether a patient is a smoker. Smoking is a modifiable risk factor for heart failure and contributes to the development and progression of cardiovascular diseases.
  12. time: This variable represents the follow-up period or duration of observation for each patient in the study. It provides information on the length of time over which events and outcomes, such as death or heart failure-related events, are recorded.
  13. DEATH_EVENT: This variable indicates whether a patient experienced a death event during the study period. It serves as an outcome variable, reflecting the occurrence of death related to heart failure.

Here, age, ejection fraction and serum sodium serves as the important variables.

Descriptive Statistics and Visualisation

To summarize the important variables, I’ve calculated the mean, median, minimum, maximum, and quartiles for each variable. There were no missing data or outliers detected in the dataset.

To visualize the relationships and patterns in the data, scatter plots with regression lines can be created. A scatter plot with a regression line shows the relationship between two variables and helps us understand the nature and strength of that relationship. It allows us to visually examine the pattern of the data points and determine if there is a linear association between the variables.

summary(heart$age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40.00   51.00   60.00   60.83   70.00   95.00
summary(heart$ejection_fraction)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   14.00   30.00   38.00   38.08   45.00   80.00
summary(heart$serum_sodium)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   113.0   134.0   137.0   136.6   140.0   148.0

Relationship between age and ejection_fraction

#scatter polt for age and ejection_fraction
ggplot(heart, aes(x = heart$age, y = heart$ejection_fraction)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(x = "Age", y = "Ejection Fraction", title = "Scatter Plot with Regression Line")

The slope of the regression line indicates the direction of the relationship between the variables. Since the line slopes slightly upward from left to right, it suggests a positive relationship, meaning that as the values on the x-axis increase, the values on the y-axis also tend to increase so as the age increases, ejection_fraction also increases. However, the slope is only slightly upward which suggests a relatively weak relationship between the two variables. Also, there is some dispersion of data points around the regression line indicating some variability in ejection fraction for individuals of the same age. Hence, age has no significant impact on ejection fraction but, has a linear relationship.

Relationship between age and serum_sodium

#scatter polt for age and serum_sodium
ggplot(heart, aes(x = heart$age, y = heart$serum_sodium)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "blue") +
  labs(x = "Age", y = "Serum Sodium", title = "Scatter Plot with Regression Line")

Here, the scatter plot and regression line of age and serum_sodium show a slightly downward slope from left to right which implies a negative relationship between age and serum_sodium. Specifically, it suggests that as age increases, there is a tendency for serum_sodium levels to decrease. In other words, older individuals tend to have lower serum_sodium compared to younger individuals. However, it’s important to note that the slope is only slightly downward, which suggests a relatively weak relationship between the two variables. The scatter plot may show some dispersion of data points around the regression line, indicating some variability in serum_sodium for individuals of the same age. Hence, it shows a linear relationship.

Hypothesis Testing

It’s crucial to keep in mind that correlation does not imply causation. While the scatter plot and regression line indicate a negative relationship between both the variables and age, further analysis and investigation would be needed to determine the underlying factors or mechanisms driving this relationship. So, I’ve proceeded with fitting the linear regression model using R because the data shows few hints of a linear connection.

Hypothesis Testing Model 1: Age and Ejection Fraction

model1 <- lm(heart$ejection_fraction ~ heart$age, data = heart)
model1 %>% summary()
## 
## Call:
## lm(formula = heart$ejection_fraction ~ heart$age, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.931  -7.944  -1.230   6.368  42.863 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.44603    3.57196   9.643   <2e-16 ***
## heart$age    0.05980    0.05763   1.038      0.3    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.83 on 297 degrees of freedom
## Multiple R-squared:  0.003612,   Adjusted R-squared:  0.000257 
## F-statistic: 1.077 on 1 and 297 DF,  p-value: 0.3003

The p-value obtained from the statistical analysis is lower than the significant value, so we can reject the null hypothesis. Therefore, the data provides statistically significant proof that a linear regression model fits the relationship between the variables being analyzed.

Hypothesis Testing Model 2: Age and Serum Sodium

model2 <- lm(heart$serum_sodium ~ heart$age, data = heart)
model2 %>% summary()
## 
## Call:
## lm(formula = heart$serum_sodium ~ heart$age, data = heart)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.640  -2.384   0.241   3.045  11.616 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 137.66272    1.33276 103.291   <2e-16 ***
## heart$age    -0.01705    0.02150  -0.793    0.428    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.415 on 297 degrees of freedom
## Multiple R-squared:  0.002113,   Adjusted R-squared:  -0.001247 
## F-statistic: 0.6288 on 1 and 297 DF,  p-value: 0.4284

The p-value obtained from the statistical analysis is lower than the significant value, so we can reject the null hypothesis. Therefore, the data provides statistically significant proof that a linear regression model fits the relationship between the variables being analyzed.

Testing assumptions for model 1

plot(model1, which = 1)

The residual vs fitted plot shows that the values align closely along the abline and are symmetrically distributed. There is a tendency for the points to cluster around the center of the plot. Based on these observations, we can conclude that the assumption of homoscedasticity, which assumes equal variance of the residuals across all levels of the predictor, is reasonably valid. The assumption is not violated in this analysis.

plot(model1, which = 2)

The Normal Q-Q plot shows that the points in the center of the graph falls along the line, indicating that they follow a normal distribution. However, towards the edges of the plot, the points curve off, indicating the presence of more extreme values than expected under a true normal distribution. The assumption is not violated.

plot(model1, which = 3)

The scale-location plot shows that the residuals are scattered randomly around the abline without displaying any distinct patterns or systematic trends. This suggests that the variance of the residuals is constant across different levels of the predictor variable, indicating that the assumption is not violated.

plot(model1, which = 4)

The Residuals vs Leverage plot shows that there are no influential cases that fall significantly outside the Cook’s distance lines. All cases are observed to be well below the threshold of Cook’s distance lines, indicating that there are no individual data points with a substantial impact on the overall regression analysis. The absence of influential cases that deviate far from the Cook’s distance lines suggests that there are no outliers or influential observations that have a disproportionate influence on the regression model. Therefore, we can conclude that the assumption is not violated in this analysis.

Testing assumptions for model 2

plot(model2, which = 1)

The residual vs fitted plot shows that the residuals and fitted values align closely along the abline and are symmetrically distributed. Additionally, there is a tendency for the residuals to cluster around the center of the plot. These observations indicate that the assumption of homoscedasticity, which assumes equal variance of the residuals across all levels of the predictor, is reasonably valid. Based on the pattern observed in the plot, it can be concluded that there is no significant violation of the assumption of homoscedasticity.

plot(model2, which = 2)

The Normal Q-Q plot reveals that the points in the center of the graph approximately adhere to the straight line, indicating that they follow a normal distribution. However, towards the edges of the plot, the points curve away from the straight line. This curvature suggests that the data contains more extreme values than would be expected under a true normal distribution. Based on the observed pattern in the Normal Q-Q plot, it can be concluded that the assumption of normality is not violated for the analyzed data.

plot(model2, which = 3)

The scale-location plot displays the residuals scattered randomly around the abline, without any discernible patterns or systematic trends. This random and scattered distribution of residuals suggests that the variance of the residuals remains constant across different levels of the predictor variable. Therefore, the assumption of homoscedasticity, which assumes equal variance of the residuals, is not violated.

plot(model2, which = 4)

The Residuals vs Leverage plot shows that all cases in the analysis are firmly below the Cook’s distance lines, indicating that there are no influential cases that have a significant impact on the regression model. The absence of any influential cases that are barely visible or fall outside the Cook’s distance lines suggests that there are no outliers or influential observations that disproportionately influence the results of the regression analysis. Based on this observation, we can conclude that the assumption of no influential cases is not violated in this analysis.

Discussion

The investigation focused on analyzing the relationship between age and two variables, ejection fraction and serum sodium, in the context of heart failure. The major findings of the investigation are as follows:

Strengths

Limitation

Future Directions

However, it is important to acknowledge the limitations of the investigation. The analysis only considers three variables: age, ejection fraction, and serum sodium. This limited scope may prevent a comprehensive understanding of the underlying relationships. To gain a more accurate understanding, future investigations should gather and analyze additional data and variables to provide more insights and considerations.

References