Survival Analysis of Heart Failure patients

This Project is a study of the patients survival rate due to heart failure condition. One of the premise of this study is that it was based on other researches on Cardiovascular diseases of the heart, which has become very common in medical profession. For more understanding on ths study use link below

https://bmcmedinformdecismak.biomedcentral.com/articles/10.1186/s12911-020-1023-5

For this project, we used different models from lifelines python library to predict patients survival of heart failure. The data comprised of 13 features and 299 observations.

Features: age, anaemia, creatinine_phosphokinase, diabetes, ejection_fraction, high_blood_pressure, platelets, serum_creatinine, serum_sodium sex, smoking, time and death_event

Next thing we do is to load the data to run the survival analysis after calling the basic library

Data Cleaning

Data description and basic statistics

The table below shows the basic statistics of the different variables which gives the mean, std, quantile ranges and maximum and minimum values of each variable. Also helps with an indication of how the data is distributed.

Correlation

Because this study focuses on multivariate features in patients survival rate (death rate) from heart failure, we tried to get an overview of how the features are associated by doing a correlation to determine if there is any other relationship that exist between them.

The correlation plots help us identify other relationships between the features. From the correlation plot, Time(follow-up period) seems to be the highest positive relationship with death_event. This is understable as patients recovery is associated with recuperation time. There also seems to be a 40% relationship between sex and smoking. These results can be used for future studies with other multivariate techniques.

Lifeline plots

Histogram to show frequency of death patients against survived time in days

Cummulative failure plots to describe death rate

From the plot, there is a slight difference in the projection of curves between the 2 kaplanmeier and cummulative failure rate. However, the curves also indicates increasing death rate overtime, but stops before the 250th day of the follow-up time, where there seems to be no more deaths (heart failure).

Next we worked on the development of a model which we can use to predict the survival rate of patients with heart failure. For this we tried diffrent distribution models to check which one will be a better fit. 2 distributions we considered were weibull distribution and exponential distribution.

We also compared the model estimates and considered maximum likelihood and AIC of both models. A model with a smaller AIC value is a better fit, while the model with a higher(maximum) loglikehood is a good fit.

Other things considered was kaplanmeier survival plot against the survival plots of both models to visualize the direction of the curve.

And finally a comparism of the data probability plot, to check how the observations are clustered along the intercept. This is also a good indication that the model will be a better fit if the observations are aligned with the slope curve of the model.

After running innitial analyais, its observed that Weibull distribution has a higher loglikelihood and the smallest AIC when compared with exponential model. Therefore its a better fit for the model. Below is also visual represesentation of the distribution models in comparism with kaplanmeier fitting

The curve indicates a steady decline in patients survival rate after 250 days. This means most patients who died of heart failure happened in the latter days of treatment. This might be due to risk of the underlying ailment that led to cause of the heartfailure, or may be the patients were unable to get adequate treatment.

And finally the survival plot of the 2 models also reveals which model follows the kaplanMeier curve. The green curve which respresents exponential distribution survival plot seems to be a little bit deviated from the blue curve which represent kaplanMeier survival plot, while the orange plot seems to be inline with it.

To further explore which model is a better fit, we can use the probability plot of the 2 models to reveal which one has more observations clustered around slope line

By careful observation between the 2 plots, it shows that weibull is a better fit for the model as most of the data points (observations) seems to be clustered along the slope line. Now we would use the weibull model to predict other features affecting heartfailure of the patients, This will help us determine which features are highly associated with patients survival

Among the different features, we observered the alpha of each feature, serum_creatinine with alpha value of 2.063559 and ejection_fraction with alpha value of 37.613387 which is very low compared with others. low alpha values indicate features are at risk, because the alpha tells us about the characteristic life of the model and how long it will last before it fails. Hence we want alphas of high values. Age is another likely indicator, it is the feature with the 3rd least alpha value of 70.651805. But observing the curve, its likely that patients within that age range from 70 and upwards have low survival probability. Other researches also concluded with same phenomenon that they are likely to have heart failure because they are in their older years of living.

We can use CoxPHFitter from lifeline to check model accuracy and fitting. It will also help rank the features based on which features are high risk for predicting survival probability of the patients

Based on the rank ejection_fraction has the highest rank 5.99, serum_creatinine is the 2nd highest with 3.23 and the other continuous features are below 1.76. For categorical, smoking is 0.52, high_blood_pressure is 0.18 and sex is 0.15, the others were below 0.01.

Serum creatinine measures the level of creatine in the blood and creatinine is waste product in the blood that comes from the muscles, healthy kidneys filter creatinine out of your blood through urine and higher creatinine of 0.7 - 1.3mg means the kidneys are not fuctioning well which may lead to hypertension and heart failure. Ejection fraction measures the volume of fluid ejected from heart chamber with contraction of the left ventricle and its measured is % of blood pumped out. High >70%, normal 70%< x >55%, low 55%< x >40% and heart failure < 40%

Based on the kaplan meier model, There are more patients with high serum creatinine levels with low survival rates when compared to patients low ejection fraction levels.

Hence Serum creatinine is more associated with heart failure of the patients in this study.

From the model plots, we used 50days to measure a 50% survival rate for each feature. As a result we can observe that patients with high_blood_pressure are more at risk with heart failure when compared with the other features because the survival rate curve falls below 50% on the 50th day during follow-up periods. The survival rate for sex features tends to drop between male and female along the follow-up periods, with some instances showing female have higher survival rate than male. Other indications are patient smokers have a lower survival rate when compared to non smoking patients.

From the model plots, there seem to be no significant difference between both features when comparing the survival rates below 50%, however in 100days of follow-up treatment patients who exibit diabetes, have a higher survival rate when compared to patients with presence of anaemia in their system.