2024-09-16

Introduction

Cardiovascular (Heart) Diseases account for 31% of deaths globally every year. The chances of death through a heart disease can be drastically reduced by behavioral changes, like maintaining good diet and exercise, not having excessive alcohol and drug usage etc. Early detection of heart disease can be extremely beneficial for people, especially the ones who have underlying risk factors like hypertension or diabetes.

Logistic Regression

Logistic regression is a statistical method which finds the relationship between a binary dependent variable and one or more independent variables, predicting the probability of an event occurring. It uses the logistic function to transform the linear combination of inputs into a value between 0 and 1, making it suitable for classification tasks.

Formula: \(P(Y=1) = \frac{1}{1+e^{-(\beta_{0}+\beta_{1}X_{1}+\beta_{2}X_{2}+...+\beta_{n}X_{n})}}\)

Here we find the probability of Y=1 (Y being true). \(\beta_{0}\) is the intercept of the model and \(\beta_{1},\beta_{2}, ..., \beta_{n}\) are the coefficients of each independent variable \(X_{1}, X_{2},..., X_{n}\).

Dataset Overview

The dataset used for the logistic regression is the Heart Failure Prediction Dataset from Kaggle. It contains the following columns with units:

Age (Number), Anemia (0 = No, 1 = Yes), Creatinine_Phosphate (Mcg/L), Diabetes (0 = No, 1 = Yes), Ejection_Fraction (Percentage), High_Blood_Pressure (0 = No, 1 = Yes), Platelets (Kiloplatelets/ml), Serum_Creatinine (Mg/dL), Serum_Sodium (mEq/L), Sex (0 = Female, 1 = Male), Smoking (0 = No, 1 = Yes), Time (Follow-Up Period - Days), DEATH_EVENT (0 = No, 1 = Yes) {Result}

Logistic Regression Model From Dataset

While fitting a logistical regression model for the heart dataset. We use some key features from the dataset. These are age, level of serum_creatine, level of serum_sodium, ejection_fraction and number of platelets.

heartModel <- glm(DEATH_EVENT ~ age + serum_creatinine + serum_sodium 
                  + ejection_fraction + platelets, data = heartdf, 
                  family = binomial)

Age Trends

Finding Corelations

The data set is highly useful to find correlations between parameters like serum_creatinine - which is the amount of creatinine in blood and serum_sodium - amount of sodium in blood to see how their levels are related with fatal heart failure.

Diabetes and Heart Failure

Diabetes is a disease which may eventually lead to heart failure. Plotting the relationship of presence of diabetes with death by heart failure.

Odds and Log Odds

Odds: In Logistical regression the odds of an avent (in this case, death by heart failure) is the ratio of the probablilty of the occurence of the event to the probability that the event does not occur.

\(Odds = \frac{P}{1-P}\)

And Log Odds: Is the logarithm of the odds which is the linear combination of the independent variables and their coefficients and the intercept.

\(log (\frac{P}{1-P}) = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + ... + \beta_{n}X_{n}\)

Thank you