Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. 4/5 CVD deaths are due to heart attacks and strokes, and 1/3 of these deaths occur prematurely in people under 70 years of age World Health Organization, 2021.
This project uses the
Heart Failure Prediction Dataset
from Kaggle. There are 11
features that can be used to predict a possible heart disease with an
output of class [1: heart disease, 0:
Normal].
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib.colors import ListedColormap
df = pd.read_csv('heart.csv')
X=df[["Age","Sex","ChestPainType","RestingBP","Cholesterol","FastingBS","RestingECG","MaxHR","ExerciseAngina","Oldpeak","ST_Slope"]]
y=df['HeartDisease']
X.head(4)
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope |
---|---|---|---|---|---|---|---|---|---|---|
40 | M | ATA | 140 | 289 | 0 | Normal | 172 | N | 0.0 | Up |
49 | F | NAP | 160 | 180 | 0 | Normal | 156 | N | 1.0 | Flat |
37 | M | ATA | 130 | 283 | 0 | ST | 98 | N | 0.0 | Up |
48 | F | ASY | 138 | 214 | 0 | Normal | 108 | Y | 1.5 | Flat |
54 | M | NAP | 150 | 195 | 0 | Normal | 122 | N | 0.0 | Up |
X.isnull().sum()
Age 0
Sex 0
ChestPainType 0
RestingBP 0
Cholesterol 0
FastingBS 0
RestingECG 0
MaxHR 0
ExerciseAngina 0
Oldpeak 0
ST_Slope 0
dtype: int64
There is no NA in this dataset.
X.loc[X['Sex'] == 'F', 'Sex'] = 0
X.loc[X['Sex'] == 'M', 'Sex'] = 1
X.loc[X['ChestPainType'] == 'TA', 'ChestPainType'] = 0
X.loc[X['ChestPainType'] == 'ATA', 'ChestPainType'] = 1
X.loc[X['ChestPainType'] == 'NAP', 'ChestPainType'] = 2
X.loc[X['ChestPainType'] == 'ASY', 'ChestPainType'] = 3
X.loc[X['RestingECG'] == 'Normal', 'RestingECG'] = 0
X.loc[X['RestingECG'] == 'ST', 'RestingECG'] = 1
X.loc[X['RestingECG'] == 'LVH', 'RestingECG'] = 2
X.loc[X['ExerciseAngina'] == 'N', 'ExerciseAngina'] = 0
X.loc[X['ExerciseAngina'] == 'Y', 'ExerciseAngina'] = 1
X.loc[X['ST_Slope'] == 'Up', 'ST_Slope'] = 0
X.loc[X['ST_Slope'] == 'Flat', 'ST_Slope'] = 1
X.loc[X['ST_Slope'] == 'Down', 'ST_Slope'] = 2
X.head(4)
Age | Sex | ChestPainType | RestingBP | Cholesterol | FastingBS | RestingECG | MaxHR | ExerciseAngina | Oldpeak | ST_Slope |
---|---|---|---|---|---|---|---|---|---|---|
40 | 1 | 1 | 140 | 289 | 0 | 0 | 172 | 0 | 0.0 | 0 |
49 | 0 | 2 | 160 | 180 | 0 | 0 | 156 | 0 | 1.0 | 1 |
37 | 1 | 1 | 130 | 283 | 0 | 1 | 98 | 0 | 0.0 | 0 |
48 | 0 | 3 | 138 | 214 | 0 | 0 | 108 | 1 | 1.5 | 1 |
54 | 1 | 2 | 150 | 195 | 0 | 0 | 122 | 0 | 0.0 | 0 |
Histograms of Features:
To better understand the distribution of each variable, we will plot histograms for all features in the dataset, allowing us to identify patterns, skewness, and potential outliers.
A = X.astype(float)
A.hist(figsize=(10, 8))
plt.suptitle("Histograms of features", y=1.02)
plt.tight_layout()
plt.show()
The histogram analysis provided insights into feature distributions, highlighting skewness, categorical imbalances, and potential outliers, but it does not show how these features interact in a higher-dimensional space.
Applying Principal Component Analysis (PCA):
To uncover patterns and relationships, we apply Principal Component Analysis (PCA), which reduces dimensionality while preserving key information, allowing us to visualize clusters and assess separability—particularly in relation to heart disease.
scaler = StandardScaler()
X_scaled=scaler.fit_transform(X)
pca = decomposition.PCA(n_components=2)
X_pca=pca.fit_transform(X_scaled)
print(X_pca)
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Heart Disease Data')
plt.show()
[[-2.69481538 0.12283566]
[-0.89748099 1.01278169]
[-1.65699679 -0.23249668]
...
[ 1.60979098 -0.68730071]
[-1.73366459 1.81553359]
[-2.26015928 -0.68564168]]
From the PCA scatter plot, we observe a degree of separation between individuals with and without heart disease. This suggests that classification models may perform well in predicting heart disease based on our features.
However, PCA also highlights that some features contribute more to the variance than others. To optimize our model, we should remove highly correlated features, as they provide redundant information and can lead to overfitting.
To do this, we will generate a correlation heatmap to identify feature pairs with high correlation and selectively drop one of them:
Heatmap of Feature Correlations:
correlation_matrix = X.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()
In this case, given a threshold of 0.7, all features exhibit low correlation, so we will retain them for modeling. With a clearer understanding of our data through PCA and correlation analysis, we can now proceed to building predictive models. Since no highly correlated features were identified, we will retain all variables and focus on selecting the best classification algorithm to predict heart disease. By leveraging machine learning techniques, we aim to develop a model that effectively distinguishes between individuals with and without heart disease based on their health metrics.
Since our goal is to predict a binary categorical outcome (0 or 1), we will utilize classification models. In this project, the following models will be used:
To ensure optimal model performance, we need to tune key hyperparameters:
Max Depth for Decision Tree and Random Forest to balance model complexity and overfitting. Number of Epochs for the Neural Network to optimize training without unnecessary computations.
1. Logistic Regression:
clfReg = LogisticRegression()
scores = []
scores = cross_val_score(clfReg, X_scaled, y, cv=10)
print("The Accuracy for the Logistic Regression Model is :", np.mean(scores))
The Accuracy for the Logistic Regression Model is : 0.8288580984233158
2. Decision tree:
score = []
n_k= [1,2,3,4,5,6,7,8,9,10,11,None]
for k in n_k:
clf = DecisionTreeClassifier(criterion="gini", max_depth=k)
scores = cross_val_score(clf, X_scaled, y, cv=10)
score.append(np.mean(scores))
print(score)
plt.plot(n_k, score, color='b')
plt.title('Decision Tree Classifier Accuracy vs. Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.show()
[0.8134615384615385, 0.7775919732441473, 0.8047778308647875, 0.8156832298136646, 0.8222646918299092, 0.8037386526516961, 0.8103081700907788, 0.8059722885809842, 0.8070234113712373, 0.7994147157190635, 0.8015527950310559, 0.7928571428571429]
We can see that the optimal max depth is 5 for Decision tree. Let’s check the accuracy at this point :
print("The accuracy when max depth is 5 for Decision Tree classifier is:",score[4])
The accuracy when max depth is 5 for Decision Tree classifier is: 0.845102723363593
3. Random Forest:
score = []
n_k= [1,2,3,4,5,6,7,8,9,10,11,None]
for k in n_k:
clfRF = RandomForestClassifier(criterion="gini", max_depth=k)
scores = cross_val_score(clfRF, X_scaled, y, cv=10)
score.append(np.mean(scores))
print(score)
plt.plot(n_k, score, color='b')
plt.title('Random Forest Classifier Accuracy vs. Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.show()
[0.8342451027233636, 0.8440516005733396, 0.8461896798853321, 0.8385809842331581, 0.8505375059722885, 0.8472766364070712, 0.8538580984233158, 0.8527711419015767, 0.8429646440516005, 0.8495102723363592, 0.8495461060678453, 0.8539058767319636]
We can see that the optimal max depth is 7 for the Random Forest Model.Let’s check the accuracy at this point :
print("The accuracy when max depth is 7 for Random Forest Classifier is :", score[6])
The accuracy when max depth is 7 for Random Forest Classifier is : 0.8571189679885333
4. Neural Network Classifier:
score=[]
score1=[]
score2=[]
clf = MLPClassifier(
hidden_layer_sizes=(10,20),
max_iter=300,
alpha=1e-4,
solver="sgd",
verbose=False,
random_state=1,
learning_rate_init=0.5,
)
clf1 = MLPClassifier(
hidden_layer_sizes=(20,),
max_iter=300,
alpha=1e-4,
solver="sgd",
verbose=False,
random_state=1,
learning_rate_init=0.1,
)
clf2 = MLPClassifier(
hidden_layer_sizes=(40,),
max_iter=300,
alpha=1e-4,
solver="sgd",
verbose=False,
random_state=1,
learning_rate_init=0.5,
)
scores = cross_val_score(clf, X_scaled, y, cv=10)
scores1 = cross_val_score(clf1, X_scaled, y, cv=10)
scores2 = cross_val_score(clf2, X_scaled, y, cv=10)
The accuracy for the neural Network With (10,20) layers is: 0.8365145723841376
The accuracy for the neural Network With 20 layers is: 0.8190993788819876
The accuracy for the neural Network With 40 layers is: 0.8070234113712373
Model | Parameters | Accuracy |
---|---|---|
Logistic Regression | - | 0.8289 |
Decision Tree Classifier | max-depth = 5 |
0.8223 |
Random Forest Classifier | max-depth - 7 |
0.8549 |
Neural Network (1) | (10, 20) layers |
0.8365 |
Neural Network (2) | 20 layers |
0.8191 |
Neural Network (3) | 40 layers |
0.8070 |
We can see that the model with the higher accuracy is the Random Forest Classifier with a max depth of 7.
clfRF = RandomForestClassifier(criterion="gini", max_depth=7)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
clfRF.fit(X_train,y_train)
ypred3=clfRF.predict(X_test)
pca1 = decomposition.PCA(n_components=2)
X_train_pca=pca1.fit_transform(X_train)
X_test_pca=pca1.fit_transform(X_test)
plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=ypred3, cmap='viridis')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Heart Disease Data')
plt.show()
The PCA plot shows a clear separation between the two predicted classes, with distinct clusters of yellow and dark purple points. While the Random Forest model effectively distinguishes between cases, some overlap suggests areas of misclassification, likely due to feature similarities or noise in the dataset
Plot the confusion matrix
disp = ConfusionMatrixDisplay.from_estimator(
clfRF,
X_test,
y_test,
# display_labels=cnames,
cmap=plt.cm.Blues,
normalize="true",
)
disp.ax_.set_title("Normalized confusion matrix")
print("Normalized confusion matrix")
print(disp.confusion_matrix)
plt.show()
Normalized confusion matrix
[[0.8372093 0.1627907 ]
[0.07142857 0.92857143]]
cmap_light = ListedColormap(["orange", "cyan"])
cmap_bold = ["darkorange", "c"]
clf3_pca=RandomForestClassifier(criterion="gini", max_depth=5)
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2)
clf3_pca.fit(X_train_pca,y_train)
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
clf3_pca,
X_train_pca,
cmap=cmap_light,
ax=ax,
response_method="predict",
plot_method="pcolormesh",
shading="auto",
)
# Plot also the training points
sns.scatterplot(
x=X_train_pca[:, 0],
y=X_train_pca[:, 1],
hue=y_train,
palette=cmap_bold,
alpha=1.0,
edgecolor="black",
)
plt.title(
"Boundary plot"
)
plt.show()
Confusion Matrix: The model achieves a high true positive rate (0.93) and a strong true negative rate (0.84), indicating good performance in correctly classifying both classes. However, a 16% false positive rate suggests some normal cases are misclassified as heart disease.
Decision Boundary Plot: The boundary plot shows a clear separation between heart disease (cyan) and non-disease (orange) cases. While the model captures the decision boundary well, some misclassified points near the boundary indicate overlapping feature distributions.
Having evaluated the model’s predictive performance, we now analyze the feature importanceto understand which variables contribute the most to heart disease prediction.
print(clfRF.feature_importances_)
[0.0587954 0.03478422 0.11170242 0.05540695 0.10893785 0.02001005
0.02047004 0.10539044 0.11123337 0.10167428 0.27159498]
The feature importance scores indicate that “ST_Slope” (slope of the ST segment) and “ChestPainType” are the most influential factors in predicting heart disease. This aligns with medical knowledge, as changes in the ST segment and different types of chest pain are strong indicators of cardiovascular issues. Other moderately important features include “MaxHR” (maximum heart rate) and “Cholesterol”, suggesting that heart rate dynamics and cholesterol levels also play significant roles.
RandomForestClassifier(max_depth=7)
In a Jupyter environment, please rerun this cell to show the HTML
representation or trust the notebook.
On GitHub, the HTML
representation is unable to render, please try loading this page with
nbviewer.org.
RandomForestClassifier(max_depth=7)
To evaluate our model, let’s randomly select four rows from our dataset and predict their values.
import random
random_value1 = random.randint(0, 918)
random_value2 = random.randint(0, 918)
random_value3 = random.randint(0, 918)
random_value4 = random.randint(0, 918)
A=X.iloc[random_value1].values
B=X.iloc[random_value2].values
C=X.iloc[random_value3].values
D=X.iloc[random_value4].values
A=scaler.transform([A])
B=scaler.transform([B])
C=scaler.transform([C])
D=scaler.transform([D])
A_pred=clfRF.predict(A)
B_pred=clfRF.predict(B)
C_pred=clfRF.predict(C)
D_pred=clfRF.predict(D)
print("True Predicted")
print(y[random_value1]," ",A_pred)
print(y[random_value2]," ",B_pred)
print(y[random_value3]," ",C_pred)
print(y[random_value4]," ",D_pred)
True Predicted
0 [0]
1 [1]
1 [1]
0 [0]
From the output, all the randomly selected values were predicted correctly! 🎯
Since an ensemble model works best with an odd number of models, we will select the three best-performing models based on accuracy:
clfReg.fit(X_train, y_train)
clfRF.fit(X_train, y_train)
clf.fit(X_train, y_train)
predReg = clfReg.predict(X_test)
predRF = clfRF.predict(X_test)
predNN = clf.predict(X_test)
prediction_ensemble = [predReg, predRF, predNN]
predictions = []
for i in range(len(X_test)): # Iterate over the test dataset
votes = [prediction[i] for prediction in prediction_ensemble]
vote_result = max(set(votes), key=votes.count)
predictions.append(vote_result)
ensemble_accuracy = accuracy_score(y_test, predictions) # Use y_test for accuracy calculation
print("Accuracy of Ensemble: ", ensemble_accuracy)
Accuracy of Ensemble: 0.8967391304347826
By using an ensemble of the top three models, we improved our accuracy from 0.85 to 0.89 🎉. This demonstrates that ensemble methods can enhance predictive performance by leveraging the strengths of multiple models.
This project demonstrated the effectiveness of machine learning in predicting heart disease using clinical data. After preprocessing, exploratory analysis, and model evaluation, the Random Forest Classifier achieved the highest individual accuracy of 85.5%, with the ensemble model boosting performance to 89.7%. Key features such as ST_Slope and ChestPainType were identified as the most influential in predicting heart disease. These results highlight the potential of data-driven approaches in supporting early diagnosis and improving cardiovascular health outcomes.