Heart Disease Prediction Using Machine Learning

1. Introduction:

Overview of Cardiovascular Diseases:

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. 4/5 CVD deaths are due to heart attacks and strokes, and 1/3 of these deaths occur prematurely in people under 70 years of age World Health Organization, 2021.

Objective of the Project:

This project uses the Heart Failure Prediction Datasetfrom Kaggle. There are 11 features that can be used to predict a possible heart disease with an output of class [1: heart disease, 0: Normal].

Dataset description:

Data Link

Age: Age of the patient [years]
Sex: Sex of the patient [M: Male, F: Female]
ChestPainType: Chest pain type [TA: Typical Angina, ATA: Atypical Angina, NAP: Non-Anginal Pain, ASY: Asymptomatic]
RestingBP: resting blood pressure [mm Hg]
Cholesterol: serum cholesterol [mm/dl]
FastingBS: fasting blood sugar [1: if FastingBS > 120 mg/dl, 0: otherwise]
RestingECG: resting electrocardiogram results [Normal: Normal, ST: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV), LVH: showing probable or definite left ventricular hypertrophy by Estes’ criteria]
MaxHR: maximum heart rate achieved [Numeric value between 60 and 202]
ExerciseAngina: exercise-induced angina [Y: Yes, N: No]
Oldpeak: oldpeak = ST [Numeric value measured in depression]
ST_Slope: the slope of the peak exercise ST segment [Up: upsloping, Flat: flat, Down: downsloping]

2. Data Preprocessing:

2.1- Data Loading and Exploration:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn import decomposition
import seaborn as sns
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.metrics import ConfusionMatrixDisplay
from matplotlib.colors import ListedColormap

df = pd.read_csv('heart.csv')
X=df[["Age","Sex","ChestPainType","RestingBP","Cholesterol","FastingBS","RestingECG","MaxHR","ExerciseAngina","Oldpeak","ST_Slope"]]
y=df['HeartDisease']

X.head(4)

Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope
40	M	ATA	140	289	Normal	172	N	0.0	Up
49	F	NAP	160	180	Normal	156	N	1.0	Flat
37	M	ATA	130	283	ST	98	N	0.0	Up
48	F	ASY	138	214	Normal	108	Y	1.5	Flat
54	M	NAP	150	195	Normal	122	N	0.0	Up

2.2- Data Cleaning & Encoding:

1. Checking if it has any NA values

X.isnull().sum()

Age               0
Sex               0
ChestPainType     0
RestingBP         0
Cholesterol       0
FastingBS         0
RestingECG        0
MaxHR             0
ExerciseAngina    0
Oldpeak           0
ST_Slope          0
dtype: int64

There is no NA in this dataset.

1. Several variables in this dataset contain categorical values, which we will convert into numerical representations to facilitate easier data manipulation and analysis.

X.loc[X['Sex'] == 'F', 'Sex'] = 0
X.loc[X['Sex'] == 'M', 'Sex'] = 1

X.loc[X['ChestPainType'] == 'TA', 'ChestPainType'] = 0
X.loc[X['ChestPainType'] == 'ATA', 'ChestPainType'] = 1
X.loc[X['ChestPainType'] == 'NAP', 'ChestPainType'] = 2
X.loc[X['ChestPainType'] == 'ASY', 'ChestPainType'] = 3

X.loc[X['RestingECG'] == 'Normal', 'RestingECG'] = 0
X.loc[X['RestingECG'] == 'ST', 'RestingECG'] = 1
X.loc[X['RestingECG'] == 'LVH', 'RestingECG'] = 2

X.loc[X['ExerciseAngina'] == 'N', 'ExerciseAngina'] = 0
X.loc[X['ExerciseAngina'] == 'Y', 'ExerciseAngina'] = 1

X.loc[X['ST_Slope'] == 'Up', 'ST_Slope'] = 0
X.loc[X['ST_Slope'] == 'Flat', 'ST_Slope'] = 1
X.loc[X['ST_Slope'] == 'Down', 'ST_Slope'] = 2

X.head(4)

Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope
40	1	1	140	289	0	172	0	0.0	0
49	0	2	160	180	0	156	0	1.0	1
37	1	1	130	283	1	98	0	0.0	0
48	0	3	138	214	0	108	1	1.5	1
54	1	2	150	195	0	122	0	0.0	0

3. Exploratory Data Analysis:

3.1- Feature Distributions:

Histograms of Features:
To better understand the distribution of each variable, we will plot histograms for all features in the dataset, allowing us to identify patterns, skewness, and potential outliers.


A = X.astype(float)
A.hist(figsize=(10, 8))
plt.suptitle("Histograms of features", y=1.02)
plt.tight_layout()
plt.show()

Observations :

Age is right-skewed, with most individuals between 40-60 years old.
Sex is imbalanced, with more males than females.
Chest Pain Type is categorical, with ASY (Asymptomatic) as the most common type.
RestingBP (Resting Blood Pressure) is mostly between 100-150 mmHg, with a few potential outliers.
Cholesterol is right-skewed, with most individuals having moderate levels, but some show very high values.
FastingBS (Fasting Blood Sugar) is binary, with most individuals having normal levels, while a smaller subset has elevated values indicating potential health risks.
MaxHR (Maximum Heart Rate Achieved) follows a normal-like distribution, ranging between 100-175 bpm.
Exercise-Induced Angina: Most individuals do not experience angina during exercise, but those who do may have a higher risk of heart disease.
Oldpeak (ST Depression) is right-skewed, with most values low, but some individuals exhibit significant ST depression.
ST Slope is mostly “Up” or “Flat”, with limited distribution in other categories. Plotting the data using PCA

The histogram analysis provided insights into feature distributions, highlighting skewness, categorical imbalances, and potential outliers, but it does not show how these features interact in a higher-dimensional space.

3.2- Dimensionality Reduction with PCA:

Applying Principal Component Analysis (PCA):

To uncover patterns and relationships, we apply Principal Component Analysis (PCA), which reduces dimensionality while preserving key information, allowing us to visualize clusters and assess separability—particularly in relation to heart disease.

scaler = StandardScaler()
X_scaled=scaler.fit_transform(X)

pca = decomposition.PCA(n_components=2)
X_pca=pca.fit_transform(X_scaled)

print(X_pca)

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis')  
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Heart Disease Data')
plt.show()

[[-2.69481538  0.12283566]
 [-0.89748099  1.01278169]
 [-1.65699679 -0.23249668]
 ...
 [ 1.60979098 -0.68730071]
 [-1.73366459  1.81553359]
 [-2.26015928 -0.68564168]]

Observation:

From the PCA scatter plot, we observe a degree of separation between individuals with and without heart disease. This suggests that classification models may perform well in predicting heart disease based on our features.

However, PCA also highlights that some features contribute more to the variance than others. To optimize our model, we should remove highly correlated features, as they provide redundant information and can lead to overfitting.

To do this, we will generate a correlation heatmap to identify feature pairs with high correlation and selectively drop one of them:

3.3- Correlation Analysis:

Heatmap of Feature Correlations:

correlation_matrix = X.corr()
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Correlation Heatmap")
plt.show()

Interpretation:

In this case, given a threshold of 0.7, all features exhibit low correlation, so we will retain them for modeling. With a clearer understanding of our data through PCA and correlation analysis, we can now proceed to building predictive models. Since no highly correlated features were identified, we will retain all variables and focus on selecting the best classification algorithm to predict heart disease. By leveraging machine learning techniques, we aim to develop a model that effectively distinguishes between individuals with and without heart disease based on their health metrics.

4. Predictive Modeling:

4.1- Model Selection & Tuning:

Since our goal is to predict a binary categorical outcome (0 or 1), we will utilize classification models. In this project, the following models will be used:

Logistic Regression –
Decision Tree - Tree-based model that captures non-linear relationships
Random Forest – Tree-based model that captures non-linear relationships.
Neural Network – A more complex model capable of learning intricate patterns in the data.

To ensure optimal model performance, we need to tune key hyperparameters:

Max Depth for Decision Tree and Random Forest to balance model complexity and overfitting. Number of Epochs for the Neural Network to optimize training without unnecessary computations.

1. Logistic Regression:

clfReg = LogisticRegression()
scores = []
scores = cross_val_score(clfReg, X_scaled, y, cv=10)  
print("The Accuracy for the Logistic Regression Model is :", np.mean(scores))

The Accuracy for the Logistic Regression Model is : 0.8288580984233158

2. Decision tree:

score = []
n_k= [1,2,3,4,5,6,7,8,9,10,11,None]

for k in n_k:
    clf = DecisionTreeClassifier(criterion="gini", max_depth=k)
    scores = cross_val_score(clf, X_scaled, y, cv=10)  
    score.append(np.mean(scores))

print(score)
plt.plot(n_k, score, color='b')
plt.title('Decision Tree Classifier Accuracy vs. Max Depth')

plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.show()

[0.8134615384615385, 0.7775919732441473, 0.8047778308647875, 0.8156832298136646, 0.8222646918299092, 0.8037386526516961, 0.8103081700907788, 0.8059722885809842, 0.8070234113712373, 0.7994147157190635, 0.8015527950310559, 0.7928571428571429]

Observation:

We can see that the optimal max depth is 5 for Decision tree. Let’s check the accuracy at this point :

print("The accuracy when max depth is 5 for Decision Tree classifier is:",score[4])

The accuracy when max depth is 5 for Decision Tree classifier is: 0.845102723363593

3. Random Forest:

score = []
n_k= [1,2,3,4,5,6,7,8,9,10,11,None]

for k in n_k:
    clfRF = RandomForestClassifier(criterion="gini", max_depth=k)
    scores = cross_val_score(clfRF, X_scaled, y, cv=10)  
    score.append(np.mean(scores))

print(score)
plt.plot(n_k, score, color='b')
plt.title('Random Forest Classifier Accuracy vs. Max Depth')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')

plt.show()

[0.8342451027233636, 0.8440516005733396, 0.8461896798853321, 0.8385809842331581, 0.8505375059722885, 0.8472766364070712, 0.8538580984233158, 0.8527711419015767, 0.8429646440516005, 0.8495102723363592, 0.8495461060678453, 0.8539058767319636]

Observation:

We can see that the optimal max depth is 7 for the Random Forest Model.Let’s check the accuracy at this point :

print("The accuracy when max depth is 7 for Random Forest Classifier is :", score[6])

The accuracy when max depth is 7 for Random Forest Classifier is : 0.8571189679885333

4. Neural Network Classifier:

score=[]
score1=[]
score2=[]

clf = MLPClassifier(
    hidden_layer_sizes=(10,20),
    max_iter=300,
    alpha=1e-4,
    solver="sgd",
    verbose=False,
    random_state=1,
    learning_rate_init=0.5,
)

clf1 = MLPClassifier(
    hidden_layer_sizes=(20,),
    max_iter=300,
    alpha=1e-4,
    solver="sgd",
    verbose=False,
    random_state=1,
    learning_rate_init=0.1,
)

clf2 = MLPClassifier(
    hidden_layer_sizes=(40,),
    max_iter=300,
    alpha=1e-4,
    solver="sgd",
    verbose=False,
    random_state=1,
    learning_rate_init=0.5,
)

scores = cross_val_score(clf, X_scaled, y, cv=10)  
scores1 = cross_val_score(clf1, X_scaled, y, cv=10)  
scores2 = cross_val_score(clf2, X_scaled, y, cv=10)

The accuracy for the neural Network With (10,20) layers is: 0.8365145723841376
The accuracy for the neural Network With 20 layers is: 0.8190993788819876
The accuracy for the neural Network With 40 layers is: 0.8070234113712373

4.2- Model Evaluation:

Models Summary:

Model	Parameters	Accuracy
Logistic Regression	-	0.8289
Decision Tree Classifier	`max-depth = 5`	0.8223
Random Forest Classifier	`max-depth - 7`	0.8549
Neural Network (1)	`(10, 20) layers`	0.8365
Neural Network (2)	`20 layers`	0.8191
Neural Network (3)	`40 layers`	0.8070

We can see that the model with the higher accuracy is the Random Forest Classifier with a max depth of 7.

Testing:

clfRF = RandomForestClassifier(criterion="gini", max_depth=7)

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2)
    
clfRF.fit(X_train,y_train)
ypred3=clfRF.predict(X_test)

pca1 = decomposition.PCA(n_components=2)
X_train_pca=pca1.fit_transform(X_train)
X_test_pca=pca1.fit_transform(X_test)

plt.scatter(X_test_pca[:, 0], X_test_pca[:, 1], c=ypred3, cmap='viridis')  
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Heart Disease Data')
plt.show()

Observation:

The PCA plot shows a clear separation between the two predicted classes, with distinct clusters of yellow and dark purple points. While the Random Forest model effectively distinguishes between cases, some overlap suggests areas of misclassification, likely due to feature similarities or noise in the dataset

Plot the confusion matrix

disp = ConfusionMatrixDisplay.from_estimator(
    clfRF,
    X_test,
    y_test,
    # display_labels=cnames,
    cmap=plt.cm.Blues,
    normalize="true",
)
disp.ax_.set_title("Normalized confusion matrix")

print("Normalized confusion matrix")
print(disp.confusion_matrix)

plt.show()

Normalized confusion matrix
[[0.8372093  0.1627907 ]
 [0.07142857 0.92857143]]


cmap_light = ListedColormap(["orange", "cyan"])
cmap_bold = ["darkorange", "c"]

clf3_pca=RandomForestClassifier(criterion="gini", max_depth=5)

X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size=0.2)

clf3_pca.fit(X_train_pca,y_train)
_, ax = plt.subplots()
DecisionBoundaryDisplay.from_estimator(
    clf3_pca,
    X_train_pca,
    cmap=cmap_light,
    ax=ax,
    response_method="predict",
    plot_method="pcolormesh",
    shading="auto",
)

# Plot also the training points
sns.scatterplot(
    x=X_train_pca[:, 0],
    y=X_train_pca[:, 1],
    hue=y_train,
    palette=cmap_bold,
    alpha=1.0,
    edgecolor="black",
)
plt.title(
    "Boundary plot"
)

plt.show()

Observations:

Confusion Matrix: The model achieves a high true positive rate (0.93) and a strong true negative rate (0.84), indicating good performance in correctly classifying both classes. However, a 16% false positive rate suggests some normal cases are misclassified as heart disease.
Decision Boundary Plot: The boundary plot shows a clear separation between heart disease (cyan) and non-disease (orange) cases. While the model captures the decision boundary well, some misclassified points near the boundary indicate overlapping feature distributions.

Having evaluated the model’s predictive performance, we now analyze the feature importanceto understand which variables contribute the most to heart disease prediction.

print(clfRF.feature_importances_)

[0.0587954  0.03478422 0.11170242 0.05540695 0.10893785 0.02001005
 0.02047004 0.10539044 0.11123337 0.10167428 0.27159498]

Interpretation:

The feature importance scores indicate that “ST_Slope” (slope of the ST segment) and “ChestPainType” are the most influential factors in predicting heart disease. This aligns with medical knowledge, as changes in the ST segment and different types of chest pain are strong indicators of cardiovascular issues. Other moderately important features include “MaxHR” (maximum heart rate) and “Cholesterol”, suggesting that heart rate dynamics and cholesterol levels also play significant roles.

RandomForestClassifier(max_depth=7)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Predicting Values Using Random Forest Classifier

To evaluate our model, let’s randomly select four rows from our dataset and predict their values.

import random

random_value1 = random.randint(0, 918)
random_value2 = random.randint(0, 918)
random_value3 = random.randint(0, 918)
random_value4 = random.randint(0, 918)

A=X.iloc[random_value1].values

B=X.iloc[random_value2].values
C=X.iloc[random_value3].values
D=X.iloc[random_value4].values

A=scaler.transform([A])
B=scaler.transform([B])
C=scaler.transform([C])
D=scaler.transform([D])

A_pred=clfRF.predict(A)
B_pred=clfRF.predict(B)
C_pred=clfRF.predict(C)
D_pred=clfRF.predict(D)

print("True    Predicted")
print(y[random_value1],"    ",A_pred)
print(y[random_value2],"    ",B_pred)
print(y[random_value3],"    ",C_pred)
print(y[random_value4],"    ",D_pred)

True    Predicted
0      [0]
1      [1]
1      [1]
0      [0]

From the output, all the randomly selected values were predicted correctly! 🎯

Building an Ensemble Model for Higher Accuracy

Now, let’s try to improve our model’s performance by using an ensemble approach. We will combine predictions from multiple models to get a more robust final prediction.

Since an ensemble model works best with an odd number of models, we will select the three best-performing models based on accuracy:

Logistic Regression Model → Accuracy: 0.8289
Random Forest Classifier (max depth = 7) → Accuracy: 0.8549
Neural Network (10,20 layers) → Accuracy: 0.8365


clfReg.fit(X_train, y_train)
clfRF.fit(X_train, y_train)
clf.fit(X_train, y_train)

predReg = clfReg.predict(X_test)
predRF = clfRF.predict(X_test)
predNN = clf.predict(X_test)

prediction_ensemble = [predReg, predRF, predNN]

predictions = []

for i in range(len(X_test)):  # Iterate over the test dataset
    votes = [prediction[i] for prediction in prediction_ensemble]
    vote_result = max(set(votes), key=votes.count)
    predictions.append(vote_result)

ensemble_accuracy = accuracy_score(y_test, predictions)  # Use y_test for accuracy calculation
print("Accuracy of Ensemble: ", ensemble_accuracy)

Accuracy of Ensemble:  0.8967391304347826

By using an ensemble of the top three models, we improved our accuracy from 0.85 to 0.89 🎉. This demonstrates that ensemble methods can enhance predictive performance by leveraging the strengths of multiple models.

Conclusion:

This project demonstrated the effectiveness of machine learning in predicting heart disease using clinical data. After preprocessing, exploratory analysis, and model evaluation, the Random Forest Classifier achieved the highest individual accuracy of 85.5%, with the ensemble model boosting performance to 89.7%. Key features such as ST_Slope and ChestPainType were identified as the most influential in predicting heart disease. These results highlight the potential of data-driven approaches in supporting early diagnosis and improving cardiovascular health outcomes.

Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope
40	1	1	140	289	0	172	0	0.0	0
49	0	2	160	180	0	156	0	1.0	1
37	1	1	130	283	1	98	0	0.0	0
48	0	3	138	214	0	108	1	1.5	1
54	1	2	150	195	0	122	0	0.0	0

Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope
40	1	1	140	289	0	172	0	0.0	0
49	0	2	160	180	0	156	0	1.0	1
37	1	1	130	283	1	98	0	0.0	0
48	0	3	138	214	0	108	1	1.5	1
54	1	2	150	195	0	122	0	0.0	0

Age	Sex	ChestPainType	RestingBP	Cholesterol	RestingECG	MaxHR	ExerciseAngina	Oldpeak	ST_Slope
40	1	1	140	289	0	172	0	0.0	0
49	0	2	160	180	0	156	0	1.0	1
37	1	1	130	283	1	98	0	0.0	0
48	0	3	138	214	0	108	1	1.5	1
54	1	2	150	195	0	122	0	0.0	0