Heart disease is a huge concern as it is the leading cause of death globally. There are different types of heart diseases such as coronary artery disease, congenital heart defects, arrhythmia and myocardial infarction. There are indicators that hint if there is a heart disease or not. These include chest pain (typical, atypical, non-anginal, asymptotic), fasting blood sugar where anything above 100 mg/dl increases the risk of heart disease by 300% [2], serum cholesterol which causes the arteries to become stiff, resting blood pressure and ST depression which is closely associated with a high risk of cardiac events [3]. Each indicator on its own does not necessarily reflect that a heart disease is present. An experienced medical professional can analyze the combined results and deduce whether a heart disease is present or not. However, this is time consuming and the diagnosis might be wrong. It is more efficient to use Machine Learning (ML) to diagnose large number of patients automatically, where the medical professional will only intervene when there is a positive diagnosis. This project will analyze and compare the results of three different ML models: K-Nearest Neighbor (KNN), Random Forest Classifier and Support Vector Machines (SVM)
The original data set available on the UCI ML repository contains 76 attributes. However, a subset data set of 13 features is the only one that has been used by ML researchers [4]. This dataset is also available on Kaggle [5], which is used in this article. The dataset does not contain any missing values, 303 rows, and one duplicated row. The last column is the class that represents whether the patient has a heart disease (1) or not (0).
## Code Name Type
## 1 age Age Integer
## 2 sex Sex Categorical (0,1)
## 3 cp Chest Pain Categorical (0-4)
## 4 trestbps Resting Blood Pressure Integer
## 5 chol Serum Cholesterol Integer
## 6 fbs Fasting Blood Sugar Categorical (0,1)
## 7 restecg Resting Electrocardiographic Results Categorical(0-2)
## 8 thalch Maximum Heart Rate Integer
## 9 exang Exercise Induced Angina Categorical (0,1)
## 10 oldpeak ST exercise relative to rest Float
## 11 slope Slope of peak exercise ST segment Categorical (0-2)
## 12 ca Number of major vessels by fluoroscopy Categorical (0-3)
## 13 thal Thalassemia Categorical (3,6,7)
Only one duplicate row was detected and removed and there aren’t any missing values in the data set. Skewness for all features are under 1 (except for ‘chol’ at 1.15). Since all features have different units and the ranges vary drastically, it is essential to normalize the features to a common scale between zero and one. There are three different types of normalization: MinMaxScaler, RobustScaler and StandardScaler. Doing this is important when trying to dimensionally reduce the data. In this project, Principal Component Analysis (PCA) was used to reduce dimensionality. In an attempt to find the best normalization method, each was tested on a PCA to find the least number of dimensions that describe 90% of the variance of the data set.
## Normalization Method Minimum Dimensions to Describe 90% of Variance
## 1 MinMax Scaler 7
## 2 Robust Scaler 8
## 3 Standard Scaler 9
The MinMax Scaler produced the best results. Since this data set is a subset of the original one with 76 attributes, it is understandable that the dimensions cannot be reduced further.
MinMax Scaler: 7 dimensions describe 90% of the dataset
Before performing PCA, feature selection was performed using SelectKBest to reduce the total number of features to a minimum. The figure below shows a plot of how accuracy, F1 score, and recall vary with number of features. In order to ensure reliable results, these scores are obtained using cross validation on the training set.
Plot of accuracy, recall and F1 score vs Number of Features
It can be seen from the figure above, only five features best describe the dataset. These features are ca, cp, exange, oldpeak and thalach. A new data set with these five features is created and PCA is performed. Fig. 3. shows that three dimensions are sufficient to preserve 95% of the variance.
PCA vs Cumulative Explained Variance
Therefore, the data set is reduced to 5 features and 3 dimensions.
There are many classification models available. This project analyses three of them, namely KNN, Random Forest Classifier and SVM.
KNN classifier is a simple supervised ML model which is mostly used for classification. The model memorizes the training set and uses a voting system when presented with a test data point; the K nearest data points to the test point participate in the vote and the majority classification type of the data points wins. The metric that determines the distance can be programmed to be euclidean, manhattan, chebychev and a few others. Since this model is very sensitive to distance between the points, it is crucial to normalize the data to a common scale. K is a hyperparameter that can be tuned by the programmer. For best results, it is usually set to be an odd number. K can be tuned experimentally to obtain the best results however, it is normally taken to be anywhere close to the square root of the number of data points. It is important to ensure that the correct value is chosen without over or under fitting the data (by choosing K too small or too big)
This model uses a collection of decision trees. A decision tree is a tree like structure that has branches and nodes. Each node is split based on its attribute. When a new data point must be tested, it traverses sequentially from the root till the end (the leaf), and it is classified based on it. Every time it hits a node along the way, it must satisfy a certain condition. For example, “if green, go to the left, if blue go to the right”. In order to ensure that the decision tree does not over-fit, the maximum depth should be regularized. Unlike KNN, decision trees do not require the data be normalized. A random forest is a collection of decision trees that are uncorrelated that act together as an ensemble. When testing a data point for classification, each decision tree will make a classification prediction, and the classification the gets the majority votes win.
The SVM model can perform linear or non-linear classification, regression and even outlier detection [6]. When doing binary classification, for example, if the two classes can be separated by a straight line, they are said to be linearly separable. If they are not linearly separable, a solution is to use polynomial features that might make the data set linearly separable.
Linearly Seperable Data
The straight line represents the boundary that separate these linearly separable data. However, there are infinitely many lines that satisfy this problem. The best line to choose is the one that has the greatest distance between itself and the points closest to it. However, this is not practically possible since data does not come perfectly separable like this. The solution is to allow a small number of data points to be misclassified. This is controlled by a parameter called C, the cost of misclassification. The higher the value of C the less prone data points will be misclassified. It is important to note that a value of C too high will cause overfitting. If the data points are not linearly separable, the SVM model will try to find a boundary that will maximize the margin with its nearest data points, this is referred to as the kernel trick. The essence of the kernel trick is in mapping the classification problem into a metric space in which the problem has a simple separation boundary but complex in the original one [7]
Kernel trick on non-linearly seperable data
In all the models tested, the reduced data set that consists of 5 features and 3 dimensions (discussed in section III) have been used. The data set has been split into train and test sets with a ratio of 3:1. In order to ensure that the model is not biased and that there is absolutely no data leakage, the hyperparameters are tuned on the training set only by using cross validation with a stratified K fold with 3 splits. Cross validation ensures that there is no bias when splitting the data set. In this case, it is split three times with a test, train and validation set. The resulting score is the average of the three folds. It is important to note that the cross-validation score is not the final result of the experiment. This is because the hyperparameters have been specifically tuned for the fitted set and therefore it would be meaningless to assume that, when the model is presented with a data it has not seen before, will perform equally well. As a result, the score will be evaluated on the score of the test set. To further ensure that the model is unbiased, the normalization of the training set and testing set is done separately. Pipelines was used as a convenient way to do this (it automatically normalizes the train and test sets separately) along with choosing the model type, normalization type and PCA all grouped into one function. Since classifying patients that have heart disease as positive is very important as it’s a matter of life or death, more emphasis is placed on the F1 score, although ideally, recall is the score that needs to be optimized. However, it is found that too much emphasis on recall lowers the precision. This is not ideal since classifying many healthy patients with a heart disease will render the test meaningless.
The code was run on an HP laptop with 8GB ram with an Intel processor core i7-8550U @ 2GHz. In order to avoid overfitting, all models have been regularized.
In this project three different classification models have been analyzed. The data set was reduced to the best five features using SelectKBest and down to three dimensions using PCA. Accuracy score in predicting heart disease in patients is not a reliable metric. This is because it is crucial to positively diagnose patients that actually have heart disease, even if a small number of patients that do not carry it are identified as positive. However, too much of the later will result in an unreliable test. Therefore, the F1 metric which is a harmonic ratio between precision and recall is best. Analyzing Table III shows that the random forest model performs best but with a time penalty. KNN performs the fastest but with slightly worst (but still acceptable) results. SVM seems to perform approximately 9 times faster than random forest with results that closely match with it. Comparing the results to a recently published article by Agarwal et al. [8] show that their F1 score for Random Forest is 90.9 using all 13 features and without any dimension reduction. This is only 5% more accuracy than the one obtained here with 5 features and 3 dimensions. Although the other models used in this project are different than theirs, generally only a 5% loss in almost all metrics is achieved.