The objective of this project was to build classifiers to predict whether a person diagnose a heart disease based on Cleveland Database sourced from the UCI Machine Learning Repository. In Phase I, we cleaned the data, filled the missing values for two descriptive features (ca and thal) and re-categorised the target feature(num) in to a binary target level (target). In Phase II, we built three binary-classifiers on the cleaned data. The rest of this report is organised as follow. Section 2 gives an overview of the methodology. Section 3 discusses the classifiersâ fine-tuning process and detailed performance analysis of each classifier. Section 4 compares the performance of the classifiers using the same resampling method. Section 5 critiques our methodology. The last section concludes with a summary.
We considered three classifiers - Random Forest (RF), Naive Bayes (NB) and \(K\)-Nearest Neighbour (KNN). The classifiers were trained to make probability predictions as it is flexible in adjusting prediction threshold to refine the performance. We split the full data set into 70 % training set and 30 % test set. For fine-tuning process, we used a five-fold cross-validation with simple random sampling as the dataset is quite balanced (target level with yes=139 and target level with no=164).
Next, for each classsifer, we determined the optimal probability threshold. Using the tuned hyperparameters and the optimal thresholds, we made predictions on the test data. During model training (hyperparameter tuning and threshold adjustment), we relied on mean misclassification error rate (mmce). In addition to mmce, we used the confusion matrix on the test data to evaluate classifiers’ performance based on Precision, recall and f1 measures. The modelling was implemented in R with the mlr package.
We fined-tune the number of features randomly selected as candidates at each split (i.e. mtry). For a classification problem,some scholars suggest mtry = \(\sqrt{p}\) where \(p\) is the number of descriptive features. In our case, \(\sqrt{p} = \sqrt{13}=3.60\). Therefore, we experimented mtry = 2, 3, and 4. We left other hyperparameters, such as the number of trees to grow at the default value. The result was 3 with a mean mmce test error of 0.193.
Since the training set may not cover all possible descriptive feature combinations, the NB classifier might produce some fitted zero probabilities as predictions. To mitigate this, we ran a grid search to determine the optimal value of the Laplacian smoothing parameter. We experimented values ranging from 0 to 30. The optimal Laplacian parameter was zero with a mean mmce test error of 0.179. That means, we do not need Laplacian smoothing as our test set is nicely represented all descriptive feature combinations.
By using the optimal kernel, we ran a grid search on \(k=2,3,...30\). The outcome was k=11 with a mean mmce test error of 0.165.
## Loading required package: ParamHelpers
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.5
## v tibble 1.4.2 v dplyr 0.7.5
## v tidyr 0.8.1 v stringr 1.3.1
## v ggplot2 2.2.1 v forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
##
## Attaching package: 'ggvis'
## The following object is masked from 'package:ggplot2':
##
## resolution
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'e1071'
## The following object is masked from 'package:mlr':
##
## impute
## [Tune] Started tuning learner classif.randomForest for parameter set:
## Type len Def Constr Req Tunable Trafo
## mtry discrete - - 2,3,4 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: mtry=2
## [Tune-y] 1: mmce.test.mean=0.2025471; time: 0.0 min
## [Tune-x] 2: mtry=3
## [Tune-y] 2: mmce.test.mean=0.1931340; time: 0.0 min
## [Tune-x] 3: mtry=4
## [Tune-y] 3: mmce.test.mean=0.2213732; time: 0.0 min
## [Tune] Result: mtry=3 : mmce.test.mean=0.1931340
## [Tune] Started tuning learner classif.naiveBayes for parameter set:
## Type len Def Constr Req Tunable Trafo
## laplace numeric - - 0 to 30 - TRUE -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: laplace=0
## [Tune-y] 1: mmce.test.mean=0.1795127; time: 0.0 min
## [Tune-x] 2: laplace=3.33
## [Tune-y] 2: mmce.test.mean=0.1840532; time: 0.0 min
## [Tune-x] 3: laplace=6.67
## [Tune-y] 3: mmce.test.mean=0.1888151; time: 0.0 min
## [Tune-x] 4: laplace=10
## [Tune-y] 4: mmce.test.mean=0.1981174; time: 0.0 min
## [Tune-x] 5: laplace=13.3
## [Tune-y] 5: mmce.test.mean=0.2029900; time: 0.0 min
## [Tune-x] 6: laplace=16.7
## [Tune-y] 6: mmce.test.mean=0.2029900; time: 0.0 min
## [Tune-x] 7: laplace=20
## [Tune-y] 7: mmce.test.mean=0.2029900; time: 0.0 min
## [Tune-x] 8: laplace=23.3
## [Tune-y] 8: mmce.test.mean=0.2029900; time: 0.0 min
## [Tune-x] 9: laplace=26.7
## [Tune-y] 9: mmce.test.mean=0.2029900; time: 0.0 min
## [Tune-x] 10: laplace=30
## [Tune-y] 10: mmce.test.mean=0.2077519; time: 0.0 min
## [Tune] Result: laplace=0 : mmce.test.mean=0.1795127
## [Tune] Started tuning learner classif.kknn for parameter set:
## Type len Def Constr Req Tunable
## k discrete - - 2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,... - TRUE
## Trafo
## k -
## With control class: TuneControlGrid
## Imputation value: 1
## [Tune-x] 1: k=2
## [Tune-y] 1: mmce.test.mean=0.2223699; time: 0.0 min
## [Tune-x] 2: k=3
## [Tune-y] 2: mmce.test.mean=0.2223699; time: 0.0 min
## [Tune-x] 3: k=4
## [Tune-y] 3: mmce.test.mean=0.2223699; time: 0.0 min
## [Tune-x] 4: k=5
## [Tune-y] 4: mmce.test.mean=0.1985604; time: 0.0 min
## [Tune-x] 5: k=6
## [Tune-y] 5: mmce.test.mean=0.2077519; time: 0.0 min
## [Tune-x] 6: k=7
## [Tune-y] 6: mmce.test.mean=0.1795127; time: 0.0 min
## [Tune-x] 7: k=8
## [Tune-y] 7: mmce.test.mean=0.1702104; time: 0.0 min
## [Tune-x] 8: k=9
## [Tune-y] 8: mmce.test.mean=0.1699889; time: 0.0 min
## [Tune-x] 9: k=10
## [Tune-y] 9: mmce.test.mean=0.1700997; time: 0.0 min
## [Tune-x] 10: k=11
## [Tune-y] 10: mmce.test.mean=0.1653378; time: 0.0 min
## [Tune-x] 11: k=12
## [Tune-y] 11: mmce.test.mean=0.1700997; time: 0.0 min
## [Tune-x] 12: k=13
## [Tune-y] 12: mmce.test.mean=0.1748616; time: 0.0 min
## [Tune-x] 13: k=14
## [Tune-y] 13: mmce.test.mean=0.1796235; time: 0.0 min
## [Tune-x] 14: k=15
## [Tune-y] 14: mmce.test.mean=0.1796235; time: 0.0 min
## [Tune-x] 15: k=16
## [Tune-y] 15: mmce.test.mean=0.1842746; time: 0.0 min
## [Tune-x] 16: k=17
## [Tune-y] 16: mmce.test.mean=0.1750831; time: 0.0 min
## [Tune-x] 17: k=18
## [Tune-y] 17: mmce.test.mean=0.1798450; time: 0.0 min
## [Tune-x] 18: k=19
## [Tune-y] 18: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 19: k=20
## [Tune-y] 19: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 20: k=21
## [Tune-y] 20: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 21: k=22
## [Tune-y] 21: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 22: k=23
## [Tune-y] 22: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 23: k=24
## [Tune-y] 23: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 24: k=25
## [Tune-y] 24: mmce.test.mean=0.1846069; time: 0.0 min
## [Tune-x] 25: k=26
## [Tune-y] 25: mmce.test.mean=0.1798450; time: 0.0 min
## [Tune-x] 26: k=27
## [Tune-y] 26: mmce.test.mean=0.1798450; time: 0.0 min
## [Tune-x] 27: k=28
## [Tune-y] 27: mmce.test.mean=0.1798450; time: 0.0 min
## [Tune-x] 28: k=29
## [Tune-y] 28: mmce.test.mean=0.1798450; time: 0.0 min
## [Tune-x] 29: k=30
## [Tune-y] 29: mmce.test.mean=0.1844961; time: 0.0 min
## [Tune] Result: k=11 : mmce.test.mean=0.1653378
The following plots depict the value of mmce vs. the range of probability thresholds. The optimum thresholds were around 0.44, 0.21, and 0.42 for RF, NB, and 30-KNN classifiers respectively. These thresholds were used to determine the probability of a person diagnose a heart disease.
Using the parameters and threshold levels, we calculated the confusion matrix for each classifier. The confusion matrix of RF classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true No Yes -err.-
## No 0.89/0.89 0.11/0.18 0.11
## Yes 0.18/0.11 0.82/0.82 0.18
## -err.- 0.11 0.18 0.13
##
##
## Absolute confusion matrix:
## predicted
## true No Yes -err.-
## No 51 6 6
## Yes 6 28 6
## -err.- 6 6 12
The confusion matrix of NB classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true No Yes -err.-
## No 0.91/0.84 0.09/0.17 0.09
## Yes 0.29/0.16 0.71/0.83 0.29
## -err.- 0.16 0.17 0.16
##
##
## Absolute confusion matrix:
## predicted
## true No Yes -err.-
## No 52 5 5
## Yes 10 24 10
## -err.- 10 5 15
The confusion matrix of 30-KNN classifer is as follow:
## Relative confusion matrix (normalized by row/column):
## predicted
## true No Yes -err.-
## No 0.89/0.91 0.11/0.17 0.11
## Yes 0.15/0.09 0.85/0.83 0.15
## -err.- 0.09 0.17 0.12
##
##
## Absolute confusion matrix:
## predicted
## true No Yes -err.-
## No 51 6 6
## Yes 5 29 5
## -err.- 5 6 11
# Evaluation Measures
EM <- matrix(c(82.35,82.35,0.8235,70.59,82.76,0.7619,85.29,82.86,0.8406),ncol=3,byrow=TRUE)
colnames(EM) <- c("Precision(%)","Recall(%)","F1")
rownames(EM) <- c("RF","NB","KNN")
EM <- as.table(EM)
EM
## Precision(%) Recall(%) F1
## RF 82.3500 82.3500 0.8235
## NB 70.5900 82.7600 0.7619
## KNN 85.2900 82.8600 0.8406
Here we consider target level=yes as positive for comparing evaluation measures. No classifer predicts 100% accuracy for test set. Based on evaluation measures precision, recall and F1, the model with classifer KNN provides the best predication beyond the training set.
In this analysis we are more concern about the recall measure as it tells us how confident we can be that all persons with a heart disease have been found by the model rather a person that predicted to has a heart disease but he/she really dose not have. In that aspect all three model provide quite similar accuracy.
The NB model assumes the descriptive features to follow normality that are not necessarily true. The solution would be a transformation on numeric features. Numerical descriptive features in dataset are measured in different scale of measurements. Therefore, it is more reasonable to normalise these features before analysis.
Among three classifiers, the \(K\)-Nearest Neighbour produces the best performance in predicting a person diagnose a heart disease with 82.86% accuracy rate (based on recall). We split the data into training and test sets and used cross-validation in model fine tuning. We determined the optimal value of the selected hyperparameter of each classifier and the probability threshold.For future works, we proposed to consider data normalisation to improve the predication power of all three classifers.