Statistical learning models were applied to heart disease dataset in order to predict the individuals’ severeness of the heart disease based on their clinical and demographic information. A variety of learning techniques, including k-nearest neighbors, random forest, neural networks models, and GBM models, were explored and validated, ultimately random forest models showed the best accuracy.
Heart disease is found to be the leading cause of death in developing countries, according to the statistic data from WHO. The diagnosis of heart disease in most cases depends on a complex combination of clinical and pathological data 1 .
In order to reduce the excessive medical costs and improve the quality of the medical care if heart diseases can be detected in early periods. Statistics and machine learning techniques are mainly applied approaches to predict the status of heart disease based on the of the clinical and demographic data 2 3 .
In this analysis, statistical learning techniques were applied to a heart disease dataset via UCI Machine Learning Repository. It contains information on heat disease severeness levels, demographic information and other clinical information. The main purpose is to predict if patients have heart disease and the severeness of the heart disease. Therefore, binary classification methods and multiclass classification methods, including k-nearest neighbors, random forest, neural networks models, and GBM models are applied. The results indicate that among all the models are considered, random forest models can probably be made with a reasonable accuracy. However, further investigation about the data and how these demographic and clinical variables was still needed through practical and statistical methods.
The original dataset Heart Disease Data Set, containing 76 attributes, can be accessed via UCI Machine Learning Repository 4 . For the purposes of this analysis, we only use a subset of 14 of them, as all published experiments refer to. It contains information on heat disease severeness levels, demographic information (age, sex, location) and other clinical information. The response variable is the number of vessels with greater than 50% diameter narrowing. There are five levels, where \(V_0\) indicates no evidence of heart disease and \(V_4\) shows the most severe evidence of heart disease.
51.7% of individuals in the training dataset show evidence of heart disease while 2.9% show the most severe evidence of heart disease. Moreover, 22.5% of individuals in the training dataset are female. AS for the age, the median age of individuals is 54 years old within a range from 28 to 77 years old in the training dataset.
A new predictor was created here for the purpose to determine the individual has heart disease or not. That is, if the number of vessels with greater than 50% diameter narrowing is zero, the individual was considered with no evidence of heart disease. Otherwise, a heart disease appeared to be detected with different levels of severeness from one to four.
In order to determine whether a certain individual has heart disease or not, and predict the levels of heart disease severeness, four modeling techniques were considered: k-nearest neighbors models, random forest models, neural networks models, and gradient boosting machine (GBM) models. For both binary classification models and multiclass classification models, all available predictor variables were used. To evaluate the ability of different models to predict the levels of severeness of heart disease, the data was split into training and testing sets.
| Model Name | Accuracy | Tunning Parameters |
|---|---|---|
| KNN Model | 0.6825 | k = 9 |
| Random Forest Model | 0.8007 | mtry = 2 |
| Neutral Network Model | 0.8260 | size = 3, decay = 0.1 |
| GBM Model | 0.8210 | shrinkage = 0.1, interaction depth = 3, number of minobsinnode = 10, number of trees = 50 |
| Model Name | Accuracy | Tunning Parameters |
|---|---|---|
| KNN Model | 0.5000 | k = 7 |
| Random Forest Model | 0.5895 | mtry = 2 |
| Neutral Network Model | 0.5827 | size = 3, decay = 0.1 |
| GBM Model | 0.6132 | shrinkage = 0.1, interaction depth = 1, number of minobsinnode = 10, number of trees = 100 |
According to results of assessing model performance based on the testing dataset, It can be interpreted that the random forest model with mtry = 2 played a significant role in forecasting among four types of models. An accuracy of 89.2% for multiclass classification and 83.8% for binary classification could be obtained based on the test data. However, compared to the no information rate 51.7% (the percentage of individuals with the evidence of heart disease) in this dataset, this seems to suggest a model acceptable but not performing very well at the prediction task.
Statistically, the application of this chosen model is to some extent limited due to the nature of the data. The predictors used to train the model was only a subset of 14 attributes out of 76. There is little evidence that those remaining variables will not have significant effect to the response. Furthermore, the sample size is relatively small for only 740 observations in the whole dataset. To generalize this model to a greater population, more data would need to be included.
age - age in yearssex - sexcp - chest pain typetrestbps - resting blood pressure (in mm Hg on admission to the hospital)chol - serum cholestoral in mg/dlfbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)restecg - resting electrocardiographic resultsthalach - maximum heart rate achievedexang - exercise induced anginaoldpeak - ST depression induced by exercise relative to restnum - five levels variable
location - source creators
For additional background on the data, see the data source on UCI Machine Learning Repository.
## Reference
Wu R, Peters W, Morgan MW. The next generation clinical decision support: linking evidence to best practice. J Healthc Inf Manag, 2002;16:50-5.↩
Anbarasi M, Anupriya E, Iyengar NCHSN. Enhanced prediction of heart Disease with feature subset selection using genetic algorithm. International Journal of Engineering Science and Technology 2010;2:5370-76.↩
Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. International Journal of Computer Science and Network Security 2008;8:343-50.↩