Abstract

Statistical learning models were applied to heart disease dataset in order to predict the individuals’ severeness of the heart disease based on their clinical and demographic information. A variety of learning techniques, including k-nearest neighbors, random forest, neural networks models, and GBM models, were explored and validated, ultimately random forest models showed the best accuracy.


Introduction

Heart disease is found to be the leading cause of death in developing countries, according to the statistic data from WHO. The diagnosis of heart disease in most cases depends on a complex combination of clinical and pathological data 1 .

In order to reduce the excessive medical costs and improve the quality of the medical care if heart diseases can be detected in early periods. Statistics and machine learning techniques are mainly applied approaches to predict the status of heart disease based on the of the clinical and demographic data 2 3 .

In this analysis, statistical learning techniques were applied to a heart disease dataset via UCI Machine Learning Repository. It contains information on heat disease severeness levels, demographic information and other clinical information. The main purpose is to predict if patients have heart disease and the severeness of the heart disease. Therefore, binary classification methods and multiclass classification methods, including k-nearest neighbors, random forest, neural networks models, and GBM models are applied. The results indicate that among all the models are considered, random forest models can probably be made with a reasonable accuracy. However, further investigation about the data and how these demographic and clinical variables was still needed through practical and statistical methods.


Methods

Data

The original dataset Heart Disease Data Set, containing 76 attributes, can be accessed via UCI Machine Learning Repository 4 . For the purposes of this analysis, we only use a subset of 14 of them, as all published experiments refer to. It contains information on heat disease severeness levels, demographic information (age, sex, location) and other clinical information. The response variable is the number of vessels with greater than 50% diameter narrowing. There are five levels, where \(V_0\) indicates no evidence of heart disease and \(V_4\) shows the most severe evidence of heart disease.

51.7% of individuals in the training dataset show evidence of heart disease while 2.9% show the most severe evidence of heart disease. Moreover, 22.5% of individuals in the training dataset are female. AS for the age, the median age of individuals is 54 years old within a range from 28 to 77 years old in the training dataset.

Modeling

A new predictor was created here for the purpose to determine the individual has heart disease or not. That is, if the number of vessels with greater than 50% diameter narrowing is zero, the individual was considered with no evidence of heart disease. Otherwise, a heart disease appeared to be detected with different levels of severeness from one to four.

In order to determine whether a certain individual has heart disease or not, and predict the levels of heart disease severeness, four modeling techniques were considered: k-nearest neighbors models, random forest models, neural networks models, and gradient boosting machine (GBM) models. For both binary classification models and multiclass classification models, all available predictor variables were used. To evaluate the ability of different models to predict the levels of severeness of heart disease, the data was split into training and testing sets.

  • K-nearest neighbors models were trained using 10-fold cross validation. The choice of k was chosen among 30 different values by default.
  • Random forest models were trained using out of bag validation.The best choice of the complexity parameter was chosen among 10 different values by default.
  • Neural networks models were trained using 10-fold cross validation. The best choice of the mtry parameter was chosen among 10 different values by default.
  • Gradient boosting machine (GBM) models were trained using 10-fold cross validation. The best choice of the mtry parameter was chosen among 10 different values by default.

Results

The results for binary classification models are as follows:
Model Name Accuracy Tunning Parameters
KNN Model 0.6825 k = 9
Random Forest Model 0.8007 mtry = 2
Neutral Network Model 0.8260 size = 3, decay = 0.1
GBM Model 0.8210 shrinkage = 0.1, interaction depth = 3, number of minobsinnode = 10, number of trees = 50
The results for multiclass classification models are as follows:
Model Name Accuracy Tunning Parameters
KNN Model 0.5000 k = 7
Random Forest Model 0.5895 mtry = 2
Neutral Network Model 0.5827 size = 3, decay = 0.1
GBM Model 0.6132 shrinkage = 0.1, interaction depth = 1, number of minobsinnode = 10, number of trees = 100

Discussion

According to results of assessing model performance based on the testing dataset, It can be interpreted that the random forest model with mtry = 2 played a significant role in forecasting among four types of models. An accuracy of 89.2% for multiclass classification and 83.8% for binary classification could be obtained based on the test data. However, compared to the no information rate 51.7% (the percentage of individuals with the evidence of heart disease) in this dataset, this seems to suggest a model acceptable but not performing very well at the prediction task.

Statistically, the application of this chosen model is to some extent limited due to the nature of the data. The predictors used to train the model was only a subset of 14 attributes out of 76. There is little evidence that those remaining variables will not have significant effect to the response. Furthermore, the sample size is relatively small for only 740 observations in the whole dataset. To generalize this model to a greater population, more data would need to be included.


Appendix

Data Dictionary

  • age - age in years
  • sex - sex
  • cp - chest pain type
  • trestbps - resting blood pressure (in mm Hg on admission to the hospital)
  • chol - serum cholestoral in mg/dl
  • fbs - (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  • restecg - resting electrocardiographic results
  • thalach - maximum heart rate achieved
  • exang - exercise induced angina
  • oldpeak - ST depression induced by exercise relative to rest
  • num - five levels variable
    • v0: 0 vessels with greater than 50% diameter narrowing. (No presence of heart disease.)
    • v1: 1 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)
    • v2: 2 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)
    • v3: 3 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)
    • v4: 4 vessels with greater than 50% diameter narrowing. (Some presence of heart disease.)
  • location - source creators
    • ch: University Hospital
    • cl: Cleveland Clinic Foundation
    • hu: Hungarian Institute of Cardiology
    • va: V.A. Medical Center

For additional background on the data, see the data source on UCI Machine Learning Repository.

EDA - Exploratory Data Analysis

## Reference


  1. Wu R, Peters W, Morgan MW. The next generation clinical decision support: linking evidence to best practice. J Healthc Inf Manag, 2002;16:50-5.

  2. Anbarasi M, Anupriya E, Iyengar NCHSN. Enhanced prediction of heart Disease with feature subset selection using genetic algorithm. International Journal of Engineering Science and Technology 2010;2:5370-76.

  3. Palaniappan S, Awang R. Intelligent heart disease prediction system using data mining techniques. International Journal of Computer Science and Network Security 2008;8:343-50.

  4. UCI Machine Learning Repository