The goal of the project is to use tree-based machine learning method to predict the probability of heart disease by patients clinical information. The prediction will be helpful for the patients to get prepare for their potential life-threatening heart disease.
The data comes from the Heart Disease Cleveland UCI, can be accessed from here: https://www.kaggle.com/cherngs/heart-disease-cleveland-uci The dataset including 303 the patients, with 108 cases had no heart disease and 165 cases have experience at least one heart disease before. Clinical information include: age, sex, chest pain type, resting blood pressure, serum cholesterol level, fasting blood sugar, resting EKG result, maximum heart rate, exercise induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, number of major vessels colored by flourosopy, thal, and if has experienced the heart disease before.
The description data can be found here: https://www.kaggle.com/ronitf/heart-disease-uci/discussion/105877
In this project, we are interested which tree-based method is better predict the hear disease-the random forest(RF) for the gradient boost machine (GBM) the assessment is conducted by ROC curve and the precision score.
knitr::include_graphics("full_figure.pdf")
As can be seen from the image, GBM has overall a better prediction result than RF. The average Precision score for RF and GBM is 0.88 vs 0.93, AUC score is 0.86 vs 0.91, respectively.
The python code can be found in the jupyter notebook: pred.ipynb
#Visualization
The last part shows the prediction result from GBM and RF, the black dot represent the test data and the triangles represent the GBM and RF prediction after traing from the train sets.
knitr::include_graphics("result.pdf")