A medical doctor tries to predict the probability of being diabetes positive based on multiple clinical variables e.g. pregnancy, age, and blood pressure.
## # A tibble: 392 x 10
## pregnant glucose pressure triceps insulin mass pedigree age diabetes `3`
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
## 1 1 89 66 23 94 28.1 0.167 21 neg 3
## 2 0 137 40 35 168 43.1 2.29 33 pos 3
## 3 3 78 50 32 88 31 0.248 26 pos 3
## 4 2 197 70 45 543 30.5 0.158 53 pos 3
## 5 1 189 60 23 846 30.1 0.398 59 pos 3
## 6 5 166 72 19 175 25.8 0.587 51 pos 3
## 7 0 118 84 47 230 45.8 0.551 31 pos 3
## 8 1 103 30 38 83 43.3 0.183 33 neg 3
## 9 1 115 70 30 96 34.6 0.529 32 pos 3
## 10 3 126 88 41 235 39.3 0.704 27 neg 3
## # ... with 382 more rows
Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).
## pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 4 1 89 66 23 94 28.1 0.167 21 neg
## 5 0 137 40 35 168 43.1 2.288 33 pos
## 7 3 78 50 32 88 31.0 0.248 26 pos
## 9 2 197 70 45 543 30.5 0.158 53 pos
## 14 1 189 60 23 846 30.1 0.398 59 pos
## 15 5 166 72 19 175 25.8 0.587 51 pos
The Random Forest Algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output.
Automatically select the optimal number of predictor variables randomly sampled as candidates at each split, and fit the final best random forest model that explains the best our data.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 22.29%
## Confusion matrix:
## neg pos class.error
## neg 181 29 0.1380952
## pos 41 63 0.3942308
## [1] neg neg pos pos neg neg pos neg neg neg pos pos pos neg pos neg neg pos pos
## [20] pos pos pos pos neg neg neg neg neg pos neg neg neg neg pos pos neg neg neg
## [39] neg pos neg pos neg neg pos neg neg neg pos neg neg neg pos neg pos neg pos
## [58] neg neg neg neg neg pos neg neg pos pos neg neg pos pos neg neg neg pos neg
## [77] neg neg
## Levels: neg pos
Model prediction accuracy rate
## [1] 0.7307692
Variable importance
## neg pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 12.038738 -0.8356251 10.292108 8.712386
## glucose 26.081787 21.7942954 32.038582 48.207319
## pressure 2.806192 1.6953171 3.056023 8.922643
## triceps 8.664626 3.7763038 9.448042 11.870341
## insulin 3.111335 12.5118702 11.979180 15.477573
## mass 11.202207 5.2066818 11.753305 16.673158
## pedigree 4.417445 1.7948117 4.285141 13.853196
## age 11.613195 9.7479344 15.818250 15.165869
Building trees in a way that each new tree helps to correct errors made by previously trained tree.
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 8
##
## OOB estimate of error rate: 22.29%
## Confusion matrix:
## neg pos class.error
## neg 181 29 0.1380952
## pos 41 63 0.3942308
## [1] neg neg pos pos neg neg
## Levels: neg pos
Model prediction accuracy rate
## [1] 0.7564103
Variable importance in percentage:
## xgbTree variable importance
##
## Overall
## glucose 100.000
## age 51.147
## insulin 46.261
## mass 26.292
## pedigree 24.164
## pregnant 7.398
## triceps 3.584
## pressure 0.000
## randomForest BoostingGradient
## Precision 0.8163265 0.8113208
## Recall 0.7692308 0.8269231
## F1 0.7920792 0.8190476