Application of ML techniques: random forest and gradient boosting

Minoo Ashtiani

IQVIA, second Interview, second presentation

Oktober 05, 2020

Data preparation

A medical doctor tries to predict the probability of being diabetes positive based on multiple clinical variables e.g. pregnancy, age, and blood pressure.

## # A tibble: 392 x 10
##    pregnant glucose pressure triceps insulin  mass pedigree   age diabetes   `3`
##       <dbl>   <dbl>    <dbl>   <dbl>   <dbl> <dbl>    <dbl> <dbl> <fct>    <dbl>
##  1        1      89       66      23      94  28.1    0.167    21 neg          3
##  2        0     137       40      35     168  43.1    2.29     33 pos          3
##  3        3      78       50      32      88  31      0.248    26 pos          3
##  4        2     197       70      45     543  30.5    0.158    53 pos          3
##  5        1     189       60      23     846  30.1    0.398    59 pos          3
##  6        5     166       72      19     175  25.8    0.587    51 pos          3
##  7        0     118       84      47     230  45.8    0.551    31 pos          3
##  8        1     103       30      38      83  43.3    0.183    33 neg          3
##  9        1     115       70      30      96  34.6    0.529    32 pos          3
## 10        3     126       88      41     235  39.3    0.704    27 neg          3
## # ... with 382 more rows

Create train and test data sets

Randomly split the data into training set (80% for building a predictive model) and test set (20% for evaluating the model).

##    pregnant glucose pressure triceps insulin mass pedigree age diabetes
## 4         1      89       66      23      94 28.1    0.167  21      neg
## 5         0     137       40      35     168 43.1    2.288  33      pos
## 7         3      78       50      32      88 31.0    0.248  26      pos
## 9         2     197       70      45     543 30.5    0.158  53      pos
## 14        1     189       60      23     846 30.1    0.398  59      pos
## 15        5     166       72      19     175 25.8    0.587  51      pos

Computing random forest classifier

The Random Forest Algorithm combines the output of multiple (randomly created) Decision Trees to generate the final output.

Fit the model random forest model on the training set

Automatically select the optimal number of predictor variables randomly sampled as candidates at each split, and fit the final best random forest model that explains the best our data.

  • Final model
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 22.29%
## Confusion matrix:
##     neg pos class.error
## neg 181  29   0.1380952
## pos  41  63   0.3942308

Make predictions on the test data

##  [1] neg neg pos pos neg neg pos neg neg neg pos pos pos neg pos neg neg pos pos
## [20] pos pos pos pos neg neg neg neg neg pos neg neg neg neg pos pos neg neg neg
## [39] neg pos neg pos neg neg pos neg neg neg pos neg neg neg pos neg pos neg pos
## [58] neg neg neg neg neg pos neg neg pos pos neg neg pos pos neg neg neg pos neg
## [77] neg neg
## Levels: neg pos

Model prediction accuracy rate

## [1] 0.7307692

Variable importance

##                neg        pos MeanDecreaseAccuracy MeanDecreaseGini
## pregnant 12.038738 -0.8356251            10.292108         8.712386
## glucose  26.081787 21.7942954            32.038582        48.207319
## pressure  2.806192  1.6953171             3.056023         8.922643
## triceps   8.664626  3.7763038             9.448042        11.870341
## insulin   3.111335 12.5118702            11.979180        15.477573
## mass     11.202207  5.2066818            11.753305        16.673158
## pedigree  4.417445  1.7948117             4.285141        13.853196
## age      11.613195  9.7479344            15.818250        15.165869

Gradient boosting approach

Building trees in a way that each new tree helps to correct errors made by previously trained tree.

Boosted classification trees

  • Final model
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 8
## 
##         OOB estimate of  error rate: 22.29%
## Confusion matrix:
##     neg pos class.error
## neg 181  29   0.1380952
## pos  41  63   0.3942308

Make predictions on the test data

## [1] neg neg pos pos neg neg
## Levels: neg pos

Model prediction accuracy rate

## [1] 0.7564103

Variable importance in percentage:

## xgbTree variable importance
## 
##          Overall
## glucose  100.000
## age       51.147
## insulin   46.261
## mass      26.292
## pedigree  24.164
## pregnant   7.398
## triceps    3.584
## pressure   0.000

Comparison between random forest and gradient boosting models

##           randomForest BoostingGradient
## Precision    0.8163265        0.8113208
## Recall       0.7692308        0.8269231
## F1           0.7920792        0.8190476

ROC curve between two models

…