0.1 Distributed Random Forest (DRF) _Supervised

H2O 를 활용한 Distributed Random Forest (DRF)

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html

DRF (Distributed Random Forest)는 강력한 분류 및 회귀 도구입니다.

데이터 집합이 주어지면 DRF는 단일 분류 또는 회귀 트리가 아닌 분류 또는 회귀 트리의 포리스트를 생성합니다.


0.1.1 packages

## Warning: package 'h2o' was built under R version 4.0.3
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 14 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    25 days  
##     H2O cluster name:           H2O_started_from_R_user_uho906 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.96 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 data import

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##                      name economy cylinders displacement power weight
## 1 AMC Ambassador Brougham    13.0         8          360   175   3821
## 2      AMC Ambassador DPL    15.0         8          390   190   3850
## 3      AMC Ambassador SST    17.0         8          304   150   3672
## 4        AMC Concord DL 6    20.2         6          232    90   3265
## 5          AMC Concord DL    18.1         6          258   120   3410
## 6          AMC Concord DL    23.0         4          151   NaN   3035
##   acceleration year economy_20mpg
## 1         11.0   73             0
## 2          8.5   70             0
## 3         11.5   72             0
## 4         18.2   79             1
## 5         15.1   78             0
## 6         20.5   82             1
## 
## [406 rows x 9 columns]

0.1.4 modeling

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |======================================================================| 100%
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.0693541
## RMSE:  0.2633517
## LogLoss:  0.3234257
## Mean Per-Class Error:  0.07899854
## AUC:  0.9618781
## AUCPR:  0.9722646
## Gini:  0.9237563
## R^2:  0.7032277
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      109  12 0.099174  =12/121
## 1       12 192 0.058824  =12/204
## Totals 121 204 0.073846  =24/325
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.604396   0.941176  98
## 2                       max f2  0.314719   0.953289 125
## 3                 max f0point5  0.604396   0.941176  98
## 4                 max accuracy  0.604396   0.926154  98
## 5                max precision  0.953934   0.991935  25
## 6                   max recall  0.023810   1.000000 170
## 7              max specificity  1.000000   0.991736   0
## 8             max absolute_mcc  0.604396   0.842003  98
## 9   max min_per_class_accuracy  0.660194   0.900826  91
## 10 max mean_per_class_accuracy  0.604396   0.921001  98
## 11                     max tns  1.000000 120.000000   0
## 12                     max fns  1.000000 116.000000   0
## 13                     max fps  0.000000 121.000000 177
## 14                     max tps  0.023810 204.000000 170
## 15                     max tnr  1.000000   0.991736   0
## 16                     max fnr  1.000000   0.568627   0
## 17                     max fpr  0.000000   1.000000 177
## 18                     max tpr  0.023810   1.000000 170
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

0.1.5 prediction

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
##   predict        p0          p1    cal_p0     cal_p1
## 1       1 0.5933097 0.406690294 0.8134390 0.18656100
## 2       0 0.6788790 0.321121008 0.8845413 0.11545873
## 3       0 0.7189684 0.281031570 0.9088956 0.09110437
## 4       0 1.0000000 0.000000000 0.9845012 0.01549877
## 5       0 0.9909091 0.009090909 0.9835605 0.01643949
## 6       0 0.9570183 0.042981711 0.9795329 0.02046715
## 
## [76 rows x 5 columns]