RandomForest.utf8

0.1 Distributed Random Forest (DRF) _Supervised

H2O 를 활용한 Distributed Random Forest (DRF)

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html

DRF (Distributed Random Forest)는 강력한 분류 및 회귀 도구입니다.

데이터 집합이 주어지면 DRF는 단일 분류 또는 회귀 트리가 아닌 분류 또는 회귀 트리의 포리스트를 생성합니다.

0.1.1 packages

library(h2o)

## Warning: package 'h2o' was built under R version 4.0.3

# 초기 준비
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 14 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    25 days  
##     H2O cluster name:           H2O_started_from_R_user_uho906 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.96 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 data import

# Import the cars dataset into H2O:
cars <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

cars

##                      name economy cylinders displacement power weight
## 1 AMC Ambassador Brougham    13.0         8          360   175   3821
## 2      AMC Ambassador DPL    15.0         8          390   190   3850
## 3      AMC Ambassador SST    17.0         8          304   150   3672
## 4        AMC Concord DL 6    20.2         6          232    90   3265
## 5          AMC Concord DL    18.1         6          258   120   3410
## 6          AMC Concord DL    23.0         4          151   NaN   3035
##   acceleration year economy_20mpg
## 1         11.0   73             0
## 2          8.5   70             0
## 3         11.5   72             0
## 4         18.2   79             1
## 5         15.1   78             0
## 6         20.5   82             1
## 
## [406 rows x 9 columns]

0.1.3 data mumming

# Set the predictors and response;
# set the response as a factor:
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "economy_20mpg"

# Split the dataset into a train and valid set:
cars_split <- h2o.splitFrame(data = cars, ratios = 0.8, seed = 1234)
train <- cars_split[[1]]
valid <- cars_split[[2]]

0.1.4 modeling

# Build and train the model:
cars_drf <- h2o.randomForest(x = predictors,
                             y = response,
                             ntrees = 10,
                             max_depth = 5,
                             min_rows = 10,
                             calibrate_model = TRUE,
                             calibration_frame = valid,
                             binomial_double_trees = TRUE,
                             training_frame = train,
                             validation_frame = valid)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |======================================================================| 100%

# Eval performance:
perf <- h2o.performance(cars_drf)
perf

## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.0693541
## RMSE:  0.2633517
## LogLoss:  0.3234257
## Mean Per-Class Error:  0.07899854
## AUC:  0.9618781
## AUCPR:  0.9722646
## Gini:  0.9237563
## R^2:  0.7032277
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      109  12 0.099174  =12/121
## 1       12 192 0.058824  =12/204
## Totals 121 204 0.073846  =24/325
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.604396   0.941176  98
## 2                       max f2  0.314719   0.953289 125
## 3                 max f0point5  0.604396   0.941176  98
## 4                 max accuracy  0.604396   0.926154  98
## 5                max precision  0.953934   0.991935  25
## 6                   max recall  0.023810   1.000000 170
## 7              max specificity  1.000000   0.991736   0
## 8             max absolute_mcc  0.604396   0.842003  98
## 9   max min_per_class_accuracy  0.660194   0.900826  91
## 10 max mean_per_class_accuracy  0.604396   0.921001  98
## 11                     max tns  1.000000 120.000000   0
## 12                     max fns  1.000000 116.000000   0
## 13                     max fps  0.000000 121.000000 177
## 14                     max tps  0.023810 204.000000 170
## 15                     max tnr  1.000000   0.991736   0
## 16                     max fnr  1.000000   0.568627   0
## 17                     max fpr  0.000000   1.000000 177
## 18                     max tpr  0.023810   1.000000 170
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

0.1.5 prediction

# Generate predictions on a validation set (if necessary):
predict <- h2o.predict(cars_drf, newdata = valid)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

predict

##   predict        p0          p1    cal_p0     cal_p1
## 1       1 0.5933097 0.406690294 0.8134390 0.18656100
## 2       0 0.6788790 0.321121008 0.8845413 0.11545873
## 3       0 0.7189684 0.281031570 0.9088956 0.09110437
## 4       0 1.0000000 0.000000000 0.9845012 0.01549877
## 5       0 0.9909091 0.009090909 0.9835605 0.01643949
## 6       0 0.9570183 0.042981711 0.9795329 0.02046715
## 
## [76 rows x 5 columns]