H2O 를 활용한 Distributed Random Forest (DRF)
[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/drf.html
DRF (Distributed Random Forest)는 강력한 분류 및 회귀 도구입니다.
데이터 집합이 주어지면 DRF는 단일 분류 또는 회귀 트리가 아닌 분류 또는 회귀 트리의 포리스트를 생성합니다.
## Warning: package 'h2o' was built under R version 4.0.3
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 hours 14 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 25 days
## H2O cluster name: H2O_started_from_R_user_uho906
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.96 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
# Import the cars dataset into H2O:
cars <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/junit/cars_20mpg.csv")
##
|
| | 0%
|
|======================================================================| 100%
## name economy cylinders displacement power weight
## 1 AMC Ambassador Brougham 13.0 8 360 175 3821
## 2 AMC Ambassador DPL 15.0 8 390 190 3850
## 3 AMC Ambassador SST 17.0 8 304 150 3672
## 4 AMC Concord DL 6 20.2 6 232 90 3265
## 5 AMC Concord DL 18.1 6 258 120 3410
## 6 AMC Concord DL 23.0 4 151 NaN 3035
## acceleration year economy_20mpg
## 1 11.0 73 0
## 2 8.5 70 0
## 3 11.5 72 0
## 4 18.2 79 1
## 5 15.1 78 0
## 6 20.5 82 1
##
## [406 rows x 9 columns]
# Set the predictors and response;
# set the response as a factor:
cars["economy_20mpg"] <- as.factor(cars["economy_20mpg"])
predictors <- c("displacement", "power", "weight", "acceleration", "year")
response <- "economy_20mpg"
# Split the dataset into a train and valid set:
cars_split <- h2o.splitFrame(data = cars, ratios = 0.8, seed = 1234)
train <- cars_split[[1]]
valid <- cars_split[[2]]
# Build and train the model:
cars_drf <- h2o.randomForest(x = predictors,
y = response,
ntrees = 10,
max_depth = 5,
min_rows = 10,
calibrate_model = TRUE,
calibration_frame = valid,
binomial_double_trees = TRUE,
training_frame = train,
validation_frame = valid)
##
|
| | 0%
|
|============================ | 40%
|
|======================================================================| 100%
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.0693541
## RMSE: 0.2633517
## LogLoss: 0.3234257
## Mean Per-Class Error: 0.07899854
## AUC: 0.9618781
## AUCPR: 0.9722646
## Gini: 0.9237563
## R^2: 0.7032277
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 109 12 0.099174 =12/121
## 1 12 192 0.058824 =12/204
## Totals 121 204 0.073846 =24/325
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.604396 0.941176 98
## 2 max f2 0.314719 0.953289 125
## 3 max f0point5 0.604396 0.941176 98
## 4 max accuracy 0.604396 0.926154 98
## 5 max precision 0.953934 0.991935 25
## 6 max recall 0.023810 1.000000 170
## 7 max specificity 1.000000 0.991736 0
## 8 max absolute_mcc 0.604396 0.842003 98
## 9 max min_per_class_accuracy 0.660194 0.900826 91
## 10 max mean_per_class_accuracy 0.604396 0.921001 98
## 11 max tns 1.000000 120.000000 0
## 12 max fns 1.000000 116.000000 0
## 13 max fps 0.000000 121.000000 177
## 14 max tps 0.023810 204.000000 170
## 15 max tnr 1.000000 0.991736 0
## 16 max fnr 1.000000 0.568627 0
## 17 max fpr 0.000000 1.000000 177
## 18 max tpr 0.023810 1.000000 170
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
# Generate predictions on a validation set (if necessary):
predict <- h2o.predict(cars_drf, newdata = valid)
##
|
| | 0%
|
|======================================================================| 100%
## predict p0 p1 cal_p0 cal_p1
## 1 1 0.5933097 0.406690294 0.8134390 0.18656100
## 2 0 0.6788790 0.321121008 0.8845413 0.11545873
## 3 0 0.7189684 0.281031570 0.9088956 0.09110437
## 4 0 1.0000000 0.000000000 0.9845012 0.01549877
## 5 0 0.9909091 0.009090909 0.9835605 0.01643949
## 6 0 0.9570183 0.042981711 0.9795329 0.02046715
##
## [76 rows x 5 columns]