H2O package 를 활용한 automl
[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html#install-in-r
[참조 2] https://r-kor.org/wp-content/uploads/2018/08/h2o__v3.pdf
[참조 3] https://github.com/DarrenCook/h2o/
## Warning: package 'h2o' was built under R version 4.0.3
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 1 days 39 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 20 days
## H2O cluster name: H2O_started_from_R_user_gax063
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.33 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
#Import a sample binary outcome train/test set into H2O
#import 안되면 첫째 http:// 로 해볼 것, 둘째, 홈페이지서 저장하고 fread(), as.h2o() 변환하여 활용
#train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
#test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")
#train <- fread("higgs_train_10k.csv")
#train <- as.h2o(train)
#test <- fread("higgs_test_5k.csv")
#test <- as.h2o(test)
train <- h2o.importFile("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
##
|
| | 0%
|
|==== | 6%
|
|========= | 12%
|
|================= | 25%
|
|====================== | 31%
|
|======================================= | 56%
|
|============================================ | 62%
|
|==================================================== | 75%
|
|========================================================= | 81%
|
|============================================================= | 87%
|
|================================================================== | 94%
|
|======================================================================| 100%
##
|
| | 0%
|
|========= | 12%
|
|====================== | 31%
|
|=================================== | 50%
|
|==================================================== | 75%
|
|============================================================= | 87%
|
|======================================================================| 100%
# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)
# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])
# Run AutoML for 10 base models (limited to 1 hour max runtime by default)
aml <- h2o.automl(x = x, y = y,
training_frame = train,
max_models = 10, # 20개
seed = 1)
##
|
| | 0%
## 09:42:26.98: AutoML: XGBoost is not available; skipping it.
|
|==== | 5%
|
|===== | 7%
|
|====== | 8%
|
|=============== | 21%
|
|================= | 24%
|
|======================= | 33%
|
|======================== | 35%
|
|========================= | 35%
|
|=================================== | 50%
|
|=============================================== | 67%
|
|======================================================================| 100%
# View the AutoML Leaderboard
lb <- aml@leaderboard
print(lb, n = nrow(lb)) # Print all rows instead of default (6 rows)
## model_id auc logloss
## 1 StackedEnsemble_AllModels_AutoML_20201029_094226 0.7862690 0.5556553
## 2 StackedEnsemble_BestOfFamily_AutoML_20201029_094226 0.7839050 0.5578148
## 3 GBM_5_AutoML_20201029_094226 0.7808615 0.5597083
## 4 GBM_1_AutoML_20201029_094226 0.7789967 0.5615896
## 5 GBM_2_AutoML_20201029_094226 0.7783378 0.5615272
## 6 GBM_grid__1_AutoML_20201029_094226_model_1 0.7778602 0.5646552
## 7 GBM_3_AutoML_20201029_094226 0.7763885 0.5639063
## 8 GBM_4_AutoML_20201029_094226 0.7707584 0.5709119
## 9 DRF_1_AutoML_20201029_094226 0.7651511 0.5802457
## 10 XRT_1_AutoML_20201029_094226 0.7651343 0.5821720
## 11 DeepLearning_1_AutoML_20201029_094226 0.7066764 0.6266720
## 12 GLM_1_AutoML_20201029_094226 0.6826481 0.6385205
## aucpr mean_per_class_error rmse mse
## 1 0.8042508 0.3303502 0.4341678 0.1885017
## 2 0.8019306 0.3214664 0.4350932 0.1893061
## 3 0.7991448 0.3253993 0.4360831 0.1901685
## 4 0.7972760 0.3266970 0.4370027 0.1909714
## 5 0.7964971 0.3298046 0.4371993 0.1911432
## 6 0.7953585 0.3337600 0.4380880 0.1919211
## 7 0.7940316 0.3280652 0.4382742 0.1920843
## 8 0.7911179 0.3537433 0.4416805 0.1950817
## 9 0.7836572 0.3404911 0.4452915 0.1982845
## 10 0.7832421 0.3491710 0.4460132 0.1989278
## 11 0.7147696 0.3871564 0.4669705 0.2180615
## 12 0.6807151 0.3972341 0.4726827 0.2234290
##
## [12 rows x 7 columns]
## Model Details:
## ==============
##
## H2OBinomialModel: stackedensemble
## Model ID: StackedEnsemble_AllModels_AutoML_20201029_094226
## Number of Base Models: 10
##
## Base Models (count by algorithm type):
##
## deeplearning drf gbm glm
## 1 2 6 1
##
## Metalearner:
##
## Metalearner algorithm: glm
## Metalearner cross-validation fold assignment:
## Fold assignment scheme: AUTO
## Number of folds: 5
## Fold column: NULL
## Metalearner hyperparameters:
##
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
##
## MSE: 0.07866061
## RMSE: 0.280465
## LogLoss: 0.2991412
## Mean Per-Class Error: 0.06808661
## AUC: 0.9843968
## AUCPR: 0.9860152
## Gini: 0.9687936
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 4298 407 0.086504 =407/4705
## 1 263 5032 0.049669 =263/5295
## Totals 4561 5439 0.067000 =670/10000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.506144 0.937582 203
## 2 max f2 0.388737 0.960679 243
## 3 max f0point5 0.616250 0.943439 161
## 4 max accuracy 0.506144 0.933000 203
## 5 max precision 0.942976 1.000000 0
## 6 max recall 0.240237 1.000000 308
## 7 max specificity 0.942976 1.000000 0
## 8 max absolute_mcc 0.506144 0.865665 203
## 9 max min_per_class_accuracy 0.535633 0.929862 192
## 10 max mean_per_class_accuracy 0.506144 0.931913 203
## 11 max tns 0.942976 4705.000000 0
## 12 max fns 0.942976 5292.000000 0
## 13 max fps 0.054423 4705.000000 399
## 14 max tps 0.240237 5295.000000 308
## 15 max tnr 0.942976 1.000000 0
## 16 max fnr 0.942976 0.999433 0
## 17 max fpr 0.054423 1.000000 399
## 18 max tpr 0.240237 1.000000 308
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1885017
## RMSE: 0.4341678
## LogLoss: 0.5556553
## Mean Per-Class Error: 0.3303502
## AUC: 0.786269
## AUCPR: 0.8042508
## Gini: 0.572538
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 2110 2595 0.551541 =2595/4705
## 1 578 4717 0.109160 =578/5295
## Totals 2688 7312 0.317300 =3173/10000
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.324731 0.748314 284
## 2 max f2 0.159562 0.860017 358
## 3 max f0point5 0.641925 0.738875 139
## 4 max accuracy 0.533394 0.709100 192
## 5 max precision 0.942302 1.000000 0
## 6 max recall 0.083961 1.000000 392
## 7 max specificity 0.942302 1.000000 0
## 8 max absolute_mcc 0.575882 0.422902 172
## 9 max min_per_class_accuracy 0.528311 0.707758 194
## 10 max mean_per_class_accuracy 0.575882 0.711301 172
## 11 max tns 0.942302 4705.000000 0
## 12 max fns 0.942302 5293.000000 0
## 13 max fps 0.056529 4705.000000 399
## 14 max tps 0.083961 5295.000000 392
## 15 max tnr 0.942302 1.000000 0
## 16 max fnr 0.942302 0.999622 0
## 17 max fpr 0.056529 1.000000 399
## 18 max tpr 0.083961 1.000000 392
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
# To generate predictions on a test set, you can make predictions
# directly on the `"H2OAutoML"` object or on the leader model
# object directly
pred <- h2o.predict(aml, test) # predict(aml, test) also works
##
|
| | 0%
|
|======================================================================| 100%
##
|
| | 0%
|
|======================================================================| 100%
# Get leaderboard with 'extra_columns = 'ALL'
lb <- h2o.get_leaderboard(object = aml, extra_columns = 'ALL')
lb
## model_id auc logloss
## 1 StackedEnsemble_AllModels_AutoML_20201029_094226 0.7862690 0.5556553
## 2 StackedEnsemble_BestOfFamily_AutoML_20201029_094226 0.7839050 0.5578148
## 3 GBM_5_AutoML_20201029_094226 0.7808615 0.5597083
## 4 GBM_1_AutoML_20201029_094226 0.7789967 0.5615896
## 5 GBM_2_AutoML_20201029_094226 0.7783378 0.5615272
## 6 GBM_grid__1_AutoML_20201029_094226_model_1 0.7778602 0.5646552
## aucpr mean_per_class_error rmse mse training_time_ms
## 1 0.8042508 0.3303502 0.4341678 0.1885017 622
## 2 0.8019306 0.3214664 0.4350932 0.1893061 382
## 3 0.7991448 0.3253993 0.4360831 0.1901685 666
## 4 0.7972760 0.3266970 0.4370027 0.1909714 532
## 5 0.7964971 0.3298046 0.4371993 0.1911432 560
## 6 0.7953585 0.3337600 0.4380880 0.1919211 456
## predict_time_per_row_ms
## 1 0.038802
## 2 0.017179
## 3 0.008661
## 4 0.005583
## 5 0.005582
## 6 0.004339
##
## [12 rows x 9 columns]