H2o autoML

0.1 AutoML

H2O package 를 활용한 automl

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/downloading.html#install-in-r

[참조 2] https://r-kor.org/wp-content/uploads/2018/08/h2o__v3.pdf

[참조 3] https://github.com/DarrenCook/h2o/

0.1.1 packages

library(h2o)

## Warning: package 'h2o' was built under R version 4.0.3

# 초기 준비
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         1 days 39 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    20 days  
##     H2O cluster name:           H2O_started_from_R_user_gax063 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.33 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 데이터 준비

#Import a sample binary outcome train/test set into H2O
#import 안되면 첫째 http:// 로 해볼 것, 둘째, 홈페이지서 저장하고 fread(), as.h2o() 변환하여 활용

#train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")
#test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

#train <- fread("higgs_train_10k.csv")
#train <- as.h2o(train)
#test <- fread("higgs_test_5k.csv")
#test <- as.h2o(test)

train <- h2o.importFile("http://s3.amazonaws.com/erin-data/higgs/higgs_train_10k.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |====                                                                  |   6%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |=================                                                     |  25%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |=======================================                               |  56%
  |                                                                            
  |============================================                          |  62%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=========================================================             |  81%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |==================================================================    |  94%
  |                                                                            
  |======================================================================| 100%

test <- h2o.importFile("http://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |=========                                                             |  12%
  |                                                                            
  |======================                                                |  31%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%
  |                                                                            
  |=============================================================         |  87%
  |                                                                            
  |======================================================================| 100%

0.1.3 전처리

# Identify predictors and response
y <- "response"
x <- setdiff(names(train), y)

# For binary classification, response should be a factor
train[, y] <- as.factor(train[, y])
test[, y] <- as.factor(test[, y])

0.1.4 AutoML 모델링

# Run AutoML for 10 base models (limited to 1 hour max runtime by default)
aml <- h2o.automl(x = x, y = y,
                  training_frame = train,
                  max_models = 10,                 # 20개
                  seed = 1)

## 
  |                                                                            
  |                                                                      |   0%
## 09:42:26.98: AutoML: XGBoost is not available; skipping it.
  |                                                                            
  |====                                                                  |   5%
  |                                                                            
  |=====                                                                 |   7%
  |                                                                            
  |======                                                                |   8%
  |                                                                            
  |===============                                                       |  21%
  |                                                                            
  |=================                                                     |  24%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |========================                                              |  35%
  |                                                                            
  |=========================                                             |  35%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

# View the AutoML Leaderboard
lb <- aml@leaderboard
print(lb, n = nrow(lb))  # Print all rows instead of default (6 rows)

##                                               model_id       auc   logloss
## 1     StackedEnsemble_AllModels_AutoML_20201029_094226 0.7862690 0.5556553
## 2  StackedEnsemble_BestOfFamily_AutoML_20201029_094226 0.7839050 0.5578148
## 3                         GBM_5_AutoML_20201029_094226 0.7808615 0.5597083
## 4                         GBM_1_AutoML_20201029_094226 0.7789967 0.5615896
## 5                         GBM_2_AutoML_20201029_094226 0.7783378 0.5615272
## 6           GBM_grid__1_AutoML_20201029_094226_model_1 0.7778602 0.5646552
## 7                         GBM_3_AutoML_20201029_094226 0.7763885 0.5639063
## 8                         GBM_4_AutoML_20201029_094226 0.7707584 0.5709119
## 9                         DRF_1_AutoML_20201029_094226 0.7651511 0.5802457
## 10                        XRT_1_AutoML_20201029_094226 0.7651343 0.5821720
## 11               DeepLearning_1_AutoML_20201029_094226 0.7066764 0.6266720
## 12                        GLM_1_AutoML_20201029_094226 0.6826481 0.6385205
##        aucpr mean_per_class_error      rmse       mse
## 1  0.8042508            0.3303502 0.4341678 0.1885017
## 2  0.8019306            0.3214664 0.4350932 0.1893061
## 3  0.7991448            0.3253993 0.4360831 0.1901685
## 4  0.7972760            0.3266970 0.4370027 0.1909714
## 5  0.7964971            0.3298046 0.4371993 0.1911432
## 6  0.7953585            0.3337600 0.4380880 0.1919211
## 7  0.7940316            0.3280652 0.4382742 0.1920843
## 8  0.7911179            0.3537433 0.4416805 0.1950817
## 9  0.7836572            0.3404911 0.4452915 0.1982845
## 10 0.7832421            0.3491710 0.4460132 0.1989278
## 11 0.7147696            0.3871564 0.4669705 0.2180615
## 12 0.6807151            0.3972341 0.4726827 0.2234290
## 
## [12 rows x 7 columns]

# The leader model is stored here
aml@leader

## Model Details:
## ==============
## 
## H2OBinomialModel: stackedensemble
## Model ID:  StackedEnsemble_AllModels_AutoML_20201029_094226 
## Number of Base Models: 10
## 
## Base Models (count by algorithm type):
## 
## deeplearning          drf          gbm          glm 
##            1            2            6            1 
## 
## Metalearner:
## 
## Metalearner algorithm: glm
## Metalearner cross-validation fold assignment:
##   Fold assignment scheme: AUTO
##   Number of folds: 5
##   Fold column: NULL
## Metalearner hyperparameters: 
## 
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on training data. **
## 
## MSE:  0.07866061
## RMSE:  0.280465
## LogLoss:  0.2991412
## Mean Per-Class Error:  0.06808661
## AUC:  0.9843968
## AUCPR:  0.9860152
## Gini:  0.9687936
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error        Rate
## 0      4298  407 0.086504   =407/4705
## 1       263 5032 0.049669   =263/5295
## Totals 4561 5439 0.067000  =670/10000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.506144    0.937582 203
## 2                       max f2  0.388737    0.960679 243
## 3                 max f0point5  0.616250    0.943439 161
## 4                 max accuracy  0.506144    0.933000 203
## 5                max precision  0.942976    1.000000   0
## 6                   max recall  0.240237    1.000000 308
## 7              max specificity  0.942976    1.000000   0
## 8             max absolute_mcc  0.506144    0.865665 203
## 9   max min_per_class_accuracy  0.535633    0.929862 192
## 10 max mean_per_class_accuracy  0.506144    0.931913 203
## 11                     max tns  0.942976 4705.000000   0
## 12                     max fns  0.942976 5292.000000   0
## 13                     max fps  0.054423 4705.000000 399
## 14                     max tps  0.240237 5295.000000 308
## 15                     max tnr  0.942976    1.000000   0
## 16                     max fnr  0.942976    0.999433   0
## 17                     max fpr  0.054423    1.000000 399
## 18                     max tpr  0.240237    1.000000 308
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: stackedensemble
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1885017
## RMSE:  0.4341678
## LogLoss:  0.5556553
## Mean Per-Class Error:  0.3303502
## AUC:  0.786269
## AUCPR:  0.8042508
## Gini:  0.572538
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           0    1    Error         Rate
## 0      2110 2595 0.551541   =2595/4705
## 1       578 4717 0.109160    =578/5295
## Totals 2688 7312 0.317300  =3173/10000
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold       value idx
## 1                       max f1  0.324731    0.748314 284
## 2                       max f2  0.159562    0.860017 358
## 3                 max f0point5  0.641925    0.738875 139
## 4                 max accuracy  0.533394    0.709100 192
## 5                max precision  0.942302    1.000000   0
## 6                   max recall  0.083961    1.000000 392
## 7              max specificity  0.942302    1.000000   0
## 8             max absolute_mcc  0.575882    0.422902 172
## 9   max min_per_class_accuracy  0.528311    0.707758 194
## 10 max mean_per_class_accuracy  0.575882    0.711301 172
## 11                     max tns  0.942302 4705.000000   0
## 12                     max fns  0.942302 5293.000000   0
## 13                     max fps  0.056529 4705.000000 399
## 14                     max tps  0.083961 5295.000000 392
## 15                     max tnr  0.942302    1.000000   0
## 16                     max fnr  0.942302    0.999622   0
## 17                     max fpr  0.056529    1.000000 399
## 18                     max tpr  0.083961    1.000000 392
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

0.1.5 예측

# To generate predictions on a test set, you can make predictions
# directly on the `"H2OAutoML"` object or on the leader model
# object directly
pred <- h2o.predict(aml, test)  # predict(aml, test) also works

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# or:
pred <- h2o.predict(aml@leader, test)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

# Get leaderboard with 'extra_columns = 'ALL'
lb <- h2o.get_leaderboard(object = aml, extra_columns = 'ALL')
lb

##                                              model_id       auc   logloss
## 1    StackedEnsemble_AllModels_AutoML_20201029_094226 0.7862690 0.5556553
## 2 StackedEnsemble_BestOfFamily_AutoML_20201029_094226 0.7839050 0.5578148
## 3                        GBM_5_AutoML_20201029_094226 0.7808615 0.5597083
## 4                        GBM_1_AutoML_20201029_094226 0.7789967 0.5615896
## 5                        GBM_2_AutoML_20201029_094226 0.7783378 0.5615272
## 6          GBM_grid__1_AutoML_20201029_094226_model_1 0.7778602 0.5646552
##       aucpr mean_per_class_error      rmse       mse training_time_ms
## 1 0.8042508            0.3303502 0.4341678 0.1885017              622
## 2 0.8019306            0.3214664 0.4350932 0.1893061              382
## 3 0.7991448            0.3253993 0.4360831 0.1901685              666
## 4 0.7972760            0.3266970 0.4370027 0.1909714              532
## 5 0.7964971            0.3298046 0.4371993 0.1911432              560
## 6 0.7953585            0.3337600 0.4380880 0.1919211              456
##   predict_time_per_row_ms
## 1                0.038802
## 2                0.017179
## 3                0.008661
## 4                0.005583
## 5                0.005582
## 6                0.004339
## 
## [12 rows x 9 columns]

H2o autoML

updragon

2020 10 27

0.1 AutoML

0.1.1 packages

0.1.2 데이터 준비

0.1.3 전처리

0.1.4 AutoML 모델링

0.1.5 예측