Gradient Boosting Machine (GBM)

0.1 Gradient Boosting Machine (GBM) _Supervised

H2O 를 활용한 Gradient Boosting Machine (GBM)

[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html

Gradient Boosting Machine (회귀 및 분류 용)은 전진 학습 앙상블 방법

0.1.1 packages

library(h2o)

## Warning: package 'h2o' was built under R version 4.0.3

# 초기 준비
h2o.init()

##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         5 hours 5 minutes 
##     H2O cluster timezone:       Asia/Seoul 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    1 month and 2 days  
##     H2O cluster name:           H2O_started_from_R_user_jna970 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.87 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.2 (2020-06-22)

0.1.2 data import

# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

prostate

##   ID CAPSULE AGE RACE DPROS DCAPS  PSA  VOL GLEASON
## 1  1       0  65    1     2     1  1.4  0.0       6
## 2  2       0  72    1     3     2  6.7  0.0       7
## 3  3       0  70    1     1     2  4.9  0.0       6
## 4  4       0  76    2     2     1 51.2 20.0       7
## 5  5       0  69    1     1     1 12.3 55.9       6
## 6  6       1  71    1     3     2  3.3  0.0       8
## 
## [380 rows x 9 columns]

0.1.3 data mumming

# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
predictors <- c("ID", "AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"

0.1.4 modeling (& kfold)

# Build and train the model:
pros_gbm <- h2o.gbm(x = predictors,
                    y = response,
                    nfolds = 5,
                    seed = 1111,
                    keep_cross_validation_predictions = TRUE,
                    training_frame = prostate)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |======================================================================| 100%

summary(pros_gbm)

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model Key:  GBM_model_R_1605053712803_3812 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1              50                       50               12171         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          7         24    14.74000
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.0637863
## RMSE:  0.2525595
## LogLoss:  0.2444918
## Mean Per-Class Error:  0.05819009
## AUC:  0.9882526
## AUCPR:  0.9834103
## Gini:  0.9765051
## R^2:  0.7347977
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      208  19 0.083700  =19/227
## 1        5 148 0.032680   =5/153
## Totals 213 167 0.063158  =24/380
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.369290   0.925000 166
## 2                       max f2  0.328487   0.955414 172
## 3                 max f0point5  0.570988   0.947137 131
## 4                 max accuracy  0.461767   0.936842 150
## 5                max precision  0.974091   1.000000   0
## 6                   max recall  0.271605   1.000000 200
## 7              max specificity  0.974091   1.000000   0
## 8             max absolute_mcc  0.369290   0.873124 166
## 9   max min_per_class_accuracy  0.430389   0.929515 158
## 10 max mean_per_class_accuracy  0.369290   0.941810 166
## 11                     max tns  0.974091 227.000000   0
## 12                     max fns  0.974091 152.000000   0
## 13                     max fps  0.009748 227.000000 377
## 14                     max tps  0.271605 153.000000 200
## 15                     max tnr  0.974091   1.000000   0
## 16                     max fnr  0.974091   0.993464   0
## 17                     max fpr  0.009748   1.000000 377
## 18                     max tpr  0.271605   1.000000 200
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## 
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.199719
## RMSE:  0.4468994
## LogLoss:  0.6036669
## Mean Per-Class Error:  0.3065129
## AUC:  0.7639861
## AUCPR:  0.6483184
## Gini:  0.5279721
## R^2:  0.1696344
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error      Rate
## 0      119 108 0.475771  =108/227
## 1       21 132 0.137255   =21/153
## Totals 140 240 0.339474  =129/380
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.212030   0.671756 239
## 2                       max f2  0.106855   0.796460 291
## 3                 max f0point5  0.608746   0.665025 113
## 4                 max accuracy  0.608746   0.723684 113
## 5                max precision  0.950591   0.857143   6
## 6                   max recall  0.016471   1.000000 375
## 7              max specificity  0.970327   0.995595   0
## 8             max absolute_mcc  0.478099   0.421373 154
## 9   max min_per_class_accuracy  0.411816   0.704846 174
## 10 max mean_per_class_accuracy  0.478099   0.711123 154
## 11                     max tns  0.970327 226.000000   0
## 12                     max fns  0.970327 153.000000   0
## 13                     max fps  0.013624 227.000000 379
## 14                     max tps  0.016471 153.000000 375
## 15                     max tnr  0.970327   0.995595   0
## 16                     max fnr  0.970327   1.000000   0
## 17                     max fpr  0.013624   1.000000 379
## 18                     max tpr  0.016471   1.000000 375
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean         sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy   0.6878125 0.06919531 0.67105263  0.6388889  0.6785714 0.64285713
## auc        0.7617476  0.0774213  0.7645611 0.71266234  0.8188235 0.66083914
## aucpr     0.64184284 0.16101824  0.5026728  0.5690787  0.7809037 0.50938404
## err       0.31218752 0.06919531 0.32894737  0.3611111 0.32142857 0.35714287
## err_count       23.6  4.8785243       25.0       26.0       27.0       25.0
##           cv_5_valid
## accuracy   0.8076923
## auc        0.8518519
## aucpr       0.847175
## err        0.1923077
## err_count       15.0
## 
## ---
##                   mean          sd cv_1_valid  cv_2_valid cv_3_valid
## pr_auc      0.64184284  0.16101824  0.5026728   0.5690787  0.7809037
## precision   0.56509644  0.11003888 0.47619048   0.5217391 0.55932206
## r2          0.13813825  0.14245723 0.12889759 0.041160017 0.28060743
## recall       0.8760893 0.095223315  0.8695652  0.85714287  0.9705882
## rmse         0.4476493 0.034771368  0.4287685   0.4773599 0.41632083
## specificity  0.5589407  0.06669616  0.5849057         0.5       0.48
##               cv_4_valid cv_5_valid
## pr_auc        0.50938404   0.847175
## precision      0.5135135   0.754717
## r2          -0.039253294  0.2792795
## recall         0.7307692 0.95238096
## rmse          0.49257874 0.42321858
## specificity   0.59090906  0.6388889
## 
## Scoring History: 
##             timestamp   duration number_of_trees training_rmse training_logloss
## 1 2020-11-11 14:21:57  0.692 sec               0       0.49043          0.67406
## 2 2020-11-11 14:21:57  0.697 sec               1       0.46862          0.63072
## 3 2020-11-11 14:21:57  0.718 sec               2       0.45119          0.59728
## 4 2020-11-11 14:21:57  0.721 sec               3       0.43645          0.56953
## 5 2020-11-11 14:21:57  0.724 sec               4       0.42351          0.54545
##   training_auc training_pr_auc training_lift training_classification_error
## 1      0.50000         0.40263       1.00000                       0.59737
## 2      0.88931         0.85602       2.48366                       0.20526
## 3      0.89688         0.86382       2.48366                       0.18684
## 4      0.89898         0.87029       2.48366                       0.18158
## 5      0.90318         0.87597       2.48366                       0.16842
## 
## ---
##              timestamp   duration number_of_trees training_rmse
## 46 2020-11-11 14:21:57  0.851 sec              45       0.26228
## 47 2020-11-11 14:21:57  0.855 sec              46       0.26080
## 48 2020-11-11 14:21:57  0.858 sec              47       0.25792
## 49 2020-11-11 14:21:57  0.862 sec              48       0.25706
## 50 2020-11-11 14:21:57  0.865 sec              49       0.25352
## 51 2020-11-11 14:21:57  0.869 sec              50       0.25256
##    training_logloss training_auc training_pr_auc training_lift
## 46          0.25888      0.98546         0.97997       2.48366
## 47          0.25665      0.98578         0.98028       2.48366
## 48          0.25239      0.98667         0.98150       2.48366
## 49          0.25089      0.98670         0.98145       2.48366
## 50          0.24592      0.98805         0.98317       2.48366
## 51          0.24449      0.98825         0.98341       2.48366
##    training_classification_error
## 46                       0.07105
## 47                       0.06316
## 48                       0.06579
## 49                       0.06316
## 50                       0.06316
## 51                       0.06316
## 
## Variable Importances: (Extract with `h2o.varimp`) 
## =================================================
## 
## Variable Importances: 
##   variable relative_importance scaled_importance percentage
## 1  GLEASON          110.325813          1.000000   0.322999
## 2       ID           58.696804          0.532031   0.171845
## 3      PSA           57.767464          0.523608   0.169125
## 4      VOL           40.088531          0.363365   0.117366
## 5    DPROS           34.955631          0.316840   0.102339
## 6      AGE           31.499592          0.285514   0.092221
## 7    DCAPS            7.156589          0.064868   0.020952
## 8     RACE            1.076955          0.009762   0.003153

# Eval performance:
perf <- h2o.performance(pros_gbm)

0.1.5 AUC of cross-validated holdout predictions

# AUC of cross-validated holdout predictions
h2o.auc(pros_gbm, xval = TRUE)

## [1] 0.7639861

0.1.6 prediction

# Generate predictions on a validation set (if necessary):
pred <- h2o.predict(pros_gbm, newdata = prostate)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

pred

##   predict        p0         p1
## 1       0 0.9673551 0.03264488
## 2       0 0.6499115 0.35008845
## 3       0 0.9618150 0.03818497
## 4       0 0.8815766 0.11842341
## 5       0 0.9767848 0.02321521
## 6       1 0.5241713 0.47582865
## 
## [380 rows x 3 columns]

Gradient Boosting Machine (GBM)

updragon

2020 11 3

0.1 Gradient Boosting Machine (GBM) _Supervised

0.1.1 packages

0.1.2 data import

0.1.3 data mumming

0.1.4 modeling (& kfold)

0.1.5 AUC of cross-validated holdout predictions

0.1.6 prediction