H2O 를 활용한 Gradient Boosting Machine (GBM)
[참조 1] http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm.html
Gradient Boosting Machine (회귀 및 분류 용)은 전진 학습 앙상블 방법
## Warning: package 'h2o' was built under R version 4.0.3
## Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 5 hours 5 minutes
## H2O cluster timezone: Asia/Seoul
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 1 month and 2 days
## H2O cluster name: H2O_started_from_R_user_jna970
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.87 GB
## H2O cluster total cores: 4
## H2O cluster allowed cores: 4
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.2 (2020-06-22)
# Import the prostate dataset into H2O:
prostate <- h2o.importFile("http://s3.amazonaws.com/h2o-public-test-data/smalldata/prostate/prostate.csv")##
|
| | 0%
|
|======================================================================| 100%
## ID CAPSULE AGE RACE DPROS DCAPS PSA VOL GLEASON
## 1 1 0 65 1 2 1 1.4 0.0 6
## 2 2 0 72 1 3 2 6.7 0.0 7
## 3 3 0 70 1 1 2 4.9 0.0 6
## 4 4 0 76 2 2 1 51.2 20.0 7
## 5 5 0 69 1 1 1 12.3 55.9 6
## 6 6 1 71 1 3 2 3.3 0.0 8
##
## [380 rows x 9 columns]
# Set the predictors and response; set the factors:
prostate$CAPSULE <- as.factor(prostate$CAPSULE)
predictors <- c("ID", "AGE", "RACE", "DPROS", "DCAPS", "PSA", "VOL", "GLEASON")
response <- "CAPSULE"# Build and train the model:
pros_gbm <- h2o.gbm(x = predictors,
y = response,
nfolds = 5,
seed = 1111,
keep_cross_validation_predictions = TRUE,
training_frame = prostate)##
|
| | 0%
|
|=============================================== | 67%
|
|======================================================================| 100%
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model Key: GBM_model_R_1605053712803_3812
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 50 50 12171 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 7 24 14.74000
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.0637863
## RMSE: 0.2525595
## LogLoss: 0.2444918
## Mean Per-Class Error: 0.05819009
## AUC: 0.9882526
## AUCPR: 0.9834103
## Gini: 0.9765051
## R^2: 0.7347977
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 208 19 0.083700 =19/227
## 1 5 148 0.032680 =5/153
## Totals 213 167 0.063158 =24/380
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.369290 0.925000 166
## 2 max f2 0.328487 0.955414 172
## 3 max f0point5 0.570988 0.947137 131
## 4 max accuracy 0.461767 0.936842 150
## 5 max precision 0.974091 1.000000 0
## 6 max recall 0.271605 1.000000 200
## 7 max specificity 0.974091 1.000000 0
## 8 max absolute_mcc 0.369290 0.873124 166
## 9 max min_per_class_accuracy 0.430389 0.929515 158
## 10 max mean_per_class_accuracy 0.369290 0.941810 166
## 11 max tns 0.974091 227.000000 0
## 12 max fns 0.974091 152.000000 0
## 13 max fps 0.009748 227.000000 377
## 14 max tps 0.271605 153.000000 200
## 15 max tnr 0.974091 1.000000 0
## 16 max fnr 0.974091 0.993464 0
## 17 max fpr 0.009748 1.000000 377
## 18 max tpr 0.271605 1.000000 200
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
##
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.199719
## RMSE: 0.4468994
## LogLoss: 0.6036669
## Mean Per-Class Error: 0.3065129
## AUC: 0.7639861
## AUCPR: 0.6483184
## Gini: 0.5279721
## R^2: 0.1696344
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## 0 1 Error Rate
## 0 119 108 0.475771 =108/227
## 1 21 132 0.137255 =21/153
## Totals 140 240 0.339474 =129/380
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.212030 0.671756 239
## 2 max f2 0.106855 0.796460 291
## 3 max f0point5 0.608746 0.665025 113
## 4 max accuracy 0.608746 0.723684 113
## 5 max precision 0.950591 0.857143 6
## 6 max recall 0.016471 1.000000 375
## 7 max specificity 0.970327 0.995595 0
## 8 max absolute_mcc 0.478099 0.421373 154
## 9 max min_per_class_accuracy 0.411816 0.704846 174
## 10 max mean_per_class_accuracy 0.478099 0.711123 154
## 11 max tns 0.970327 226.000000 0
## 12 max fns 0.970327 153.000000 0
## 13 max fps 0.013624 227.000000 379
## 14 max tps 0.016471 153.000000 375
## 15 max tnr 0.970327 0.995595 0
## 16 max fnr 0.970327 1.000000 0
## 17 max fpr 0.013624 1.000000 379
## 18 max tpr 0.016471 1.000000 375
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.6878125 0.06919531 0.67105263 0.6388889 0.6785714 0.64285713
## auc 0.7617476 0.0774213 0.7645611 0.71266234 0.8188235 0.66083914
## aucpr 0.64184284 0.16101824 0.5026728 0.5690787 0.7809037 0.50938404
## err 0.31218752 0.06919531 0.32894737 0.3611111 0.32142857 0.35714287
## err_count 23.6 4.8785243 25.0 26.0 27.0 25.0
## cv_5_valid
## accuracy 0.8076923
## auc 0.8518519
## aucpr 0.847175
## err 0.1923077
## err_count 15.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## pr_auc 0.64184284 0.16101824 0.5026728 0.5690787 0.7809037
## precision 0.56509644 0.11003888 0.47619048 0.5217391 0.55932206
## r2 0.13813825 0.14245723 0.12889759 0.041160017 0.28060743
## recall 0.8760893 0.095223315 0.8695652 0.85714287 0.9705882
## rmse 0.4476493 0.034771368 0.4287685 0.4773599 0.41632083
## specificity 0.5589407 0.06669616 0.5849057 0.5 0.48
## cv_4_valid cv_5_valid
## pr_auc 0.50938404 0.847175
## precision 0.5135135 0.754717
## r2 -0.039253294 0.2792795
## recall 0.7307692 0.95238096
## rmse 0.49257874 0.42321858
## specificity 0.59090906 0.6388889
##
## Scoring History:
## timestamp duration number_of_trees training_rmse training_logloss
## 1 2020-11-11 14:21:57 0.692 sec 0 0.49043 0.67406
## 2 2020-11-11 14:21:57 0.697 sec 1 0.46862 0.63072
## 3 2020-11-11 14:21:57 0.718 sec 2 0.45119 0.59728
## 4 2020-11-11 14:21:57 0.721 sec 3 0.43645 0.56953
## 5 2020-11-11 14:21:57 0.724 sec 4 0.42351 0.54545
## training_auc training_pr_auc training_lift training_classification_error
## 1 0.50000 0.40263 1.00000 0.59737
## 2 0.88931 0.85602 2.48366 0.20526
## 3 0.89688 0.86382 2.48366 0.18684
## 4 0.89898 0.87029 2.48366 0.18158
## 5 0.90318 0.87597 2.48366 0.16842
##
## ---
## timestamp duration number_of_trees training_rmse
## 46 2020-11-11 14:21:57 0.851 sec 45 0.26228
## 47 2020-11-11 14:21:57 0.855 sec 46 0.26080
## 48 2020-11-11 14:21:57 0.858 sec 47 0.25792
## 49 2020-11-11 14:21:57 0.862 sec 48 0.25706
## 50 2020-11-11 14:21:57 0.865 sec 49 0.25352
## 51 2020-11-11 14:21:57 0.869 sec 50 0.25256
## training_logloss training_auc training_pr_auc training_lift
## 46 0.25888 0.98546 0.97997 2.48366
## 47 0.25665 0.98578 0.98028 2.48366
## 48 0.25239 0.98667 0.98150 2.48366
## 49 0.25089 0.98670 0.98145 2.48366
## 50 0.24592 0.98805 0.98317 2.48366
## 51 0.24449 0.98825 0.98341 2.48366
## training_classification_error
## 46 0.07105
## 47 0.06316
## 48 0.06579
## 49 0.06316
## 50 0.06316
## 51 0.06316
##
## Variable Importances: (Extract with `h2o.varimp`)
## =================================================
##
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 GLEASON 110.325813 1.000000 0.322999
## 2 ID 58.696804 0.532031 0.171845
## 3 PSA 57.767464 0.523608 0.169125
## 4 VOL 40.088531 0.363365 0.117366
## 5 DPROS 34.955631 0.316840 0.102339
## 6 AGE 31.499592 0.285514 0.092221
## 7 DCAPS 7.156589 0.064868 0.020952
## 8 RACE 1.076955 0.009762 0.003153
## [1] 0.7639861
# Generate predictions on a validation set (if necessary):
pred <- h2o.predict(pros_gbm, newdata = prostate)##
|
| | 0%
|
|======================================================================| 100%
## predict p0 p1
## 1 0 0.9673551 0.03264488
## 2 0 0.6499115 0.35008845
## 3 0 0.9618150 0.03818497
## 4 0 0.8815766 0.11842341
## 5 0 0.9767848 0.02321521
## 6 1 0.5241713 0.47582865
##
## [380 rows x 3 columns]