Split dataset into train, valid and test
splits = h2o.splitFrame(
data = loans,
ratios = c(0.5,0.2),
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 4569)
loans_train = splits[[1]]
loans_valid = splits[[2]]
loans_test = splits[[3]]
# dim(loans_train)
# dim(loans_valid)
# dim(loans_test)Gradient Boosting Machine Learing Model Setting
y_response = "default"
x_predictors = setdiff(names(loans), c(y_response, "names"))
gbm_loans = h2o.gbm(x = x_predictors,
y = y_response,
training_frame = loans_train,
validation_frame = loans_valid,
#validation_frame = loans_test,
ntrees=500,
learn_rate=0.005,
learn_rate_annealing = 0.99,
stopping_rounds = 10,
stopping_tolerance = 1e-4,
stopping_metric = "AUC",
sample_rate = 0.9,
col_sample_rate = 0.9,
seed = 4569,
score_tree_interval = 10
)## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1558919351183_3389
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 500 500 180748 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 17 31 23.96400
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.176599
## RMSE: 0.4202368
## LogLoss: 0.5373888
## Mean Per-Class Error: 0.1910963
## AUC: 0.8918551
## pr_auc: 0.8035924
## Gini: 0.7837101
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## no yes Error Rate
## no 584 78 0.117825 =78/662
## yes 92 256 0.264368 =92/348
## Totals 676 334 0.168317 =170/1010
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.354626 0.750733 164
## 2 max f2 0.290111 0.812562 291
## 3 max f0point5 0.400812 0.820957 98
## 4 max accuracy 0.368542 0.838614 142
## 5 max precision 0.598200 1.000000 0
## 6 max recall 0.264495 1.000000 371
## 7 max specificity 0.598200 1.000000 0
## 8 max absolute_mcc 0.368542 0.633925 142
## 9 max min_per_class_accuracy 0.335600 0.796073 200
## 10 max mean_per_class_accuracy 0.354626 0.808904 164
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1901109
## RMSE: 0.4360171
## LogLoss: 0.5668623
## Mean Per-Class Error: 0.2957044
## AUC: 0.7756624
## pr_auc: 0.6490403
## Gini: 0.5513248
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## no yes Error Rate
## no 156 101 0.392996 =101/257
## yes 25 101 0.198413 =25/126
## Totals 181 202 0.328982 =126/383
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.307786 0.615854 179
## 2 max f2 0.283216 0.758755 241
## 3 max f0point5 0.400520 0.679487 49
## 4 max accuracy 0.400520 0.775457 49
## 5 max precision 0.598200 1.000000 0
## 6 max recall 0.248641 1.000000 320
## 7 max specificity 0.598200 1.000000 0
## 8 max absolute_mcc 0.400520 0.460375 49
## 9 max min_per_class_accuracy 0.324252 0.690476 147
## 10 max mean_per_class_accuracy 0.307786 0.704296 179
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 months_loan 4882.286133 1.000000 0.235244
## 2 amount 3386.633057 0.693657 0.163178
## 3 savings_balance 2718.146729 0.556736 0.130969
## 4 age 1999.284546 0.409498 0.096332
## 5 housing 1767.279297 0.361978 0.085153
## 6 employment 1442.025513 0.295359 0.069481
## 7 credit_record 1282.387573 0.262661 0.061789
## 8 job 1063.397217 0.217807 0.051238
## 9 purpose 744.287109 0.152446 0.035862
## 10 fixed_assets 668.080750 0.136838 0.032190
## 11 other_credit 353.054474 0.072313 0.017011
## 12 loans_count 272.191284 0.055751 0.013115
## 13 dependents 175.124817 0.035869 0.008438
p = treemap(varimp_gbm_loans,
index=c("variable"),
vSize="scaled_importance",
type="index",
fontsize.labels=c(18),
title="gbm_loans : Relative Important Predictors",
fontsize.title=24,
) ## predict no yes
## 1 yes 0.4819037 0.5180963
## 2 no 0.7083291 0.2916709
## 3 yes 0.6627646 0.3372354
## 4 yes 0.4709696 0.5290304
## 5 yes 0.6629977 0.3370023
## 6 no 0.7343484 0.2656516
##
## [607 rows x 3 columns]
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.354625927965481:
## no yes Error Rate
## no 584 78 0.117825 =78/662
## yes 92 256 0.264368 =92/348
## Totals 676 334 0.168317 =170/1010
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.307786176353324:
## no yes Error Rate
## no 156 101 0.392996 =101/257
## yes 25 101 0.198413 =25/126
## Totals 181 202 0.328982 =126/383
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.32610821859585:
## no yes Error Rate
## no 275 113 0.291237 =113/388
## yes 62 157 0.283105 =62/219
## Totals 337 270 0.288303 =175/607
AUC: Area Under ROC Curve
- AUC uses to measure Measure performance of a classifier which the area under the ROC Curve
## [1] 0.8918485
## [1] 0.7756624
## [1] 0.7750788