HK Loan Market: Gradient Boosting Machine Learning with R + H2O

loans = read.csv("credit.csv")

Split dataset into train, valid and test

splits = h2o.splitFrame(
  data = loans,
ratios = c(0.5,0.2),
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 4569)
loans_train = splits[[1]]
loans_valid = splits[[2]]
loans_test  = splits[[3]]

# dim(loans_train)
# dim(loans_valid)
# dim(loans_test)

Gradient Boosting Machine Learing Model Setting

y_response = "default"
x_predictors = setdiff(names(loans), c(y_response, "names")) 
gbm_loans = h2o.gbm(x = x_predictors,
                    y = y_response, 
                    training_frame = loans_train,
                    validation_frame = loans_valid,
                    #validation_frame = loans_test,
                    ntrees=500,
                    learn_rate=0.005,
                    learn_rate_annealing = 0.99,
                    stopping_rounds = 10, 
                    stopping_tolerance = 1e-4, 
                    stopping_metric = "AUC",
                    sample_rate = 0.9,
                    col_sample_rate = 0.9,
                    seed = 4569,
                    score_tree_interval = 10
                    )

gbm_loans

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1558919351183_3389 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             500                      500              180748         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         17         31    23.96400
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.176599
## RMSE:  0.4202368
## LogLoss:  0.5373888
## Mean Per-Class Error:  0.1910963
## AUC:  0.8918551
## pr_auc:  0.8035924
## Gini:  0.7837101
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         no yes    Error       Rate
## no     584  78 0.117825    =78/662
## yes     92 256 0.264368    =92/348
## Totals 676 334 0.168317  =170/1010
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.354626 0.750733 164
## 2                       max f2  0.290111 0.812562 291
## 3                 max f0point5  0.400812 0.820957  98
## 4                 max accuracy  0.368542 0.838614 142
## 5                max precision  0.598200 1.000000   0
## 6                   max recall  0.264495 1.000000 371
## 7              max specificity  0.598200 1.000000   0
## 8             max absolute_mcc  0.368542 0.633925 142
## 9   max min_per_class_accuracy  0.335600 0.796073 200
## 10 max mean_per_class_accuracy  0.354626 0.808904 164
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1901109
## RMSE:  0.4360171
## LogLoss:  0.5668623
## Mean Per-Class Error:  0.2957044
## AUC:  0.7756624
## pr_auc:  0.6490403
## Gini:  0.5513248
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         no yes    Error      Rate
## no     156 101 0.392996  =101/257
## yes     25 101 0.198413   =25/126
## Totals 181 202 0.328982  =126/383
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.307786 0.615854 179
## 2                       max f2  0.283216 0.758755 241
## 3                 max f0point5  0.400520 0.679487  49
## 4                 max accuracy  0.400520 0.775457  49
## 5                max precision  0.598200 1.000000   0
## 6                   max recall  0.248641 1.000000 320
## 7              max specificity  0.598200 1.000000   0
## 8             max absolute_mcc  0.400520 0.460375  49
## 9   max min_per_class_accuracy  0.324252 0.690476 147
## 10 max mean_per_class_accuracy  0.307786 0.704296 179
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

varimp_gbm_loans =h2o.varimp(gbm_loans)
varimp_gbm_loans

## Variable Importances: 
##           variable relative_importance scaled_importance percentage
## 1      months_loan         4882.286133          1.000000   0.235244
## 2           amount         3386.633057          0.693657   0.163178
## 3  savings_balance         2718.146729          0.556736   0.130969
## 4              age         1999.284546          0.409498   0.096332
## 5          housing         1767.279297          0.361978   0.085153
## 6       employment         1442.025513          0.295359   0.069481
## 7    credit_record         1282.387573          0.262661   0.061789
## 8              job         1063.397217          0.217807   0.051238
## 9          purpose          744.287109          0.152446   0.035862
## 10    fixed_assets          668.080750          0.136838   0.032190
## 11    other_credit          353.054474          0.072313   0.017011
## 12     loans_count          272.191284          0.055751   0.013115
## 13      dependents          175.124817          0.035869   0.008438

p = treemap(varimp_gbm_loans,
            index=c("variable"),
            vSize="scaled_importance",
            type="index", 
            fontsize.labels=c(18), 
            title="gbm_loans : Relative Important Predictors",
            fontsize.title=24,  
            )

plot(gbm_loans)

loans_preds = h2o.predict(gbm_loans, loans_test)

loans_preds

##   predict        no       yes
## 1     yes 0.4819037 0.5180963
## 2      no 0.7083291 0.2916709
## 3     yes 0.6627646 0.3372354
## 4     yes 0.4709696 0.5290304
## 5     yes 0.6629977 0.3370023
## 6      no 0.7343484 0.2656516
## 
## [607 rows x 3 columns]

h2o.confusionMatrix(gbm_loans, loans_train)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.354625927965481:
##         no yes    Error       Rate
## no     584  78 0.117825    =78/662
## yes     92 256 0.264368    =92/348
## Totals 676 334 0.168317  =170/1010

h2o.confusionMatrix(gbm_loans, loans_valid)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.307786176353324:
##         no yes    Error      Rate
## no     156 101 0.392996  =101/257
## yes     25 101 0.198413   =25/126
## Totals 181 202 0.328982  =126/383

h2o.confusionMatrix(gbm_loans, loans_test)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.32610821859585:
##         no yes    Error      Rate
## no     275 113 0.291237  =113/388
## yes     62 157 0.283105   =62/219
## Totals 337 270 0.288303  =175/607

AUC: Area Under ROC Curve

AUC uses to measure Measure performance of a classifier which the area under the ROC Curve

h2o.auc(h2o.performance(gbm_loans, newdata = loans_train))

## [1] 0.8918485

h2o.auc(h2o.performance(gbm_loans, newdata = loans_valid))

## [1] 0.7756624

h2o.auc(h2o.performance(gbm_loans, newdata = loans_test))

## [1] 0.7750788

HK Loan Market: Gradient Boosting Machine Learning with R + H2O

Written by DK WC

5/26/2019