Identifying Risky Loans using Ensemble Trees

German Credit Data Set

In this example, we will be using the C5.0 algorithm to determine the creditworthiness of a loan applicant. The data set used in the example is from UCI Machine Learning Repository. Let’s Load and Inspect the Data

credit.df = read.csv('data/german_credit_data.csv')
credit.df$default = factor(credit.df$default, levels = c(1, 2), labels = c('Yes', 'No'))
str(credit.df)

## 'data.frame':    1000 obs. of  21 variables:
##  $ checking_balance    : chr  "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
##  $ months_loan_duration: int  6 48 12 42 24 36 24 36 12 30 ...
##  $ credit_history      : chr  "critical" "repaid" "critical" "repaid" ...
##  $ purpose             : chr  "radio/tv" "radio/tv" "education" "furniture" ...
##  $ amount              : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
##  $ savings_balance     : chr  "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
##  $ employment_length   : chr  "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
##  $ installment_rate    : int  4 2 2 2 3 2 3 2 2 4 ...
##  $ personal_status     : chr  "single male" "female" "single male" "single male" ...
##  $ other_debtors       : chr  "none" "none" "none" "guarantor" ...
##  $ residence_history   : int  4 2 3 4 4 4 4 2 4 2 ...
##  $ property            : chr  "real estate" "real estate" "real estate" "building society savings" ...
##  $ age                 : int  67 22 49 45 53 35 53 35 61 28 ...
##  $ installment_plan    : chr  "none" "none" "none" "none" ...
##  $ housing             : chr  "own" "own" "own" "for free" ...
##  $ existing_credits    : int  2 1 1 1 2 1 1 1 1 2 ...
##  $ default             : Factor w/ 2 levels "Yes","No": 1 2 1 1 2 1 1 1 1 2 ...
##  $ dependents          : int  1 1 2 2 2 2 1 1 1 1 ...
##  $ telephone           : chr  "yes" "none" "none" "none" ...
##  $ foreign_worker      : chr  "yes" "yes" "yes" "yes" ...
##  $ job                 : chr  "skilled employee" "skilled employee" "unskilled resident" "skilled employee" ...

Summary statistics

summary(credit.df)

##  checking_balance   months_loan_duration credit_history       purpose         
##  Length:1000        Min.   : 4.0         Length:1000        Length:1000       
##  Class :character   1st Qu.:12.0         Class :character   Class :character  
##  Mode  :character   Median :18.0         Mode  :character   Mode  :character  
##                     Mean   :20.9                                              
##                     3rd Qu.:24.0                                              
##                     Max.   :72.0                                              
##      amount      savings_balance    employment_length  installment_rate
##  Min.   :  250   Length:1000        Length:1000        Min.   :1.000   
##  1st Qu.: 1366   Class :character   Class :character   1st Qu.:2.000   
##  Median : 2320   Mode  :character   Mode  :character   Median :3.000   
##  Mean   : 3271                                         Mean   :2.973   
##  3rd Qu.: 3972                                         3rd Qu.:4.000   
##  Max.   :18424                                         Max.   :4.000   
##  personal_status    other_debtors      residence_history   property        
##  Length:1000        Length:1000        Min.   :1.000     Length:1000       
##  Class :character   Class :character   1st Qu.:2.000     Class :character  
##  Mode  :character   Mode  :character   Median :3.000     Mode  :character  
##                                        Mean   :2.845                       
##                                        3rd Qu.:4.000                       
##                                        Max.   :4.000                       
##       age        installment_plan     housing          existing_credits
##  Min.   :19.00   Length:1000        Length:1000        Min.   :1.000   
##  1st Qu.:27.00   Class :character   Class :character   1st Qu.:1.000   
##  Median :33.00   Mode  :character   Mode  :character   Median :1.000   
##  Mean   :35.55                                         Mean   :1.407   
##  3rd Qu.:42.00                                         3rd Qu.:2.000   
##  Max.   :75.00                                         Max.   :4.000   
##  default     dependents     telephone         foreign_worker    
##  Yes:700   Min.   :1.000   Length:1000        Length:1000       
##  No :300   1st Qu.:1.000   Class :character   Class :character  
##            Median :1.000   Mode  :character   Mode  :character  
##            Mean   :1.155                                        
##            3rd Qu.:1.000                                        
##            Max.   :2.000                                        
##      job           
##  Length:1000       
##  Class :character  
##  Mode  :character  
##                    
##                    
##

Sneak-peek at the Data

kable(head(credit.df, n=15), format="html")

checking_balance	months_loan_duration	credit_history	purpose	amount	savings_balance	employment_length	installment_rate	personal_status	other_debtors	residence_history	property	age	installment_plan	housing	existing_credits	default	dependents	telephone	foreign_worker	job
< 0 DM	6	critical	radio/tv	1169	unknown	> 7 yrs	4	single male	none	4	real estate	67	none	own	2	Yes	1	yes	yes	skilled employee
1 - 200 DM	48	repaid	radio/tv	5951	< 100 DM	1 - 4 yrs	2	female	none	2	real estate	22	none	own	1	No	1	none	yes	skilled employee
unknown	12	critical	education	2096	< 100 DM	4 - 7 yrs	2	single male	none	3	real estate	49	none	own	1	Yes	2	none	yes	unskilled resident
< 0 DM	42	repaid	furniture	7882	< 100 DM	4 - 7 yrs	2	single male	guarantor	4	building society savings	45	none	for free	1	Yes	2	none	yes	skilled employee
< 0 DM	24	delayed	car (new)	4870	< 100 DM	1 - 4 yrs	3	single male	none	4	unknown/none	53	none	for free	2	No	2	none	yes	skilled employee
unknown	36	repaid	education	9055	unknown	1 - 4 yrs	2	single male	none	4	unknown/none	35	none	for free	1	Yes	2	yes	yes	unskilled resident
unknown	24	repaid	furniture	2835	501 - 1000 DM	> 7 yrs	3	single male	none	4	building society savings	53	none	own	1	Yes	1	none	yes	skilled employee
1 - 200 DM	36	repaid	car (used)	6948	< 100 DM	1 - 4 yrs	2	single male	none	2	other	35	none	rent	1	Yes	1	yes	yes	mangement self-employed
unknown	12	repaid	radio/tv	3059	> 1000 DM	4 - 7 yrs	2	divorced male	none	4	real estate	61	none	own	1	Yes	1	none	yes	unskilled resident
1 - 200 DM	30	critical	car (new)	5234	< 100 DM	unemployed	4	married male	none	2	other	28	none	own	2	No	1	none	yes	mangement self-employed
1 - 200 DM	12	repaid	car (new)	1295	< 100 DM	0 - 1 yrs	3	female	none	1	other	25	none	rent	1	No	1	none	yes	skilled employee
< 0 DM	48	repaid	business	4308	< 100 DM	0 - 1 yrs	3	female	none	4	building society savings	24	none	rent	1	No	1	none	yes	skilled employee
1 - 200 DM	12	repaid	radio/tv	1567	< 100 DM	1 - 4 yrs	1	female	none	1	other	22	none	own	1	Yes	1	yes	yes	skilled employee
< 0 DM	24	critical	car (new)	1199	< 100 DM	> 7 yrs	4	single male	none	4	other	60	none	own	2	No	1	none	yes	unskilled resident
< 0 DM	15	repaid	car (new)	1403	< 100 DM	1 - 4 yrs	2	female	none	4	other	28	none	rent	1	Yes	1	none	yes	skilled employee

Start the H2o instance

h2o.init()

## 
## H2O is not running yet, starting it now...
## 
## Note:  In case of errors look at the following log files:
##     C:\Users\dncha\AppData\Local\Temp\Rtmp0mP5Qk\file42581eec36fa/h2o_dncha_started_from_r.out
##     C:\Users\dncha\AppData\Local\Temp\Rtmp0mP5Qk\file425841f24974/h2o_dncha_started_from_r.err
## 
## 
## Starting H2O JVM and connecting:  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 seconds 778 milliseconds 
##     H2O cluster timezone:       America/New_York 
##     H2O data parsing timezone:  UTC 
##     H2O cluster version:        3.32.0.1 
##     H2O cluster version age:    2 months and 3 days  
##     H2O cluster name:           H2O_started_from_R_dncha_xic696 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.52 GB 
##     H2O cluster total cores:    6 
##     H2O cluster allowed cores:  6 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     H2O API Extensions:         Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4 
##     R Version:                  R version 4.0.3 (2020-10-10)

Import the dataset into H2O

credit.df = as.h2o(credit.df)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

Set the Predictors and response

credit.df["default"] = as.factor(credit.df["default"])
predictors = c("checking_balance","months_loan_duration","credit_history","purpose","amount","savings_balance","employment_length",
                "installment_rate","personal_status","other_debtors","residence_history","property","age","installment_plan","housing",
                "existing_credits","dependents","telephone","foreign_worker","job" )
response = "default"

Split the dataset into a training and validation data sets

credit_split = h2o.splitFrame(data = credit.df, ratios = 0.8, seed = 2021)
train = credit_split[[1]]
valid = credit_split[[2]]
h2o.table(train[,17])

##   default Count
## 1      No   248
## 2     Yes   564
## 
## [2 rows x 2 columns]

prop.table(h2o.table(train[,17]))[,2]

##       Count
## 1 0.3050431
## 2 0.6937269
## 
## [2 rows x 1 column]

Let us check the quality of the split

h2o.table(train[,17])

##   default Count
## 1      No   248
## 2     Yes   564
## 
## [2 rows x 2 columns]

prop.table(h2o.table(train[,17]))[,2]

##       Count
## 1 0.3050431
## 2 0.6937269
## 
## [2 rows x 1 column]

h2o.table(valid[,17])

##   default Count
## 1      No    52
## 2     Yes   136
## 
## [2 rows x 2 columns]

prop.table(h2o.table(valid[,17]))[,2]

##       Count
## 1 0.2751323
## 2 0.7195767
## 
## [2 rows x 1 column]

Build and train the RandomForest model

credit_drf = h2o.randomForest(x = predictors,
                             y = response,
                             ntrees = 100,
                             nfolds = 5,
                             max_depth = 5,
                             min_rows = 10,
                             calibrate_model = TRUE,
                             calibration_frame = valid,
                             binomial_double_trees = TRUE,
                             training_frame = train,
                             validation_frame = valid,
                             keep_cross_validation_predictions=T,
                             seed=2020)

## Warning in .h2o.processResponseWarnings(res): Dropping bad and constant columns: [checking_balance, installment_plan, purpose, housing, employment_length, telephone, other_debtors, credit_history, savings_balance, foreign_worker, property, personal_status, job].

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |============================                                          |  39%
  |                                                                            
  |===========================================================           |  84%
  |                                                                            
  |======================================================================| 100%

credit_drf

## Model Details:
## ==============
## 
## H2OBinomialModel: drf
## Model ID:  DRF_model_R_1607781240421_1 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      200               52250         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          9         25    16.14000
## 
## 
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.1954185
## RMSE:  0.4420616
## LogLoss:  0.5756885
## Mean Per-Class Error:  0.4323238
## AUC:  0.6747634
## AUCPR:  0.8149597
## Gini:  0.3495267
## R^2:  0.07881493
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No     45 203 0.818548  =203/248
## Yes    26 538 0.046099   =26/564
## Totals 71 741 0.282020  =229/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.527147   0.824521 342
## 2                       max f2  0.288040   0.919166 399
## 3                 max f0point5  0.647740   0.775510 260
## 4                 max accuracy  0.527147   0.717980 342
## 5                max precision  0.904195   1.000000   0
## 6                   max recall  0.288040   1.000000 399
## 7              max specificity  0.904195   1.000000   0
## 8             max absolute_mcc  0.647740   0.261967 260
## 9   max min_per_class_accuracy  0.702283   0.604839 193
## 10 max mean_per_class_accuracy  0.647740   0.626030 260
## 11                     max tns  0.904195 248.000000   0
## 12                     max fns  0.904195 563.000000   0
## 13                     max fps  0.337473 248.000000 397
## 14                     max tps  0.288040 564.000000 399
## 15                     max tnr  0.904195   1.000000   0
## 16                     max fnr  0.904195   0.998227   0
## 17                     max fpr  0.337473   1.000000 397
## 18                     max tpr  0.288040   1.000000 399
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
## 
## MSE:  0.1970796
## RMSE:  0.4439365
## LogLoss:  0.5812905
## Mean Per-Class Error:  0.4295814
## AUC:  0.584983
## AUCPR:  0.793562
## Gini:  0.1699661
## R^2:  0.01504772
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No     10  42 0.807692   =42/52
## Yes     7 129 0.051471   =7/136
## Totals 17 171 0.260638  =49/188
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.522234   0.840391 170
## 2                       max f2  0.363811   0.928962 187
## 3                 max f0point5  0.522234   0.786585 170
## 4                 max accuracy  0.522234   0.739362 170
## 5                max precision  0.882854   1.000000   0
## 6                   max recall  0.363811   1.000000 187
## 7              max specificity  0.882854   1.000000   0
## 8             max absolute_mcc  0.522234   0.219667 170
## 9   max min_per_class_accuracy  0.709162   0.519231  95
## 10 max mean_per_class_accuracy  0.766978   0.571550  51
## 11                     max tns  0.882854  52.000000   0
## 12                     max fns  0.882854 135.000000   0
## 13                     max fps  0.424059  52.000000 185
## 14                     max tps  0.363811 136.000000 187
## 15                     max tnr  0.882854   1.000000   0
## 16                     max fnr  0.882854   0.992647   0
## 17                     max fpr  0.424059   1.000000 185
## 18                     max tpr  0.363811   1.000000 187
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.1980747
## RMSE:  0.4450558
## LogLoss:  0.5820837
## Mean Per-Class Error:  0.4782229
## AUC:  0.6571615
## AUCPR:  0.8007964
## Gini:  0.3143231
## R^2:  0.06629377
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No     13 235 0.947581  =235/248
## Yes     5 559 0.008865    =5/564
## Totals 18 794 0.295567  =240/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.426927   0.823270 382
## 2                       max f2  0.324056   0.919465 398
## 3                 max f0point5  0.567618   0.767312 322
## 4                 max accuracy  0.567618   0.717980 322
## 5                max precision  0.903290   1.000000   0
## 6                   max recall  0.324056   1.000000 398
## 7              max specificity  0.903290   1.000000   0
## 8             max absolute_mcc  0.672086   0.242262 226
## 9   max min_per_class_accuracy  0.698816   0.600806 193
## 10 max mean_per_class_accuracy  0.672086   0.626301 226
## 11                     max tns  0.903290 248.000000   0
## 12                     max fns  0.903290 563.000000   0
## 13                     max fps  0.301628 248.000000 399
## 14                     max tps  0.324056 564.000000 398
## 15                     max tnr  0.903290   1.000000   0
## 16                     max fnr  0.903290   0.998227   0
## 17                     max fpr  0.301628   1.000000 399
## 18                     max tpr  0.324056   1.000000 398
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy  0.73221344  0.03010463  0.7096774  0.7826087     0.7375  0.7160494
## auc        0.6686505  0.05936365  0.5887619 0.66851443  0.7253103  0.6343402
## aucpr      0.8099034 0.054801244 0.74830717  0.8773992 0.85262334 0.76748276
## err       0.26778653  0.03010463 0.29032257  0.2173913     0.2625 0.28395063
## err_count       43.2   2.3874674       45.0       40.0       42.0       46.0
##           cv_5_valid
## accuracy   0.7152318
## auc       0.72632575
## aucpr      0.8037045
## err       0.28476822
## err_count       43.0
## 
## ---
##                   mean          sd  cv_1_valid   cv_2_valid cv_3_valid
## pr_auc       0.8099034 0.054801244  0.74830717    0.8773992 0.85262334
## precision     0.724168  0.03439666         0.7   0.78142077 0.72789115
## r2          0.05476902 0.052296337 0.017857224 -0.007952967 0.11634714
## recall      0.98675823  0.01249434         1.0          1.0 0.98165137
## rmse        0.44578502 0.018204482  0.46327117   0.41779405 0.43804547
## specificity 0.15029694   0.0916091         0.1  0.024390243 0.21568628
##              cv_4_valid cv_5_valid
## pr_auc       0.76748276  0.8037045
## precision     0.7152318  0.6962963
## r2          0.050057463 0.09753624
## recall         0.972973  0.9791667
## rmse         0.45266846 0.45714596
## specificity  0.15686275 0.25454545

Feature Importance in RandomForest Model

h2o.varimp(credit_drf)

## Variable Importances: 
##               variable relative_importance scaled_importance percentage
## 1 months_loan_duration         1060.386963          1.000000   0.322269
## 2               amount         1043.569702          0.984140   0.317158
## 3                  age          669.808533          0.631664   0.203566
## 4     installment_rate          235.961121          0.222524   0.071712
## 5     existing_credits          121.715500          0.114784   0.036991
## 6    residence_history          114.824768          0.108286   0.034897
## 7           dependents           44.115959          0.041604   0.013408

h2o.varimp_plot(credit_drf)

Evaluate RandomForest model performance on Training dataset

perf = h2o.performance(credit_drf)
perf

## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
## 
## MSE:  0.1954185
## RMSE:  0.4420616
## LogLoss:  0.5756885
## Mean Per-Class Error:  0.4323238
## AUC:  0.6747634
## AUCPR:  0.8149597
## Gini:  0.3495267
## R^2:  0.07881493
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No     45 203 0.818548  =203/248
## Yes    26 538 0.046099   =26/564
## Totals 71 741 0.282020  =229/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.527147   0.824521 342
## 2                       max f2  0.288040   0.919166 399
## 3                 max f0point5  0.647740   0.775510 260
## 4                 max accuracy  0.527147   0.717980 342
## 5                max precision  0.904195   1.000000   0
## 6                   max recall  0.288040   1.000000 399
## 7              max specificity  0.904195   1.000000   0
## 8             max absolute_mcc  0.647740   0.261967 260
## 9   max min_per_class_accuracy  0.702283   0.604839 193
## 10 max mean_per_class_accuracy  0.647740   0.626030 260
## 11                     max tns  0.904195 248.000000   0
## 12                     max fns  0.904195 563.000000   0
## 13                     max fps  0.337473 248.000000 397
## 14                     max tps  0.288040 564.000000 399
## 15                     max tnr  0.904195   1.000000   0
## 16                     max fnr  0.904195   0.998227   0
## 17                     max fpr  0.337473   1.000000 397
## 18                     max tpr  0.288040   1.000000 399
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Evaluate RandomForest model performance on Validation dataset

perf_drf_val = h2o.performance(credit_drf, valid)
perf_drf_val

## H2OBinomialMetrics: drf
## 
## MSE:  0.1970796
## RMSE:  0.4439365
## LogLoss:  0.5812905
## Mean Per-Class Error:  0.4295814
## AUC:  0.584983
## AUCPR:  0.793562
## Gini:  0.1699661
## R^2:  0.01504772
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No     10  42 0.807692   =42/52
## Yes     7 129 0.051471   =7/136
## Totals 17 171 0.260638  =49/188
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.522234   0.840391 170
## 2                       max f2  0.363811   0.928962 187
## 3                 max f0point5  0.522234   0.786585 170
## 4                 max accuracy  0.522234   0.739362 170
## 5                max precision  0.882854   1.000000   0
## 6                   max recall  0.363811   1.000000 187
## 7              max specificity  0.882854   1.000000   0
## 8             max absolute_mcc  0.522234   0.219667 170
## 9   max min_per_class_accuracy  0.709162   0.519231  95
## 10 max mean_per_class_accuracy  0.766978   0.571550  51
## 11                     max tns  0.882854  52.000000   0
## 12                     max fns  0.882854 135.000000   0
## 13                     max fps  0.424059  52.000000 185
## 14                     max tps  0.363811 136.000000 187
## 15                     max tnr  0.882854   1.000000   0
## 16                     max fnr  0.882854   0.992647   0
## 17                     max fpr  0.424059   1.000000 185
## 18                     max tpr  0.363811   1.000000 187
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Partial Dependency Plots for RandomForest Model

h2o.partialPlot(object = credit_drf, data = credit.df, cols = c('amount', 
                                                                'age',
                                                                'months_loan_duration',
                                                                'existing_credits',
                                                                'residence_history',
                                                                'installment_rate',
                                                                'dependents'))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |======================================================================| 100%

## [[1]]
## PartialDependence: Partial dependency plot for amount
##          amount mean_response stddev_response std_error_mean_response
## 1    250.000000      0.681666        0.099656                0.003151
## 2   1206.526316      0.690344        0.091499                0.002893
## 3   2163.052632      0.712094        0.096583                0.003054
## 4   3119.578947      0.725546        0.090936                0.002876
## 5   4076.105263      0.704348        0.083092                0.002628
## 6   5032.631579      0.692170        0.080203                0.002536
## 7   5989.157895      0.673860        0.086180                0.002725
## 8   6945.684211      0.639456        0.090300                0.002856
## 9   7902.210526      0.627722        0.082316                0.002603
## 10  8858.736842      0.580340        0.070622                0.002233
## 11  9815.263158      0.574778        0.071543                0.002262
## 12 10771.789474      0.561890        0.069406                0.002195
## 13 11728.315789      0.461472        0.047279                0.001495
## 14 12684.842105      0.455652        0.048186                0.001524
## 15 13641.368421      0.455652        0.048186                0.001524
## 16 14597.894737      0.455652        0.048186                0.001524
## 17 15554.421053      0.455652        0.048186                0.001524
## 18 16510.947368      0.455652        0.048186                0.001524
## 19 17467.473684      0.455652        0.048186                0.001524
## 20 18424.000000      0.455652        0.048186                0.001524
## 
## [[2]]
## PartialDependence: Partial dependency plot for age
##          age mean_response stddev_response std_error_mean_response
## 1  19.000000      0.617441        0.102681                0.003247
## 2  21.947368      0.617441        0.102681                0.003247
## 3  24.894737      0.646385        0.099068                0.003133
## 4  27.842105      0.675101        0.103413                0.003270
## 5  30.789474      0.673716        0.091435                0.002891
## 6  33.736842      0.715941        0.096748                0.003059
## 7  36.684211      0.745373        0.095777                0.003029
## 8  39.631579      0.737151        0.097613                0.003087
## 9  42.578947      0.734783        0.103350                0.003268
## 10 45.526316      0.736638        0.104037                0.003290
## 11 48.473684      0.738011        0.104764                0.003313
## 12 51.421053      0.735592        0.111089                0.003513
## 13 54.368421      0.729086        0.109397                0.003459
## 14 57.315789      0.727456        0.108995                0.003447
## 15 60.263158      0.726534        0.108238                0.003423
## 16 63.210526      0.729979        0.107900                0.003412
## 17 66.157895      0.728900        0.107530                0.003400
## 18 69.105263      0.728900        0.107530                0.003400
## 19 72.052632      0.728900        0.107530                0.003400
## 20 75.000000      0.728900        0.107530                0.003400
## 
## [[3]]
## PartialDependence: Partial dependency plot for months_loan_duration
##    months_loan_duration mean_response stddev_response std_error_mean_response
## 1              4.000000      0.760020        0.080747                0.002553
## 2              7.578947      0.755063        0.080017                0.002530
## 3             11.157895      0.735514        0.078578                0.002485
## 4             14.736842      0.729653        0.076627                0.002423
## 5             18.315789      0.697437        0.074280                0.002349
## 6             21.894737      0.694335        0.073465                0.002323
## 7             25.473684      0.682869        0.072863                0.002304
## 8             29.052632      0.654520        0.076739                0.002427
## 9             32.631579      0.643906        0.075382                0.002384
## 10            36.210526      0.589429        0.069988                0.002213
## 11            39.789474      0.563508        0.073546                0.002326
## 12            43.368421      0.560697        0.074481                0.002355
## 13            46.947368      0.526293        0.069878                0.002210
## 14            50.526316      0.513843        0.068217                0.002157
## 15            54.105263      0.513843        0.068217                0.002157
## 16            57.684211      0.513843        0.068217                0.002157
## 17            61.263158      0.513843        0.068217                0.002157
## 18            64.842105      0.513843        0.068217                0.002157
## 19            68.421053      0.513843        0.068217                0.002157
## 20            72.000000      0.513843        0.068217                0.002157
## 
## [[4]]
## PartialDependence: Partial dependency plot for existing_credits
##   existing_credits mean_response stddev_response std_error_mean_response
## 1         1.000000      0.687315        0.108602                0.003434
## 2         2.000000      0.707802        0.110617                0.003498
## 3         3.000000      0.734493        0.098054                0.003101
## 4         4.000000      0.734493        0.098054                0.003101
## 
## [[5]]
## PartialDependence: Partial dependency plot for residence_history
##   residence_history mean_response stddev_response std_error_mean_response
## 1          1.000000      0.703842        0.094538                0.002990
## 2          2.000000      0.694231        0.106841                0.003379
## 3          3.000000      0.694573        0.110901                0.003507
## 4          4.000000      0.691912        0.116933                0.003698
## 
## [[6]]
## PartialDependence: Partial dependency plot for installment_rate
##   installment_rate mean_response stddev_response std_error_mean_response
## 1         1.000000      0.738315        0.095438                0.003018
## 2         2.000000      0.718990        0.101822                0.003220
## 3         3.000000      0.685391        0.108502                0.003431
## 4         4.000000      0.671080        0.112188                0.003548
## 
## [[7]]
## PartialDependence: Partial dependency plot for dependents
##   dependents mean_response stddev_response std_error_mean_response
## 1   1.000000      0.697008        0.111889                0.003538
## 2   2.000000      0.684354        0.106949                0.003382

Build and train the GBM model

credit_gbm = h2o.gbm(x = predictors,
                     y = response,
                     ntrees=100,
                     nfolds = 5,
                     training_frame = train,
                     validation_frame = valid,
                     distribution = "bernoulli",
                     keep_cross_validation_predictions=T,
                     seed = 2020)

## Warning in .h2o.processResponseWarnings(res): Dropping bad and constant columns: [checking_balance, installment_plan, purpose, housing, employment_length, telephone, other_debtors, credit_history, savings_balance, foreign_worker, property, personal_status, job].

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==                                                                    |   3%
  |                                                                            
  |==========================================================            |  82%
  |                                                                            
  |======================================================================| 100%

credit_gbm

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1607781240421_1216 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             100                      100               26308         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000          7         26    16.29000
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.08072184
## RMSE:  0.2841159
## LogLoss:  0.2889944
## Mean Per-Class Error:  0.08756577
## AUC:  0.9740441
## AUCPR:  0.9886578
## Gini:  0.9480883
## R^2:  0.6194845
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     216  32 0.129032  =32/248
## Yes     26 538 0.046099  =26/564
## Totals 242 570 0.071429  =58/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.597487   0.948854 231
## 2                       max f2  0.486085   0.967293 263
## 3                 max f0point5  0.686787   0.956672 197
## 4                 max accuracy  0.602242   0.928571 229
## 5                max precision  0.995919   1.000000   0
## 6                   max recall  0.192998   1.000000 357
## 7              max specificity  0.995919   1.000000   0
## 8             max absolute_mcc  0.602242   0.830938 229
## 9   max min_per_class_accuracy  0.647444   0.918440 213
## 10 max mean_per_class_accuracy  0.642998   0.920184 216
## 11                     max tns  0.995919 248.000000   0
## 12                     max fns  0.995919 561.000000   0
## 13                     max fps  0.027782 248.000000 399
## 14                     max tps  0.192998 564.000000 357
## 15                     max tnr  0.995919   1.000000   0
## 16                     max fnr  0.995919   0.994681   0
## 17                     max fpr  0.027782   1.000000 399
## 18                     max tpr  0.192998   1.000000 357
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.2228741
## RMSE:  0.4720954
## LogLoss:  0.6524698
## Mean Per-Class Error:  0.4807692
## AUC:  0.5996889
## AUCPR:  0.8046438
## Gini:  0.1993778
## R^2:  -0.1138662
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No      2  50 0.961538   =50/52
## Yes     0 136 0.000000   =0/136
## Totals  2 186 0.265957  =50/188
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.140224   0.844720 185
## 2                       max f2  0.140224   0.931507 185
## 3                 max f0point5  0.245108   0.777512 174
## 4                 max accuracy  0.210653   0.734043 179
## 5                max precision  0.989209   1.000000   0
## 6                   max recall  0.140224   1.000000 185
## 7              max specificity  0.989209   1.000000   0
## 8             max absolute_mcc  0.950781   0.175397  13
## 9   max min_per_class_accuracy  0.741064   0.557692  98
## 10 max mean_per_class_accuracy  0.795601   0.596154  83
## 11                     max tns  0.989209  52.000000   0
## 12                     max fns  0.989209 135.000000   0
## 13                     max fps  0.093880  52.000000 187
## 14                     max tps  0.140224 136.000000 185
## 15                     max tnr  0.989209   1.000000   0
## 16                     max fnr  0.989209   0.992647   0
## 17                     max fpr  0.093880   1.000000 187
## 18                     max tpr  0.140224   1.000000 185
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
## 
## MSE:  0.2133289
## RMSE:  0.4618754
## LogLoss:  0.6443836
## Mean Per-Class Error:  0.4514413
## AUC:  0.6504054
## AUCPR:  0.7847516
## Gini:  0.3008107
## R^2:  -0.005613097
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error      Rate
## No     32 216 0.870968  =216/248
## Yes    18 546 0.031915   =18/564
## Totals 50 762 0.288177  =234/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.254081   0.823529 357
## 2                       max f2  0.014005   0.919465 398
## 3                 max f0point5  0.634923   0.768398 228
## 4                 max accuracy  0.254081   0.711823 357
## 5                max precision  0.966253   0.875000  14
## 6                   max recall  0.014005   1.000000 398
## 7              max specificity  0.991132   0.995968   0
## 8             max absolute_mcc  0.634923   0.244092 228
## 9   max min_per_class_accuracy  0.757644   0.611702 165
## 10 max mean_per_class_accuracy  0.699854   0.629132 197
## 11                     max tns  0.991132 247.000000   0
## 12                     max fns  0.991132 562.000000   0
## 13                     max fps  0.011657 248.000000 399
## 14                     max tps  0.014005 564.000000 398
## 15                     max tnr  0.991132   0.995968   0
## 16                     max fnr  0.991132   0.996454   0
## 17                     max fpr  0.011657   1.000000 399
## 18                     max tpr  0.014005   1.000000 398
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary: 
##                 mean          sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy  0.71925884 0.041758135 0.67741936  0.7880435        0.7  0.7222222
## auc       0.65698946 0.035979126  0.5987619  0.6527375  0.6601907  0.6855679
## aucpr      0.7910727 0.050494414 0.71627074  0.8571369  0.7918857  0.8052056
## err       0.28074113 0.041758135 0.32258064 0.21195652        0.3  0.2777778
## err_count       45.2   4.2071366       50.0       39.0       48.0       45.0
##           cv_5_valid
## accuracy   0.7086093
## auc        0.6876894
## aucpr      0.7848649
## err       0.29139072
## err_count       44.0
## 
## ---
##                     mean          sd  cv_1_valid  cv_2_valid  cv_3_valid
## pr_auc         0.7910727 0.050494414  0.71627074   0.8571369   0.7918857
## precision      0.7156215 0.044764668  0.67741936   0.7921348   0.6942675
## r2          -0.019336531  0.10400455 -0.12294832 -0.13473132 0.012843479
## recall        0.98498434 0.021575985         1.0    0.986014         1.0
## rmse          0.46234497 0.021716403  0.49536788  0.44329074  0.46298975
## specificity    0.1169097 0.109665364         0.0  0.09756097  0.05882353
##             cv_4_valid  cv_5_valid
## pr_auc       0.8052056   0.7848649
## precision   0.71428573         0.7
## r2          0.09316581 0.054987703
## recall        0.990991   0.9479167
## rmse        0.44227818   0.4677984
## specificity 0.13725491  0.29090908

Feature Importance in GBM Model

h2o.varimp(credit_gbm)

## Variable Importances: 
##               variable relative_importance scaled_importance percentage
## 1               amount          201.582108          1.000000   0.384392
## 2                  age          110.161858          0.546486   0.210065
## 3 months_loan_duration          105.774239          0.524720   0.201699
## 4     existing_credits           32.525982          0.161354   0.062023
## 5    residence_history           29.147604          0.144594   0.055581
## 6     installment_rate           28.376364          0.140768   0.054110
## 7           dependents           16.849354          0.083586   0.032130

h2o.varimp_plot(credit_gbm)

Evaluate GBM model performance on Training dataset

perf_gbm_train = h2o.performance(credit_gbm)
perf_gbm_train

## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.08072184
## RMSE:  0.2841159
## LogLoss:  0.2889944
## Mean Per-Class Error:  0.08756577
## AUC:  0.9740441
## AUCPR:  0.9886578
## Gini:  0.9480883
## R^2:  0.6194845
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##         No Yes    Error     Rate
## No     216  32 0.129032  =32/248
## Yes     26 538 0.046099  =26/564
## Totals 242 570 0.071429  =58/812
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.597487   0.948854 231
## 2                       max f2  0.486085   0.967293 263
## 3                 max f0point5  0.686787   0.956672 197
## 4                 max accuracy  0.602242   0.928571 229
## 5                max precision  0.995919   1.000000   0
## 6                   max recall  0.192998   1.000000 357
## 7              max specificity  0.995919   1.000000   0
## 8             max absolute_mcc  0.602242   0.830938 229
## 9   max min_per_class_accuracy  0.647444   0.918440 213
## 10 max mean_per_class_accuracy  0.642998   0.920184 216
## 11                     max tns  0.995919 248.000000   0
## 12                     max fns  0.995919 561.000000   0
## 13                     max fps  0.027782 248.000000 399
## 14                     max tps  0.192998 564.000000 357
## 15                     max tnr  0.995919   1.000000   0
## 16                     max fnr  0.995919   0.994681   0
## 17                     max fpr  0.027782   1.000000 399
## 18                     max tpr  0.192998   1.000000 357
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Evaluate GBM model performance on Validation dataset

predict_gbm = h2o.predict(credit_gbm, newdata = valid)

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

perf_gbm_val = h2o.performance(credit_gbm, valid)
perf_gbm_val

## H2OBinomialMetrics: gbm
## 
## MSE:  0.2228741
## RMSE:  0.4720954
## LogLoss:  0.6524698
## Mean Per-Class Error:  0.4807692
## AUC:  0.5996889
## AUCPR:  0.8046438
## Gini:  0.1993778
## R^2:  -0.1138662
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##        No Yes    Error     Rate
## No      2  50 0.961538   =50/52
## Yes     0 136 0.000000   =0/136
## Totals  2 186 0.265957  =50/188
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold      value idx
## 1                       max f1  0.140224   0.844720 185
## 2                       max f2  0.140224   0.931507 185
## 3                 max f0point5  0.245108   0.777512 174
## 4                 max accuracy  0.210653   0.734043 179
## 5                max precision  0.989209   1.000000   0
## 6                   max recall  0.140224   1.000000 185
## 7              max specificity  0.989209   1.000000   0
## 8             max absolute_mcc  0.950781   0.175397  13
## 9   max min_per_class_accuracy  0.741064   0.557692  98
## 10 max mean_per_class_accuracy  0.795601   0.596154  83
## 11                     max tns  0.989209  52.000000   0
## 12                     max fns  0.989209 135.000000   0
## 13                     max fps  0.093880  52.000000 187
## 14                     max tps  0.140224 136.000000 185
## 15                     max tnr  0.989209   1.000000   0
## 16                     max fnr  0.989209   0.992647   0
## 17                     max fpr  0.093880   1.000000 187
## 18                     max tpr  0.140224   1.000000 185
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

Partial Dependency Plots for GBM Model

h2o.partialPlot(object = credit_gbm, data = credit.df, cols = c('amount', 
                                                                'age',
                                                                'months_loan_duration',
                                                                'existing_credits',
                                                                'residence_history',
                                                                'installment_rate',
                                                                'dependents'))

## 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==========                                                            |  14%
  |                                                                            
  |======================================================================| 100%

## [[1]]
## PartialDependence: Partial dependency plot for amount
##          amount mean_response stddev_response std_error_mean_response
## 1    250.000000      0.816605        0.179518                0.005677
## 2   1206.526316      0.479805        0.240422                0.007603
## 3   2163.052632      0.638468        0.262501                0.008301
## 4   3119.578947      0.814451        0.172167                0.005444
## 5   4076.105263      0.499470        0.201377                0.006368
## 6   5032.631579      0.587170        0.212258                0.006712
## 7   5989.157895      0.688832        0.233964                0.007399
## 8   6945.684211      0.668347        0.244814                0.007742
## 9   7902.210526      0.648302        0.252116                0.007973
## 10  8858.736842      0.227349        0.200570                0.006343
## 11  9815.263158      0.339576        0.224751                0.007107
## 12 10771.789474      0.344699        0.221123                0.006993
## 13 11728.315789      0.228508        0.203443                0.006433
## 14 12684.842105      0.087733        0.130003                0.004111
## 15 13641.368421      0.093373        0.134894                0.004266
## 16 14597.894737      0.088250        0.130353                0.004122
## 17 15554.421053      0.088250        0.130353                0.004122
## 18 16510.947368      0.088250        0.130353                0.004122
## 19 17467.473684      0.088250        0.130353                0.004122
## 20 18424.000000      0.088250        0.130353                0.004122
## 
## [[2]]
## PartialDependence: Partial dependency plot for age
##          age mean_response stddev_response std_error_mean_response
## 1  19.000000      0.501949        0.253478                0.008016
## 2  21.947368      0.563037        0.263407                0.008330
## 3  24.894737      0.628822        0.249826                0.007900
## 4  27.842105      0.675405        0.234801                0.007425
## 5  30.789474      0.647186        0.226378                0.007159
## 6  33.736842      0.678719        0.229565                0.007259
## 7  36.684211      0.832353        0.161509                0.005107
## 8  39.631579      0.796781        0.182556                0.005773
## 9  42.578947      0.748857        0.218104                0.006897
## 10 45.526316      0.799486        0.191049                0.006041
## 11 48.473684      0.801373        0.193954                0.006133
## 12 51.421053      0.696768        0.264239                0.008356
## 13 54.368421      0.671279        0.274165                0.008670
## 14 57.315789      0.719032        0.239743                0.007581
## 15 60.263158      0.680278        0.240948                0.007619
## 16 63.210526      0.790093        0.196858                0.006225
## 17 66.157895      0.723962        0.227134                0.007183
## 18 69.105263      0.723962        0.227134                0.007183
## 19 72.052632      0.723962        0.227134                0.007183
## 20 75.000000      0.723962        0.227134                0.007183
## 
## [[3]]
## PartialDependence: Partial dependency plot for months_loan_duration
##    months_loan_duration mean_response stddev_response std_error_mean_response
## 1              4.000000      0.777827        0.245616                0.007767
## 2              7.578947      0.765740        0.248302                0.007852
## 3             11.157895      0.724953        0.245542                0.007765
## 4             14.736842      0.720862        0.254015                0.008033
## 5             18.315789      0.671168        0.236638                0.007483
## 6             21.894737      0.674866        0.232270                0.007345
## 7             25.473684      0.671474        0.233319                0.007378
## 8             29.052632      0.590455        0.246653                0.007800
## 9             32.631579      0.564291        0.253771                0.008025
## 10            36.210526      0.490392        0.256615                0.008115
## 11            39.789474      0.437204        0.254624                0.008052
## 12            43.368421      0.428207        0.250317                0.007916
## 13            46.947368      0.363505        0.219758                0.006949
## 14            50.526316      0.329331        0.216915                0.006859
## 15            54.105263      0.329331        0.216915                0.006859
## 16            57.684211      0.334110        0.217403                0.006875
## 17            61.263158      0.334110        0.217403                0.006875
## 18            64.842105      0.334110        0.217403                0.006875
## 19            68.421053      0.334110        0.217403                0.006875
## 20            72.000000      0.334110        0.217403                0.006875
## 
## [[4]]
## PartialDependence: Partial dependency plot for existing_credits
##   existing_credits mean_response stddev_response std_error_mean_response
## 1         1.000000      0.688385        0.250241                0.007913
## 2         2.000000      0.723687        0.245062                0.007750
## 3         3.000000      0.734272        0.243546                0.007702
## 4         4.000000      0.734272        0.243546                0.007702
## 
## [[5]]
## PartialDependence: Partial dependency plot for residence_history
##   residence_history mean_response stddev_response std_error_mean_response
## 1          1.000000      0.720821        0.228991                0.007241
## 2          2.000000      0.685075        0.245908                0.007776
## 3          3.000000      0.704677        0.248096                0.007845
## 4          4.000000      0.685858        0.257893                0.008155
## 
## [[6]]
## PartialDependence: Partial dependency plot for installment_rate
##   installment_rate mean_response stddev_response std_error_mean_response
## 1         1.000000      0.711415        0.238594                0.007545
## 2         2.000000      0.709392        0.244679                0.007737
## 3         3.000000      0.689506        0.250506                0.007922
## 4         4.000000      0.680057        0.255345                0.008075
## 
## [[7]]
## PartialDependence: Partial dependency plot for dependents
##   dependents mean_response stddev_response std_error_mean_response
## 1   1.000000      0.704368        0.252720                0.007992
## 2   2.000000      0.653643        0.254764                0.008056

Shutdown the H2O instance

h2o.shutdown(prompt = F)