German Credit Data Set
In this example, we will be using the C5.0 algorithm to determine the creditworthiness of a loan applicant. The data set used in the example is from UCI Machine Learning Repository. Let’s Load and Inspect the Data
credit.df = read.csv('data/german_credit_data.csv')
credit.df$default = factor(credit.df$default, levels = c(1, 2), labels = c('Yes', 'No'))
str(credit.df)## 'data.frame': 1000 obs. of 21 variables:
## $ checking_balance : chr "< 0 DM" "1 - 200 DM" "unknown" "< 0 DM" ...
## $ months_loan_duration: int 6 48 12 42 24 36 24 36 12 30 ...
## $ credit_history : chr "critical" "repaid" "critical" "repaid" ...
## $ purpose : chr "radio/tv" "radio/tv" "education" "furniture" ...
## $ amount : int 1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
## $ savings_balance : chr "unknown" "< 100 DM" "< 100 DM" "< 100 DM" ...
## $ employment_length : chr "> 7 yrs" "1 - 4 yrs" "4 - 7 yrs" "4 - 7 yrs" ...
## $ installment_rate : int 4 2 2 2 3 2 3 2 2 4 ...
## $ personal_status : chr "single male" "female" "single male" "single male" ...
## $ other_debtors : chr "none" "none" "none" "guarantor" ...
## $ residence_history : int 4 2 3 4 4 4 4 2 4 2 ...
## $ property : chr "real estate" "real estate" "real estate" "building society savings" ...
## $ age : int 67 22 49 45 53 35 53 35 61 28 ...
## $ installment_plan : chr "none" "none" "none" "none" ...
## $ housing : chr "own" "own" "own" "for free" ...
## $ existing_credits : int 2 1 1 1 2 1 1 1 1 2 ...
## $ default : Factor w/ 2 levels "Yes","No": 1 2 1 1 2 1 1 1 1 2 ...
## $ dependents : int 1 1 2 2 2 2 1 1 1 1 ...
## $ telephone : chr "yes" "none" "none" "none" ...
## $ foreign_worker : chr "yes" "yes" "yes" "yes" ...
## $ job : chr "skilled employee" "skilled employee" "unskilled resident" "skilled employee" ...
Summary statistics
summary(credit.df)## checking_balance months_loan_duration credit_history purpose
## Length:1000 Min. : 4.0 Length:1000 Length:1000
## Class :character 1st Qu.:12.0 Class :character Class :character
## Mode :character Median :18.0 Mode :character Mode :character
## Mean :20.9
## 3rd Qu.:24.0
## Max. :72.0
## amount savings_balance employment_length installment_rate
## Min. : 250 Length:1000 Length:1000 Min. :1.000
## 1st Qu.: 1366 Class :character Class :character 1st Qu.:2.000
## Median : 2320 Mode :character Mode :character Median :3.000
## Mean : 3271 Mean :2.973
## 3rd Qu.: 3972 3rd Qu.:4.000
## Max. :18424 Max. :4.000
## personal_status other_debtors residence_history property
## Length:1000 Length:1000 Min. :1.000 Length:1000
## Class :character Class :character 1st Qu.:2.000 Class :character
## Mode :character Mode :character Median :3.000 Mode :character
## Mean :2.845
## 3rd Qu.:4.000
## Max. :4.000
## age installment_plan housing existing_credits
## Min. :19.00 Length:1000 Length:1000 Min. :1.000
## 1st Qu.:27.00 Class :character Class :character 1st Qu.:1.000
## Median :33.00 Mode :character Mode :character Median :1.000
## Mean :35.55 Mean :1.407
## 3rd Qu.:42.00 3rd Qu.:2.000
## Max. :75.00 Max. :4.000
## default dependents telephone foreign_worker
## Yes:700 Min. :1.000 Length:1000 Length:1000
## No :300 1st Qu.:1.000 Class :character Class :character
## Median :1.000 Mode :character Mode :character
## Mean :1.155
## 3rd Qu.:1.000
## Max. :2.000
## job
## Length:1000
## Class :character
## Mode :character
##
##
##
Sneak-peek at the Data
kable(head(credit.df, n=15), format="html") | checking_balance | months_loan_duration | credit_history | purpose | amount | savings_balance | employment_length | installment_rate | personal_status | other_debtors | residence_history | property | age | installment_plan | housing | existing_credits | default | dependents | telephone | foreign_worker | job |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| < 0 DM | 6 | critical | radio/tv | 1169 | unknown | > 7 yrs | 4 | single male | none | 4 | real estate | 67 | none | own | 2 | Yes | 1 | yes | yes | skilled employee |
| 1 - 200 DM | 48 | repaid | radio/tv | 5951 | < 100 DM | 1 - 4 yrs | 2 | female | none | 2 | real estate | 22 | none | own | 1 | No | 1 | none | yes | skilled employee |
| unknown | 12 | critical | education | 2096 | < 100 DM | 4 - 7 yrs | 2 | single male | none | 3 | real estate | 49 | none | own | 1 | Yes | 2 | none | yes | unskilled resident |
| < 0 DM | 42 | repaid | furniture | 7882 | < 100 DM | 4 - 7 yrs | 2 | single male | guarantor | 4 | building society savings | 45 | none | for free | 1 | Yes | 2 | none | yes | skilled employee |
| < 0 DM | 24 | delayed | car (new) | 4870 | < 100 DM | 1 - 4 yrs | 3 | single male | none | 4 | unknown/none | 53 | none | for free | 2 | No | 2 | none | yes | skilled employee |
| unknown | 36 | repaid | education | 9055 | unknown | 1 - 4 yrs | 2 | single male | none | 4 | unknown/none | 35 | none | for free | 1 | Yes | 2 | yes | yes | unskilled resident |
| unknown | 24 | repaid | furniture | 2835 | 501 - 1000 DM | > 7 yrs | 3 | single male | none | 4 | building society savings | 53 | none | own | 1 | Yes | 1 | none | yes | skilled employee |
| 1 - 200 DM | 36 | repaid | car (used) | 6948 | < 100 DM | 1 - 4 yrs | 2 | single male | none | 2 | other | 35 | none | rent | 1 | Yes | 1 | yes | yes | mangement self-employed |
| unknown | 12 | repaid | radio/tv | 3059 | > 1000 DM | 4 - 7 yrs | 2 | divorced male | none | 4 | real estate | 61 | none | own | 1 | Yes | 1 | none | yes | unskilled resident |
| 1 - 200 DM | 30 | critical | car (new) | 5234 | < 100 DM | unemployed | 4 | married male | none | 2 | other | 28 | none | own | 2 | No | 1 | none | yes | mangement self-employed |
| 1 - 200 DM | 12 | repaid | car (new) | 1295 | < 100 DM | 0 - 1 yrs | 3 | female | none | 1 | other | 25 | none | rent | 1 | No | 1 | none | yes | skilled employee |
| < 0 DM | 48 | repaid | business | 4308 | < 100 DM | 0 - 1 yrs | 3 | female | none | 4 | building society savings | 24 | none | rent | 1 | No | 1 | none | yes | skilled employee |
| 1 - 200 DM | 12 | repaid | radio/tv | 1567 | < 100 DM | 1 - 4 yrs | 1 | female | none | 1 | other | 22 | none | own | 1 | Yes | 1 | yes | yes | skilled employee |
| < 0 DM | 24 | critical | car (new) | 1199 | < 100 DM | > 7 yrs | 4 | single male | none | 4 | other | 60 | none | own | 2 | No | 1 | none | yes | unskilled resident |
| < 0 DM | 15 | repaid | car (new) | 1403 | < 100 DM | 1 - 4 yrs | 2 | female | none | 4 | other | 28 | none | rent | 1 | Yes | 1 | none | yes | skilled employee |
Start the H2o instance
h2o.init()##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
## C:\Users\dncha\AppData\Local\Temp\Rtmp0mP5Qk\file42581eec36fa/h2o_dncha_started_from_r.out
## C:\Users\dncha\AppData\Local\Temp\Rtmp0mP5Qk\file425841f24974/h2o_dncha_started_from_r.err
##
##
## Starting H2O JVM and connecting: Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 2 seconds 778 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.32.0.1
## H2O cluster version age: 2 months and 3 days
## H2O cluster name: H2O_started_from_R_dncha_xic696
## H2O cluster total nodes: 1
## H2O cluster total memory: 3.52 GB
## H2O cluster total cores: 6
## H2O cluster allowed cores: 6
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 54321
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Amazon S3, Algos, AutoML, Core V3, TargetEncoder, Core V4
## R Version: R version 4.0.3 (2020-10-10)
Import the dataset into H2O
credit.df = as.h2o(credit.df)##
|
| | 0%
|
|======================================================================| 100%
Set the Predictors and response
credit.df["default"] = as.factor(credit.df["default"])
predictors = c("checking_balance","months_loan_duration","credit_history","purpose","amount","savings_balance","employment_length",
"installment_rate","personal_status","other_debtors","residence_history","property","age","installment_plan","housing",
"existing_credits","dependents","telephone","foreign_worker","job" )
response = "default"Split the dataset into a training and validation data sets
credit_split = h2o.splitFrame(data = credit.df, ratios = 0.8, seed = 2021)
train = credit_split[[1]]
valid = credit_split[[2]]
h2o.table(train[,17])## default Count
## 1 No 248
## 2 Yes 564
##
## [2 rows x 2 columns]
prop.table(h2o.table(train[,17]))[,2]## Count
## 1 0.3050431
## 2 0.6937269
##
## [2 rows x 1 column]
Let us check the quality of the split
h2o.table(train[,17])## default Count
## 1 No 248
## 2 Yes 564
##
## [2 rows x 2 columns]
prop.table(h2o.table(train[,17]))[,2]## Count
## 1 0.3050431
## 2 0.6937269
##
## [2 rows x 1 column]
h2o.table(valid[,17])## default Count
## 1 No 52
## 2 Yes 136
##
## [2 rows x 2 columns]
prop.table(h2o.table(valid[,17]))[,2]## Count
## 1 0.2751323
## 2 0.7195767
##
## [2 rows x 1 column]
Build and train the RandomForest model
credit_drf = h2o.randomForest(x = predictors,
y = response,
ntrees = 100,
nfolds = 5,
max_depth = 5,
min_rows = 10,
calibrate_model = TRUE,
calibration_frame = valid,
binomial_double_trees = TRUE,
training_frame = train,
validation_frame = valid,
keep_cross_validation_predictions=T,
seed=2020)## Warning in .h2o.processResponseWarnings(res): Dropping bad and constant columns: [checking_balance, installment_plan, purpose, housing, employment_length, telephone, other_debtors, credit_history, savings_balance, foreign_worker, property, personal_status, job].
##
|
| | 0%
|
|============================ | 39%
|
|=========================================================== | 84%
|
|======================================================================| 100%
credit_drf## Model Details:
## ==============
##
## H2OBinomialModel: drf
## Model ID: DRF_model_R_1607781240421_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 200 52250 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 9 25 16.14000
##
##
## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.1954185
## RMSE: 0.4420616
## LogLoss: 0.5756885
## Mean Per-Class Error: 0.4323238
## AUC: 0.6747634
## AUCPR: 0.8149597
## Gini: 0.3495267
## R^2: 0.07881493
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 45 203 0.818548 =203/248
## Yes 26 538 0.046099 =26/564
## Totals 71 741 0.282020 =229/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.527147 0.824521 342
## 2 max f2 0.288040 0.919166 399
## 3 max f0point5 0.647740 0.775510 260
## 4 max accuracy 0.527147 0.717980 342
## 5 max precision 0.904195 1.000000 0
## 6 max recall 0.288040 1.000000 399
## 7 max specificity 0.904195 1.000000 0
## 8 max absolute_mcc 0.647740 0.261967 260
## 9 max min_per_class_accuracy 0.702283 0.604839 193
## 10 max mean_per_class_accuracy 0.647740 0.626030 260
## 11 max tns 0.904195 248.000000 0
## 12 max fns 0.904195 563.000000 0
## 13 max fps 0.337473 248.000000 397
## 14 max tps 0.288040 564.000000 399
## 15 max tnr 0.904195 1.000000 0
## 16 max fnr 0.904195 0.998227 0
## 17 max fpr 0.337473 1.000000 397
## 18 max tpr 0.288040 1.000000 399
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on validation data. **
##
## MSE: 0.1970796
## RMSE: 0.4439365
## LogLoss: 0.5812905
## Mean Per-Class Error: 0.4295814
## AUC: 0.584983
## AUCPR: 0.793562
## Gini: 0.1699661
## R^2: 0.01504772
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 10 42 0.807692 =42/52
## Yes 7 129 0.051471 =7/136
## Totals 17 171 0.260638 =49/188
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.522234 0.840391 170
## 2 max f2 0.363811 0.928962 187
## 3 max f0point5 0.522234 0.786585 170
## 4 max accuracy 0.522234 0.739362 170
## 5 max precision 0.882854 1.000000 0
## 6 max recall 0.363811 1.000000 187
## 7 max specificity 0.882854 1.000000 0
## 8 max absolute_mcc 0.522234 0.219667 170
## 9 max min_per_class_accuracy 0.709162 0.519231 95
## 10 max mean_per_class_accuracy 0.766978 0.571550 51
## 11 max tns 0.882854 52.000000 0
## 12 max fns 0.882854 135.000000 0
## 13 max fps 0.424059 52.000000 185
## 14 max tps 0.363811 136.000000 187
## 15 max tnr 0.882854 1.000000 0
## 16 max fnr 0.882854 0.992647 0
## 17 max fpr 0.424059 1.000000 185
## 18 max tpr 0.363811 1.000000 187
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: drf
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.1980747
## RMSE: 0.4450558
## LogLoss: 0.5820837
## Mean Per-Class Error: 0.4782229
## AUC: 0.6571615
## AUCPR: 0.8007964
## Gini: 0.3143231
## R^2: 0.06629377
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 13 235 0.947581 =235/248
## Yes 5 559 0.008865 =5/564
## Totals 18 794 0.295567 =240/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.426927 0.823270 382
## 2 max f2 0.324056 0.919465 398
## 3 max f0point5 0.567618 0.767312 322
## 4 max accuracy 0.567618 0.717980 322
## 5 max precision 0.903290 1.000000 0
## 6 max recall 0.324056 1.000000 398
## 7 max specificity 0.903290 1.000000 0
## 8 max absolute_mcc 0.672086 0.242262 226
## 9 max min_per_class_accuracy 0.698816 0.600806 193
## 10 max mean_per_class_accuracy 0.672086 0.626301 226
## 11 max tns 0.903290 248.000000 0
## 12 max fns 0.903290 563.000000 0
## 13 max fps 0.301628 248.000000 399
## 14 max tps 0.324056 564.000000 398
## 15 max tnr 0.903290 1.000000 0
## 16 max fnr 0.903290 0.998227 0
## 17 max fpr 0.301628 1.000000 399
## 18 max tpr 0.324056 1.000000 398
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.73221344 0.03010463 0.7096774 0.7826087 0.7375 0.7160494
## auc 0.6686505 0.05936365 0.5887619 0.66851443 0.7253103 0.6343402
## aucpr 0.8099034 0.054801244 0.74830717 0.8773992 0.85262334 0.76748276
## err 0.26778653 0.03010463 0.29032257 0.2173913 0.2625 0.28395063
## err_count 43.2 2.3874674 45.0 40.0 42.0 46.0
## cv_5_valid
## accuracy 0.7152318
## auc 0.72632575
## aucpr 0.8037045
## err 0.28476822
## err_count 43.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## pr_auc 0.8099034 0.054801244 0.74830717 0.8773992 0.85262334
## precision 0.724168 0.03439666 0.7 0.78142077 0.72789115
## r2 0.05476902 0.052296337 0.017857224 -0.007952967 0.11634714
## recall 0.98675823 0.01249434 1.0 1.0 0.98165137
## rmse 0.44578502 0.018204482 0.46327117 0.41779405 0.43804547
## specificity 0.15029694 0.0916091 0.1 0.024390243 0.21568628
## cv_4_valid cv_5_valid
## pr_auc 0.76748276 0.8037045
## precision 0.7152318 0.6962963
## r2 0.050057463 0.09753624
## recall 0.972973 0.9791667
## rmse 0.45266846 0.45714596
## specificity 0.15686275 0.25454545
Feature Importance in RandomForest Model
h2o.varimp(credit_drf)## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 months_loan_duration 1060.386963 1.000000 0.322269
## 2 amount 1043.569702 0.984140 0.317158
## 3 age 669.808533 0.631664 0.203566
## 4 installment_rate 235.961121 0.222524 0.071712
## 5 existing_credits 121.715500 0.114784 0.036991
## 6 residence_history 114.824768 0.108286 0.034897
## 7 dependents 44.115959 0.041604 0.013408
h2o.varimp_plot(credit_drf)Evaluate RandomForest model performance on Training dataset
perf = h2o.performance(credit_drf)
perf## H2OBinomialMetrics: drf
## ** Reported on training data. **
## ** Metrics reported on Out-Of-Bag training samples **
##
## MSE: 0.1954185
## RMSE: 0.4420616
## LogLoss: 0.5756885
## Mean Per-Class Error: 0.4323238
## AUC: 0.6747634
## AUCPR: 0.8149597
## Gini: 0.3495267
## R^2: 0.07881493
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 45 203 0.818548 =203/248
## Yes 26 538 0.046099 =26/564
## Totals 71 741 0.282020 =229/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.527147 0.824521 342
## 2 max f2 0.288040 0.919166 399
## 3 max f0point5 0.647740 0.775510 260
## 4 max accuracy 0.527147 0.717980 342
## 5 max precision 0.904195 1.000000 0
## 6 max recall 0.288040 1.000000 399
## 7 max specificity 0.904195 1.000000 0
## 8 max absolute_mcc 0.647740 0.261967 260
## 9 max min_per_class_accuracy 0.702283 0.604839 193
## 10 max mean_per_class_accuracy 0.647740 0.626030 260
## 11 max tns 0.904195 248.000000 0
## 12 max fns 0.904195 563.000000 0
## 13 max fps 0.337473 248.000000 397
## 14 max tps 0.288040 564.000000 399
## 15 max tnr 0.904195 1.000000 0
## 16 max fnr 0.904195 0.998227 0
## 17 max fpr 0.337473 1.000000 397
## 18 max tpr 0.288040 1.000000 399
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Evaluate RandomForest model performance on Validation dataset
perf_drf_val = h2o.performance(credit_drf, valid)
perf_drf_val## H2OBinomialMetrics: drf
##
## MSE: 0.1970796
## RMSE: 0.4439365
## LogLoss: 0.5812905
## Mean Per-Class Error: 0.4295814
## AUC: 0.584983
## AUCPR: 0.793562
## Gini: 0.1699661
## R^2: 0.01504772
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 10 42 0.807692 =42/52
## Yes 7 129 0.051471 =7/136
## Totals 17 171 0.260638 =49/188
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.522234 0.840391 170
## 2 max f2 0.363811 0.928962 187
## 3 max f0point5 0.522234 0.786585 170
## 4 max accuracy 0.522234 0.739362 170
## 5 max precision 0.882854 1.000000 0
## 6 max recall 0.363811 1.000000 187
## 7 max specificity 0.882854 1.000000 0
## 8 max absolute_mcc 0.522234 0.219667 170
## 9 max min_per_class_accuracy 0.709162 0.519231 95
## 10 max mean_per_class_accuracy 0.766978 0.571550 51
## 11 max tns 0.882854 52.000000 0
## 12 max fns 0.882854 135.000000 0
## 13 max fps 0.424059 52.000000 185
## 14 max tps 0.363811 136.000000 187
## 15 max tnr 0.882854 1.000000 0
## 16 max fnr 0.882854 0.992647 0
## 17 max fpr 0.424059 1.000000 185
## 18 max tpr 0.363811 1.000000 187
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Partial Dependency Plots for RandomForest Model
h2o.partialPlot(object = credit_drf, data = credit.df, cols = c('amount',
'age',
'months_loan_duration',
'existing_credits',
'residence_history',
'installment_rate',
'dependents'))##
|
| | 0%
|
|========== | 14%
|
|======================================================================| 100%
## [[1]]
## PartialDependence: Partial dependency plot for amount
## amount mean_response stddev_response std_error_mean_response
## 1 250.000000 0.681666 0.099656 0.003151
## 2 1206.526316 0.690344 0.091499 0.002893
## 3 2163.052632 0.712094 0.096583 0.003054
## 4 3119.578947 0.725546 0.090936 0.002876
## 5 4076.105263 0.704348 0.083092 0.002628
## 6 5032.631579 0.692170 0.080203 0.002536
## 7 5989.157895 0.673860 0.086180 0.002725
## 8 6945.684211 0.639456 0.090300 0.002856
## 9 7902.210526 0.627722 0.082316 0.002603
## 10 8858.736842 0.580340 0.070622 0.002233
## 11 9815.263158 0.574778 0.071543 0.002262
## 12 10771.789474 0.561890 0.069406 0.002195
## 13 11728.315789 0.461472 0.047279 0.001495
## 14 12684.842105 0.455652 0.048186 0.001524
## 15 13641.368421 0.455652 0.048186 0.001524
## 16 14597.894737 0.455652 0.048186 0.001524
## 17 15554.421053 0.455652 0.048186 0.001524
## 18 16510.947368 0.455652 0.048186 0.001524
## 19 17467.473684 0.455652 0.048186 0.001524
## 20 18424.000000 0.455652 0.048186 0.001524
##
## [[2]]
## PartialDependence: Partial dependency plot for age
## age mean_response stddev_response std_error_mean_response
## 1 19.000000 0.617441 0.102681 0.003247
## 2 21.947368 0.617441 0.102681 0.003247
## 3 24.894737 0.646385 0.099068 0.003133
## 4 27.842105 0.675101 0.103413 0.003270
## 5 30.789474 0.673716 0.091435 0.002891
## 6 33.736842 0.715941 0.096748 0.003059
## 7 36.684211 0.745373 0.095777 0.003029
## 8 39.631579 0.737151 0.097613 0.003087
## 9 42.578947 0.734783 0.103350 0.003268
## 10 45.526316 0.736638 0.104037 0.003290
## 11 48.473684 0.738011 0.104764 0.003313
## 12 51.421053 0.735592 0.111089 0.003513
## 13 54.368421 0.729086 0.109397 0.003459
## 14 57.315789 0.727456 0.108995 0.003447
## 15 60.263158 0.726534 0.108238 0.003423
## 16 63.210526 0.729979 0.107900 0.003412
## 17 66.157895 0.728900 0.107530 0.003400
## 18 69.105263 0.728900 0.107530 0.003400
## 19 72.052632 0.728900 0.107530 0.003400
## 20 75.000000 0.728900 0.107530 0.003400
##
## [[3]]
## PartialDependence: Partial dependency plot for months_loan_duration
## months_loan_duration mean_response stddev_response std_error_mean_response
## 1 4.000000 0.760020 0.080747 0.002553
## 2 7.578947 0.755063 0.080017 0.002530
## 3 11.157895 0.735514 0.078578 0.002485
## 4 14.736842 0.729653 0.076627 0.002423
## 5 18.315789 0.697437 0.074280 0.002349
## 6 21.894737 0.694335 0.073465 0.002323
## 7 25.473684 0.682869 0.072863 0.002304
## 8 29.052632 0.654520 0.076739 0.002427
## 9 32.631579 0.643906 0.075382 0.002384
## 10 36.210526 0.589429 0.069988 0.002213
## 11 39.789474 0.563508 0.073546 0.002326
## 12 43.368421 0.560697 0.074481 0.002355
## 13 46.947368 0.526293 0.069878 0.002210
## 14 50.526316 0.513843 0.068217 0.002157
## 15 54.105263 0.513843 0.068217 0.002157
## 16 57.684211 0.513843 0.068217 0.002157
## 17 61.263158 0.513843 0.068217 0.002157
## 18 64.842105 0.513843 0.068217 0.002157
## 19 68.421053 0.513843 0.068217 0.002157
## 20 72.000000 0.513843 0.068217 0.002157
##
## [[4]]
## PartialDependence: Partial dependency plot for existing_credits
## existing_credits mean_response stddev_response std_error_mean_response
## 1 1.000000 0.687315 0.108602 0.003434
## 2 2.000000 0.707802 0.110617 0.003498
## 3 3.000000 0.734493 0.098054 0.003101
## 4 4.000000 0.734493 0.098054 0.003101
##
## [[5]]
## PartialDependence: Partial dependency plot for residence_history
## residence_history mean_response stddev_response std_error_mean_response
## 1 1.000000 0.703842 0.094538 0.002990
## 2 2.000000 0.694231 0.106841 0.003379
## 3 3.000000 0.694573 0.110901 0.003507
## 4 4.000000 0.691912 0.116933 0.003698
##
## [[6]]
## PartialDependence: Partial dependency plot for installment_rate
## installment_rate mean_response stddev_response std_error_mean_response
## 1 1.000000 0.738315 0.095438 0.003018
## 2 2.000000 0.718990 0.101822 0.003220
## 3 3.000000 0.685391 0.108502 0.003431
## 4 4.000000 0.671080 0.112188 0.003548
##
## [[7]]
## PartialDependence: Partial dependency plot for dependents
## dependents mean_response stddev_response std_error_mean_response
## 1 1.000000 0.697008 0.111889 0.003538
## 2 2.000000 0.684354 0.106949 0.003382
Build and train the GBM model
credit_gbm = h2o.gbm(x = predictors,
y = response,
ntrees=100,
nfolds = 5,
training_frame = train,
validation_frame = valid,
distribution = "bernoulli",
keep_cross_validation_predictions=T,
seed = 2020)## Warning in .h2o.processResponseWarnings(res): Dropping bad and constant columns: [checking_balance, installment_plan, purpose, housing, employment_length, telephone, other_debtors, credit_history, savings_balance, foreign_worker, property, personal_status, job].
##
|
| | 0%
|
|== | 3%
|
|========================================================== | 82%
|
|======================================================================| 100%
credit_gbm## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1607781240421_1216
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 100 100 26308 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 7 26 16.29000
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.08072184
## RMSE: 0.2841159
## LogLoss: 0.2889944
## Mean Per-Class Error: 0.08756577
## AUC: 0.9740441
## AUCPR: 0.9886578
## Gini: 0.9480883
## R^2: 0.6194845
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 216 32 0.129032 =32/248
## Yes 26 538 0.046099 =26/564
## Totals 242 570 0.071429 =58/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.597487 0.948854 231
## 2 max f2 0.486085 0.967293 263
## 3 max f0point5 0.686787 0.956672 197
## 4 max accuracy 0.602242 0.928571 229
## 5 max precision 0.995919 1.000000 0
## 6 max recall 0.192998 1.000000 357
## 7 max specificity 0.995919 1.000000 0
## 8 max absolute_mcc 0.602242 0.830938 229
## 9 max min_per_class_accuracy 0.647444 0.918440 213
## 10 max mean_per_class_accuracy 0.642998 0.920184 216
## 11 max tns 0.995919 248.000000 0
## 12 max fns 0.995919 561.000000 0
## 13 max fps 0.027782 248.000000 399
## 14 max tps 0.192998 564.000000 357
## 15 max tnr 0.995919 1.000000 0
## 16 max fnr 0.995919 0.994681 0
## 17 max fpr 0.027782 1.000000 399
## 18 max tpr 0.192998 1.000000 357
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.2228741
## RMSE: 0.4720954
## LogLoss: 0.6524698
## Mean Per-Class Error: 0.4807692
## AUC: 0.5996889
## AUCPR: 0.8046438
## Gini: 0.1993778
## R^2: -0.1138662
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 2 50 0.961538 =50/52
## Yes 0 136 0.000000 =0/136
## Totals 2 186 0.265957 =50/188
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.140224 0.844720 185
## 2 max f2 0.140224 0.931507 185
## 3 max f0point5 0.245108 0.777512 174
## 4 max accuracy 0.210653 0.734043 179
## 5 max precision 0.989209 1.000000 0
## 6 max recall 0.140224 1.000000 185
## 7 max specificity 0.989209 1.000000 0
## 8 max absolute_mcc 0.950781 0.175397 13
## 9 max min_per_class_accuracy 0.741064 0.557692 98
## 10 max mean_per_class_accuracy 0.795601 0.596154 83
## 11 max tns 0.989209 52.000000 0
## 12 max fns 0.989209 135.000000 0
## 13 max fps 0.093880 52.000000 187
## 14 max tps 0.140224 136.000000 185
## 15 max tnr 0.989209 1.000000 0
## 16 max fnr 0.989209 0.992647 0
## 17 max fpr 0.093880 1.000000 187
## 18 max tpr 0.140224 1.000000 185
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on cross-validation data. **
## ** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
##
## MSE: 0.2133289
## RMSE: 0.4618754
## LogLoss: 0.6443836
## Mean Per-Class Error: 0.4514413
## AUC: 0.6504054
## AUCPR: 0.7847516
## Gini: 0.3008107
## R^2: -0.005613097
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 32 216 0.870968 =216/248
## Yes 18 546 0.031915 =18/564
## Totals 50 762 0.288177 =234/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.254081 0.823529 357
## 2 max f2 0.014005 0.919465 398
## 3 max f0point5 0.634923 0.768398 228
## 4 max accuracy 0.254081 0.711823 357
## 5 max precision 0.966253 0.875000 14
## 6 max recall 0.014005 1.000000 398
## 7 max specificity 0.991132 0.995968 0
## 8 max absolute_mcc 0.634923 0.244092 228
## 9 max min_per_class_accuracy 0.757644 0.611702 165
## 10 max mean_per_class_accuracy 0.699854 0.629132 197
## 11 max tns 0.991132 247.000000 0
## 12 max fns 0.991132 562.000000 0
## 13 max fps 0.011657 248.000000 399
## 14 max tps 0.014005 564.000000 398
## 15 max tnr 0.991132 0.995968 0
## 16 max fnr 0.991132 0.996454 0
## 17 max fpr 0.011657 1.000000 399
## 18 max tpr 0.014005 1.000000 398
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## Cross-Validation Metrics Summary:
## mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid
## accuracy 0.71925884 0.041758135 0.67741936 0.7880435 0.7 0.7222222
## auc 0.65698946 0.035979126 0.5987619 0.6527375 0.6601907 0.6855679
## aucpr 0.7910727 0.050494414 0.71627074 0.8571369 0.7918857 0.8052056
## err 0.28074113 0.041758135 0.32258064 0.21195652 0.3 0.2777778
## err_count 45.2 4.2071366 50.0 39.0 48.0 45.0
## cv_5_valid
## accuracy 0.7086093
## auc 0.6876894
## aucpr 0.7848649
## err 0.29139072
## err_count 44.0
##
## ---
## mean sd cv_1_valid cv_2_valid cv_3_valid
## pr_auc 0.7910727 0.050494414 0.71627074 0.8571369 0.7918857
## precision 0.7156215 0.044764668 0.67741936 0.7921348 0.6942675
## r2 -0.019336531 0.10400455 -0.12294832 -0.13473132 0.012843479
## recall 0.98498434 0.021575985 1.0 0.986014 1.0
## rmse 0.46234497 0.021716403 0.49536788 0.44329074 0.46298975
## specificity 0.1169097 0.109665364 0.0 0.09756097 0.05882353
## cv_4_valid cv_5_valid
## pr_auc 0.8052056 0.7848649
## precision 0.71428573 0.7
## r2 0.09316581 0.054987703
## recall 0.990991 0.9479167
## rmse 0.44227818 0.4677984
## specificity 0.13725491 0.29090908
Feature Importance in GBM Model
h2o.varimp(credit_gbm)## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 amount 201.582108 1.000000 0.384392
## 2 age 110.161858 0.546486 0.210065
## 3 months_loan_duration 105.774239 0.524720 0.201699
## 4 existing_credits 32.525982 0.161354 0.062023
## 5 residence_history 29.147604 0.144594 0.055581
## 6 installment_rate 28.376364 0.140768 0.054110
## 7 dependents 16.849354 0.083586 0.032130
h2o.varimp_plot(credit_gbm)Evaluate GBM model performance on Training dataset
perf_gbm_train = h2o.performance(credit_gbm)
perf_gbm_train## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.08072184
## RMSE: 0.2841159
## LogLoss: 0.2889944
## Mean Per-Class Error: 0.08756577
## AUC: 0.9740441
## AUCPR: 0.9886578
## Gini: 0.9480883
## R^2: 0.6194845
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 216 32 0.129032 =32/248
## Yes 26 538 0.046099 =26/564
## Totals 242 570 0.071429 =58/812
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.597487 0.948854 231
## 2 max f2 0.486085 0.967293 263
## 3 max f0point5 0.686787 0.956672 197
## 4 max accuracy 0.602242 0.928571 229
## 5 max precision 0.995919 1.000000 0
## 6 max recall 0.192998 1.000000 357
## 7 max specificity 0.995919 1.000000 0
## 8 max absolute_mcc 0.602242 0.830938 229
## 9 max min_per_class_accuracy 0.647444 0.918440 213
## 10 max mean_per_class_accuracy 0.642998 0.920184 216
## 11 max tns 0.995919 248.000000 0
## 12 max fns 0.995919 561.000000 0
## 13 max fps 0.027782 248.000000 399
## 14 max tps 0.192998 564.000000 357
## 15 max tnr 0.995919 1.000000 0
## 16 max fnr 0.995919 0.994681 0
## 17 max fpr 0.027782 1.000000 399
## 18 max tpr 0.192998 1.000000 357
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Evaluate GBM model performance on Validation dataset
predict_gbm = h2o.predict(credit_gbm, newdata = valid)##
|
| | 0%
|
|======================================================================| 100%
perf_gbm_val = h2o.performance(credit_gbm, valid)
perf_gbm_val## H2OBinomialMetrics: gbm
##
## MSE: 0.2228741
## RMSE: 0.4720954
## LogLoss: 0.6524698
## Mean Per-Class Error: 0.4807692
## AUC: 0.5996889
## AUCPR: 0.8046438
## Gini: 0.1993778
## R^2: -0.1138662
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## No Yes Error Rate
## No 2 50 0.961538 =50/52
## Yes 0 136 0.000000 =0/136
## Totals 2 186 0.265957 =50/188
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.140224 0.844720 185
## 2 max f2 0.140224 0.931507 185
## 3 max f0point5 0.245108 0.777512 174
## 4 max accuracy 0.210653 0.734043 179
## 5 max precision 0.989209 1.000000 0
## 6 max recall 0.140224 1.000000 185
## 7 max specificity 0.989209 1.000000 0
## 8 max absolute_mcc 0.950781 0.175397 13
## 9 max min_per_class_accuracy 0.741064 0.557692 98
## 10 max mean_per_class_accuracy 0.795601 0.596154 83
## 11 max tns 0.989209 52.000000 0
## 12 max fns 0.989209 135.000000 0
## 13 max fps 0.093880 52.000000 187
## 14 max tps 0.140224 136.000000 185
## 15 max tnr 0.989209 1.000000 0
## 16 max fnr 0.989209 0.992647 0
## 17 max fpr 0.093880 1.000000 187
## 18 max tpr 0.140224 1.000000 185
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
Partial Dependency Plots for GBM Model
h2o.partialPlot(object = credit_gbm, data = credit.df, cols = c('amount',
'age',
'months_loan_duration',
'existing_credits',
'residence_history',
'installment_rate',
'dependents'))##
|
| | 0%
|
|========== | 14%
|
|======================================================================| 100%
## [[1]]
## PartialDependence: Partial dependency plot for amount
## amount mean_response stddev_response std_error_mean_response
## 1 250.000000 0.816605 0.179518 0.005677
## 2 1206.526316 0.479805 0.240422 0.007603
## 3 2163.052632 0.638468 0.262501 0.008301
## 4 3119.578947 0.814451 0.172167 0.005444
## 5 4076.105263 0.499470 0.201377 0.006368
## 6 5032.631579 0.587170 0.212258 0.006712
## 7 5989.157895 0.688832 0.233964 0.007399
## 8 6945.684211 0.668347 0.244814 0.007742
## 9 7902.210526 0.648302 0.252116 0.007973
## 10 8858.736842 0.227349 0.200570 0.006343
## 11 9815.263158 0.339576 0.224751 0.007107
## 12 10771.789474 0.344699 0.221123 0.006993
## 13 11728.315789 0.228508 0.203443 0.006433
## 14 12684.842105 0.087733 0.130003 0.004111
## 15 13641.368421 0.093373 0.134894 0.004266
## 16 14597.894737 0.088250 0.130353 0.004122
## 17 15554.421053 0.088250 0.130353 0.004122
## 18 16510.947368 0.088250 0.130353 0.004122
## 19 17467.473684 0.088250 0.130353 0.004122
## 20 18424.000000 0.088250 0.130353 0.004122
##
## [[2]]
## PartialDependence: Partial dependency plot for age
## age mean_response stddev_response std_error_mean_response
## 1 19.000000 0.501949 0.253478 0.008016
## 2 21.947368 0.563037 0.263407 0.008330
## 3 24.894737 0.628822 0.249826 0.007900
## 4 27.842105 0.675405 0.234801 0.007425
## 5 30.789474 0.647186 0.226378 0.007159
## 6 33.736842 0.678719 0.229565 0.007259
## 7 36.684211 0.832353 0.161509 0.005107
## 8 39.631579 0.796781 0.182556 0.005773
## 9 42.578947 0.748857 0.218104 0.006897
## 10 45.526316 0.799486 0.191049 0.006041
## 11 48.473684 0.801373 0.193954 0.006133
## 12 51.421053 0.696768 0.264239 0.008356
## 13 54.368421 0.671279 0.274165 0.008670
## 14 57.315789 0.719032 0.239743 0.007581
## 15 60.263158 0.680278 0.240948 0.007619
## 16 63.210526 0.790093 0.196858 0.006225
## 17 66.157895 0.723962 0.227134 0.007183
## 18 69.105263 0.723962 0.227134 0.007183
## 19 72.052632 0.723962 0.227134 0.007183
## 20 75.000000 0.723962 0.227134 0.007183
##
## [[3]]
## PartialDependence: Partial dependency plot for months_loan_duration
## months_loan_duration mean_response stddev_response std_error_mean_response
## 1 4.000000 0.777827 0.245616 0.007767
## 2 7.578947 0.765740 0.248302 0.007852
## 3 11.157895 0.724953 0.245542 0.007765
## 4 14.736842 0.720862 0.254015 0.008033
## 5 18.315789 0.671168 0.236638 0.007483
## 6 21.894737 0.674866 0.232270 0.007345
## 7 25.473684 0.671474 0.233319 0.007378
## 8 29.052632 0.590455 0.246653 0.007800
## 9 32.631579 0.564291 0.253771 0.008025
## 10 36.210526 0.490392 0.256615 0.008115
## 11 39.789474 0.437204 0.254624 0.008052
## 12 43.368421 0.428207 0.250317 0.007916
## 13 46.947368 0.363505 0.219758 0.006949
## 14 50.526316 0.329331 0.216915 0.006859
## 15 54.105263 0.329331 0.216915 0.006859
## 16 57.684211 0.334110 0.217403 0.006875
## 17 61.263158 0.334110 0.217403 0.006875
## 18 64.842105 0.334110 0.217403 0.006875
## 19 68.421053 0.334110 0.217403 0.006875
## 20 72.000000 0.334110 0.217403 0.006875
##
## [[4]]
## PartialDependence: Partial dependency plot for existing_credits
## existing_credits mean_response stddev_response std_error_mean_response
## 1 1.000000 0.688385 0.250241 0.007913
## 2 2.000000 0.723687 0.245062 0.007750
## 3 3.000000 0.734272 0.243546 0.007702
## 4 4.000000 0.734272 0.243546 0.007702
##
## [[5]]
## PartialDependence: Partial dependency plot for residence_history
## residence_history mean_response stddev_response std_error_mean_response
## 1 1.000000 0.720821 0.228991 0.007241
## 2 2.000000 0.685075 0.245908 0.007776
## 3 3.000000 0.704677 0.248096 0.007845
## 4 4.000000 0.685858 0.257893 0.008155
##
## [[6]]
## PartialDependence: Partial dependency plot for installment_rate
## installment_rate mean_response stddev_response std_error_mean_response
## 1 1.000000 0.711415 0.238594 0.007545
## 2 2.000000 0.709392 0.244679 0.007737
## 3 3.000000 0.689506 0.250506 0.007922
## 4 4.000000 0.680057 0.255345 0.008075
##
## [[7]]
## PartialDependence: Partial dependency plot for dependents
## dependents mean_response stddev_response std_error_mean_response
## 1 1.000000 0.704368 0.252720 0.007992
## 2 2.000000 0.653643 0.254764 0.008056
Shutdown the H2O instance
h2o.shutdown(prompt = F)