Predictive Analytics of Car Insurance Claims (USA): Gradient Boosting Machine Learning Vs Decision Tree Analysis
Car Insurance Claims (USA) : Circular TreeMap(interacive)
Car_Insurance_claim = read.csv("carinsuranceclaimUSV02.csv", header = T)
Car Insurance Claims (USA) : Search DataTable
Car Insurance Claims (USA) : Decision Tree Analysis
indata_model = ctree(CLAIM_FLAG ~. , indata_train)
plot(indata_model)
indata_model
##
## Model formula:
## CLAIM_FLAG ~ OWER_AGE_CLS + MSTATUS + GENDER + EDUCATION + OCCUPATION +
## CLM_FREQ_CLS
##
## Fitted party:
## [1] root
## | [2] CLM_FREQ_CLS in CLM_FREQ_0, CLM_FREQ_5
## | | [3] OWER_AGE_CLS in 31=< AGE <45, 46=< AGE <60, AGE >=< 61
## | | | [4] EDUCATION <High School, High School: Claim_No (n = 1394, err = 27.0%)
## | | | [5] EDUCATION in Bachelors, Masters, PhD
## | | | | [6] OWER_AGE_CLS in 31=< AGE <45, AGE >=< 61: Claim_No (n = 845, err = 21.5%)
## | | | | [7] OWER_AGE_CLS in 46=< AGE <60: Claim_No (n = 979, err = 13.4%)
## | | [8] OWER_AGE_CLS in 31=<< AGE <45, AGE < 30: Claim_No (n = 147, err = 46.9%)
## | [9] CLM_FREQ_CLS in CLM_FREQ_1, CLM_FREQ_2, CLM_FREQ_3, CLM_FREQ_4
## | | [10] OCCUPATION in Blue Collar, Clerical, Home Maker, Professional, Student, Unemployed
## | | | [11] EDUCATION <High School, High School: Claim_Yes (n = 1099, err = 43.8%)
## | | | [12] EDUCATION in Bachelors, Masters, PhD: Claim_No (n = 843, err = 44.4%)
## | | [13] OCCUPATION in Doctor, Lawyer, Manager
## | | | [14] GENDER in Female
## | | | | [15] OCCUPATION in Doctor, Lawyer: Claim_No (n = 134, err = 42.5%)
## | | | | [16] OCCUPATION in Manager: Claim_No (n = 98, err = 23.5%)
## | | | [17] GENDER in Male: Claim_No (n = 254, err = 18.5%)
##
## Number of inner nodes: 8
## Number of terminal nodes: 9
Decision Tree Model : Confusion Matrix
indata_pred = predict(indata_model, indata_test)
indata_conft = table("prediction" = indata_pred , "actual" = indata_test$CLAIM_FLAG)
| Claim_No | Claim_Yes | |
|---|---|---|
| Claim_No | 2851 | 490 |
| Claim_Yes | 325 | 197 |
Gradient Boosting Machine Learning
splits = h2o.splitFrame(
data = indata,
ratios = c(0.6,0.1),
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 2256)
indata_train = splits[[1]]
indata_valid = splits[[2]]
indata_test = splits[[3]]
# dim(indat_train)
# dim(indat_valid)
# dim(indat_test)
Gradient Boosting Machine Learning Model Setting
y_response = "CLAIM_FLAG"
x_predictors = setdiff(names(indata), c(y_response, "names"))
gbm_indata = h2o.gbm(x = x_predictors,
y = y_response,
training_frame = indata_train,
validation_frame = indata_valid,
#validation_frame = indata_test,
ntrees=500,
learn_rate=0.001,
learn_rate_annealing = 0.99,
stopping_rounds = 20,
stopping_tolerance = 1e-4,
stopping_metric = "AUC",
sample_rate = 0.9,
col_sample_rate = 0.9,
seed = 2256,
score_tree_interval = 20
)
gbm_indata
## Model Details:
## ==============
##
## H2OBinomialModel: gbm
## Model ID: GBM_model_R_1559377896373_1
## Model Summary:
## number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1 500 500 227904 5
## max_depth mean_depth min_leaves max_leaves mean_leaves
## 1 5 5.00000 28 32 31.51200
##
##
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
##
## MSE: 0.1903705
## RMSE: 0.4363147
## LogLoss: 0.5669711
## Mean Per-Class Error: 0.3228044
## AUC: 0.7374177
## pr_auc: 0.4944572
## Gini: 0.4748354
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Claim_No Claim_Yes Error Rate
## Claim_No 2647 1602 0.377030 =1602/4249
## Claim_Yes 412 1122 0.268579 =412/1534
## Totals 3059 2724 0.348262 =2014/5783
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.263002 0.527008 269
## 2 max f2 0.253121 0.678330 349
## 3 max f0point5 0.279741 0.506144 142
## 4 max accuracy 0.289444 0.752378 62
## 5 max precision 0.304281 1.000000 0
## 6 max recall 0.246149 1.000000 398
## 7 max specificity 0.304281 1.000000 0
## 8 max absolute_mcc 0.269102 0.322348 220
## 9 max min_per_class_accuracy 0.265229 0.673570 248
## 10 max mean_per_class_accuracy 0.263002 0.677196 269
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
##
## MSE: 0.1932356
## RMSE: 0.4395858
## LogLoss: 0.573392
## Mean Per-Class Error: 0.3388755
## AUC: 0.7164597
## pr_auc: 0.474162
## Gini: 0.4329193
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
## Claim_No Claim_Yes Error Rate
## Claim_No 364 363 0.499312 =363/727
## Claim_Yes 48 221 0.178439 =48/269
## Totals 412 584 0.412651 =411/996
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.258521 0.518171 283
## 2 max f2 0.253119 0.689940 331
## 3 max f0point5 0.271672 0.463668 167
## 4 max accuracy 0.289961 0.752008 36
## 5 max precision 0.302393 1.000000 0
## 6 max recall 0.246428 1.000000 393
## 7 max specificity 0.302393 1.000000 0
## 8 max absolute_mcc 0.257937 0.290673 292
## 9 max min_per_class_accuracy 0.264574 0.646840 225
## 10 max mean_per_class_accuracy 0.258521 0.661125 283
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
varimp_gbm_indata =h2o.varimp(gbm_indata)
varimp_gbm_indata
## Variable Importances:
## variable relative_importance scaled_importance percentage
## 1 CLM_FREQ_CLS 29743.185547 1.000000 0.503663
## 2 OCCUPATION 14980.474609 0.503661 0.253675
## 3 OWER_AGE_CLS 5238.026855 0.176108 0.088699
## 4 MSTATUS 4985.911621 0.167632 0.084430
## 5 EDUCATION 3595.964844 0.120900 0.060893
## 6 GENDER 510.127869 0.017151 0.008638
p = treemap(varimp_gbm_indata,
index=c("variable"),
vSize="scaled_importance",
type="index",
fontsize.labels=c(18),
title="gbm_indata : Relative Important Predictors",
fontsize.title=24,
)
plot(gbm_indata)
indata_preds = h2o.predict(gbm_indata, indata_test)
indata_preds
## predict Claim_No Claim_Yes
## 1 Claim_Yes 0.7303725 0.2696275
## 2 Claim_Yes 0.7304108 0.2695892
## 3 Claim_Yes 0.7303725 0.2696275
## 4 Claim_Yes 0.7303725 0.2696275
## 5 Claim_Yes 0.7102944 0.2897056
## 6 Claim_Yes 0.7092393 0.2907607
##
## [2877 rows x 3 columns]
h2o.confusionMatrix(gbm_indata, indata_train)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.264281590535036:
## Claim_No Claim_Yes Error Rate
## Claim_No 2772 1477 0.347611 =1477/4249
## Claim_Yes 458 1076 0.298566 =458/1534
## Totals 3230 2553 0.334601 =1935/5783
h2o.confusionMatrix(gbm_indata, indata_valid)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.258521076723682:
## Claim_No Claim_Yes Error Rate
## Claim_No 364 363 0.499312 =363/727
## Claim_Yes 48 221 0.178439 =48/269
## Totals 412 584 0.412651 =411/996
h2o.confusionMatrix(gbm_indata, indata_test)
## Confusion Matrix (vertical: actual; across: predicted) for max f1 @ threshold = 0.265406378986216:
## Claim_No Claim_Yes Error Rate
## Claim_No 1441 674 0.318676 =674/2115
## Claim_Yes 267 495 0.350394 =267/762
## Totals 1708 1169 0.327077 =941/2877
AUC: Area Under ROC Curve
- AUC uses to measure Measure performance of a classifier which the area under the ROC Curve
h2o.auc(h2o.performance(gbm_indata, newdata = indata_train))
## [1] 0.7374285
h2o.auc(h2o.performance(gbm_indata, newdata = indata_valid))
## [1] 0.7164597
h2o.auc(h2o.performance(gbm_indata, newdata = indata_test))
## [1] 0.7190677
01 June, 2019