Predictive Analytics of Car Insurance Claims (USA): Gradient Boosting Machine Learning Vs Decision Tree Analysis

Car Insurance Claims (USA) : Circular TreeMap(interacive)

Car_Insurance_claim = read.csv("carinsuranceclaimUSV02.csv", header = T)

Car Insurance Claims (USA) : Search DataTable

Car Insurance Claims (USA) : Decision Tree Analysis

indata_model = ctree(CLAIM_FLAG ~. , indata_train)
plot(indata_model)

indata_model

## 
## Model formula:
## CLAIM_FLAG ~ OWER_AGE_CLS + MSTATUS + GENDER + EDUCATION + OCCUPATION + 
##     CLM_FREQ_CLS
## 
## Fitted party:
## [1] root
## |   [2] CLM_FREQ_CLS in CLM_FREQ_0, CLM_FREQ_5
## |   |   [3] OWER_AGE_CLS in 31=< AGE <45, 46=< AGE <60, AGE >=< 61
## |   |   |   [4] EDUCATION <High School, High School: Claim_No (n = 1394, err = 27.0%)
## |   |   |   [5] EDUCATION in Bachelors, Masters, PhD
## |   |   |   |   [6] OWER_AGE_CLS in 31=< AGE <45, AGE >=< 61: Claim_No (n = 845, err = 21.5%)
## |   |   |   |   [7] OWER_AGE_CLS in 46=< AGE <60: Claim_No (n = 979, err = 13.4%)
## |   |   [8] OWER_AGE_CLS in 31=<< AGE <45, AGE < 30: Claim_No (n = 147, err = 46.9%)
## |   [9] CLM_FREQ_CLS in CLM_FREQ_1, CLM_FREQ_2, CLM_FREQ_3, CLM_FREQ_4
## |   |   [10] OCCUPATION in Blue Collar, Clerical, Home Maker, Professional, Student, Unemployed
## |   |   |   [11] EDUCATION <High School, High School: Claim_Yes (n = 1099, err = 43.8%)
## |   |   |   [12] EDUCATION in Bachelors, Masters, PhD: Claim_No (n = 843, err = 44.4%)
## |   |   [13] OCCUPATION in Doctor, Lawyer, Manager
## |   |   |   [14] GENDER in Female
## |   |   |   |   [15] OCCUPATION in Doctor, Lawyer: Claim_No (n = 134, err = 42.5%)
## |   |   |   |   [16] OCCUPATION in Manager: Claim_No (n = 98, err = 23.5%)
## |   |   |   [17] GENDER in Male: Claim_No (n = 254, err = 18.5%)
## 
## Number of inner nodes:    8
## Number of terminal nodes: 9

Decision Tree Model : Confusion Matrix

indata_pred = predict(indata_model, indata_test)
indata_conft = table("prediction" = indata_pred , "actual" = indata_test$CLAIM_FLAG)

Table 1: Confusion Matrix: Test Data: Car Insuarnce Claim
Predicted	Actual Class
	Claim_No	Claim_Yes
Claim_No	2851	490
Claim_Yes	325	197

Gradient Boosting Machine Learning

splits = h2o.splitFrame(
  data = indata,
ratios = c(0.6,0.1),
destination_frames = c("train.hex", "valid.hex", "test.hex"), seed = 2256)
indata_train = splits[[1]]
indata_valid = splits[[2]]
indata_test  = splits[[3]]

# dim(indat_train)
# dim(indat_valid)
# dim(indat_test)

Gradient Boosting Machine Learning Model Setting

y_response = "CLAIM_FLAG"
x_predictors = setdiff(names(indata), c(y_response, "names"))
gbm_indata = h2o.gbm(x = x_predictors,
                    y = y_response,
                    training_frame = indata_train,
                    validation_frame = indata_valid,
                    #validation_frame = indata_test,
                    ntrees=500,
                    learn_rate=0.001,
                    learn_rate_annealing = 0.99,
                    stopping_rounds = 20,
                    stopping_tolerance = 1e-4,
                    stopping_metric = "AUC",
                    sample_rate = 0.9,
                    col_sample_rate = 0.9,
                    seed = 2256,
                    score_tree_interval = 20
                    )

gbm_indata

## Model Details:
## ==============
## 
## H2OBinomialModel: gbm
## Model ID:  GBM_model_R_1559377896373_1 
## Model Summary: 
##   number_of_trees number_of_internal_trees model_size_in_bytes min_depth
## 1             500                      500              227904         5
##   max_depth mean_depth min_leaves max_leaves mean_leaves
## 1         5    5.00000         28         32    31.51200
## 
## 
## H2OBinomialMetrics: gbm
## ** Reported on training data. **
## 
## MSE:  0.1903705
## RMSE:  0.4363147
## LogLoss:  0.5669711
## Mean Per-Class Error:  0.3228044
## AUC:  0.7374177
## pr_auc:  0.4944572
## Gini:  0.4748354
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           Claim_No Claim_Yes    Error        Rate
## Claim_No      2647      1602 0.377030  =1602/4249
## Claim_Yes      412      1122 0.268579   =412/1534
## Totals        3059      2724 0.348262  =2014/5783
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.263002 0.527008 269
## 2                       max f2  0.253121 0.678330 349
## 3                 max f0point5  0.279741 0.506144 142
## 4                 max accuracy  0.289444 0.752378  62
## 5                max precision  0.304281 1.000000   0
## 6                   max recall  0.246149 1.000000 398
## 7              max specificity  0.304281 1.000000   0
## 8             max absolute_mcc  0.269102 0.322348 220
## 9   max min_per_class_accuracy  0.265229 0.673570 248
## 10 max mean_per_class_accuracy  0.263002 0.677196 269
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
## H2OBinomialMetrics: gbm
## ** Reported on validation data. **
## 
## MSE:  0.1932356
## RMSE:  0.4395858
## LogLoss:  0.573392
## Mean Per-Class Error:  0.3388755
## AUC:  0.7164597
## pr_auc:  0.474162
## Gini:  0.4329193
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##           Claim_No Claim_Yes    Error      Rate
## Claim_No       364       363 0.499312  =363/727
## Claim_Yes       48       221 0.178439   =48/269
## Totals         412       584 0.412651  =411/996
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.258521 0.518171 283
## 2                       max f2  0.253119 0.689940 331
## 3                 max f0point5  0.271672 0.463668 167
## 4                 max accuracy  0.289961 0.752008  36
## 5                max precision  0.302393 1.000000   0
## 6                   max recall  0.246428 1.000000 393
## 7              max specificity  0.302393 1.000000   0
## 8             max absolute_mcc  0.257937 0.290673 292
## 9   max min_per_class_accuracy  0.264574 0.646840 225
## 10 max mean_per_class_accuracy  0.258521 0.661125 283
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`

varimp_gbm_indata =h2o.varimp(gbm_indata)
varimp_gbm_indata

## Variable Importances: 
##       variable relative_importance scaled_importance percentage
## 1 CLM_FREQ_CLS        29743.185547          1.000000   0.503663
## 2   OCCUPATION        14980.474609          0.503661   0.253675
## 3 OWER_AGE_CLS         5238.026855          0.176108   0.088699
## 4      MSTATUS         4985.911621          0.167632   0.084430
## 5    EDUCATION         3595.964844          0.120900   0.060893
## 6       GENDER          510.127869          0.017151   0.008638

p = treemap(varimp_gbm_indata,
            index=c("variable"),
            vSize="scaled_importance",
            type="index",
            fontsize.labels=c(18),
            title="gbm_indata : Relative Important Predictors",
            fontsize.title=24,
            )

plot(gbm_indata)

indata_preds = h2o.predict(gbm_indata, indata_test)

indata_preds

##     predict  Claim_No Claim_Yes
## 1 Claim_Yes 0.7303725 0.2696275
## 2 Claim_Yes 0.7304108 0.2695892
## 3 Claim_Yes 0.7303725 0.2696275
## 4 Claim_Yes 0.7303725 0.2696275
## 5 Claim_Yes 0.7102944 0.2897056
## 6 Claim_Yes 0.7092393 0.2907607
## 
## [2877 rows x 3 columns]

h2o.confusionMatrix(gbm_indata, indata_train)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.264281590535036:
##           Claim_No Claim_Yes    Error        Rate
## Claim_No      2772      1477 0.347611  =1477/4249
## Claim_Yes      458      1076 0.298566   =458/1534
## Totals        3230      2553 0.334601  =1935/5783

h2o.confusionMatrix(gbm_indata, indata_valid)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.258521076723682:
##           Claim_No Claim_Yes    Error      Rate
## Claim_No       364       363 0.499312  =363/727
## Claim_Yes       48       221 0.178439   =48/269
## Totals         412       584 0.412651  =411/996

h2o.confusionMatrix(gbm_indata, indata_test)

## Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ threshold = 0.265406378986216:
##           Claim_No Claim_Yes    Error       Rate
## Claim_No      1441       674 0.318676  =674/2115
## Claim_Yes      267       495 0.350394   =267/762
## Totals        1708      1169 0.327077  =941/2877

AUC: Area Under ROC Curve

AUC uses to measure Measure performance of a classifier which the area under the ROC Curve

h2o.auc(h2o.performance(gbm_indata, newdata = indata_train))

## [1] 0.7374285

h2o.auc(h2o.performance(gbm_indata, newdata = indata_valid))

## [1] 0.7164597

h2o.auc(h2o.performance(gbm_indata, newdata = indata_test))

## [1] 0.7190677

Written by DK WC

01 June, 2019