EDA

  • Proportion of optional coverage is higher (75% overall and goes to 84% for 1+ employees) in claimed policies than in unclaimed policies(54%)
  • Average public liability for claimed is higher(2.86Million) than for unclaimed (2.48Million)
  • Average Net Premium in Claimed is much higher ($480) than unclaimed ($179). Median_claimed: $212 Median_unclaimed: $93.16
  • Average Commission for claimed is much higher ($240) than unclaimed ($98). Median_claimed: 120.78 median_unclaimed: $54

  • Combinedtrade risk = 1 has much lower proportion in claimed (~9%) than unclaimed (~25%)
  • Combined trade risk = 7 has much higher proportion in claimed(~20%) than unclaimed(~10%). Other combined risk levels have similar distributions in both claimed and unclaimed

Feature Importance

  • The Boruta package in R was used to determine important features
  • Product, Trade 1 Category and Commision Amount were the top 3 most important features identified
  • Other features in the model were selected based on basic insurance business sense and EDA

feature importance

ADS creation

  • Since the proportion of claim incurring policies was very less (~1.4%) in the original data, an ADS was created with claim incurring policies forming 27.2% of the dataset
  • Strata sampling was used for ADS creation
  • Additional calculated fields like Optional_Flag (to indicate if optional coverages opted or not), grouped_total_emp (0,1 or 1+ employees), gross_premium_bucket were created

Model creation

  • Random Forest, Logistic Regression and Gradient Boosting were used to train the model on the training data
  • Product, Transaction Type, Public Liability Limit, Optional Flag, Trade 1 Category, Gross Premium Bucket, Combined Trade Risk Level and Grouped Total Employee were taken as the features in the models
  • Testing accuracies of models built using Random Forest was ~82%, for logistic regression (threshold >0.5) 79.45% and for GBM was 81.6%

Random Forest confusion matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 2475  516
##          1  126  489
##                                           
##                Accuracy : 0.822           
##                  95% CI : (0.8091, 0.8343)
##     No Information Rate : 0.7213          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4973          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9516          
##             Specificity : 0.4866          
##          Pos Pred Value : 0.8275          
##          Neg Pred Value : 0.7951          
##              Prevalence : 0.7213          
##          Detection Rate : 0.6864          
##    Detection Prevalence : 0.8295          
##       Balanced Accuracy : 0.7191          
##                                           
##        'Positive' Class : 0               
##