This document is written to illustrate the issues observed in the data set that is related to the data completeness, accuracy and consistency. Do note that issues that are raised here does not necessary mean that there is an inherent issue, but rather it is potentially an area that we should follow-up with Client X to ensure the correctness of the report.
For claims data, there are two files in total.
historical claim
open claim
Issue 1: Inconsistency between cause of loss and event type - An example is shown below:
historical_claims[ EVENT_TYPE == "THEFT", table(CAUSE_OF_LOSS, EVENT_TYPE)]
## EVENT_TYPE
## CAUSE_OF_LOSS THEFT
## COLLISION 4187
## COLLISION WITH ANIMAL 2807
## COLLISION WITH COMMERICAL VEHICLES/FLEETS 2085
## COLLISION WITH FLYING OBJECTS (NOT AIRCRAFT) 2067
## COLLISION WITH NON-VEHICULAR OBJECT 2859
## COLLISION WITH OTHER NON-VEHICULAR PROPERTY 2099
## COLLISION WITH OTHER VEHICLE - AT FAULT 2028
## COLLISION WITH OTHER VEHICLE - NOT AT FAULT 2088
## COLLISION WITH TREE 2847
## FIRE 2940
## FLOOD 5650
## HIT BY ANOTHER VEHICLE 2908
## INFLUENCED BY FALLING/FLYING OBJECTS (NOT AIRCRAFT) 2735
## LOSS OF KEYS 2777
## MALICIOUS DAMAGE 2790
## PART THEFT 2821
## THEFT 2799
Issue 2: Distribution of claims - Need to confirm if there are any missing claims and if IBNR has been considered.
ALL_CLAIMS[year(DATE_OF_LOSS) >= 2019, table(format(DATE_OF_LOSS, format = "%Y-%m"), SOURCE )]
## SOURCE
## Historical OPEN
## 2019-01 6197 0
## 2019-02 5433 0
## 2019-03 5782 0
## 2019-04 5474 0
## 2019-05 5222 0
## 2019-06 4901 0
## 2019-07 4925 0
## 2019-08 4429 0
## 2019-09 4162 0
## 2019-10 3987 0
## 2019-11 3652 0
## 2019-12 3422 0
## 2020-01 2953 0
## 2020-02 2610 0
## 2020-03 2334 0
## 2020-04 1955 0
## 2020-05 1689 0
## 2020-06 1267 0
## 2020-07 946 0
## 2020-08 648 0
## 2020-09 301 35
## 2020-10 22 72
## 2020-11 0 8
#######################################################################################
# Appending of the claims data #
#######################################################################################
historical_claims[, SOURCE := "Historical"]
open_claims[ , SOURCE := "OPEN" ]
ALL_CLAIMS <- rbind(historical_claims, open_claims)
names(ALL_CLAIMS)[stringr::str_detect(names(ALL_CLAIMS), pattern = "DATE")][1:15] -> col_for_Date
for(i in 1:15){
ALL_CLAIMS[ , col_for_Date[i] := as.Date(get(col_for_Date[i]), format = "%m/%d/%Y")]
}
For exposure data, there are three files in total.
Vehicle details
Customer details
Policy details
Issue 1: Year of Manufacture of Vehicle are all 2012 or earlier. This needs to be further updated. On a similar note, the years of driving experience for customers are outdated too.
## YEAR_OF_MANUFACTURE N
## 1: 1995 19075
## 2: 1996 19307
## 3: 1997 19317
## 4: 1998 19118
## 5: 1999 19173
## 6: 2000 137424
## 7: 2001 118235
## 8: 2002 118183
## 9: 2003 118442
## 10: 2004 117343
## 11: 2005 117495
## 12: 2006 118023
## 13: 2007 118325
## 14: 2008 118176
## 15: 2009 117922
## 16: 2010 118232
## 17: 2011 117786
## 18: 2012 118424
Issue 2: Vehicle Gross Weight might not be the same unit for all weight. It is odd that a bus is of the same weight as a motorbike
vehicle_new[ VEHICLE_GROSS_WEIGHT < 800, table(VEHICLE_GROSS_WEIGHT, VEHICLE_TYPE)]
## VEHICLE_TYPE
## VEHICLE_GROSS_WEIGHT BUS CARRY_GOODS MOTORBIKE MOTORCAR
## 500 373 396 6714 4028
## 510 402 384 6763 4028
## 520 420 396 6768 3998
## 530 411 385 6848 4018
## 540 369 327 6859 4190
## 550 364 389 6745 4033
## 560 375 361 6825 4076
## 570 380 394 6937 4116
## 580 376 360 6686 4042
## 590 385 358 6780 4016
## 600 391 381 6949 4063
## 610 379 375 6798 4133
## 620 399 391 6767 4034
## 630 378 375 6755 4019
## 640 359 358 6806 4071
## 650 363 377 6774 4077
## 660 333 368 6940 4025
## 670 372 357 6837 4130
## 680 381 390 6790 3965
## 690 389 352 6855 4103
## 700 368 410 6727 3941
## 710 357 345 6919 3929
## 720 391 360 6916 4045
## 730 350 408 6751 4032
## 740 402 394 6883 4021
## 750 369 376 6805 4044
## 760 402 401 6600 4008
## 770 358 377 6758 3979
## 780 353 364 6774 4100
## 790 380 384 6631 3984
Issue 3: Engine Size might not be the same unit for all entries. It is odd that the engine size of a bus is smaller than a motorbike. I suspect that it is written in horsepower rather than cc.
vehicle_new[ ENGINE_SIZE_CC < 1000 , table(ENGINE_SIZE_CC, VEHICLE_TYPE)]
## VEHICLE_TYPE
## ENGINE_SIZE_CC BUS CARRY_GOODS MOTORBIKE MOTORCAR
## 500 1522 714 17752 13031
## 520 1490 722 17673 12992
## 540 1501 697 17936 13383
## 560 1426 726 17693 13308
## 580 1450 779 17808 13094
## 600 1500 712 17912 13200
## 620 1462 736 17789 13286
## 640 1458 688 17862 13217
## 660 1453 727 17890 13151
## 680 1485 751 17738 13277
## 700 1494 724 17859 13443
## 720 1499 745 17692 13313
## 740 1437 779 17815 13144
## 760 1432 720 17964 13190
## 780 1538 736 17901 13426
## 800 1536 731 17890 13249
## 820 1457 782 17938 13145
## 840 1467 736 17655 12984
## 860 1459 798 17800 13367
## 880 1497 751 17693 13292
## 900 1481 748 17970 13389
## 920 1515 745 17750 13389
## 940 1425 705 17593 13252
## 960 1473 718 17763 13210
## 980 1423 738 17615 13406
Customer Details.
Issue 4: 103,834 customer cannot be found to have a vehicle
customer[!CUSTOMER_IDENTIFIER %in% vehicle_new$CUSTOMER_IDENTIFIER,]
## CUSTOMER_IDENTIFIER GENDER DRIVING_EXPERIENCE
## 1: CUSTOMER_3 M THREE_TO_FIVE_YEARS
## 2: CUSTOMER_28 M THREE_TO_FIVE_YEARS
## 3: CUSTOMER_33 M THREE_TO_FIVE_YEARS
## 4: CUSTOMER_34 M FIVE_TO_TEN_YEARS
## 5: CUSTOMER_39 M THREE_TO_FIVE_YEARS
## ---
## 103830: CUSTOMER_1038528 F THREE_TO_FIVE_YEARS
## 103831: CUSTOMER_1038531 M THREE_TO_FIVE_YEARS
## 103832: CUSTOMER_1038532 M THREE_TO_FIVE_YEARS
## 103833: CUSTOMER_1038535 M THREE_TO_FIVE_YEARS
## 103834: CUSTOMER_1038551 M THREE_TO_FIVE_YEARS
# vehicle_new[!CUSTOMER_IDENTIFIER %in% customer$CUSTOMER_IDENTIFIER,]
# policies_new[, i_terdt := CANCELLATION_DATE]
# policies_new[is.na(i_terdt), i_terdt := POLICY_END_DATE]
Issue 5: 203,352 Vehicles do not have corresponding policies.
vehicle_new[!VEHICLE_IDENTIFIER %in% policies_new$VEHICLE_IDENTIFIER]
## VEHICLE_IDENTIFIER YEAR_OF_MANUFACTURE MAKE_OF_VEHICLE
## 1: VEHICLE_4771 2004 MARUTI SUZUKI
## 2: VEHICLE_4838 2005 JOHN DEERE
## 3: VEHICLE_7194 2005 Jaguar
## 4: VEHICLE_7867 2002 OPEL
## 5: VEHICLE_11450 2008 Mini
## ---
## 203348: VEHICLE_1649794 1995 HMT
## 203349: VEHICLE_1649831 2000 Renault
## 203350: VEHICLE_1649936 2007 SHRACHI
## 203351: VEHICLE_1649960 2002 Maserati
## 203352: VEHICLE_1649971 2000 TRISHUL AGRO INDUSTRIES
## VEHICLE_TYPE SEATING_CAPACITY ENGINE_SIZE_CC VEHICLE_GROSS_WEIGHT
## 1: MOTORBIKE 1 1600 530
## 2: CARRY_GOODS 2 1410 2500
## 3: MOTORCAR 5 1000 710
## 4: MOTORBIKE 2 1765 890
## 5: MOTORCAR 5 7280 1865
## ---
## 203348: BUS 58 960 3100
## 203349: MOTORCAR 2 840 2290
## 203350: CARRY_GOODS 7 2763 2800
## 203351: MOTORCAR 2 1000 630
## 203352: MOTORBIKE 1 580 660
## CUSTOMER_IDENTIFIER
## 1: CUSTOMER_936772
## 2: CUSTOMER_29936
## 3: CUSTOMER_107
## 4: CUSTOMER_710934
## 5: CUSTOMER_1018887
## ---
## 203348: CUSTOMER_231383
## 203349: CUSTOMER_199253
## 203350: CUSTOMER_730870
## 203351: CUSTOMER_567963
## 203352: CUSTOMER_767561
Step 1: Merge vehicle_new & customer together using “CUSTOMER_IDENTIFIER”
Step 2: Merge the result of step 1 with policies_new using “VEHICLE_IDENTIFIER”
The completion of this step will provide a policy data with information on vehicle and customer.
customer[vehicle_new, on = c("CUSTOMER_IDENTIFIER")] -> vehicle_customer
vehicle_customer[policies_new, on = c("VEHICLE_IDENTIFIER")] -> policy_vehicle_customer
Step 3: A termination date of the policy is also added to this data by taking the earlier of cancellation date or policy date. This termination date will help to calculate exposure later.
policy_vehicle_customer[, i_terdt := CANCELLATION_DATE]
policy_vehicle_customer[is.na(i_terdt), i_terdt := POLICY_END_DATE]
The data set “policy_vehicle_customer” will then be exported later.
Next - To create exposure according to period of time in-force in a year. For instance, a policy start date in 1st Dec 2012 and policy end date in 1st Dec 2013 will be given exposure of 1/12 in 2012 and 11/12 in 2013.
exposure.computation <- function(finalpol = policies_new ,survey.start = as.Date("2012-01-01", format = "%Y-%m-%d") ,
survey.end = as.Date("2020-12-31", format = "%Y-%m-%d"))
{
exposures_old <- finalpol
exposures <- NULL
# Step 1 # - Calendar Year
for (calendar_year in as.numeric(format(survey.start, "%Y")):as.numeric(format(survey.end, "%Y")) ) {
#cat("Calendar Year", calendar_year, "\n")
tmp <- exposures_old
tmp$calendar_year <- calendar_year
calendar_year_start <- as.Date(paste(calendar_year, '01', '01', sep = "-"))
calendar_year_end <- as.Date(paste(calendar_year + 1, '01', '01', sep = "-"))
if (calendar_year_start < survey.start) {
calendar_year_start = survey.start
}
if (calendar_year_end > survey.end) {
calendar_year_end = survey.end
}
tmp$start_date <- as.Date(ifelse(calendar_year_start > tmp$POLICY_START_DATE, calendar_year_start, tmp$POLICY_START_DATE), origin = "1970-01-01")
tmp$end_date <- as.Date(ifelse(is.na(tmp$i_terdt), calendar_year_end, ifelse(calendar_year_end<tmp$i_terdt, calendar_year_end, tmp$i_terdt) ), origin = "1970-01-01")
tmp <- tmp[which(tmp$start_date < tmp$end_date),]
if (exists('exposures')) {
exposures <- rbind(exposures, tmp)
} else {
exposures <- tmp
}
}
rm(tmp)
rm(calendar_year)
rm(calendar_year_start)
rm(calendar_year_end)
rm(exposures_old)
exposures$exposure <- as.numeric(exposures$end_date - exposures$start_date)/365.25
return(exposures)
}
exposure_new <- exposure.computation(finalpol = policy_vehicle_customer ,survey.start = as.Date("2012-01-01", format = "%Y-%m-%d") ,
survey.end = as.Date("2021-01-01", format = "%Y-%m-%d"))
Now that the result of the exposure has been computed, we will create a column for “Earned Premium” by using the exposure years computed and multiply it with the Gross Written Premium (GWP).
exposure_new[ , Earned_Premium := exposure * GWP]
exposure_new[, .(POLICY_IDENTIFIER, calendar_year, exposure, Earned_Premium)] -> exposure_new
Finally, we have the following data set to be uploaded into Power BI
ALL_CLAIMS as ALL Claims 20220814
policy_vehicle_customer as policies_new_20220814
exposure_new as “exposure 20220814”
These three data set will be added into Power BI with the following diagram as the Model between them.
Data Flow of Data into Power BI
Power BI Data Model