Motor Insurance Analysis
Date : 04/08/2022
- Introduction
- Part One: data set view,summary-descriptive analysis
- Part Two: insurance risk measurements by factors.
- Part Three: claims count response to the predictors variables
- Part four: rating factors
- Conclusion:
Introduction
This analysis is conducted by using R-programming language as an analysis tool in the insurance industry, and specially for motor insurance .
The data set used called (datacar) and can be found on the (insurancedata) package on r.
>we have also used packages such as:
-
tidyvarse.
- tidyr.
- janitor.
- knitr.
- insurancerating ,
and others.
>what we are going to do ?
looking at the data summary and description on order to understand the shape of the data.
calculating the claims cost and group it by factors.
calculating the claims count and group it by factors.
calculating the total exposure and group it by factors.
calculating the frequency cost and group it by factors.
calculating the average severity cost and group it by factors.
calculating the pure premium cost and group it by factors.
calculating the rating factor for each.
———————————————-
Part One: data set view,summary-descriptive analysis
data set dimensions
lets have a look at the data set dimension :
## [1] 67856 10
the data set has 10 variables and and 67856 rows , the 1st 5 rows are as below :
| veh_value | exposure | clm | numclaims | claimcst0 | veh_body | veh_age | gender | area | agecat |
|---|---|---|---|---|---|---|---|---|---|
| 1.06 | 0.3039014 | 0 | 0 | 0 | HBACK | 3 | F | C | 2 |
| 1.03 | 0.6488706 | 0 | 0 | 0 | HBACK | 2 | F | A | 4 |
| 3.26 | 0.5694730 | 0 | 0 | 0 | UTE | 2 | F | E | 2 |
| 4.14 | 0.3175907 | 0 | 0 | 0 | STNWG | 2 | F | D | 2 |
| 0.72 | 0.6488706 | 0 | 0 | 0 | HBACK | 4 | F | C | 2 |
| 2.01 | 0.8542094 | 0 | 0 | 0 | HDTOP | 3 | M | C | 4 |
Data set variables types
lets have a look at the variables types:
## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, HBACK,…
## $ veh_age <int> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, F, F, …
## $ area <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, D, C, …
## $ agecat <int> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …
now we will make some data transformation,will do the below :
1. convert agecat & veh_age types into factor
2. create a new variable called: veh_value_group
lets look again to the variables types :
## Rows: 67,856
## Columns: 11
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, …
## $ exposure <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.6488…
## $ clm <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ numclaims <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000…
## $ veh_body <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, …
## $ veh_age <fct> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, …
## $ gender <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, …
## $ area <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, …
## $ agecat <fct> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, …
## $ veh_value_group <fct> group 1, group 1, group 1, group 1, group 1, group 1, …
Data set summary and distribution
lets have the data summary and the variables distribution :
| veh_value | exposure | clm | numclaims | claimcst0 | veh_body | veh_age | gender | area | agecat | veh_value_group | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 0.000 | Min. :0.002738 | Min. :0.00000 | Min. :0.00000 | Min. : 0.0 | SEDAN :22233 | 1:12257 | F:38603 | A:16312 | 1: 5742 | group 1:67779 | |
| 1st Qu.: 1.010 | 1st Qu.:0.219028 | 1st Qu.:0.00000 | 1st Qu.:0.00000 | 1st Qu.: 0.0 | HBACK :18915 | 2:16587 | M:29253 | B:13341 | 2:12875 | group 2: 70 | |
| Median : 1.500 | Median :0.446270 | Median :0.00000 | Median :0.00000 | Median : 0.0 | STNWG :16261 | 3:20064 | NA | C:20540 | 3:15767 | group 3: 6 | |
| Mean : 1.777 | Mean :0.468651 | Mean :0.06814 | Mean :0.07276 | Mean : 137.3 | UTE : 4586 | 4:18948 | NA | D: 8173 | 4:16189 | group 4: 1 | |
| 3rd Qu.: 2.150 | 3rd Qu.:0.709103 | 3rd Qu.:0.00000 | 3rd Qu.:0.00000 | 3rd Qu.: 0.0 | TRUCK : 1750 | NA | NA | E: 5912 | 5:10736 | NA | |
| Max. :34.560 | Max. :0.999316 | Max. :1.00000 | Max. :4.00000 | Max. :55922.1 | HDTOP : 1579 | NA | NA | F: 3578 | 6: 6547 | NA | |
| NA | NA | NA | NA | NA | (Other): 2532 | NA | NA | NA | NA | NA |
Numerical variables
the vehicle value range is between 0 and 34,560 with 1,777 on average
the exposure range is between 0 and 1 with 0.44 on average
the claims cost range is between 0 and 55,922.1 with 137,3 on average
Categotical variables
7% of the vehicles in this data had claims .
43% of the data samples are for males and 57% are for females.
more than 50% of our vehicles are old (3 & 4 ages) and only 18% are brand new vehicles.
area C has the highest number of customers with 30 % , and area F has the lowest 5%.
more than 45% of the customers are located in the age group 3 & 4.
more than 99% of our vehicles values are within group 1.
sedan vehicles has the highest percentage with 32% ,and (Bus,CONVT,MCARA,RDSTR) has the lowest with less than 1%.
———————————————-
Part Two: insurance risk measurements by factors.
as the insurance risk level is measured by the frequency and the severity; we will calculate the the same and group it by factors.
gender factor
| gender | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| F | 4908749 | 2832 | 17954.60 | 0.1577311 | 1733.315 | 273.3978 |
| M | 4405855 | 2105 | 13846.21 | 0.1520271 | 2093.043 | 318.1993 |
we can see that :
females number of claims are more than males.
females claims cost are more than males but; the average severity is less ,and this is due to the larger number of claims(more than males)
both females and males almost have the same frequency.
female risk premium is lesser than males ,and this due to the higher exposure females have.
the below chart visualize all together for more clear understanding:
veh_age factor
| veh_age | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| 3 | 2718237 | 1446 | 9542.111 | 0.1515388 | 1879.832 | 284.8675 |
| 2 | 2486217 | 1354 | 7923.677 | 0.1708803 | 1836.202 | 313.7706 |
| 4 | 2554895 | 1261 | 8996.079 | 0.1401722 | 2026.087 | 284.0010 |
| 1 | 1555255 | 876 | 5338.951 | 0.1640772 | 1775.405 | 291.3034 |
we can see that :
the claims cost for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.
vehicle age 1 has the lowest number on claims while the vehicle age 3 has the highest.
the exposure for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.
the frequency for the vehicle age 4 is the lowest while for vehicle age 2 is the highest.
the average_severity for the vehicle age 1 is the lowest while for vehicle age 4 is the highest.
the risk premium for the vehicle age 3 & 4 is the lowest and almost the same, while its the highest for vehicle age 2
the below chart visualize all together for more clear understating :
veh_body
| veh_body | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| HBACK | 2589136.192 | 1330 | 8810.31348 | 0.1509594 | 1946.7189 | 293.8756 |
| UTE | 597208.965 | 276 | 2105.73032 | 0.1310709 | 2163.8006 | 283.6113 |
| STNWG | 2363091.211 | 1248 | 7638.39014 | 0.1633852 | 1893.5026 | 309.3703 |
| HDTOP | 294811.869 | 136 | 783.29911 | 0.1736246 | 2167.7343 | 376.3720 |
| PANVN | 133113.412 | 68 | 409.16085 | 0.1661938 | 1957.5502 | 325.3327 |
| SEDAN | 2681622.477 | 1598 | 10444.59959 | 0.1529977 | 1678.1117 | 256.7473 |
| TRUCK | 319496.849 | 130 | 843.96441 | 0.1540349 | 2457.6681 | 378.5667 |
| COUPE | 187723.251 | 75 | 319.12663 | 0.2350164 | 2502.9767 | 588.2406 |
| MIBUS | 116104.880 | 45 | 316.84052 | 0.1420273 | 2580.1084 | 366.4458 |
| MCARA | 10673.950 | 15 | 59.27995 | 0.2530367 | 711.5967 | 180.0601 |
| BUS | 13363.120 | 10 | 25.84805 | 0.3868764 | 1336.3120 | 516.9876 |
| CONVT | 6888.810 | 3 | 32.59685 | 0.0920334 | 2296.2700 | 211.3336 |
| RDSTR | 1369.458 | 3 | 11.66872 | 0.2570976 | 456.4861 | 117.3615 |
We can see that:
sedan,Hback,and Stnwg have the highest claims cost (), while RDSTR has the lowest claims cost.
RDSTR ,convt,Bus,MCARA,COUPE & PANVN have less than 100 claims, SEDAN,STNWG & HBACK have more than 1000 claims ,other have claims between 130 and 276.
BUS has the highest frequency with 38.3% , (RDSTR,MCARA, and COUPE have frequincies (27.7%,25.3% and 23.5%), other groups have frequencies less than 18%.
RDSTR has the lowest average severity , (MIBUS,COUPE and TRUCK ) have the highest.
the risk premium is the less than 200 for RDSTR and MCARA, the more tgan 500 for BUS and COUPE.
the below chart visualize all together for more clear understanding:
area
| area | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| C | 2865707.2 | 1493 | 9578.494 | 0.1558700 | 1919.429 | 299.1814 |
| A | 2071765.6 | 1181 | 7597.101 | 0.1554540 | 1754.247 | 272.7048 |
| E | 868822.9 | 413 | 2771.866 | 0.1489971 | 2103.687 | 313.4434 |
| D | 911058.2 | 524 | 3819.518 | 0.1371901 | 1738.661 | 238.5270 |
| B | 1795295.2 | 1021 | 6297.848 | 0.1621189 | 1758.369 | 285.0649 |
| F | 801955.4 | 305 | 1735.992 | 0.1756921 | 2629.362 | 461.9581 |
we can see that :
the lowest claims cost is for area F , and the highest cost is for area C.
area F has the lowest number of claims , while area C has the highest .
area D has the lowest frequency , while area F has the highest.
area D has the lowest average frequency ,while area F has the highest.
area D has the lowest risk premium , while area F has the highest.
the below chart visualize all together for more clear understanding:
age category
| agecat | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| 2 | 1984840.8 | 1000 | 5891.871 | 0.1697254 | 1984.841 | 336.8778 |
| 4 | 2145303.0 | 1185 | 7616.542 | 0.1555824 | 1810.382 | 281.6636 |
| 6 | 683568.5 | 390 | 3099.666 | 0.1258200 | 1752.740 | 220.5297 |
| 3 | 2132107.1 | 1189 | 7409.457 | 0.1604706 | 1793.194 | 287.7549 |
| 5 | 1061412.2 | 648 | 5171.009 | 0.1253140 | 1637.982 | 205.2621 |
| 1 | 1307372.9 | 525 | 2612.274 | 0.2009743 | 2490.234 | 500.4732 |
we can see that :
age category 6 has the lowest claims cost , while age category 4 has the highest
age category 6 has the lowest number of claims , while age category 4 & 3 has the highest (1185,1189) .
age category 6 & 5 has the lowest frequency with almost(12.5%) , while age category 1 has the highest with 20%.
age category 5 has the lowest average frequency ,while age category 1 has the highest.
age category 5 has the lowest risk premium , while age category 1 has the highest.
the below chart visualize all together for more clear understanding:
now, lets create a new model that contains all factor variables and see how much it differs from the previous models and which one is the best.
## $statistics
## aic bic bayes.factor p
## numclaims_glm_all 34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_agecat 34862.03 34916.78 1.002062e+33
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.001 0.004 0.009 0.247
## $statistics
## aic bic bayes.factor p
## numclaims_glm_all 34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_area 34938.22 34992.97 2.858943e+16
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.002 0.006 0.014 0.284
## $statistics
## aic bic bayes.factor p
## numclaims_glm_all 34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_gender 34944.03 34962.28 1.317565e+23
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.002 0.007 0.015 0.282
## $statistics
## aic bic bayes.factor p
## numclaims_glm_all 34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_veh_age 34920.50 34957.00 1.850433e+24
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.002 0.006 0.013 0.287
## $statistics
## aic bic bayes.factor p
## numclaims_glm_all 34822.37 35068.75 0.00 <2e-16
## numclaims_glm_veh_body 34930.11 35048.74 22196.62
##
## $predicted_differences
## 0% 25% 50% 75% 100%
## 0.000 0.002 0.006 0.014 0.132
from our results we can say that the numclaims_glm_all model is best .
lets apply the anova function on the model so we can well know how much the variables are important to the model
## Analysis of Deviance Table
##
## Model: poisson, link: log
##
## Response: numclaims
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 67855 25507
## veh_body 12 37.563 67843 25469 0.000181 ***
## veh_age 3 33.543 67840 25436 2.474e-07 ***
## gender 1 1.595 67839 25434 0.206575
## area 5 14.525 67834 25420 0.012598 *
## agecat 5 86.074 67829 25334 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop in deviance allows to test whether the model terms are significant :
adding veh_body to the NULL model causes a drop in deviance of 37.563 corresponding to 67843 degree of freedom and resulting residual deviance of 25469
adding veh_age to the NULL model causes a drop in deviance of 33.543 corresponding to 67840 degree of freedom and resulting residual deviance of 25436
adding gender to the NULL model causes a drop in deviance of 1.595 corresponding to 67839 degree of freedom and resulting residual deviance of 25434
adding area to the NULL model causes a drop in deviance of 14.525 corresponding to 67834 degree of freedom and resulting residual deviance of 25420
adding agecat to the NULL model causes a drop in deviance of 86.074 corresponding to 68729 degree of freedom and resulting residual deviance of 25334
so the gender variable has the least impact on the model
—————————–
Part Three: claims count response to the predictors variables
the below chart express visually how number of claims will act (decrease on increase)-along with the lower and upper confident intervals-, when any of the factors increase by one unit:
if any factor increases by one unit ;the number of claims will increase or decrease by the coefficient value for the same factor.
—————————–
Part four: rating factors
now;lets look at the rating factor for the new model (numclaims_glm_all) which is the best performed model:
## Significance levels: *** p < 0.001; ** p < 0.01;
## * p < 0.05; . p < 0.1
| risk_factor | level | est_numclaims_glm_all | exposure |
|---|---|---|---|
| (Intercept) | (Intercept) | 0.154456 | NA |
| veh_body | BUS | 2.539240 ** | 26 |
| veh_body | CONVT | 0.548256 | 33 |
| veh_body | COUPE | 1.534809 *** | 319 |
| veh_body | HBACK | 0.938495 . | 8810 |
| veh_body | HDTOP | 1.117535 | 783 |
| veh_body | MCARA | 1.824920 * | 59 |
| veh_body | MIBUS | 0.957521 | 317 |
| veh_body | PANVN | 1.074029 | 409 |
| veh_body | RDSTR | 1.513937 | 12 |
| veh_body | SEDAN | 1.000000 | 10445 |
| veh_body | STNWG | 1.045286 | 7638 |
| veh_body | TRUCK | 0.995693 | 844 |
| veh_body | UTE | 0.840990 ** | 2106 |
| veh_age | 1 | 1.089375 * | 5339 |
| veh_age | 2 | 1.134451 *** | 7924 |
| veh_age | 3 | 1.000000 | 9542 |
| veh_age | 4 | 0.925126 * | 8996 |
| gender | F | 1.000000 | 17955 |
| gender | M | 0.976814 | 13846 |
| area | A | 0.996318 | 7597 |
| area | B | 1.048834 | 6298 |
| area | C | 1.000000 | 9578 |
| area | D | 0.891774 * | 3820 |
| area | E | 0.965319 | 2772 |
| area | F | 1.065872 | 1736 |
| agecat | 1 | 1.293463 *** | 2612 |
| agecat | 2 | 1.087360 . | 5892 |
| agecat | 3 | 1.027766 | 7409 |
| agecat | 4 | 1.000000 | 7617 |
| agecat | 5 | 0.805326 *** | 5171 |
| agecat | 6 | 0.820623 *** | 3100 |
on graph:
## Significance levels: *** p < 0.001; ** p < 0.01;
## * p < 0.05; . p < 0.1
—————————–
Conclusion:
preferred segments
area : D.
agecat : 5,6
veh_body : UTE,CONVT.
the high risk segments
- veh_body : BUS,COUPE,MACARA,RDSTS