Motor Insurance Analysis

Date : 04/08/2022

Author : Omar Soub


Introduction

This analysis is conducted by using R-programming language as an analysis tool in the insurance industry, and specially for motor insurance .

The data set used called (datacar) and can be found on the (insurancedata) package on r.

>we have also used packages such as:
- tidyvarse.
- tidyr.
- janitor.
- knitr.
- insurancerating , and others.

>what we are going to do ?

———————————————-

Part One: data set view,summary-descriptive analysis


data set dimensions

lets have a look at the data set dimension :

## [1] 67856    10

the data set has 10 variables and and 67856 rows , the 1st 5 rows are as below :

veh_value exposure clm numclaims claimcst0 veh_body veh_age gender area agecat
1.06 0.3039014 0 0 0 HBACK 3 F C 2
1.03 0.6488706 0 0 0 HBACK 2 F A 4
3.26 0.5694730 0 0 0 UTE 2 F E 2
4.14 0.3175907 0 0 0 STNWG 2 F D 2
0.72 0.6488706 0 0 0 HBACK 4 F C 2
2.01 0.8542094 0 0 0 HDTOP 3 M C 4

Data set variables types

lets have a look at the variables types:

## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure  <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body  <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, HBACK,…
## $ veh_age   <int> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender    <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, F, F, …
## $ area      <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, D, C, …
## $ agecat    <int> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …

now we will make some data transformation,will do the below :

1. convert agecat & veh_age types into factor

2. create a new variable called: veh_value_group


lets look again to the variables types :

## Rows: 67,856
## Columns: 11
## $ veh_value       <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, …
## $ exposure        <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.6488…
## $ clm             <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ numclaims       <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, …
## $ claimcst0       <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000…
## $ veh_body        <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, …
## $ veh_age         <fct> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, …
## $ gender          <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, …
## $ area            <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, …
## $ agecat          <fct> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, …
## $ veh_value_group <fct> group 1, group 1, group 1, group 1, group 1, group 1, …

Data set summary and distribution

lets have the data summary and the variables distribution :

veh_value exposure clm numclaims claimcst0 veh_body veh_age gender area agecat veh_value_group
Min. : 0.000 Min. :0.002738 Min. :0.00000 Min. :0.00000 Min. : 0.0 SEDAN :22233 1:12257 F:38603 A:16312 1: 5742 group 1:67779
1st Qu.: 1.010 1st Qu.:0.219028 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.: 0.0 HBACK :18915 2:16587 M:29253 B:13341 2:12875 group 2: 70
Median : 1.500 Median :0.446270 Median :0.00000 Median :0.00000 Median : 0.0 STNWG :16261 3:20064 NA C:20540 3:15767 group 3: 6
Mean : 1.777 Mean :0.468651 Mean :0.06814 Mean :0.07276 Mean : 137.3 UTE : 4586 4:18948 NA D: 8173 4:16189 group 4: 1
3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.: 0.0 TRUCK : 1750 NA NA E: 5912 5:10736 NA
Max. :34.560 Max. :0.999316 Max. :1.00000 Max. :4.00000 Max. :55922.1 HDTOP : 1579 NA NA F: 3578 6: 6547 NA
NA NA NA NA NA (Other): 2532 NA NA NA NA NA

Numerical variables


Categotical variables

———————————————-

Part Two: insurance risk measurements by factors.

as the insurance risk level is measured by the frequency and the severity; we will calculate the the same and group it by factors.

gender factor

gender claimcst0 numclaims exposure frequency average_severity risk_premium
F 4908749 2832 17954.60 0.1577311 1733.315 273.3978
M 4405855 2105 13846.21 0.1520271 2093.043 318.1993

we can see that :

the below chart visualize all together for more clear understanding:


veh_age factor

veh_age claimcst0 numclaims exposure frequency average_severity risk_premium
3 2718237 1446 9542.111 0.1515388 1879.832 284.8675
2 2486217 1354 7923.677 0.1708803 1836.202 313.7706
4 2554895 1261 8996.079 0.1401722 2026.087 284.0010
1 1555255 876 5338.951 0.1640772 1775.405 291.3034

we can see that :

the below chart visualize all together for more clear understating :


veh_body

veh_body claimcst0 numclaims exposure frequency average_severity risk_premium
HBACK 2589136.192 1330 8810.31348 0.1509594 1946.7189 293.8756
UTE 597208.965 276 2105.73032 0.1310709 2163.8006 283.6113
STNWG 2363091.211 1248 7638.39014 0.1633852 1893.5026 309.3703
HDTOP 294811.869 136 783.29911 0.1736246 2167.7343 376.3720
PANVN 133113.412 68 409.16085 0.1661938 1957.5502 325.3327
SEDAN 2681622.477 1598 10444.59959 0.1529977 1678.1117 256.7473
TRUCK 319496.849 130 843.96441 0.1540349 2457.6681 378.5667
COUPE 187723.251 75 319.12663 0.2350164 2502.9767 588.2406
MIBUS 116104.880 45 316.84052 0.1420273 2580.1084 366.4458
MCARA 10673.950 15 59.27995 0.2530367 711.5967 180.0601
BUS 13363.120 10 25.84805 0.3868764 1336.3120 516.9876
CONVT 6888.810 3 32.59685 0.0920334 2296.2700 211.3336
RDSTR 1369.458 3 11.66872 0.2570976 456.4861 117.3615

We can see that:

the below chart visualize all together for more clear understanding:


area

area claimcst0 numclaims exposure frequency average_severity risk_premium
C 2865707.2 1493 9578.494 0.1558700 1919.429 299.1814
A 2071765.6 1181 7597.101 0.1554540 1754.247 272.7048
E 868822.9 413 2771.866 0.1489971 2103.687 313.4434
D 911058.2 524 3819.518 0.1371901 1738.661 238.5270
B 1795295.2 1021 6297.848 0.1621189 1758.369 285.0649
F 801955.4 305 1735.992 0.1756921 2629.362 461.9581

we can see that :

the below chart visualize all together for more clear understanding:


age category

agecat claimcst0 numclaims exposure frequency average_severity risk_premium
2 1984840.8 1000 5891.871 0.1697254 1984.841 336.8778
4 2145303.0 1185 7616.542 0.1555824 1810.382 281.6636
6 683568.5 390 3099.666 0.1258200 1752.740 220.5297
3 2132107.1 1189 7409.457 0.1604706 1793.194 287.7549
5 1061412.2 648 5171.009 0.1253140 1637.982 205.2621
1 1307372.9 525 2612.274 0.2009743 2490.234 500.4732

we can see that :

the below chart visualize all together for more clear understanding:


now, lets create a new model that contains all factor variables and see how much it differs from the previous models and which one is the best.

## $statistics
##                           aic      bic bayes.factor      p
## numclaims_glm_all    34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_agecat 34862.03 34916.78 1.002062e+33       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.001 0.004 0.009 0.247
## $statistics
##                         aic      bic bayes.factor      p
## numclaims_glm_all  34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_area 34938.22 34992.97 2.858943e+16       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.002 0.006 0.014 0.284
## $statistics
##                           aic      bic bayes.factor      p
## numclaims_glm_all    34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_gender 34944.03 34962.28 1.317565e+23       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.002 0.007 0.015 0.282
## $statistics
##                            aic      bic bayes.factor      p
## numclaims_glm_all     34822.37 35068.75 0.000000e+00 <2e-16
## numclaims_glm_veh_age 34920.50 34957.00 1.850433e+24       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.002 0.006 0.013 0.287
## $statistics
##                             aic      bic bayes.factor      p
## numclaims_glm_all      34822.37 35068.75         0.00 <2e-16
## numclaims_glm_veh_body 34930.11 35048.74     22196.62       
## 
## $predicted_differences
##    0%   25%   50%   75%  100% 
## 0.000 0.002 0.006 0.014 0.132

from our results we can say that the numclaims_glm_all model is best .


lets apply the anova function on the model so we can well know how much the variables are important to the model

## Analysis of Deviance Table
## 
## Model: poisson, link: log
## 
## Response: numclaims
## 
## Terms added sequentially (first to last)
## 
## 
##          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                     67855      25507              
## veh_body 12   37.563     67843      25469  0.000181 ***
## veh_age   3   33.543     67840      25436 2.474e-07 ***
## gender    1    1.595     67839      25434  0.206575    
## area      5   14.525     67834      25420  0.012598 *  
## agecat    5   86.074     67829      25334 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The drop in deviance allows to test whether the model terms are significant :

so the gender variable has the least impact on the model

—————————–

Part Three: claims count response to the predictors variables

the below chart express visually how number of claims will act (decrease on increase)-along with the lower and upper confident intervals-, when any of the factors increase by one unit:

if any factor increases by one unit ;the number of claims will increase or decrease by the coefficient value for the same factor.

—————————–

Part four: rating factors

now;lets look at the rating factor for the new model (numclaims_glm_all) which is the best performed model:

## Significance levels: *** p < 0.001; ** p < 0.01;
##     * p < 0.05; . p < 0.1
risk_factor level est_numclaims_glm_all exposure
(Intercept) (Intercept) 0.154456 NA
veh_body BUS 2.539240 ** 26
veh_body CONVT 0.548256 33
veh_body COUPE 1.534809 *** 319
veh_body HBACK 0.938495 . 8810
veh_body HDTOP 1.117535 783
veh_body MCARA 1.824920 * 59
veh_body MIBUS 0.957521 317
veh_body PANVN 1.074029 409
veh_body RDSTR 1.513937 12
veh_body SEDAN 1.000000 10445
veh_body STNWG 1.045286 7638
veh_body TRUCK 0.995693 844
veh_body UTE 0.840990 ** 2106
veh_age 1 1.089375 * 5339
veh_age 2 1.134451 *** 7924
veh_age 3 1.000000 9542
veh_age 4 0.925126 * 8996
gender F 1.000000 17955
gender M 0.976814 13846
area A 0.996318 7597
area B 1.048834 6298
area C 1.000000 9578
area D 0.891774 * 3820
area E 0.965319 2772
area F 1.065872 1736
agecat 1 1.293463 *** 2612
agecat 2 1.087360 . 5892
agecat 3 1.027766 7409
agecat 4 1.000000 7617
agecat 5 0.805326 *** 5171
agecat 6 0.820623 *** 3100

on graph:

## Significance levels: *** p < 0.001; ** p < 0.01;
##     * p < 0.05; . p < 0.1

—————————–

Conclusion:

preferred segments

the high risk segments