Introduction

  • On this chapter ,we will use R-Programming Language in order to conduct inferential analysis to analyse the data as much as we can to help decision making

————————————————————————————————————————————————

Data Set and Data types view


Data Set (1st five rows) :

veh_value exposure clm numclaims claimcst0 veh_body veh_age gender area agecat
1.06 0.3039014 0 0 0 HBACK 3 F C 2
1.03 0.6488706 0 0 0 HBACK 2 F A 4
3.26 0.5694730 0 0 0 UTE 2 F E 2
4.14 0.3175907 0 0 0 STNWG 2 F D 2
0.72 0.6488706 0 0 0 HBACK 4 F C 2
2.01 0.8542094 0 0 0 HDTOP 3 M C 4

data set dimensions

lets have a look at the data set dimension :

## [1] 67856    10

The Data set has 10 columns and 67856 rows


data set variables types

lets have a look at the variables types:

## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure  <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body  <chr> "HBACK", "HBACK", "UTE", "STNWG", "HBACK", "HDTOP", "PANVN",…
## $ veh_age   <dbl> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender    <chr> "F", "F", "F", "F", "F", "M", "M", "M", "F", "F", "M", "M", …
## $ area      <chr> "C", "A", "E", "D", "C", "C", "A", "B", "A", "B", "A", "C", …
## $ agecat    <dbl> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …

now we will make some data transformation,will do the below :

convert agecat ,veh_age,veh_body ,gender and area types into factor


lets look again to the variables types after datatype conversion :

## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure  <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body  <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, HBACK,…
## $ veh_age   <fct> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender    <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, F, F, …
## $ area      <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, D, C, …
## $ agecat    <fct> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …

————————————————————————

Data summary

##    veh_value         exposure        clm       numclaims   claimcst0      
##  Min.   : 0.000   Min.   :0.002738   0:63232   0:63232   Min.   :    0.0  
##  1st Qu.: 1.010   1st Qu.:0.219028   1: 4624   1: 4333   1st Qu.:    0.0  
##  Median : 1.500   Median :0.446270             2:  271   Median :    0.0  
##  Mean   : 1.777   Mean   :0.468651             3:   18   Mean   :  137.3  
##  3rd Qu.: 2.150   3rd Qu.:0.709103             4:    2   3rd Qu.:    0.0  
##  Max.   :34.560   Max.   :0.999316                       Max.   :55922.1  
##                                                                           
##     veh_body     veh_age   gender    area      agecat   
##  SEDAN  :22233   1:12257   F:38603   A:16312   1: 5742  
##  HBACK  :18915   2:16587   M:29253   B:13341   2:12875  
##  STNWG  :16261   3:20064             C:20540   3:15767  
##  UTE    : 4586   4:18948             D: 8173   4:16189  
##  TRUCK  : 1750                       E: 5912   5:10736  
##  HDTOP  : 1579                       F: 3578   6: 6547  
##  (Other): 2532

Numerical variables EDA

we will create new data frame that contain only The needed Numerical variables : veh_value , exposure and claimcst0


Numerical_variables summary

claimcst0 exposure veh_value
Mean 1.372702e+02 0.4686515 1.777021e+00
Std.Dev 1.056298e+03 0.2900254 1.205232e+00
Min 0.000000e+00 0.0027379 0.000000e+00
Q1 0.000000e+00 0.2190281 1.010000e+00
Median 0.000000e+00 0.4462697 1.500000e+00
Q3 0.000000e+00 0.7091034 2.150000e+00
Max 5.592213e+04 0.9993155 3.456000e+01
MAD 0.000000e+00 0.3612632 8.006040e-01
IQR 0.000000e+00 0.4900753 1.140000e+00
CV 7.695028e+00 0.6188509 6.782316e-01
Skewness 1.750173e+01 0.1755497 2.967891e+00
SE.Skewness 9.403100e-03 0.0094031 9.403100e-03
Kurtosis 4.798760e+02 -1.1425554 2.675256e+01
N.Valid 6.785600e+04 67856.0000000 6.785600e+04
Pct.Valid 1.000000e+02 100.0000000 1.000000e+02

Observations:

  • A. for claims: the mean is 137 ,median is 0 , and the internal quantile range (50% of th data) is 0 as well ==>

    • A.1 mean > median and the data is right skewed

    • A.2 as the data range from (0,inf) ==> the data has gamma distribution

  • B. for exposure: the mean is 0.469 ,median is 0.446 , and the internal quantile range (50% of th data) is 0.999 ==>

    • B.1 as the data probabilities are not the same. ==> the data has non-uniform distribution
  • C. for veh_value: the mean is 1777 ,median is 1500 , and the internal quantile range (50% of th data) is 1140 ==>

    • C.1 mean > median and the data is right skewed

    • C.2 as the data range from (0,inf) ==> the data has gamma distribution


Numerical variables Visualization


Correlation between Numerical variables

pairs plot

as the data is not normally distributed , we will see the corr using ‘spearman’ method

veh_value exposure claimcst0
veh_value 1.0000000 0.0091092 0.0265644
exposure 0.0091092 1.0000000 0.1328116
claimcst0 0.0265644 0.1328116 1.0000000

Categorical variables EDA

we will create new data frame that contain only The needed Categorical variables : numclaims, veh_body veh_age gender ,area and agecat


Categorical_variables summary:

## Frequencies  
## Categorical_variables$numclaims  
## Type: Factor  
## 
##                Freq    % Valid   % Valid Cum.    % Total   % Total Cum.
## ----------- ------- ---------- -------------- ---------- --------------
##           0   63232    93.1856        93.1856    93.1856        93.1856
##           1    4333     6.3856        99.5712     6.3856        99.5712
##           2     271     0.3994        99.9705     0.3994        99.9705
##           3      18     0.0265        99.9971     0.0265        99.9971
##           4       2     0.0029       100.0000     0.0029       100.0000
##        <NA>       0                               0.0000       100.0000
##       Total   67856   100.0000       100.0000   100.0000       100.0000
## 
## Categorical_variables$veh_body  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##         BUS      48     0.071          0.071     0.071          0.071
##       CONVT      81     0.119          0.190     0.119          0.190
##       COUPE     780     1.149          1.340     1.149          1.340
##       HBACK   18915    27.875         29.215    27.875         29.215
##       HDTOP    1579     2.327         31.542     2.327         31.542
##       MCARA     127     0.187         31.729     0.187         31.729
##       MIBUS     717     1.057         32.786     1.057         32.786
##       PANVN     752     1.108         33.894     1.108         33.894
##       RDSTR      27     0.040         33.934     0.040         33.934
##       SEDAN   22233    32.765         66.699    32.765         66.699
##       STNWG   16261    23.964         90.663    23.964         90.663
##       TRUCK    1750     2.579         93.242     2.579         93.242
##         UTE    4586     6.758        100.000     6.758        100.000
##        <NA>       0                              0.000        100.000
##       Total   67856   100.000        100.000   100.000        100.000
## 
## Categorical_variables$veh_age  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           1   12257     18.06          18.06     18.06          18.06
##           2   16587     24.44          42.51     24.44          42.51
##           3   20064     29.57          72.08     29.57          72.08
##           4   18948     27.92         100.00     27.92         100.00
##        <NA>       0                               0.00         100.00
##       Total   67856    100.00         100.00    100.00         100.00
## 
## Categorical_variables$gender  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           F   38603     56.89          56.89     56.89          56.89
##           M   29253     43.11         100.00     43.11         100.00
##        <NA>       0                               0.00         100.00
##       Total   67856    100.00         100.00    100.00         100.00
## 
## Categorical_variables$area  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           A   16312     24.04          24.04     24.04          24.04
##           B   13341     19.66          43.70     19.66          43.70
##           C   20540     30.27          73.97     30.27          73.97
##           D    8173     12.04          86.01     12.04          86.01
##           E    5912      8.71          94.73      8.71          94.73
##           F    3578      5.27         100.00      5.27         100.00
##        <NA>       0                               0.00         100.00
##       Total   67856    100.00         100.00    100.00         100.00
## 
## Categorical_variables$agecat  
## Type: Factor  
## 
##                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
##           1    5742      8.46           8.46      8.46           8.46
##           2   12875     18.97          27.44     18.97          27.44
##           3   15767     23.24          50.67     23.24          50.67
##           4   16189     23.86          74.53     23.86          74.53
##           5   10736     15.82          90.35     15.82          90.35
##           6    6547      9.65         100.00      9.65         100.00
##        <NA>       0                               0.00         100.00
##       Total   67856    100.00         100.00    100.00         100.00

Observations:

  • A. for numclaims: ==>

    • A.1 The number of Risks that have no claims are 63232 (93.19%) out of Total number of Risks 67856

    • A.2 The number of Risks that have one claim are 4333 (6.39%) out of Total number of Risks 67856

    • A.3 The number of Risks that have two claim are 271 (3.99%%) out of Total number of Risks 67856

    • A.4 The number of Risks that have three claim are 18 (0.27%%) out of Total number of Risks 67856

    • A.5 The number of Risks that have four claim are 2 (0.0029%%) out of Total number of Risks 67856

  • B. for veh_body: ==>

    • B.1 SEDAN,HBACK,and STNWG have the most frequencies with 22233(32.76%),18915(27.86%),16261(23.97%) in sequence out of Total number of Risks 67856

    • B.2 RDSTR,BUS,and CONVT have the lowest frequencies with 27(0.04%),48(0.07%),81(0.12%) in sequence out of Total number of Risks 67856

  • C. for veh_age: ==>

    • C.1 The number of Risks with the highest frequency is veh_age==3 with 20064 (29.57%) out of Total number of Risks 67856

    • C.2 The number of Risks with the lowest frequency is veh_age==1 with 12257 (18.06%) out of Total number of Risks 67856

  • D. for gender: ==>

    • D.1 Females has the higher frequency than males with 38603 (56.89%) for females and 29253(43.11%) for males
  • F. for veh_age: ==>

    • F.1 The number of Risks with the highest frequency is area : C with 20540 (30.27%) out of Total number of Risks 67856

    • F.2 The number of Risks with the lowest frequency is area : F with 3578 (5.27%) out of Total number of Risks 67856

  • G. for agecat: ==>

    • G.1 The number of Risks with the highest frequency is agecat == 3 and agecat == 4 C with 16189 (23.86%) ,15767 (23.24%) in sequence out of Total number of Risks 67856

    • G.2 The number of Risks with the lowest frequency is agecat == 1 and agecat == 6 with 5742 (8.46%) ,6547 (9.65%) in sequence out of Total number of Risks 67856


Categorical variables Visualization


Categorical variables in terms of claims occurrence:

1. gender :

Characteristic N Overall, N = 67,8561 0, N = 63,2321 1, N = 4,6241 p-value2
gender 67,856


0.6
    F
38,603 (57%) 35,955 (57%) 2,648 (57%)
    M
29,253 (43%) 27,277 (43%) 1,976 (43%)
1 n (%)
2 Pearson’s Chi-squared test
  • Insights

A. Out of 67856 risk , we have 4624 with positive claims claims

B. female had more claim percentage than males

C. the p value of the the pearson chisq test is 0.6 ==> more than 0.05 which statistically means there is no significant difference between genders in respect of claims


  • Visualizing gender against clm


now we want to assess and report the effect size and the magnitude of differences between these groups

table <- table(data$clm ,data$gender)
chi_square_test<-chisq.test(table)
phi_coefficient <- sqrt(chi_square_test$statistic / sum(table))
cat(paste("The effect size (Phi Coefficient) is",phi_coefficient ,"indicating",interpret_phi(phi_coefficient),"effect size"))
## The effect size (Phi Coefficient) is 0.0019987264321987 indicating tiny effect size
  • claims probabilities by gender
##  gender   prob      SE  df asymp.LCL asymp.UCL null  z.ratio p.value
##  F      0.0686 0.00129 Inf    0.0661    0.0712  0.5 -129.543  <.0001
##  M      0.0675 0.00147 Inf    0.0647    0.0705  0.5 -112.676  <.0001
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale 
## Tests are performed on the logit scale
  • Insights

the probability of females having claims is little higher than the probability of males with .0686 females and 0.0675 for males

  • To visualize our results :


2. area :

Characteristic N Overall, N = 67,8561 0, N = 63,2321 1, N = 4,6241 p-value2
area 67,856


0.003
    A
16,312 (24%) 15,227 (24%) 1,085 (23%)
    B
13,341 (20%) 12,376 (20%) 965 (21%)
    C
20,540 (30%) 19,128 (30%) 1,412 (31%)
    D
8,173 (12%) 7,677 (12%) 496 (11%)
    E
5,912 (8.7%) 5,526 (8.7%) 386 (8.3%)
    F
3,578 (5.3%) 3,298 (5.2%) 280 (6.1%)
1 n (%)
2 Pearson’s Chi-squared test
  • Insights

A. area A has the highest number of claims with ‘31%’ percentage

B. area F has the lowest number of claims with ‘6.1%’ percentage

C. the p value of the the pearson chisq test is 0.003 ==> less than 0.05 which statistically means there is significant difference between areas in respect of claims


  • Visualizing area against clm


now we want to assess and report the effect size and the magnitude of differences between these groups

library(vcd)
table <- table(data$clm ,data$area)
cat(paste("The effect size (Cramér's V) is ", assocstats(table)$cramer,"indicating",interpret_cramers_v(assocstats(table)$cramer),"effect size"))
## The effect size (Cramér's V) is  0.0163593665773446 indicating tiny effect size
  • claims probabilities by area
##  area   prob      SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  A    0.0665 0.00195 Inf    0.0628    0.0704  0.5 -84.066  <.0001
##  B    0.0723 0.00224 Inf    0.0681    0.0769  0.5 -76.337  <.0001
##  C    0.0687 0.00177 Inf    0.0654    0.0723  0.5 -94.504  <.0001
##  D    0.0607 0.00264 Inf    0.0557    0.0661  0.5 -59.130  <.0001
##  E    0.0653 0.00321 Inf    0.0593    0.0719  0.5 -50.552  <.0001
##  F    0.0783 0.00449 Inf    0.0699    0.0875  0.5 -39.621  <.0001
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale 
## Tests are performed on the logit scale
  • Insights

We can see that area== F has the highest probability of claimed with 0.0783% and area== D has the lowest probability with 0.0607%

  • To visualize our results :


3. agecat :

Characteristic N Overall, N = 67,8561 0, N = 63,2321 1, N = 4,6241 p-value2
agecat 67,856


<0.001
    1
5,742 (8.5%) 5,246 (8.3%) 496 (11%)
    2
12,875 (19%) 11,943 (19%) 932 (20%)
    3
15,767 (23%) 14,654 (23%) 1,113 (24%)
    4
16,189 (24%) 15,085 (24%) 1,104 (24%)
    5
10,736 (16%) 10,122 (16%) 614 (13%)
    6
6,547 (9.6%) 6,182 (9.8%) 365 (7.9%)
1 n (%)
2 Pearson’s Chi-squared test

Insights

A. Age Category 2 and 4 have the highest number of claims with ‘24%’ percentage for each -‘noting that both categories has the highest count of total agecat observations’-

B. Age Category 6 has the lowest number of claims with ‘7.9%’ percentage

C. the p value of the the pearson chisq test < 0.001 ==> less than 0.05 which statistically means there is significant difference between Age Categories in respect of claims


  • Visualizing Age Category against clm


now we want to assess and report the effect size and the magnitude of differences between these groups

## The effect size (Cramér's V) is  0.0324228335586784 indicating tiny effect size
  • claims probabilities by agecat
##  agecat   prob      SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  1      0.0864 0.00371 Inf    0.0794    0.0939  0.5 -50.210  <.0001
##  2      0.0724 0.00228 Inf    0.0680    0.0770  0.5 -74.994  <.0001
##  3      0.0706 0.00204 Inf    0.0667    0.0747  0.5 -82.904  <.0001
##  4      0.0682 0.00198 Inf    0.0644    0.0722  0.5 -83.865  <.0001
##  5      0.0572 0.00224 Inf    0.0530    0.0617  0.5 -67.429  <.0001
##  6      0.0558 0.00284 Inf    0.0504    0.0616  0.5 -52.530  <.0001
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale 
## Tests are performed on the logit scale
  • Insights

We can see that agecat== 1 has the highest probability of claimed with 0.0864% and agecat== 6 has the lowest probability with 0.0558%

  • To visualize our results :


4. veh_age :

Characteristic N Overall, N = 67,8561 0, N = 63,2321 1, N = 4,6241 p-value2
veh_age 67,856


<0.001
    1
12,257 (18%) 11,432 (18%) 825 (18%)
    2
16,587 (24%) 15,328 (24%) 1,259 (27%)
    3
20,064 (30%) 18,702 (30%) 1,362 (29%)
    4
18,948 (28%) 17,770 (28%) 1,178 (25%)
1 n (%)
2 Pearson’s Chi-squared test

Insights

A. Vehicle Age 3 has highest number of claims with ‘29%’ percentage-‘noting that it has the highest count of total Vehicle Age categories observations’-

B. Vehicle Age 1 has the lowest number of claims with ‘18%’ percentage

C. the p value of the the pearson chisq test < 0.001 ==> less than 0.05 which statistically means there is significant difference between Vehicle Ages in respect of claims


  • Visualizing Vehicle Age Category against clm


now we want to assess and report the effect size and the magnitude of differences between these groups

## The effect size (Cramér's V) is  0.0197729280570061 indicating tiny effect size
  • claims probabilities by agecat
##  veh_age   prob      SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  1       0.0673 0.00226 Inf    0.0630    0.0719  0.5 -72.921  <.0001
##  2       0.0759 0.00206 Inf    0.0720    0.0800  0.5 -85.251  <.0001
##  3       0.0679 0.00178 Inf    0.0645    0.0714  0.5 -93.341  <.0001
##  4       0.0622 0.00175 Inf    0.0588    0.0657  0.5 -90.198  <.0001
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale 
## Tests are performed on the logit scale
  • Insights

We can see that veh_age== 2 has the highest probability of claimed with 0.0759% and veh_age== 4 has the lowest probability with 0.0622%

  • To visualize our results :


5. veh_body :

Characteristic N Overall, N = 67,8561 0, N = 63,2321 1, N = 4,6241 p-value
veh_body 67,856



    BUS
48 (<0.1%) 39 (<0.1%) 9 (0.2%)
    CONVT
81 (0.1%) 78 (0.1%) 3 (<0.1%)
    COUPE
780 (1.1%) 712 (1.1%) 68 (1.5%)
    HBACK
18,915 (28%) 17,651 (28%) 1,264 (27%)
    HDTOP
1,579 (2.3%) 1,449 (2.3%) 130 (2.8%)
    MCARA
127 (0.2%) 113 (0.2%) 14 (0.3%)
    MIBUS
717 (1.1%) 674 (1.1%) 43 (0.9%)
    PANVN
752 (1.1%) 690 (1.1%) 62 (1.3%)
    RDSTR
27 (<0.1%) 25 (<0.1%) 2 (<0.1%)
    SEDAN
22,233 (33%) 20,757 (33%) 1,476 (32%)
    STNWG
16,261 (24%) 15,088 (24%) 1,173 (25%)
    TRUCK
1,750 (2.6%) 1,630 (2.6%) 120 (2.6%)
    UTE
4,586 (6.8%) 4,326 (6.8%) 260 (5.6%)
1 n (%)

Insights

A. Sedan veh_body has highest number of claims with ‘32%’ percentage-‘noting that it has the highest count of total veh_body observations’-

B. convt and RDSTR veh_body has the lowest number of claims with less than ‘0.1%’ percentage


  • Visualizing veh_body Category against clm


now we want to assess and report the effect size and the magnitude of differences between these groups

## The effect size (Cramér's V) is  0.0252738312900611 indicating tiny effect size
  • claims probabilities by veh_body
##  veh_body   prob      SE  df asymp.LCL asymp.UCL null z.ratio p.value
##  BUS      0.1875 0.05634 Inf    0.1005    0.3227  0.5  -3.965  0.0001
##  CONVT    0.0370 0.02097 Inf    0.0120    0.1086  0.5  -5.540  <.0001
##  COUPE    0.0872 0.01010 Inf    0.0693    0.1091  0.5 -18.503  <.0001
##  HBACK    0.0668 0.00182 Inf    0.0634    0.0705  0.5 -90.549  <.0001
##  HDTOP    0.0823 0.00692 Inf    0.0697    0.0969  0.5 -26.335  <.0001
##  MCARA    0.1102 0.02779 Inf    0.0664    0.1776  0.5  -7.371  <.0001
##  MIBUS    0.0600 0.00887 Inf    0.0448    0.0799  0.5 -17.497  <.0001
##  PANVN    0.0824 0.01003 Inf    0.0648    0.1044  0.5 -18.174  <.0001
##  RDSTR    0.0741 0.05040 Inf    0.0186    0.2525  0.5  -3.437  0.0006
##  SEDAN    0.0664 0.00167 Inf    0.0632    0.0697  0.5 -98.133  <.0001
##  STNWG    0.0721 0.00203 Inf    0.0683    0.0762  0.5 -84.269  <.0001
##  TRUCK    0.0686 0.00604 Inf    0.0576    0.0814  0.5 -27.581  <.0001
##  UTE      0.0567 0.00341 Inf    0.0504    0.0638  0.5 -44.034  <.0001
## 
## Confidence level used: 0.95 
## Intervals are back-transformed from the logit scale 
## Tests are performed on the logit scale
  • Insights

We can see that veh_body== BUS has the highest probability of claimed with 18.75% and veh_body== CONVT has the lowest probability with 0.0370%

  • To visualize our results :

————————————————————————

Data summary - Positive claims

positive claims Data is the data with claims > 0


##    veh_value         exposure        numclaims   claimcst0          veh_body   
##  Min.   : 0.000   Min.   :0.002738   0:   0    Min.   :  200.0   SEDAN  :1476  
##  1st Qu.: 1.100   1st Qu.:0.410678   1:4333    1st Qu.:  353.8   HBACK  :1264  
##  Median : 1.570   Median :0.637919   2: 271    Median :  761.6   STNWG  :1173  
##  Mean   : 1.859   Mean   :0.611271   3:  18    Mean   : 2014.4   UTE    : 260  
##  3rd Qu.: 2.310   3rd Qu.:0.832307   4:   2    3rd Qu.: 2091.4   HDTOP  : 130  
##  Max.   :13.900   Max.   :0.999316             Max.   :55922.1   TRUCK  : 120  
##                                                                  (Other): 201  
##  veh_age  gender   area     agecat  
##  1: 825   F:2648   A:1085   1: 496  
##  2:1259   M:1976   B: 965   2: 932  
##  3:1362            C:1412   3:1113  
##  4:1178            D: 496   4:1104  
##                    E: 386   5: 614  
##                    F: 280   6: 365  
## 

Numerical variables EDA

we will create new data frame that contain only The needed Numerical variables : veh_value , exposure and claimcst0


Numerical_variables summary

claimcst0 exposure veh_value
Mean 2.014404e+03 0.6112714 1.8591955
Std.Dev 3.548907e+03 0.2616474 1.1595951
Min 2.000000e+02 0.0027379 0.0000000
Q1 3.537700e+02 0.4106776 1.1000000
Median 7.615650e+02 0.6379192 1.5700000
Q3 2.091900e+03 0.8323066 2.3100000
Max 5.592213e+04 0.9993155 13.9000000
MAD 8.325763e+02 0.3125536 0.8302560
IQR 1.737655e+03 0.4216290 1.2100000
CV 1.761765e+00 0.4280381 0.6237080
Skewness 5.038509e+00 -0.3468537 1.8535754
SE.Skewness 3.601020e-02 0.0360102 0.0360102
Kurtosis 4.019591e+01 -0.8719807 6.9007387
N.Valid 4.624000e+03 4624.0000000 4624.0000000
Pct.Valid 1.000000e+02 100.0000000 100.0000000

Observations:

  • A. for claims: the mean is 2014 ,median is 761 , and the internal quantile range (50% of th data) is 1737 as well ==>

    -noting that the the mean was 137 ,median was 0 , and the internal quantile range (50% of th data) was 0 as well for all data set-

  • B. for exposure: the mean is 0.611 ,median is 0.638 , and the internal quantile range (50% of th data) is 0.422 ==>

    -noting that the mean was 0.469 ,median was 0.446 , and the internal quantile range (50% of th data) was 0.999 for all data set-

  • C. for veh_value: the mean is 1859 ,median is 1570 , and the internal quantile range (50% of th data) is 1210 ==>

    -noting that the mean was 1777 ,median was 1500 , and the internal quantile range (50% of th data) was 1140 for all data set-


Numerical variables Visualization


Correlation between Numerical variables

pairs plot

as the data is not normally distributed , we will see the corr using ‘spearman’ method

veh_value exposure claimcst0
veh_value 1.0000000 0.0624592 -0.0219103
exposure 0.0624592 1.0000000 -0.0418187
claimcst0 -0.0219103 -0.0418187 1.0000000

Categorical variables EDA

we will create new data frame that contain only The needed Categorical variables : numclaims, veh_body veh_age gender ,area and agecat

numclaims veh_body veh_age gender area agecat
1 SEDAN 3 M B 6
1 SEDAN 3 F F 4
1 HBACK 3 M C 4
2 STNWG 3 M F 2
1 STNWG 2 M F 3
1 HBACK 3 F A 4

Categorical_variables summary:

## Frequencies  
## Categorical_variables$numclaims  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           0      0     0.000          0.000     0.000          0.000
##           1   4333    93.707         93.707    93.707         93.707
##           2    271     5.861         99.567     5.861         99.567
##           3     18     0.389         99.957     0.389         99.957
##           4      2     0.043        100.000     0.043        100.000
##        <NA>      0                              0.000        100.000
##       Total   4624   100.000        100.000   100.000        100.000
## 
## Categorical_variables$veh_body  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##         BUS      9     0.195          0.195     0.195          0.195
##       CONVT      3     0.065          0.260     0.065          0.260
##       COUPE     68     1.471          1.730     1.471          1.730
##       HBACK   1264    27.336         29.066    27.336         29.066
##       HDTOP    130     2.811         31.877     2.811         31.877
##       MCARA     14     0.303         32.180     0.303         32.180
##       MIBUS     43     0.930         33.110     0.930         33.110
##       PANVN     62     1.341         34.451     1.341         34.451
##       RDSTR      2     0.043         34.494     0.043         34.494
##       SEDAN   1476    31.920         66.414    31.920         66.414
##       STNWG   1173    25.368         91.782    25.368         91.782
##       TRUCK    120     2.595         94.377     2.595         94.377
##         UTE    260     5.623        100.000     5.623        100.000
##        <NA>      0                              0.000        100.000
##       Total   4624   100.000        100.000   100.000        100.000
## 
## Categorical_variables$veh_age  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    825     17.84          17.84     17.84          17.84
##           2   1259     27.23          45.07     27.23          45.07
##           3   1362     29.46          74.52     29.46          74.52
##           4   1178     25.48         100.00     25.48         100.00
##        <NA>      0                               0.00         100.00
##       Total   4624    100.00         100.00    100.00         100.00
## 
## Categorical_variables$gender  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           F   2648     57.27          57.27     57.27          57.27
##           M   1976     42.73         100.00     42.73         100.00
##        <NA>      0                               0.00         100.00
##       Total   4624    100.00         100.00    100.00         100.00
## 
## Categorical_variables$area  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           A   1085     23.46          23.46     23.46          23.46
##           B    965     20.87          44.33     20.87          44.33
##           C   1412     30.54          74.87     30.54          74.87
##           D    496     10.73          85.60     10.73          85.60
##           E    386      8.35          93.94      8.35          93.94
##           F    280      6.06         100.00      6.06         100.00
##        <NA>      0                               0.00         100.00
##       Total   4624    100.00         100.00    100.00         100.00
## 
## Categorical_variables$agecat  
## Type: Factor  
## 
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    496     10.73          10.73     10.73          10.73
##           2    932     20.16          30.88     20.16          30.88
##           3   1113     24.07          54.95     24.07          54.95
##           4   1104     23.88          78.83     23.88          78.83
##           5    614     13.28          92.11     13.28          92.11
##           6    365      7.89         100.00      7.89         100.00
##        <NA>      0                               0.00         100.00
##       Total   4624    100.00         100.00    100.00         100.00

Observations:

  • A. for numclaims: ==>

    • A.1 The number of Risks that have one claim are 4333 (93.707) out of Total number of Risks with positive claims 4624

    • A.2 The number of Risks that have one claim are 271 (5.861) out of Total number of Risks with positive claims 4624

    • A.3 The number of Risks that have two claim are 18 (0.389%) out of Total number of Risks with positive claims 4624

    • A.4 The number of Risks that have three claim are 2 (0.043%) out of Total number of Risks with positive claims 4624

  • B. for veh_body: ==>

    • B.1 SEDAN,HBACK,and STNWG have the most claims incurred frequency with 1476(31.920%),1264(27.34%),1173(25.37%) in sequence out of Total number of Risks with positive claims 4624

    • B.2 RDSTR,CONVT,and CONVT have the lowest claims incurred frequency with 2(0.043%),3(0.056%),9(0.195%) in sequence out of Total number of Risks with positive claims 4624

  • C. for veh_age: ==>

    • C.1 The number of Risks with the highest claims incurred frequency is veh_age==3 with 1362 (29.46%) out of Total number of Risks with positive claims 4624

    • C.2 The number of Risks with the lowest claims incurred frequency is veh_age==1 with 825 (17.84%) out of Total number of Risks with positive claims 4624

  • D. for gender: ==>

    • D.1 Females has the higher claims incurred frequency than males with 2648 (57.2%) for females and 1976(42.73%) for males
  • F. for area: ==>

    • F.1 The number of Risks with the highest claims incurred frequency is area : C with 1412 (30.54%) out of Total number of Risks with positive claims 4624

    • F.2 The number of Risks with the lowest claims incurred frequency is area : F with 280 (6.06%) out of Total number of Risks with positive claims 4624

  • G. for agecat: ==>

    • G.1 The number of Risks with the highest claims incurred frequency is agecat == 3 and agecat == 4 C with 1113 (27.7%) ,1104 (23.88%) in sequence out of Total number of Risks with positive claims 4624

    • G.2 The number of Risks with the lowest claims incurred frequency is agecat == 6 with 365 (7.89%) out of Total number of Risks with positive claims 4624


Categorical variables Visualization

————————————————————————

Inferential Analysis

choosing the Right statistics tests

  • In Order To determine what types of statistical tests we will apply , we need to to determine Normality of the target variable (claims):

claimcst0 variable Normality check by conducting Visualization


claims variable Normality check by Applying Anderson-Darling normality test

## 
##  Anderson-Darling normality test
## 
## data:  df$claimcst0
## A = 654.01, p-value < 2.2e-16

Based on our above analysis results:

  • Normality: The Anderson-Darling normality test (p-value < 0.05) indicates non-normal distribution.

==> Given these results, We will be using non-parametric statistical tests


Aplying non-parametric statistical tests

A. Mann-Whitney U Test

Mann-Whitney U Test -for features that have 2 categories : equivelent to wilcox.test() wilcox rank sum test with continuity correction,better than wilcox sighned rank exct test(that used when variables are paired)

  • A.1 Mann-Whitney U Test on claimcst0 against gender
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  df$claimcst0 by df$gender
## W = 2520284, p-value = 0.03234
## alternative hypothesis: true location shift is not equal to 0

After Applying the test we found that:

The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between genders


  • A.2 Visualization


  • A.3 wilcox_effsize :

  • now we want to assess and report the effect size and the magnitude of differences between these groups

## # A tibble: 1 × 7
##   .y.       group1 group2 effsize    n1    n2 magnitude
## * <chr>     <chr>  <chr>    <dbl> <int> <int> <ord>    
## 1 claimcst0 F      M       0.0315  2648  1976 small

The wilcox_effsize test showed that the effect size= .0315 and the indicating small effect size


  • A.4 Assumption Checks:

  • Check the homogeneity of variances using Levene’s test.

levene_test <- leveneTest(claimcst0~gender,data = df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    1  12.543 0.0004016 ***
##       4622                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The test indicates that the variances are significantly different across the groups (p-value < 0.05).

  • A.5 Post-hoc Power Analysis:

  • Conduct a post-hoc power analysis based on the effect size.

# Post-hoc Power Analysis
#t-test (two samples with unequal n)
library(pwr)
## Warning: package 'pwr' was built under R version 4.3.3
effect_size_value <- effect_size$effsize
power_analysis <- pwr.t2n.test(n1 = length(df$gender=="M"), n2 = length(df$gender=="F"), 
                               d = effect_size_value, 
                               sig.level = 0.05, power = NULL)
print(power_analysis)
## 
##      t test power calculation 
## 
##              n1 = 4624
##              n2 = 4624
##               d = 0.03147296
##       sig.level = 0.05
##           power = 0.3277674
##     alternative = two.sided
  • The power of the test is 0.3277674,This means there is a 32.77% chance of detecting a true effect.

B. kruskal.test

B.1 kruskal.test on claims against area

  • B.1.1 kruskal.test on claims against area as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$claimcst0 and df$area
## Kruskal-Wallis chi-squared = 26.616, df = 5, p-value = 6.776e-05

After Applying the test we found that:

The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between areas


  • B.1.2 Visualization


  • B.1.3 kruskal_effsize :

  • now we want to assess and report the effect size and the magnitude of differences between these groups

.y. n effsize method magnitude
claimcst0 4624 0.0046807 eta2[H] small
  • The kruskal_effsize test showed that the effect size (epsilon_squared)= 0.004680724 and the indicating small effect size

  • B.1.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :
##    Comparison          Z      P.unadj        P.adj
## 1       A - B  0.3929272 6.943733e-01 1.0000000000
## 2       A - C -0.1226836 9.023577e-01 0.9023576683
## 3       B - C -0.5348567 5.927489e-01 1.0000000000
## 4       A - D -1.0455938 2.957486e-01 1.0000000000
## 5       B - D -1.3404712 1.800922e-01 1.0000000000
## 6       C - D -0.9908884 3.217401e-01 1.0000000000
## 7       A - E -2.8037166 5.051729e-03 0.0505172881
## 8       B - E -3.0477582 2.305554e-03 0.0276666430
## 9       C - E -2.8067635 5.004197e-03 0.0550461713
## 10      D - E -1.6131418 1.067137e-01 0.8537095799
## 11      A - F -3.9167774 8.974053e-05 0.0011666269
## 12      B - F -4.1238931 3.725219e-05 0.0005587828
## 13      C - F -3.9375509 8.231746e-05 0.0011524445
## 14      D - F -2.7541273 5.884887e-03 0.0529639842
## 15      E - F -1.2278040 2.195206e-01 1.0000000000

From the above table we can get that :

  • The p.adjusted value is less than 0.05 between area B-E ,area A-F,area B-F,area C-F –> statistically there is significant difference in the data distribution between these groups,and there is no statistically significant difference in the data distribution between all other area groups

  • B.1.5 Assumption Checks:

  • Check the homogeneity of variances using Levene’s test.

levene_test <- leveneTest(claimcst0 ~area,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value   Pr(>F)   
## group    5  3.9598 0.001384 **
##       4618                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The test indicates that the variances are significantly different across the groups (p-value < 0.05).

  • B.1.6 Post-hoc Power Analysis:

  • Conduct a post-hoc power analysis based on the effect size.

effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$area)),
                                 n = nrow(df) / length(unique(df$area)),
                                 f = effect_size,
                                 sig.level = 0.05,
                                 power = NULL)
print(power_analysis)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 6
##               n = 770.6667
##               f = 0.06841581
##       sig.level = 0.05
##           power = 0.966717
## 
## NOTE: n is number in each group
  • The power of the test is 0.966717,This means there is a 96% chance of detecting a true effect.

B.2 kruskal.test on claims against veh_body

  • B.2.1 kruskal.test on claims against veh_body as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$claimcst0 and df$veh_body
## Kruskal-Wallis chi-squared = 18.358, df = 12, p-value = 0.1052

After Applying the test we found that:

The pvalue is more than 0.05 –> statistically there is no significant difference in the data distribution between veh_body


  • B.2.2 Visualization


B.3 kruskal.test on claims against age category

  • B.3.1 kruskal.test on claims against age category as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$claimcst0 and df$agecat
## Kruskal-Wallis chi-squared = 11.099, df = 5, p-value = 0.04946

After Applying the test we found that:

The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between agecat


  • B.3.2 Visualization


  • B.3.3 kruskal_effsize :

  • now we want to assess and report the effect size and the magnitude of differences between these groups

.y. n effsize method magnitude
claimcst0 4624 0.0013206 eta2[H] small
  • The kruskal_effsize test showed that the effect size (epsilon_squared)= 0.001320641 and the indicating small effect size

  • B.3.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :
##    Comparison           Z     P.unadj      P.adj
## 1       1 - 2  2.25523606 0.024118516 0.28942219
## 2       1 - 3  2.50162552 0.012362461 0.16071199
## 3       2 - 3  0.21869627 0.826886659 1.00000000
## 4       1 - 4  2.74351582 0.006078512 0.08509917
## 5       2 - 4  0.51604229 0.605824873 1.00000000
## 6       3 - 4  0.31181652 0.755179965 1.00000000
## 7       1 - 5  3.16592394 0.001545912 0.02318868
## 8       2 - 5  1.26571867 0.205613823 1.00000000
## 9       3 - 5  1.11552540 0.264625342 1.00000000
## 10      4 - 5  0.85082071 0.394868957 1.00000000
## 11      1 - 6  1.94645859 0.051599677 0.56759645
## 12      2 - 6  0.14394179 0.885546439 1.00000000
## 13      3 - 6 -0.01363333 0.989122512 0.98912251
## 14      4 - 6 -0.23298594 0.815772326 1.00000000
## 15      5 - 6 -0.86090528 0.389290214 1.00000000

From the above table we can get that :

  • The p.adjusted value is less than 0.05 only between agecat 1-5 –> statistically there is significant difference in the data distribution between these groups,and there is no statistically significant difference in the data distribution between all other agecat groups

  • B.3.5 Assumption Checks:

  • Check the homogeneity of variances using Levene’s test.

levene_test <- leveneTest(claimcst0 ~agecat,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value    Pr(>F)    
## group    5  4.5834 0.0003578 ***
##       4618                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • The test indicates that the variances are significantly different across the groups (p-value < 0.05).

  • B.3.6 Post-hoc Power Analysis:

  • Conduct a post-hoc power analysis based on the effect size.

effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$agecat)),
                                 n = nrow(df) / length(unique(df$agecat)),
                                 f = effect_size,
                                 sig.level = 0.05,
                                 power = NULL)
print(power_analysis)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 6
##               n = 770.6667
##               f = 0.03634062
##       sig.level = 0.05
##           power = 0.4397375
## 
## NOTE: n is number in each group
  • The power of the test is 0.4397375,This means there is a 43.9% chance of detecting a true effect.

B.4 kruskal.test on claims against number of claims

  • B.4.1 kruskal.test on claims against number of claims as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$claimcst0 and df$numclaims
## Kruskal-Wallis chi-squared = 159.02, df = 3, p-value < 2.2e-16

After Applying the test we found that:

The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between numclaims


  • B.4.2 Visualization


  • B.4.3 kruskal_effsize :

  • now we want to assess and report the effect size and the magnitude of differences between these groups

.y. n effsize method magnitude
claimcst0 4624 0.0337713 eta2[H] small
  • The kruskal_effsize test showed that the effect size (epsilon_squared)= 0.0337713 and the indicating small effect size

  • B.4.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :
##   Comparison           Z      P.unadj        P.adj
## 1      1 - 2 -11.6335467 2.782897e-31 1.669738e-30
## 2      1 - 3  -4.7312845 2.231036e-06 1.115518e-05
## 3      2 - 3  -1.5983152 1.099729e-01 3.299186e-01
## 4      1 - 4  -1.7924333 7.306358e-02 2.922543e-01
## 5      2 - 4  -0.7598618 4.473372e-01 8.946744e-01
## 6      3 - 4  -0.2015760 8.402482e-01 8.402482e-01

From the above table we can get that :

  • The p.adjusted value is less than 0.05 between numclaims 1-2 ,numclaims 1-3 –> statistically there is significant difference in the data distribution between these groups,and there is no statistically significant differences in the data distribution between all other numclaims groups

  • B.4.5 Assumption Checks:

  • Check the homogeneity of variances using Levene’s test.

levene_test <- leveneTest(claimcst0 ~numclaims,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    3  0.9987 0.3923
##       4620
  • The test indicates that the variances are not significantly different across the groups (p-value > 0.05).

  • B.4.6 Post-hoc Power Analysis:

  • Conduct a post-hoc power analysis based on the effect size.

effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$numclaims)),
                                 n = nrow(df) / length(unique(df$numclaims)),
                                 f = effect_size,
                                 sig.level = 0.05,
                                 power = NULL)
print(power_analysis)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 4
##               n = 1156
##               f = 0.1837697
##       sig.level = 0.05
##           power = 1
## 
## NOTE: n is number in each group
  • The power of the test is 1,This means there is a 100% chance of detecting a true effect.

B.5 kruskal.test on claims against veh_age

  • B.5.1 kruskal.test on claims against veh_age as it has more than 2 factors
## 
##  Kruskal-Wallis rank sum test
## 
## data:  df$claimcst0 and df$veh_age
## Kruskal-Wallis chi-squared = 18.668, df = 3, p-value = 0.0003201

After Applying the test we found that:

The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between veh_age


  • B.5.2 Visualization


  • B.5.3 kruskal_effsize :

  • now we want to assess and report the effect size and the magnitude of differences between these groups

.y. n effsize method magnitude
claimcst0 4624 0.0033914 eta2[H] small
  • The kruskal_effsize test showed that the effect size (epsilon_squared)= 0.003391443 and the indicating small effect size

  • B.5.4 to find out in details these differences we will Perform Dunn’s Test with holm correction for p-values :
##   Comparison         Z      P.unadj        P.adj
## 1      1 - 2 -1.401384 1.610994e-01 0.3221987936
## 2      1 - 3 -2.860319 4.232150e-03 0.0169286002
## 3      2 - 3 -1.622098 1.047823e-01 0.3143468597
## 4      1 - 4 -3.988218 6.657135e-05 0.0003994281
## 5      2 - 4 -2.918060 3.522165e-03 0.0176108256
## 6      3 - 4 -1.379031 1.678852e-01 0.1678852290

From the above table we can get that :

  • The p.adjusted value is less than 0.05 only between veh_age 1-3 ,veh_age 1-4 ,veh_age 2-4 –> statistically there is significant difference in the data distribution between these groups,and there is no statistically significant difference in the data distribution between all other veh_age groups

  • B.5.5 Assumption Checks:

  • Check the homogeneity of variances using Levene’s test.

levene_test <- leveneTest(claimcst0 ~veh_age,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
##         Df F value Pr(>F)
## group    3   0.926 0.4272
##       4620
  • The test indicates that the variances are not significantly different across the groups (p-value > 0.05).

  • B.5.6 Post-hoc Power Analysis:

  • Conduct a post-hoc power analysis based on the effect size.

effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$veh_age)),
                                 n = nrow(df) / length(unique(df$veh_age)),
                                 f = effect_size,
                                 sig.level = 0.05,
                                 power = NULL)
print(power_analysis)
## 
##      Balanced one-way analysis of variance power calculation 
## 
##               k = 4
##               n = 1156
##               f = 0.0582361
##       sig.level = 0.05
##           power = 0.9288195
## 
## NOTE: n is number in each group
  • The power of the test is 0.92,This means there is a 92% chance of detecting a true effect.

————————————————————————

Insurance risk measurements by factors.

  • as the insurance risk level is measured by the frequency and the severity; we will calculate the the same and group it by factors.

gender factor

gender claimcst0 numclaims exposure frequency average_severity risk_premium
F 4908749 2832 17954.60 0.1577311 1733.315 273.3978
M 4405855 2105 13846.21 0.1520271 2093.043 318.1993

we can see that :

  • females number of claims are more than males.

  • females claims cost are more than males but; the average severity is less ,and this is due to the larger number of claims(more than males)

  • both females and males almost have the same frequency.

  • female risk premium is lesser than males ,and this due to the higher exposure females have.

the below chart visualize all together for more clear understanding:


veh_age factor

veh_age claimcst0 numclaims exposure frequency average_severity risk_premium
3 2718237 1446 9542.111 0.1515388 1879.832 284.8675
2 2486217 1354 7923.677 0.1708803 1836.202 313.7706
4 2554895 1261 8996.079 0.1401722 2026.087 284.0010
1 1555255 876 5338.951 0.1640772 1775.405 291.3034

we can see that :

  • the claims cost for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.

  • vehicle age 1 has the lowest number on claims while the vehicle age 3 has the highest.

  • the exposure for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.

  • the frequency for the vehicle age 4 is the lowest while for vehicle age 2 is the highest.

  • the average_severity for the vehicle age 1 is the lowest while for vehicle age 4 is the highest.

  • the risk premium for the vehicle age 3 & 4 is the lowest and almost the same, while its the highest for vehicle age 2

the below chart visualize all together for more clear understating :


veh_body factor

veh_body claimcst0 numclaims exposure frequency average_severity risk_premium
HBACK 2589136.192 1330 8810.31348 0.1509594 1946.7189 293.8756
UTE 597208.965 276 2105.73032 0.1310709 2163.8006 283.6113
STNWG 2363091.211 1248 7638.39014 0.1633852 1893.5026 309.3703
HDTOP 294811.869 136 783.29911 0.1736246 2167.7343 376.3720
PANVN 133113.412 68 409.16085 0.1661938 1957.5502 325.3327
SEDAN 2681622.477 1598 10444.59959 0.1529977 1678.1117 256.7473
TRUCK 319496.849 130 843.96441 0.1540349 2457.6681 378.5667
COUPE 187723.251 75 319.12663 0.2350164 2502.9767 588.2406
MIBUS 116104.880 45 316.84052 0.1420273 2580.1084 366.4458
MCARA 10673.950 15 59.27995 0.2530367 711.5967 180.0601
BUS 13363.120 10 25.84805 0.3868764 1336.3120 516.9876
CONVT 6888.810 3 32.59685 0.0920334 2296.2700 211.3336
RDSTR 1369.458 3 11.66872 0.2570976 456.4861 117.3615

We can see that:

  • sedan,Hback,and Stnwg have the highest claims cost (), while RDSTR has the lowest claims cost.

  • RDSTR ,convt,Bus,MCARA,COUPE & PANVN have less than 100 claims, SEDAN,STNWG & HBACK have more than 1000 claims ,other have claims between 130 and 276.

  • BUS has the highest frequency with 38.3% , (RDSTR,MCARA, and COUPE have frequincies (27.7%,25.3% and 23.5%), other groups have frequencies less than 18%.

  • RDSTR has the lowest average severity , (MIBUS,COUPE and TRUCK ) have the highest.

  • the risk premium is the less than 200 for RDSTR and MCARA, the more tgan 500 for BUS and COUPE.

the below chart visualize all together for more clear understanding:


area factor

area claimcst0 numclaims exposure frequency average_severity risk_premium
C 2865707.2 1493 9578.494 0.1558700 1919.429 299.1814
A 2071765.6 1181 7597.101 0.1554540 1754.247 272.7048
E 868822.9 413 2771.866 0.1489971 2103.687 313.4434
D 911058.2 524 3819.518 0.1371901 1738.661 238.5270
B 1795295.2 1021 6297.848 0.1621189 1758.369 285.0649
F 801955.4 305 1735.992 0.1756921 2629.362 461.9581

we can see that :

  • the lowest claims cost is for area F , and the highest cost is for area C.

  • area F has the lowest number of claims , while area C has the highest .

  • area D has the lowest frequency , while area F has the highest.

  • area D has the lowest average frequency ,while area F has the highest.

  • area D has the lowest risk premium , while area F has the highest.

the below chart visualize all together for more clear understanding:


age category factor

agecat claimcst0 numclaims exposure frequency average_severity risk_premium
2 1984840.8 1000 5891.871 0.1697254 1984.841 336.8778
4 2145303.0 1185 7616.542 0.1555824 1810.382 281.6636
6 683568.5 390 3099.666 0.1258200 1752.740 220.5297
3 2132107.1 1189 7409.457 0.1604706 1793.194 287.7549
5 1061412.2 648 5171.009 0.1253140 1637.982 205.2621
1 1307372.9 525 2612.274 0.2009743 2490.234 500.4732

we can see that :

  • age category 6 has the lowest claims cost , while age category 4 has the highest

  • age category 6 has the lowest number of claims , while age category 4 & 3 has the highest (1185,1189) .

  • age category 6 & 5 has the lowest frequency with almost(12.5%) , while age category 1 has the highest with 20%.

  • age category 5 has the lowest average frequency ,while age category 1 has the highest.

  • age category 5 has the lowest risk premium , while age category 1 has the highest.

the below chart visualize all together for more clear understanding: