————————————————————————————————————————————————
| veh_value | exposure | clm | numclaims | claimcst0 | veh_body | veh_age | gender | area | agecat |
|---|---|---|---|---|---|---|---|---|---|
| 1.06 | 0.3039014 | 0 | 0 | 0 | HBACK | 3 | F | C | 2 |
| 1.03 | 0.6488706 | 0 | 0 | 0 | HBACK | 2 | F | A | 4 |
| 3.26 | 0.5694730 | 0 | 0 | 0 | UTE | 2 | F | E | 2 |
| 4.14 | 0.3175907 | 0 | 0 | 0 | STNWG | 2 | F | D | 2 |
| 0.72 | 0.6488706 | 0 | 0 | 0 | HBACK | 4 | F | C | 2 |
| 2.01 | 0.8542094 | 0 | 0 | 0 | HDTOP | 3 | M | C | 4 |
lets have a look at the data set dimension :
## [1] 67856 10
The Data set has 10 columns and 67856 rows
lets have a look at the variables types:
## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body <chr> "HBACK", "HBACK", "UTE", "STNWG", "HBACK", "HDTOP", "PANVN",…
## $ veh_age <dbl> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender <chr> "F", "F", "F", "F", "F", "M", "M", "M", "F", "F", "M", "M", …
## $ area <chr> "C", "A", "E", "D", "C", "C", "A", "B", "A", "B", "A", "C", …
## $ agecat <dbl> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …
now we will make some data transformation,will do the below :
convert agecat ,veh_age,veh_body ,gender and area types into factor
lets look again to the variables types after datatype conversion :
## Rows: 67,856
## Columns: 10
## $ veh_value <dbl> 1.06, 1.03, 3.26, 4.14, 0.72, 2.01, 1.60, 1.47, 0.52, 0.38, …
## $ exposure <dbl> 0.30390144, 0.64887064, 0.56947296, 0.31759069, 0.64887064, …
## $ clm <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ numclaims <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, …
## $ claimcst0 <dbl> 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.0000, 0.00…
## $ veh_body <fct> HBACK, HBACK, UTE, STNWG, HBACK, HDTOP, PANVN, HBACK, HBACK,…
## $ veh_age <fct> 3, 2, 2, 2, 4, 3, 3, 2, 4, 4, 2, 3, 2, 1, 3, 2, 3, 3, 4, 3, …
## $ gender <fct> F, F, F, F, F, M, M, M, F, F, M, M, F, M, M, M, F, M, F, F, …
## $ area <fct> C, A, E, D, C, C, A, B, A, B, A, C, C, A, B, C, F, C, D, C, …
## $ agecat <fct> 2, 4, 2, 2, 2, 4, 4, 6, 3, 4, 2, 4, 4, 5, 6, 4, 4, 4, 2, 3, …
————————————————————————
## veh_value exposure clm numclaims claimcst0
## Min. : 0.000 Min. :0.002738 0:63232 0:63232 Min. : 0.0
## 1st Qu.: 1.010 1st Qu.:0.219028 1: 4624 1: 4333 1st Qu.: 0.0
## Median : 1.500 Median :0.446270 2: 271 Median : 0.0
## Mean : 1.777 Mean :0.468651 3: 18 Mean : 137.3
## 3rd Qu.: 2.150 3rd Qu.:0.709103 4: 2 3rd Qu.: 0.0
## Max. :34.560 Max. :0.999316 Max. :55922.1
##
## veh_body veh_age gender area agecat
## SEDAN :22233 1:12257 F:38603 A:16312 1: 5742
## HBACK :18915 2:16587 M:29253 B:13341 2:12875
## STNWG :16261 3:20064 C:20540 3:15767
## UTE : 4586 4:18948 D: 8173 4:16189
## TRUCK : 1750 E: 5912 5:10736
## HDTOP : 1579 F: 3578 6: 6547
## (Other): 2532
we will create new data frame that contain only The needed Numerical variables : veh_value , exposure and claimcst0
| claimcst0 | exposure | veh_value | |
|---|---|---|---|
| Mean | 1.372702e+02 | 0.4686515 | 1.777021e+00 |
| Std.Dev | 1.056298e+03 | 0.2900254 | 1.205232e+00 |
| Min | 0.000000e+00 | 0.0027379 | 0.000000e+00 |
| Q1 | 0.000000e+00 | 0.2190281 | 1.010000e+00 |
| Median | 0.000000e+00 | 0.4462697 | 1.500000e+00 |
| Q3 | 0.000000e+00 | 0.7091034 | 2.150000e+00 |
| Max | 5.592213e+04 | 0.9993155 | 3.456000e+01 |
| MAD | 0.000000e+00 | 0.3612632 | 8.006040e-01 |
| IQR | 0.000000e+00 | 0.4900753 | 1.140000e+00 |
| CV | 7.695028e+00 | 0.6188509 | 6.782316e-01 |
| Skewness | 1.750173e+01 | 0.1755497 | 2.967891e+00 |
| SE.Skewness | 9.403100e-03 | 0.0094031 | 9.403100e-03 |
| Kurtosis | 4.798760e+02 | -1.1425554 | 2.675256e+01 |
| N.Valid | 6.785600e+04 | 67856.0000000 | 6.785600e+04 |
| Pct.Valid | 1.000000e+02 | 100.0000000 | 1.000000e+02 |
Observations:
A. for claims: the mean is 137 ,median is 0 , and the internal quantile range (50% of th data) is 0 as well ==>
A.1 mean > median and the data is right skewed
A.2 as the data range from (0,inf) ==> the data has gamma distribution
B. for exposure: the mean is 0.469 ,median is 0.446 , and the internal quantile range (50% of th data) is 0.999 ==>
C. for veh_value: the mean is 1777 ,median is 1500 , and the internal quantile range (50% of th data) is 1140 ==>
C.1 mean > median and the data is right skewed
C.2 as the data range from (0,inf) ==> the data has gamma distribution
pairs plot
as the data is not normally distributed , we will see the corr using ‘spearman’ method
| veh_value | exposure | claimcst0 | |
|---|---|---|---|
| veh_value | 1.0000000 | 0.0091092 | 0.0265644 |
| exposure | 0.0091092 | 1.0000000 | 0.1328116 |
| claimcst0 | 0.0265644 | 0.1328116 | 1.0000000 |
we will create new data frame that contain only The needed Categorical variables : numclaims, veh_body veh_age gender ,area and agecat
## Frequencies
## Categorical_variables$numclaims
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- ---------- -------------- ---------- --------------
## 0 63232 93.1856 93.1856 93.1856 93.1856
## 1 4333 6.3856 99.5712 6.3856 99.5712
## 2 271 0.3994 99.9705 0.3994 99.9705
## 3 18 0.0265 99.9971 0.0265 99.9971
## 4 2 0.0029 100.0000 0.0029 100.0000
## <NA> 0 0.0000 100.0000
## Total 67856 100.0000 100.0000 100.0000 100.0000
##
## Categorical_variables$veh_body
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## BUS 48 0.071 0.071 0.071 0.071
## CONVT 81 0.119 0.190 0.119 0.190
## COUPE 780 1.149 1.340 1.149 1.340
## HBACK 18915 27.875 29.215 27.875 29.215
## HDTOP 1579 2.327 31.542 2.327 31.542
## MCARA 127 0.187 31.729 0.187 31.729
## MIBUS 717 1.057 32.786 1.057 32.786
## PANVN 752 1.108 33.894 1.108 33.894
## RDSTR 27 0.040 33.934 0.040 33.934
## SEDAN 22233 32.765 66.699 32.765 66.699
## STNWG 16261 23.964 90.663 23.964 90.663
## TRUCK 1750 2.579 93.242 2.579 93.242
## UTE 4586 6.758 100.000 6.758 100.000
## <NA> 0 0.000 100.000
## Total 67856 100.000 100.000 100.000 100.000
##
## Categorical_variables$veh_age
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## 1 12257 18.06 18.06 18.06 18.06
## 2 16587 24.44 42.51 24.44 42.51
## 3 20064 29.57 72.08 29.57 72.08
## 4 18948 27.92 100.00 27.92 100.00
## <NA> 0 0.00 100.00
## Total 67856 100.00 100.00 100.00 100.00
##
## Categorical_variables$gender
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## F 38603 56.89 56.89 56.89 56.89
## M 29253 43.11 100.00 43.11 100.00
## <NA> 0 0.00 100.00
## Total 67856 100.00 100.00 100.00 100.00
##
## Categorical_variables$area
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## A 16312 24.04 24.04 24.04 24.04
## B 13341 19.66 43.70 19.66 43.70
## C 20540 30.27 73.97 30.27 73.97
## D 8173 12.04 86.01 12.04 86.01
## E 5912 8.71 94.73 8.71 94.73
## F 3578 5.27 100.00 5.27 100.00
## <NA> 0 0.00 100.00
## Total 67856 100.00 100.00 100.00 100.00
##
## Categorical_variables$agecat
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------- --------- -------------- --------- --------------
## 1 5742 8.46 8.46 8.46 8.46
## 2 12875 18.97 27.44 18.97 27.44
## 3 15767 23.24 50.67 23.24 50.67
## 4 16189 23.86 74.53 23.86 74.53
## 5 10736 15.82 90.35 15.82 90.35
## 6 6547 9.65 100.00 9.65 100.00
## <NA> 0 0.00 100.00
## Total 67856 100.00 100.00 100.00 100.00
Observations:
A. for numclaims: ==>
A.1 The number of Risks that have no claims are 63232 (93.19%) out of Total number of Risks 67856
A.2 The number of Risks that have one claim are 4333 (6.39%) out of Total number of Risks 67856
A.3 The number of Risks that have two claim are 271 (3.99%%) out of Total number of Risks 67856
A.4 The number of Risks that have three claim are 18 (0.27%%) out of Total number of Risks 67856
A.5 The number of Risks that have four claim are 2 (0.0029%%) out of Total number of Risks 67856
B. for veh_body: ==>
B.1 SEDAN,HBACK,and STNWG have the most frequencies with 22233(32.76%),18915(27.86%),16261(23.97%) in sequence out of Total number of Risks 67856
B.2 RDSTR,BUS,and CONVT have the lowest frequencies with 27(0.04%),48(0.07%),81(0.12%) in sequence out of Total number of Risks 67856
C. for veh_age: ==>
C.1 The number of Risks with the highest frequency is veh_age==3 with 20064 (29.57%) out of Total number of Risks 67856
C.2 The number of Risks with the lowest frequency is veh_age==1 with 12257 (18.06%) out of Total number of Risks 67856
D. for gender: ==>
F. for veh_age: ==>
F.1 The number of Risks with the highest frequency is area : C with 20540 (30.27%) out of Total number of Risks 67856
F.2 The number of Risks with the lowest frequency is area : F with 3578 (5.27%) out of Total number of Risks 67856
G. for agecat: ==>
G.1 The number of Risks with the highest frequency is agecat == 3 and agecat == 4 C with 16189 (23.86%) ,15767 (23.24%) in sequence out of Total number of Risks 67856
G.2 The number of Risks with the lowest frequency is agecat == 1 and agecat == 6 with 5742 (8.46%) ,6547 (9.65%) in sequence out of Total number of Risks 67856
| Characteristic | N | Overall, N = 67,8561 | 0, N = 63,2321 | 1, N = 4,6241 | p-value2 |
|---|---|---|---|---|---|
| gender | 67,856 | 0.6 | |||
| Â Â Â Â F | 38,603 (57%) | 35,955 (57%) | 2,648 (57%) | ||
| Â Â Â Â M | 29,253 (43%) | 27,277 (43%) | 1,976 (43%) | ||
| 1 n (%) | |||||
| 2 Pearson’s Chi-squared test | |||||
A. Out of 67856 risk , we have 4624 with positive claims claims
B. female had more claim percentage than males
C. the p value of the the pearson chisq test is 0.6 ==> more than 0.05 which statistically means there is no significant difference between genders in respect of claims
now we want to assess and report the effect size and the magnitude of differences between these groups
table <- table(data$clm ,data$gender)
chi_square_test<-chisq.test(table)
phi_coefficient <- sqrt(chi_square_test$statistic / sum(table))
cat(paste("The effect size (Phi Coefficient) is",phi_coefficient ,"indicating",interpret_phi(phi_coefficient),"effect size"))
## The effect size (Phi Coefficient) is 0.0019987264321987 indicating tiny effect size
## gender prob SE df asymp.LCL asymp.UCL null z.ratio p.value
## F 0.0686 0.00129 Inf 0.0661 0.0712 0.5 -129.543 <.0001
## M 0.0675 0.00147 Inf 0.0647 0.0705 0.5 -112.676 <.0001
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Tests are performed on the logit scale
the probability of females having claims is little higher than the probability of males with .0686 females and 0.0675 for males
| Characteristic | N | Overall, N = 67,8561 | 0, N = 63,2321 | 1, N = 4,6241 | p-value2 |
|---|---|---|---|---|---|
| area | 67,856 | 0.003 | |||
| Â Â Â Â A | 16,312 (24%) | 15,227 (24%) | 1,085 (23%) | ||
| Â Â Â Â B | 13,341 (20%) | 12,376 (20%) | 965 (21%) | ||
| Â Â Â Â C | 20,540 (30%) | 19,128 (30%) | 1,412 (31%) | ||
| Â Â Â Â D | 8,173 (12%) | 7,677 (12%) | 496 (11%) | ||
| Â Â Â Â E | 5,912 (8.7%) | 5,526 (8.7%) | 386 (8.3%) | ||
| Â Â Â Â F | 3,578 (5.3%) | 3,298 (5.2%) | 280 (6.1%) | ||
| 1 n (%) | |||||
| 2 Pearson’s Chi-squared test | |||||
A. area A has the highest number of claims with ‘31%’ percentage
B. area F has the lowest number of claims with ‘6.1%’ percentage
C. the p value of the the pearson chisq test is 0.003 ==> less than 0.05 which statistically means there is significant difference between areas in respect of claims
now we want to assess and report the effect size and the magnitude of differences between these groups
library(vcd)
table <- table(data$clm ,data$area)
cat(paste("The effect size (Cramér's V) is ", assocstats(table)$cramer,"indicating",interpret_cramers_v(assocstats(table)$cramer),"effect size"))
## The effect size (Cramér's V) is 0.0163593665773446 indicating tiny effect size
## area prob SE df asymp.LCL asymp.UCL null z.ratio p.value
## A 0.0665 0.00195 Inf 0.0628 0.0704 0.5 -84.066 <.0001
## B 0.0723 0.00224 Inf 0.0681 0.0769 0.5 -76.337 <.0001
## C 0.0687 0.00177 Inf 0.0654 0.0723 0.5 -94.504 <.0001
## D 0.0607 0.00264 Inf 0.0557 0.0661 0.5 -59.130 <.0001
## E 0.0653 0.00321 Inf 0.0593 0.0719 0.5 -50.552 <.0001
## F 0.0783 0.00449 Inf 0.0699 0.0875 0.5 -39.621 <.0001
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Tests are performed on the logit scale
We can see that area== F has the highest probability of claimed with 0.0783% and area== D has the lowest probability with 0.0607%
| Characteristic | N | Overall, N = 67,8561 | 0, N = 63,2321 | 1, N = 4,6241 | p-value2 |
|---|---|---|---|---|---|
| agecat | 67,856 | <0.001 | |||
| Â Â Â Â 1 | 5,742 (8.5%) | 5,246 (8.3%) | 496 (11%) | ||
| Â Â Â Â 2 | 12,875 (19%) | 11,943 (19%) | 932 (20%) | ||
| Â Â Â Â 3 | 15,767 (23%) | 14,654 (23%) | 1,113 (24%) | ||
| Â Â Â Â 4 | 16,189 (24%) | 15,085 (24%) | 1,104 (24%) | ||
| Â Â Â Â 5 | 10,736 (16%) | 10,122 (16%) | 614 (13%) | ||
| Â Â Â Â 6 | 6,547 (9.6%) | 6,182 (9.8%) | 365 (7.9%) | ||
| 1 n (%) | |||||
| 2 Pearson’s Chi-squared test | |||||
Insights
A. Age Category 2 and 4 have the highest number of claims with ‘24%’ percentage for each -‘noting that both categories has the highest count of total agecat observations’-
B. Age Category 6 has the lowest number of claims with ‘7.9%’ percentage
C. the p value of the the pearson chisq test < 0.001 ==> less than 0.05 which statistically means there is significant difference between Age Categories in respect of claims
now we want to assess and report the effect size and the magnitude of differences between these groups
## The effect size (Cramér's V) is 0.0324228335586784 indicating tiny effect size
## agecat prob SE df asymp.LCL asymp.UCL null z.ratio p.value
## 1 0.0864 0.00371 Inf 0.0794 0.0939 0.5 -50.210 <.0001
## 2 0.0724 0.00228 Inf 0.0680 0.0770 0.5 -74.994 <.0001
## 3 0.0706 0.00204 Inf 0.0667 0.0747 0.5 -82.904 <.0001
## 4 0.0682 0.00198 Inf 0.0644 0.0722 0.5 -83.865 <.0001
## 5 0.0572 0.00224 Inf 0.0530 0.0617 0.5 -67.429 <.0001
## 6 0.0558 0.00284 Inf 0.0504 0.0616 0.5 -52.530 <.0001
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Tests are performed on the logit scale
We can see that agecat== 1 has the highest probability of claimed with 0.0864% and agecat== 6 has the lowest probability with 0.0558%
| Characteristic | N | Overall, N = 67,8561 | 0, N = 63,2321 | 1, N = 4,6241 | p-value2 |
|---|---|---|---|---|---|
| veh_age | 67,856 | <0.001 | |||
| Â Â Â Â 1 | 12,257 (18%) | 11,432 (18%) | 825 (18%) | ||
| Â Â Â Â 2 | 16,587 (24%) | 15,328 (24%) | 1,259 (27%) | ||
| Â Â Â Â 3 | 20,064 (30%) | 18,702 (30%) | 1,362 (29%) | ||
| Â Â Â Â 4 | 18,948 (28%) | 17,770 (28%) | 1,178 (25%) | ||
| 1 n (%) | |||||
| 2 Pearson’s Chi-squared test | |||||
Insights
A. Vehicle Age 3 has highest number of claims with ‘29%’ percentage-‘noting that it has the highest count of total Vehicle Age categories observations’-
B. Vehicle Age 1 has the lowest number of claims with ‘18%’ percentage
C. the p value of the the pearson chisq test < 0.001 ==> less than 0.05 which statistically means there is significant difference between Vehicle Ages in respect of claims
now we want to assess and report the effect size and the magnitude of differences between these groups
## The effect size (Cramér's V) is 0.0197729280570061 indicating tiny effect size
## veh_age prob SE df asymp.LCL asymp.UCL null z.ratio p.value
## 1 0.0673 0.00226 Inf 0.0630 0.0719 0.5 -72.921 <.0001
## 2 0.0759 0.00206 Inf 0.0720 0.0800 0.5 -85.251 <.0001
## 3 0.0679 0.00178 Inf 0.0645 0.0714 0.5 -93.341 <.0001
## 4 0.0622 0.00175 Inf 0.0588 0.0657 0.5 -90.198 <.0001
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Tests are performed on the logit scale
We can see that veh_age== 2 has the highest probability of claimed with 0.0759% and veh_age== 4 has the lowest probability with 0.0622%
| Characteristic | N | Overall, N = 67,8561 | 0, N = 63,2321 | 1, N = 4,6241 | p-value |
|---|---|---|---|---|---|
| veh_body | 67,856 | ||||
| Â Â Â Â BUS | 48 (<0.1%) | 39 (<0.1%) | 9 (0.2%) | ||
| Â Â Â Â CONVT | 81 (0.1%) | 78 (0.1%) | 3 (<0.1%) | ||
| Â Â Â Â COUPE | 780 (1.1%) | 712 (1.1%) | 68 (1.5%) | ||
| Â Â Â Â HBACK | 18,915 (28%) | 17,651 (28%) | 1,264 (27%) | ||
| Â Â Â Â HDTOP | 1,579 (2.3%) | 1,449 (2.3%) | 130 (2.8%) | ||
| Â Â Â Â MCARA | 127 (0.2%) | 113 (0.2%) | 14 (0.3%) | ||
| Â Â Â Â MIBUS | 717 (1.1%) | 674 (1.1%) | 43 (0.9%) | ||
| Â Â Â Â PANVN | 752 (1.1%) | 690 (1.1%) | 62 (1.3%) | ||
| Â Â Â Â RDSTR | 27 (<0.1%) | 25 (<0.1%) | 2 (<0.1%) | ||
| Â Â Â Â SEDAN | 22,233 (33%) | 20,757 (33%) | 1,476 (32%) | ||
| Â Â Â Â STNWG | 16,261 (24%) | 15,088 (24%) | 1,173 (25%) | ||
| Â Â Â Â TRUCK | 1,750 (2.6%) | 1,630 (2.6%) | 120 (2.6%) | ||
| Â Â Â Â UTE | 4,586 (6.8%) | 4,326 (6.8%) | 260 (5.6%) | ||
| 1 n (%) | |||||
Insights
A. Sedan veh_body has highest number of claims with ‘32%’ percentage-‘noting that it has the highest count of total veh_body observations’-
B. convt and RDSTR veh_body has the lowest number of claims with less than ‘0.1%’ percentage
now we want to assess and report the effect size and the magnitude of differences between these groups
## The effect size (Cramér's V) is 0.0252738312900611 indicating tiny effect size
## veh_body prob SE df asymp.LCL asymp.UCL null z.ratio p.value
## BUS 0.1875 0.05634 Inf 0.1005 0.3227 0.5 -3.965 0.0001
## CONVT 0.0370 0.02097 Inf 0.0120 0.1086 0.5 -5.540 <.0001
## COUPE 0.0872 0.01010 Inf 0.0693 0.1091 0.5 -18.503 <.0001
## HBACK 0.0668 0.00182 Inf 0.0634 0.0705 0.5 -90.549 <.0001
## HDTOP 0.0823 0.00692 Inf 0.0697 0.0969 0.5 -26.335 <.0001
## MCARA 0.1102 0.02779 Inf 0.0664 0.1776 0.5 -7.371 <.0001
## MIBUS 0.0600 0.00887 Inf 0.0448 0.0799 0.5 -17.497 <.0001
## PANVN 0.0824 0.01003 Inf 0.0648 0.1044 0.5 -18.174 <.0001
## RDSTR 0.0741 0.05040 Inf 0.0186 0.2525 0.5 -3.437 0.0006
## SEDAN 0.0664 0.00167 Inf 0.0632 0.0697 0.5 -98.133 <.0001
## STNWG 0.0721 0.00203 Inf 0.0683 0.0762 0.5 -84.269 <.0001
## TRUCK 0.0686 0.00604 Inf 0.0576 0.0814 0.5 -27.581 <.0001
## UTE 0.0567 0.00341 Inf 0.0504 0.0638 0.5 -44.034 <.0001
##
## Confidence level used: 0.95
## Intervals are back-transformed from the logit scale
## Tests are performed on the logit scale
We can see that veh_body== BUS has the highest probability of claimed with 18.75% and veh_body== CONVT has the lowest probability with 0.0370%
————————————————————————
positive claims Data is the data with claims > 0
## veh_value exposure numclaims claimcst0 veh_body
## Min. : 0.000 Min. :0.002738 0: 0 Min. : 200.0 SEDAN :1476
## 1st Qu.: 1.100 1st Qu.:0.410678 1:4333 1st Qu.: 353.8 HBACK :1264
## Median : 1.570 Median :0.637919 2: 271 Median : 761.6 STNWG :1173
## Mean : 1.859 Mean :0.611271 3: 18 Mean : 2014.4 UTE : 260
## 3rd Qu.: 2.310 3rd Qu.:0.832307 4: 2 3rd Qu.: 2091.4 HDTOP : 130
## Max. :13.900 Max. :0.999316 Max. :55922.1 TRUCK : 120
## (Other): 201
## veh_age gender area agecat
## 1: 825 F:2648 A:1085 1: 496
## 2:1259 M:1976 B: 965 2: 932
## 3:1362 C:1412 3:1113
## 4:1178 D: 496 4:1104
## E: 386 5: 614
## F: 280 6: 365
##
we will create new data frame that contain only The needed Numerical variables : veh_value , exposure and claimcst0
| claimcst0 | exposure | veh_value | |
|---|---|---|---|
| Mean | 2.014404e+03 | 0.6112714 | 1.8591955 |
| Std.Dev | 3.548907e+03 | 0.2616474 | 1.1595951 |
| Min | 2.000000e+02 | 0.0027379 | 0.0000000 |
| Q1 | 3.537700e+02 | 0.4106776 | 1.1000000 |
| Median | 7.615650e+02 | 0.6379192 | 1.5700000 |
| Q3 | 2.091900e+03 | 0.8323066 | 2.3100000 |
| Max | 5.592213e+04 | 0.9993155 | 13.9000000 |
| MAD | 8.325763e+02 | 0.3125536 | 0.8302560 |
| IQR | 1.737655e+03 | 0.4216290 | 1.2100000 |
| CV | 1.761765e+00 | 0.4280381 | 0.6237080 |
| Skewness | 5.038509e+00 | -0.3468537 | 1.8535754 |
| SE.Skewness | 3.601020e-02 | 0.0360102 | 0.0360102 |
| Kurtosis | 4.019591e+01 | -0.8719807 | 6.9007387 |
| N.Valid | 4.624000e+03 | 4624.0000000 | 4624.0000000 |
| Pct.Valid | 1.000000e+02 | 100.0000000 | 100.0000000 |
Observations:
A. for claims: the mean is 2014 ,median is 761 , and the internal quantile range (50% of th data) is 1737 as well ==>
-noting that the the mean was 137 ,median was 0 , and the internal quantile range (50% of th data) was 0 as well for all data set-
B. for exposure: the mean is 0.611 ,median is 0.638 , and the internal quantile range (50% of th data) is 0.422 ==>
-noting that the mean was 0.469 ,median was 0.446 , and the internal quantile range (50% of th data) was 0.999 for all data set-
C. for veh_value: the mean is 1859 ,median is 1570 , and the internal quantile range (50% of th data) is 1210 ==>
-noting that the mean was 1777 ,median was 1500 , and the internal quantile range (50% of th data) was 1140 for all data set-
pairs plot
as the data is not normally distributed , we will see the corr using ‘spearman’ method
| veh_value | exposure | claimcst0 | |
|---|---|---|---|
| veh_value | 1.0000000 | 0.0624592 | -0.0219103 |
| exposure | 0.0624592 | 1.0000000 | -0.0418187 |
| claimcst0 | -0.0219103 | -0.0418187 | 1.0000000 |
we will create new data frame that contain only The needed Categorical variables : numclaims, veh_body veh_age gender ,area and agecat
| numclaims | veh_body | veh_age | gender | area | agecat |
|---|---|---|---|---|---|
| 1 | SEDAN | 3 | M | B | 6 |
| 1 | SEDAN | 3 | F | F | 4 |
| 1 | HBACK | 3 | M | C | 4 |
| 2 | STNWG | 3 | M | F | 2 |
| 1 | STNWG | 2 | M | F | 3 |
| 1 | HBACK | 3 | F | A | 4 |
## Frequencies
## Categorical_variables$numclaims
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 0 0 0.000 0.000 0.000 0.000
## 1 4333 93.707 93.707 93.707 93.707
## 2 271 5.861 99.567 5.861 99.567
## 3 18 0.389 99.957 0.389 99.957
## 4 2 0.043 100.000 0.043 100.000
## <NA> 0 0.000 100.000
## Total 4624 100.000 100.000 100.000 100.000
##
## Categorical_variables$veh_body
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## BUS 9 0.195 0.195 0.195 0.195
## CONVT 3 0.065 0.260 0.065 0.260
## COUPE 68 1.471 1.730 1.471 1.730
## HBACK 1264 27.336 29.066 27.336 29.066
## HDTOP 130 2.811 31.877 2.811 31.877
## MCARA 14 0.303 32.180 0.303 32.180
## MIBUS 43 0.930 33.110 0.930 33.110
## PANVN 62 1.341 34.451 1.341 34.451
## RDSTR 2 0.043 34.494 0.043 34.494
## SEDAN 1476 31.920 66.414 31.920 66.414
## STNWG 1173 25.368 91.782 25.368 91.782
## TRUCK 120 2.595 94.377 2.595 94.377
## UTE 260 5.623 100.000 5.623 100.000
## <NA> 0 0.000 100.000
## Total 4624 100.000 100.000 100.000 100.000
##
## Categorical_variables$veh_age
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 825 17.84 17.84 17.84 17.84
## 2 1259 27.23 45.07 27.23 45.07
## 3 1362 29.46 74.52 29.46 74.52
## 4 1178 25.48 100.00 25.48 100.00
## <NA> 0 0.00 100.00
## Total 4624 100.00 100.00 100.00 100.00
##
## Categorical_variables$gender
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## F 2648 57.27 57.27 57.27 57.27
## M 1976 42.73 100.00 42.73 100.00
## <NA> 0 0.00 100.00
## Total 4624 100.00 100.00 100.00 100.00
##
## Categorical_variables$area
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## A 1085 23.46 23.46 23.46 23.46
## B 965 20.87 44.33 20.87 44.33
## C 1412 30.54 74.87 30.54 74.87
## D 496 10.73 85.60 10.73 85.60
## E 386 8.35 93.94 8.35 93.94
## F 280 6.06 100.00 6.06 100.00
## <NA> 0 0.00 100.00
## Total 4624 100.00 100.00 100.00 100.00
##
## Categorical_variables$agecat
## Type: Factor
##
## Freq % Valid % Valid Cum. % Total % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
## 1 496 10.73 10.73 10.73 10.73
## 2 932 20.16 30.88 20.16 30.88
## 3 1113 24.07 54.95 24.07 54.95
## 4 1104 23.88 78.83 23.88 78.83
## 5 614 13.28 92.11 13.28 92.11
## 6 365 7.89 100.00 7.89 100.00
## <NA> 0 0.00 100.00
## Total 4624 100.00 100.00 100.00 100.00
Observations:
A. for numclaims: ==>
A.1 The number of Risks that have one claim are 4333 (93.707) out of Total number of Risks with positive claims 4624
A.2 The number of Risks that have one claim are 271 (5.861) out of Total number of Risks with positive claims 4624
A.3 The number of Risks that have two claim are 18 (0.389%) out of Total number of Risks with positive claims 4624
A.4 The number of Risks that have three claim are 2 (0.043%) out of Total number of Risks with positive claims 4624
B. for veh_body: ==>
B.1 SEDAN,HBACK,and STNWG have the most claims incurred frequency with 1476(31.920%),1264(27.34%),1173(25.37%) in sequence out of Total number of Risks with positive claims 4624
B.2 RDSTR,CONVT,and CONVT have the lowest claims incurred frequency with 2(0.043%),3(0.056%),9(0.195%) in sequence out of Total number of Risks with positive claims 4624
C. for veh_age: ==>
C.1 The number of Risks with the highest claims incurred frequency is veh_age==3 with 1362 (29.46%) out of Total number of Risks with positive claims 4624
C.2 The number of Risks with the lowest claims incurred frequency is veh_age==1 with 825 (17.84%) out of Total number of Risks with positive claims 4624
D. for gender: ==>
F. for area: ==>
F.1 The number of Risks with the highest claims incurred frequency is area : C with 1412 (30.54%) out of Total number of Risks with positive claims 4624
F.2 The number of Risks with the lowest claims incurred frequency is area : F with 280 (6.06%) out of Total number of Risks with positive claims 4624
G. for agecat: ==>
G.1 The number of Risks with the highest claims incurred frequency is agecat == 3 and agecat == 4 C with 1113 (27.7%) ,1104 (23.88%) in sequence out of Total number of Risks with positive claims 4624
G.2 The number of Risks with the lowest claims incurred frequency is agecat == 6 with 365 (7.89%) out of Total number of Risks with positive claims 4624
————————————————————————
##
## Anderson-Darling normality test
##
## data: df$claimcst0
## A = 654.01, p-value < 2.2e-16
Based on our above analysis results:
==> Given these results, We will be using non-parametric statistical tests
Mann-Whitney U Test -for features that have 2 categories : equivelent to wilcox.test() wilcox rank sum test with continuity correction,better than wilcox sighned rank exct test(that used when variables are paired)
##
## Wilcoxon rank sum test with continuity correction
##
## data: df$claimcst0 by df$gender
## W = 2520284, p-value = 0.03234
## alternative hypothesis: true location shift is not equal to 0
After Applying the test we found that:
The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between genders
A.3 wilcox_effsize :
now we want to assess and report the effect size and the magnitude of differences between these groups
## # A tibble: 1 × 7
## .y. group1 group2 effsize n1 n2 magnitude
## * <chr> <chr> <chr> <dbl> <int> <int> <ord>
## 1 claimcst0 F M 0.0315 2648 1976 small
The wilcox_effsize test showed that the effect size= .0315 and the indicating small effect size
A.4 Assumption Checks:
Check the homogeneity of variances using Levene’s test.
levene_test <- leveneTest(claimcst0~gender,data = df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 12.543 0.0004016 ***
## 4622
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A.5 Post-hoc Power Analysis:
Conduct a post-hoc power analysis based on the effect size.
# Post-hoc Power Analysis
#t-test (two samples with unequal n)
library(pwr)
## Warning: package 'pwr' was built under R version 4.3.3
effect_size_value <- effect_size$effsize
power_analysis <- pwr.t2n.test(n1 = length(df$gender=="M"), n2 = length(df$gender=="F"),
d = effect_size_value,
sig.level = 0.05, power = NULL)
print(power_analysis)
##
## t test power calculation
##
## n1 = 4624
## n2 = 4624
## d = 0.03147296
## sig.level = 0.05
## power = 0.3277674
## alternative = two.sided
##
## Kruskal-Wallis rank sum test
##
## data: df$claimcst0 and df$area
## Kruskal-Wallis chi-squared = 26.616, df = 5, p-value = 6.776e-05
After Applying the test we found that:
The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between areas
B.1.3 kruskal_effsize :
now we want to assess and report the effect size and the magnitude of differences between these groups
| .y. | n | effsize | method | magnitude |
|---|---|---|---|---|
| claimcst0 | 4624 | 0.0046807 | eta2[H] | small |
## Comparison Z P.unadj P.adj
## 1 A - B 0.3929272 6.943733e-01 1.0000000000
## 2 A - C -0.1226836 9.023577e-01 0.9023576683
## 3 B - C -0.5348567 5.927489e-01 1.0000000000
## 4 A - D -1.0455938 2.957486e-01 1.0000000000
## 5 B - D -1.3404712 1.800922e-01 1.0000000000
## 6 C - D -0.9908884 3.217401e-01 1.0000000000
## 7 A - E -2.8037166 5.051729e-03 0.0505172881
## 8 B - E -3.0477582 2.305554e-03 0.0276666430
## 9 C - E -2.8067635 5.004197e-03 0.0550461713
## 10 D - E -1.6131418 1.067137e-01 0.8537095799
## 11 A - F -3.9167774 8.974053e-05 0.0011666269
## 12 B - F -4.1238931 3.725219e-05 0.0005587828
## 13 C - F -3.9375509 8.231746e-05 0.0011524445
## 14 D - F -2.7541273 5.884887e-03 0.0529639842
## 15 E - F -1.2278040 2.195206e-01 1.0000000000
From the above table we can get that :
B.1.5 Assumption Checks:
Check the homogeneity of variances using Levene’s test.
levene_test <- leveneTest(claimcst0 ~area,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 3.9598 0.001384 **
## 4618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B.1.6 Post-hoc Power Analysis:
Conduct a post-hoc power analysis based on the effect size.
effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$area)),
n = nrow(df) / length(unique(df$area)),
f = effect_size,
sig.level = 0.05,
power = NULL)
print(power_analysis)
##
## Balanced one-way analysis of variance power calculation
##
## k = 6
## n = 770.6667
## f = 0.06841581
## sig.level = 0.05
## power = 0.966717
##
## NOTE: n is number in each group
##
## Kruskal-Wallis rank sum test
##
## data: df$claimcst0 and df$veh_body
## Kruskal-Wallis chi-squared = 18.358, df = 12, p-value = 0.1052
After Applying the test we found that:
The pvalue is more than 0.05 –> statistically there is no significant difference in the data distribution between veh_body
##
## Kruskal-Wallis rank sum test
##
## data: df$claimcst0 and df$agecat
## Kruskal-Wallis chi-squared = 11.099, df = 5, p-value = 0.04946
After Applying the test we found that:
The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between agecat
B.3.3 kruskal_effsize :
now we want to assess and report the effect size and the magnitude of differences between these groups
| .y. | n | effsize | method | magnitude |
|---|---|---|---|---|
| claimcst0 | 4624 | 0.0013206 | eta2[H] | small |
## Comparison Z P.unadj P.adj
## 1 1 - 2 2.25523606 0.024118516 0.28942219
## 2 1 - 3 2.50162552 0.012362461 0.16071199
## 3 2 - 3 0.21869627 0.826886659 1.00000000
## 4 1 - 4 2.74351582 0.006078512 0.08509917
## 5 2 - 4 0.51604229 0.605824873 1.00000000
## 6 3 - 4 0.31181652 0.755179965 1.00000000
## 7 1 - 5 3.16592394 0.001545912 0.02318868
## 8 2 - 5 1.26571867 0.205613823 1.00000000
## 9 3 - 5 1.11552540 0.264625342 1.00000000
## 10 4 - 5 0.85082071 0.394868957 1.00000000
## 11 1 - 6 1.94645859 0.051599677 0.56759645
## 12 2 - 6 0.14394179 0.885546439 1.00000000
## 13 3 - 6 -0.01363333 0.989122512 0.98912251
## 14 4 - 6 -0.23298594 0.815772326 1.00000000
## 15 5 - 6 -0.86090528 0.389290214 1.00000000
From the above table we can get that :
B.3.5 Assumption Checks:
Check the homogeneity of variances using Levene’s test.
levene_test <- leveneTest(claimcst0 ~agecat,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 5 4.5834 0.0003578 ***
## 4618
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
B.3.6 Post-hoc Power Analysis:
Conduct a post-hoc power analysis based on the effect size.
effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$agecat)),
n = nrow(df) / length(unique(df$agecat)),
f = effect_size,
sig.level = 0.05,
power = NULL)
print(power_analysis)
##
## Balanced one-way analysis of variance power calculation
##
## k = 6
## n = 770.6667
## f = 0.03634062
## sig.level = 0.05
## power = 0.4397375
##
## NOTE: n is number in each group
##
## Kruskal-Wallis rank sum test
##
## data: df$claimcst0 and df$numclaims
## Kruskal-Wallis chi-squared = 159.02, df = 3, p-value < 2.2e-16
After Applying the test we found that:
The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between numclaims
B.4.3 kruskal_effsize :
now we want to assess and report the effect size and the magnitude of differences between these groups
| .y. | n | effsize | method | magnitude |
|---|---|---|---|---|
| claimcst0 | 4624 | 0.0337713 | eta2[H] | small |
## Comparison Z P.unadj P.adj
## 1 1 - 2 -11.6335467 2.782897e-31 1.669738e-30
## 2 1 - 3 -4.7312845 2.231036e-06 1.115518e-05
## 3 2 - 3 -1.5983152 1.099729e-01 3.299186e-01
## 4 1 - 4 -1.7924333 7.306358e-02 2.922543e-01
## 5 2 - 4 -0.7598618 4.473372e-01 8.946744e-01
## 6 3 - 4 -0.2015760 8.402482e-01 8.402482e-01
From the above table we can get that :
B.4.5 Assumption Checks:
Check the homogeneity of variances using Levene’s test.
levene_test <- leveneTest(claimcst0 ~numclaims,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.9987 0.3923
## 4620
B.4.6 Post-hoc Power Analysis:
Conduct a post-hoc power analysis based on the effect size.
effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$numclaims)),
n = nrow(df) / length(unique(df$numclaims)),
f = effect_size,
sig.level = 0.05,
power = NULL)
print(power_analysis)
##
## Balanced one-way analysis of variance power calculation
##
## k = 4
## n = 1156
## f = 0.1837697
## sig.level = 0.05
## power = 1
##
## NOTE: n is number in each group
##
## Kruskal-Wallis rank sum test
##
## data: df$claimcst0 and df$veh_age
## Kruskal-Wallis chi-squared = 18.668, df = 3, p-value = 0.0003201
After Applying the test we found that:
The pvalue is less than 0.05 –> statistically there is significant difference in the data distribution between veh_age
B.5.3 kruskal_effsize :
now we want to assess and report the effect size and the magnitude of differences between these groups
| .y. | n | effsize | method | magnitude |
|---|---|---|---|---|
| claimcst0 | 4624 | 0.0033914 | eta2[H] | small |
## Comparison Z P.unadj P.adj
## 1 1 - 2 -1.401384 1.610994e-01 0.3221987936
## 2 1 - 3 -2.860319 4.232150e-03 0.0169286002
## 3 2 - 3 -1.622098 1.047823e-01 0.3143468597
## 4 1 - 4 -3.988218 6.657135e-05 0.0003994281
## 5 2 - 4 -2.918060 3.522165e-03 0.0176108256
## 6 3 - 4 -1.379031 1.678852e-01 0.1678852290
From the above table we can get that :
B.5.5 Assumption Checks:
Check the homogeneity of variances using Levene’s test.
levene_test <- leveneTest(claimcst0 ~veh_age,data=df)
print(levene_test)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.926 0.4272
## 4620
B.5.6 Post-hoc Power Analysis:
Conduct a post-hoc power analysis based on the effect size.
effect_size <- sqrt(eff_size$effsize)
power_analysis <- pwr.anova.test(k = length(unique(df$veh_age)),
n = nrow(df) / length(unique(df$veh_age)),
f = effect_size,
sig.level = 0.05,
power = NULL)
print(power_analysis)
##
## Balanced one-way analysis of variance power calculation
##
## k = 4
## n = 1156
## f = 0.0582361
## sig.level = 0.05
## power = 0.9288195
##
## NOTE: n is number in each group
————————————————————————
| gender | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| F | 4908749 | 2832 | 17954.60 | 0.1577311 | 1733.315 | 273.3978 |
| M | 4405855 | 2105 | 13846.21 | 0.1520271 | 2093.043 | 318.1993 |
we can see that :
females number of claims are more than males.
females claims cost are more than males but; the average severity is less ,and this is due to the larger number of claims(more than males)
both females and males almost have the same frequency.
female risk premium is lesser than males ,and this due to the higher exposure females have.
the below chart visualize all together for more clear understanding:
| veh_age | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| 3 | 2718237 | 1446 | 9542.111 | 0.1515388 | 1879.832 | 284.8675 |
| 2 | 2486217 | 1354 | 7923.677 | 0.1708803 | 1836.202 | 313.7706 |
| 4 | 2554895 | 1261 | 8996.079 | 0.1401722 | 2026.087 | 284.0010 |
| 1 | 1555255 | 876 | 5338.951 | 0.1640772 | 1775.405 | 291.3034 |
we can see that :
the claims cost for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.
vehicle age 1 has the lowest number on claims while the vehicle age 3 has the highest.
the exposure for the vehicle age 1 is the lowest while for vehicle age 3 is the highest.
the frequency for the vehicle age 4 is the lowest while for vehicle age 2 is the highest.
the average_severity for the vehicle age 1 is the lowest while for vehicle age 4 is the highest.
the risk premium for the vehicle age 3 & 4 is the lowest and almost the same, while its the highest for vehicle age 2
the below chart visualize all together for more clear understating :
| veh_body | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| HBACK | 2589136.192 | 1330 | 8810.31348 | 0.1509594 | 1946.7189 | 293.8756 |
| UTE | 597208.965 | 276 | 2105.73032 | 0.1310709 | 2163.8006 | 283.6113 |
| STNWG | 2363091.211 | 1248 | 7638.39014 | 0.1633852 | 1893.5026 | 309.3703 |
| HDTOP | 294811.869 | 136 | 783.29911 | 0.1736246 | 2167.7343 | 376.3720 |
| PANVN | 133113.412 | 68 | 409.16085 | 0.1661938 | 1957.5502 | 325.3327 |
| SEDAN | 2681622.477 | 1598 | 10444.59959 | 0.1529977 | 1678.1117 | 256.7473 |
| TRUCK | 319496.849 | 130 | 843.96441 | 0.1540349 | 2457.6681 | 378.5667 |
| COUPE | 187723.251 | 75 | 319.12663 | 0.2350164 | 2502.9767 | 588.2406 |
| MIBUS | 116104.880 | 45 | 316.84052 | 0.1420273 | 2580.1084 | 366.4458 |
| MCARA | 10673.950 | 15 | 59.27995 | 0.2530367 | 711.5967 | 180.0601 |
| BUS | 13363.120 | 10 | 25.84805 | 0.3868764 | 1336.3120 | 516.9876 |
| CONVT | 6888.810 | 3 | 32.59685 | 0.0920334 | 2296.2700 | 211.3336 |
| RDSTR | 1369.458 | 3 | 11.66872 | 0.2570976 | 456.4861 | 117.3615 |
We can see that:
sedan,Hback,and Stnwg have the highest claims cost (), while RDSTR has the lowest claims cost.
RDSTR ,convt,Bus,MCARA,COUPE & PANVN have less than 100 claims, SEDAN,STNWG & HBACK have more than 1000 claims ,other have claims between 130 and 276.
BUS has the highest frequency with 38.3% , (RDSTR,MCARA, and COUPE have frequincies (27.7%,25.3% and 23.5%), other groups have frequencies less than 18%.
RDSTR has the lowest average severity , (MIBUS,COUPE and TRUCK ) have the highest.
the risk premium is the less than 200 for RDSTR and MCARA, the more tgan 500 for BUS and COUPE.
the below chart visualize all together for more clear understanding:
| area | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| C | 2865707.2 | 1493 | 9578.494 | 0.1558700 | 1919.429 | 299.1814 |
| A | 2071765.6 | 1181 | 7597.101 | 0.1554540 | 1754.247 | 272.7048 |
| E | 868822.9 | 413 | 2771.866 | 0.1489971 | 2103.687 | 313.4434 |
| D | 911058.2 | 524 | 3819.518 | 0.1371901 | 1738.661 | 238.5270 |
| B | 1795295.2 | 1021 | 6297.848 | 0.1621189 | 1758.369 | 285.0649 |
| F | 801955.4 | 305 | 1735.992 | 0.1756921 | 2629.362 | 461.9581 |
we can see that :
the lowest claims cost is for area F , and the highest cost is for area C.
area F has the lowest number of claims , while area C has the highest .
area D has the lowest frequency , while area F has the highest.
area D has the lowest average frequency ,while area F has the highest.
area D has the lowest risk premium , while area F has the highest.
the below chart visualize all together for more clear understanding:
| agecat | claimcst0 | numclaims | exposure | frequency | average_severity | risk_premium |
|---|---|---|---|---|---|---|
| 2 | 1984840.8 | 1000 | 5891.871 | 0.1697254 | 1984.841 | 336.8778 |
| 4 | 2145303.0 | 1185 | 7616.542 | 0.1555824 | 1810.382 | 281.6636 |
| 6 | 683568.5 | 390 | 3099.666 | 0.1258200 | 1752.740 | 220.5297 |
| 3 | 2132107.1 | 1189 | 7409.457 | 0.1604706 | 1793.194 | 287.7549 |
| 5 | 1061412.2 | 648 | 5171.009 | 0.1253140 | 1637.982 | 205.2621 |
| 1 | 1307372.9 | 525 | 2612.274 | 0.2009743 | 2490.234 | 500.4732 |
we can see that :
age category 6 has the lowest claims cost , while age category 4 has the highest
age category 6 has the lowest number of claims , while age category 4 & 3 has the highest (1185,1189) .
age category 6 & 5 has the lowest frequency with almost(12.5%) , while age category 1 has the highest with 20%.
age category 5 has the lowest average frequency ,while age category 1 has the highest.
age category 5 has the lowest risk premium , while age category 1 has the highest.
the below chart visualize all together for more clear understanding: