Asuransi Umum dan Kerugian - Project GLM Pricing

Group 6:

Ariel Haposan (22/503114/PA/21606)
Kanaya Azlia Sava (21/480251/PA/20855)
Priscilla Deviana Tachija (21/477681/PA/20684)
Victorius Chendryanto (22/496953/PA/21379)

The main goal of this project is to compute the relativities for the premium based on a set of rating factors. Statistical tests will be performed to measure the quality of the fit and determine whether the given rating factors are statistically significant to predict the frequency and/or severity component of the tariff model. All statistical tests should be performed at a confidence level of 95%.

1. Briefly discuss the data found

Preprocessing data

Importing dataset

library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
G3_6 <- read_excel("G3-6.xlsx")
head(G3_6)

The imported dataset is a modified version of the original given dataset, where some variables in the original dataset were read as chr (character) instead of dbl (double). The data was modified by manually changing the values stored as text in Excel, although this could also be done using R.

To understand the dataset better, we need to define each variable in the dataset. To do this, we found a similar dataset on the internet for reference, and we must first analyze it by examining the characteristics of each variable.

Analyzing characteristics of each variables

summary(G3_6)
##    veh_value          exposure             clm            numclaims      
##  Min.   :   0.00   Min.   :0.002738   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:   1.01   1st Qu.:0.219028   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :   1.50   Median :0.446270   Median :0.00000   Median :0.00000  
##  Mean   :  74.81   Mean   :0.468651   Mean   :0.06814   Mean   :0.07276  
##  3rd Qu.:   2.20   3rd Qu.:0.709103   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :7995.00   Max.   :0.999316   Max.   :1.00000   Max.   :4.00000  
##    claimcst0         veh_body            veh_age         gender         
##  Min.   :    0.0   Length:67856       Min.   :1.000   Length:67856      
##  1st Qu.:    0.0   Class :character   1st Qu.:2.000   Class :character  
##  Median :    0.0   Mode  :character   Median :3.000   Mode  :character  
##  Mean   :  137.3                      Mean   :2.674                     
##  3rd Qu.:    0.0                      3rd Qu.:4.000                     
##  Max.   :55922.1                      Max.   :4.000                     
##      area               agecat     
##  Length:67856       Min.   :1.000  
##  Class :character   1st Qu.:2.000  
##  Mode  :character   Median :3.000  
##                     Mean   :3.485  
##                     3rd Qu.:5.000  
##                     Max.   :6.000
cat("NAs data:", sum(is.na(G3_6)))
## NAs data: 0

As we can see, the variables veh_value and claimcst0 have significant gaps between the minimum, median, and maximum values. We may need to investigate whether there are anomalies in these values. However, for variable claimcst0, we can likely rule out our suspicion, as it is defined as the claim severity, which is known to be highly variable, with many instances of 0 (as confirmed by the fact that the third quartile is 0). Therefore, we will focus on checking for anomalies in the variable veh_value.

We also acknowledge that the values of the variables clm, veh_body, veh_age, gender, area, and agecat are likely categorical, so we need to factor these variables. From a similar dataset we found online, veh_value is defined as the value of vehicles, which seems odd, as vehicle prices usually don’t exhibit such large variations. Considering this, we will further examine the distribution of veh_value and its correlation with other variables.

print(cor(G3_6[c(1:5,7,10)]))
##              veh_value   exposure          clm    numclaims     claimcst0
## veh_value  1.000000000 0.01839012  0.006417623  0.006179527  3.352714e-03
## exposure   0.018390122 1.00000000  0.132980367  0.134393258  3.652955e-02
## clm        0.006417623 0.13298037  1.000000000  0.967106972  4.805655e-01
## numclaims  0.006179527 0.13439326  0.967106972  1.000000000  4.817623e-01
## claimcst0  0.003352714 0.03652955  0.480565452  0.481762292  1.000000e+00
## veh_age   -0.216648123 0.03696233 -0.012128772 -0.011572424 -1.157106e-05
## agecat    -0.022104660 0.02503435 -0.030433931 -0.029571784 -2.743031e-02
##                 veh_age      agecat
## veh_value -2.166481e-01 -0.02210466
## exposure   3.696233e-02  0.02503435
## clm       -1.212877e-02 -0.03043393
## numclaims -1.157242e-02 -0.02957178
## claimcst0 -1.157106e-05 -0.02743031
## veh_age    1.000000e+00  0.01761360
## agecat     1.761360e-02  1.00000000
plot(G3_6[c(1:5,7,10)])

The plot of each variable strengthens our hypothesis that the variables clm, veh_body, veh_age, gender, area, and agecat are categorical variables. This is evident from the patterns that follow distinct lines, indicating a finite set of discrete values, consistent with the definitions found in a similar dataset.

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
# Create a basic ggplot2 boxplot
p <- ggplot(G3_6, aes(x = "", y = veh_value)) + 
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
  labs(title = "Interactive Boxplot of Vehicle Value", y = "Vehicle Value") +
  theme_minimal()

# Convert ggplot2 plot to an interactive plotly plot
interactive_plot <- ggplotly(p)

# Show the interactive plot
interactive_plot

From the boxplot of the variable veh_value, it appears there are anomalies in the data, with some values significantly higher than expected. Upon further examination of the data samples, we noticed a substantial gap between certain values, indicating potential errors or outliers. To address this issue, we will apply scaling to the anomalous data.

We hypothesize that any value above 999 is incorrect, possibly due to a misplacement of decimal points or other data entry errors. Therefore, we will scale down these values by a factor of 1000 to bring them into a reasonable range. This adjustment will ensure the data is more consistent with expectations and facilitate more accurate analysis moving forward.

library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
G3_6 = G3_6 %>% mutate(veh_value = case_when(veh_value > 999 ~ veh_value/1000, veh_value < 999 ~ veh_value))
G3_6 <- within(G3_6, {
    clm <- factor(clm)
    veh_body <- factor(veh_body)
    veh_age <- factor(veh_age)
    gender <- factor(gender)
    area <- factor(area)
    agecat <- factor(agecat)
})

summary(G3_6)
##    veh_value         exposure        clm         numclaims      
##  Min.   : 0.000   Min.   :0.002738   0:63232   Min.   :0.00000  
##  1st Qu.: 1.010   1st Qu.:0.219028   1: 4624   1st Qu.:0.00000  
##  Median : 1.500   Median :0.446270             Median :0.00000  
##  Mean   : 1.777   Mean   :0.468651             Mean   :0.07276  
##  3rd Qu.: 2.150   3rd Qu.:0.709103             3rd Qu.:0.00000  
##  Max.   :34.560   Max.   :0.999316             Max.   :4.00000  
##                                                                 
##    claimcst0          veh_body     veh_age   gender    area      agecat   
##  Min.   :    0.0   SEDAN  :22233   1:12257   F:38603   A:16312   1: 5742  
##  1st Qu.:    0.0   HBACK  :18915   2:16587   M:29253   B:13341   2:12875  
##  Median :    0.0   STNWG  :16261   3:20064             C:20540   3:15767  
##  Mean   :  137.3   UTE    : 4586   4:18948             D: 8173   4:16189  
##  3rd Qu.:    0.0   TRUCK  : 1750                       E: 5912   5:10736  
##  Max.   :55922.1   HDTOP  : 1579                       F: 3578   6: 6547  
##                    (Other): 2532

Definition of each variables

After examining the characteristics and cross-checking with the similar dataset, we are able to define each of the variables as follows:

  • veh_value: Vehicle value, ranges from 0 to 34.560. (This is likely times 1,000 in real-life data.)
  • exposure: Proportion of the year during which the policy was exposed. In practice, each policy is not exposed for the full year. Some policies come into force partway through the year, while others are canceled before the year’s end.
  • clm: Claim occurrence. 0 (no), 1 (yes).
  • numclaims: Number of claims.
  • claimcst0: Claim amount. 0 if no claim. Ranges up to 55,922.
  • veh_body: Vehicle body type. Can be one of bus, convertible, coupe, hatchback, hardtop, motorized caravan/combi, minibus, panel van, roadster, sedan, station wagon, truck, or utility.
  • veh_age: Vehicle age. 1 (new), 2, 3, or 4.
  • gender: Gender of the driver. M (Male) and F (Female).
  • area: Driver’s area of residence. Can be one of A, B, C, D, E, or F.
  • agecat: Driver’s age category. 1 (youngest), 2, 3, 4, 5, or 6.

For the purpose of creating tariff cells for premium calculation, all variables need to be in categorical form to simplify the analysis. Therefore, we need to convert the veh_value variable into a categorical variable. We have decided to divide veh_value into four categories based on its range. The four categories are chosen by taking the first quartile, median, and third quartile of the variable. The four categories are then defined as follows:

LOW: 0 to 1.01 MIDDLELOW: 1.01 to 1.5 MIDDLEHIGH: 1.5 to 2.15 HIGH: 2.15 to 100

The following code was used to implement this categorization:

library(ggplot2)
library(plotly)

# Create a basic ggplot2 boxplot
p <- ggplot(G3_6, aes(x = "", y = veh_value)) + 
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
  labs(title = "Interactive Boxplot of Vehicle Value", y = "Vehicle Value") +
  theme_minimal()

# Convert ggplot2 plot to an interactive plotly plot
interactive_plot <- ggplotly(p)

# Show the interactive plot
interactive_plot
G3_6 = G3_6 %>% mutate(veh_value = cut(veh_value, breaks = c(0, 1.01, 1.5, 2.15, 100), labels = c("LOW", "MIDDLELOW", "MIDDLEHIGH", "HIGH"), include.lowest = TRUE))

We also need to define frequency and severity for further analysis

G3_6$frequency = G3_6$numclaims / G3_6$exposure
G3_6$severity = G3_6$claimcst0 / G3_6$numclaims
G3_6$severity[G3_6$numclaims == 0] = 0 #If the number of claims = 0, thus the severity should be 0 too
summary(G3_6)
##       veh_value        exposure        clm         numclaims      
##  LOW       :17060   Min.   :0.002738   0:63232   Min.   :0.00000  
##  MIDDLELOW :17308   1st Qu.:0.219028   1: 4624   1st Qu.:0.00000  
##  MIDDLEHIGH:16550   Median :0.446270             Median :0.00000  
##  HIGH      :16938   Mean   :0.468651             Mean   :0.07276  
##                     3rd Qu.:0.709103             3rd Qu.:0.00000  
##                     Max.   :0.999316             Max.   :4.00000  
##                                                                   
##    claimcst0          veh_body     veh_age   gender    area      agecat   
##  Min.   :    0.0   SEDAN  :22233   1:12257   F:38603   A:16312   1: 5742  
##  1st Qu.:    0.0   HBACK  :18915   2:16587   M:29253   B:13341   2:12875  
##  Median :    0.0   STNWG  :16261   3:20064             C:20540   3:15767  
##  Mean   :  137.3   UTE    : 4586   4:18948             D: 8173   4:16189  
##  3rd Qu.:    0.0   TRUCK  : 1750                       E: 5912   5:10736  
##  Max.   :55922.1   HDTOP  : 1579                       F: 3578   6: 6547  
##                    (Other): 2532                                          
##    frequency           severity      
##  Min.   :  0.0000   Min.   :    0.0  
##  1st Qu.:  0.0000   1st Qu.:    0.0  
##  Median :  0.0000   Median :    0.0  
##  Mean   :  0.2142   Mean   :  130.6  
##  3rd Qu.:  0.0000   3rd Qu.:    0.0  
##  Max.   :365.2500   Max.   :55922.1  
## 

The function “levels” enumerate the different outcomes/categories of each predictor

cat("Level of variable veh_value:", levels(G3_6$veh_value), "\n")
## Level of variable veh_value: LOW MIDDLELOW MIDDLEHIGH HIGH
cat("Level of variable veh_body:", levels(G3_6$veh_body), "\n")
## Level of variable veh_body: BUS CONVT COUPE HBACK HDTOP MCARA MIBUS PANVN RDSTR SEDAN STNWG TRUCK UTE
cat("Level of variable veh_age:", levels(G3_6$veh_age), "\n")
## Level of variable veh_age: 1 2 3 4
cat("Level of variable gender:", levels(G3_6$gender), "\n")
## Level of variable gender: F M
cat("Level of variable area:", levels(G3_6$area), "\n")
## Level of variable area: A B C D E F
cat("Level of variable agecat:", levels(G3_6$agecat))
## Level of variable agecat: 1 2 3 4 5 6

Observation: the variable area and agecat each have 6 categories, gender has 2 categories, veh_body has 13 categories, and both veh_age and veh_value have 4 categories.

To better understand the relationship between predictors and claim frequency

par(mfrow = c(2, 3))
plot(G3_6$veh_value, G3_6$frequency, col="red",xlab="Vehicle Value",ylab="Claim Frequency")
plot(G3_6$veh_body, G3_6$frequency, col="red",xlab="Vehicle Body",ylab="Claim Frequency")
plot(G3_6$veh_age, G3_6$frequency, col="red",xlab="Vehicle Age",ylab="Claim Frequency")
plot(G3_6$gender, G3_6$frequency, col="red",xlab="Gender",ylab="Claim Frequency")
plot(G3_6$area, G3_6$frequency, col="red",xlab="Area",ylab="Claim Frequency")
plot(G3_6$agecat, G3_6$frequency, col="red",xlab="Driver's Age",ylab="Claim Frequency")

Observations:

  • Vehicle Value (LOW and MIDDLEHIGH)

    The claim frequency for both categories shows many outliers, particularly in the LOW category. The distribution is relatively compact, suggesting less variation within the core data but significant variation due to outliers.

  • Vehicle Body (BUS, HDTOP, SEDAN)

    The claim frequency varies notably across different vehicle body types. The HDTOP category appears to have a wider spread of claim frequencies, including more outliers compared to BUS and SEDAN, suggesting a higher variability in claims.

  • Vehicle Age (1, 2, 3, 4)

    There seems to be a consistent pattern of claim frequency across different vehicle ages, with outliers present in each category. The range of core claim frequencies does not change dramatically, indicating stable claim behavior across vehicle ages.

  • Gender (F, M)

    The claim frequencies between genders show considerable overlap, with both having outliers. However, there appears to be a slightly higher concentration of outliers in the male category.

  • Area (A, B, C, D, E, F)

    Claim frequencies across different areas show significant differences, with some areas like B and E having higher frequencies and more outliers. This indicates varying risk levels or claim propensities based on area.

  • Driver’s Age (1, 2, 3, 4, 5, 6)

    The claim frequency is similar across all age groups, with no significant differences. Outliers are present in each group, but the core data remains consistent, indicating stable claim behavior across different driver ages.

Summary: Numerous outliers are present across all variables, potentially affecting the perception of claim frequency variability. There is significant variability in claim frequencies among different categories within each variable, notably in vehicle body type and area. The prevalence of outliers complicates the identification of clear patterns in claim frequencies, indicating a need for further analysis. Utilizing median values or outlier-resistant measures could provide a clearer understanding of the central tendencies and dispersions within the data.

par(mfrow = c(2, 3))
plot(G3_6$veh_value, G3_6$severity, col="red",xlab="Vehicle Body",ylab="Claim Severity")
plot(G3_6$veh_body, G3_6$severity, col="red",xlab="Vehicle Body",ylab="Claim Severity")
plot(G3_6$veh_age, G3_6$severity, col="red",xlab="Vehicle Age",ylab="Claim Severity")
plot(G3_6$gender, G3_6$severity, col="red",xlab="Gender",ylab="Claim Severity")
plot(G3_6$area, G3_6$severity, col="red",xlab="Area",ylab="Claim Severity")
plot(G3_6$agecat, G3_6$severity, col="red",xlab="Driver's Age",ylab="Claim Severity")

Observations: Vehicle Value (LOW, MIDDLEHIGH): Claim severity is similar between low and middle-high vehicle values, with both having outliers, but no significant difference in the overall distribution.

Vehicle Body (BUS, HDTOP, SEDAN): The SEDAN category shows a wider spread of claim severities with more outliers compared to BUS and HDTOP. Overall, there are higher claim severities for sedans.

Vehicle Age (1, 2, 3, 4): Claim severity remains fairly consistent across vehicle age categories, with outliers present in all. There is no strong pattern suggesting that vehicle age affects claim severity.

Gender (F, M): Claim severity between males and females appears quite similar, with both categories containing a few large outliers. No major differences in the distributions.

Area (A, B, C, D, E, F): Areas exhibit similar distributions of claim severity, with outliers present in each. No specific area stands out as having significantly higher or lower severities.

Driver’s Age (1, 2, 3, 4, 5, 6): Claim severity is consistent across driver age groups, with outliers present in each. There is no significant variation in the core claim severities across different age groups.

Summary: The claim severity appears fairly consistent across most variables, with outliers present in all categories. There is no significant difference in severity between vehicle values (low and middle-high) or between genders. Vehicle body type shows some variation, with sedans exhibiting a wider spread of severities and more outliers compared to buses and hardtops. Similarly, claim severity remains stable across different vehicle ages and driver age groups, with no notable patterns or changes. The various areas also demonstrate similar severity distributions, with outliers in each region but no significant differences overall.

Finding the suitable distributions of claims frequency

library(fitdistrplus)
## Warning: package 'fitdistrplus' was built under R version 4.3.3
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## The following object is masked from 'package:plotly':
## 
##     select
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.3.3
library(MASS)
descdist(G3_6$numclaims, discrete = TRUE)

## summary statistics
## ------
## min:  0   max:  4 
## median:  0 
## mean:  0.07275701 
## estimated sd:  0.2782038 
## estimated skewness:  4.067377 
## estimated kurtosis:  21.50576

Possible fit distributions are normal, nbinom, and poisson based on Cullen and Frey graph.

Fitting the frequency of claims.

freqnorm <- fitdist(G3_6$numclaims, "norm")
freqbinom <- fitdist(G3_6$numclaims, "nbinom")
freqpois <- fitdist(G3_6$numclaims, "pois")
freqfit <- list(freqnorm, freqbinom, freqpois)
summary(freqnorm)
## Fitting of the distribution ' norm ' by maximum likelihood 
## Parameters : 
##        estimate   Std. Error
## mean 0.07275701 0.0010679865
## sd   0.27820178 0.0007551366
## Loglikelihood:  -9467.943   AIC:  18939.89   BIC:  18958.14 
## Correlation matrix:
##               mean            sd
## mean  1.000000e+00 -5.867883e-12
## sd   -5.867883e-12  1.000000e+00
summary(freqbinom)
## Fitting of the distribution ' nbinom ' by maximum likelihood 
## Parameters : 
##        estimate  Std. Error
## size 1.15881641 0.143342061
## mu   0.07275935 0.001067318
## Loglikelihood:  -18049.68   AIC:  36103.36   BIC:  36121.61 
## Correlation matrix:
##               size            mu
## size  1.000000e+00 -1.230821e-05
## mu   -1.230821e-05  1.000000e+00
summary(freqpois)
## Fitting of the distribution ' pois ' by maximum likelihood 
## Parameters : 
##          estimate  Std. Error
## lambda 0.07275701 0.001035288
## Loglikelihood:  -18101.5   AIC:  36205   BIC:  36214.13
data.frame(Dist = c("Normal", "NB", "Poisson"),
           Loglik = c(freqnorm$loglik, freqbinom$loglik, freqpois$loglik),
           AIC = c(freqnorm$aic, freqbinom$aic, freqpois$aic),
           BIC = c(freqnorm$bic, freqbinom$bic, freqpois$bic))

From the fitting process, based on the AIC and BIC goodness-of-fit tests, we found that the negative binomial distribution is the best fit for claim frequency, with parameters size = 1.15881641 and mu = 0.07275935. Although the normal distribution has the smallest AIC and BIC values, it is not recommended for modeling frequency, as the normal distribution is continuous.

Finding the suitable distributions of claims severity

library(fitdistrplus)
library(MASS)
descdist(G3_6$claimcst0, discrete = FALSE)

## summary statistics
## ------
## min:  0   max:  55922.13 
## median:  0 
## mean:  137.2702 
## estimated sd:  1056.298 
## estimated skewness:  17.5025 
## estimated kurtosis:  482.9257

Based on the Cullen and Frey graph, the possible fit distributions are Gamma, Beta, and Weibull. However, to fit a GLM model, the selected distribution must belong to the exponential family. Since Weibull is not part of the exponential family, it will be excluded as a potential fit.

Fitting the severity of claims

G3_6_sev <- G3_6[G3_6$claimcst0 > 0, ]
sevgamma <- fitdist(G3_6_sev$claimcst0, "gamma", "mme")
G3_6_beta <- G3_6_sev$claimcst0 / max(G3_6_sev$claimcst0)
sevbeta <- fitdist(G3_6_beta, "beta", "mme")

Compare AIC/BIC and summaries

data.frame(Dist = c("Gamma", "Beta"),
           Loglik = c(sevgamma$loglik, sevbeta$loglik),
           AIC = c(sevgamma$aic, sevbeta$aic),
           BIC = c(sevgamma$bic, sevbeta$bic))
summary(sevgamma)
## Fitting of the distribution ' gamma ' by matching moments 
## Parameters : 
##           estimate  Std. Error
## shape 0.3222537552 0.923148113
## rate  0.0001599747 0.000537987
## Loglikelihood:  -40576.4   AIC:  81156.79   BIC:  81169.67 
## Correlation matrix:
##           shape      rate
## shape 1.0000000 0.8518301
## rate  0.8518301 1.0000000
summary(sevbeta)
## Fitting of the distribution ' beta ' by matching moments 
## Parameters : 
##         estimate Std. Error
## shape1 0.2746241   1.929803
## shape2 7.3492500  71.819583
## Loglikelihood:  -Inf   AIC:  Inf   BIC:  Inf 
## Correlation matrix:
##          shape1   shape2
## shape1 1.000000 1.012924
## shape2 1.012924 1.000000

Visual comparison of the fits

par(mfrow = c(2, 2))  # Arrange plots in 2x2 grid
plot(sevgamma)

plot(sevbeta)

From the fitting process, based on the AIC and BIC goodness-of-fit tests, we found that the Weibull distribution is the best fit for claim severity. The Beta distribution fitting failed, as indicated by the -Inf log-likelihood and Inf AIC/BIC values, meaning the Beta distribution is not suitable for modeling claim severity.

Reordering for the base tariff cell: We typically choose the base tariff cell as the one with the largest exposure (e.g., longest duration). Therefore, we select the tariff cell (MIDDLELOW, HBACK, 2, F, A, 2) as the base tariff cell.

print(basecell<- G3_6[which.max(G3_6$exposure),c(1:12)])
## # A tibble: 1 × 12
##   veh_value exposure clm   numclaims claimcst0 veh_body veh_age gender area 
##   <fct>        <dbl> <fct>     <dbl>     <dbl> <fct>    <fct>   <fct>  <fct>
## 1 MIDDLELOW    0.999 0             0         0 HBACK    2       F      A    
## # ℹ 3 more variables: agecat <fct>, frequency <dbl>, severity <dbl>

We will do a relevel for each rating factors

G3_6$veh_value<- relevel(G3_6$veh_value, as.character(basecell$veh_value))
G3_6$veh_body<- relevel(G3_6$veh_body, as.character(basecell$veh_body))
G3_6$veh_age<- relevel(G3_6$veh_age, as.character(basecell$veh_age))
G3_6$gender<- relevel(G3_6$gender, as.character(basecell$gender))
G3_6$area<- relevel(G3_6$area, as.character(basecell$area))
G3_6$agecat<- relevel(G3_6$agecat, as.character(basecell$agecat))

2. Model the frequency and severity separately

Data Preparation

Splitting 75% of data to be trained and 25% to be tested

train = G3_6[1:floor(0.75*nrow(G3_6)),]
test = G3_6[(floor(0.75*nrow(G3_6))+1):nrow(G3_6),]
train
test

Make new train data but without basecell, this data is going to be used while crossvalidating the models

train_wo_basecell = train[-which.max(G3_6$exposure),]
train_wo_basecell

We will perform GLM modelling for frequency and severity using the possible distributions obtained in number 1.

Frequency Model

Frequency model with Negative Binomial distribution

Note: we use offset to transform the exposure data from rate to count.

freq_nb = glm.nb(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)), 
             data = train[train$exposure>0,])
summary(freq_nb)
## 
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age + 
##     gender + area + agecat + offset(log(exposure)), data = train[train$exposure > 
##     0, ], init.theta = 2.724312268, link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.76565    0.06748 -26.164  < 2e-16 ***
## veh_valueLOW          -0.06731    0.05824  -1.156 0.247789    
## veh_valueMIDDLEHIGH    0.05421    0.05060   1.071 0.284072    
## veh_valueHIGH          0.16749    0.06325   2.648 0.008098 ** 
## veh_bodyBUS            0.74877    0.42770   1.751 0.080002 .  
## veh_bodyCONVT         -1.55225    1.00507  -1.544 0.122486    
## veh_bodyCOUPE          0.41529    0.14502   2.864 0.004188 ** 
## veh_bodyHDTOP          0.08466    0.11593   0.730 0.465212    
## veh_bodyMCARA          0.52128    0.30483   1.710 0.087259 .  
## veh_bodyMIBUS         -0.04588    0.18409  -0.249 0.803200    
## veh_bodyPANVN          0.02443    0.15363   0.159 0.873651    
## veh_bodyRDSTR        -16.32780 1879.40165  -0.009 0.993068    
## veh_bodySEDAN          0.02601    0.04684   0.555 0.578594    
## veh_bodySTNWG         -0.01212    0.06412  -0.189 0.850139    
## veh_bodyTRUCK          0.05657    0.11337   0.499 0.617779    
## veh_bodyUTE           -0.16794    0.08638  -1.944 0.051860 .  
## veh_age1              -0.04346    0.05260  -0.826 0.408706    
## veh_age3              -0.08878    0.04954  -1.792 0.073143 .  
## veh_age4              -0.10840    0.06670  -1.625 0.104110    
## genderM               -0.02158    0.03590  -0.601 0.547894    
## areaB                  0.05400    0.05118   1.055 0.291364    
## areaC                  0.03939    0.04638   0.849 0.395736    
## areaD                 -0.12636    0.06336  -1.994 0.046100 *  
## areaE                 -0.02791    0.06885  -0.405 0.685167    
## areaF                 -0.01541    0.08085  -0.191 0.848871    
## agecat1                0.19860    0.06441   3.083 0.002047 ** 
## agecat3               -0.05581    0.05166  -1.080 0.280073    
## agecat4               -0.05609    0.05141  -1.091 0.275304    
## agecat5               -0.28619    0.06065  -4.719 2.37e-06 ***
## agecat6               -0.26143    0.07202  -3.630 0.000283 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.7243) family taken to be 1)
## 
##     Null deviance: 17525  on 50891  degrees of freedom
## Residual deviance: 17385  on 50862  degrees of freedom
## AIC: 25499
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.724 
##           Std. Err.:  0.691 
## 
##  2 x log-likelihood:  -25437.443

The frequency model with Negative Binomial distribution results on that rating factors that have a significant p-value are: - veh_value (HIGH) - veh_body (COUPE) - area (D) - agecat (1, 5, and 6)

On the other hand, vehicle age and gender do not have significant p-values, which explains that these two ranking factors do not significantly affect the number of claims.

Deviance statistic

cbind(scaled_deviance = freq_nb$deviance, df = freq_nb$df.residual, p = 1-pchisq(freq_nb$deviance, freq_nb$df.residual))
##      scaled_deviance    df p
## [1,]        17384.99 50862 1
qchisq(0.95, freq_nb$df.residual)
## [1] 51387.75

The frequency model used Negative Binomial distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.

Frequency model with Poisson distribution

freq_pois = glm(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)), 
             data = train[train$exposure>0,], family=poisson("log"))
summary(freq_pois)
## 
## Call:
## glm(formula = numclaims ~ veh_value + veh_body + veh_age + gender + 
##     area + agecat + offset(log(exposure)), family = poisson("log"), 
##     data = train[train$exposure > 0, ])
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -1.76700    0.06626 -26.666  < 2e-16 ***
## veh_valueLOW         -0.06818    0.05735  -1.189 0.234480    
## veh_valueMIDDLEHIGH   0.05449    0.04972   1.096 0.273116    
## veh_valueHIGH         0.16690    0.06212   2.687 0.007216 ** 
## veh_bodyBUS           0.75288    0.41123   1.831 0.067134 .  
## veh_bodyCONVT        -1.55192    1.00160  -1.549 0.121277    
## veh_bodyCOUPE         0.41671    0.14175   2.940 0.003284 ** 
## veh_bodyHDTOP         0.08737    0.11360   0.769 0.441866    
## veh_bodyMCARA         0.52259    0.29580   1.767 0.077273 .  
## veh_bodyMIBUS        -0.04240    0.18104  -0.234 0.814812    
## veh_bodyPANVN         0.02892    0.15046   0.192 0.847592    
## veh_bodyRDSTR       -10.32610   93.56994  -0.110 0.912126    
## veh_bodySEDAN         0.02759    0.04606   0.599 0.549225    
## veh_bodySTNWG        -0.00987    0.06302  -0.157 0.875560    
## veh_bodyTRUCK         0.06048    0.11114   0.544 0.586335    
## veh_bodyUTE          -0.16457    0.08502  -1.936 0.052914 .  
## veh_age1             -0.04218    0.05161  -0.817 0.413733    
## veh_age3             -0.08943    0.04864  -1.839 0.065971 .  
## veh_age4             -0.10928    0.06556  -1.667 0.095537 .  
## genderM              -0.02126    0.03528  -0.603 0.546819    
## areaB                 0.05308    0.05026   1.056 0.290969    
## areaC                 0.03809    0.04556   0.836 0.403167    
## areaD                -0.12848    0.06237  -2.060 0.039394 *  
## areaE                -0.03029    0.06771  -0.447 0.654559    
## areaF                -0.01674    0.07934  -0.211 0.832877    
## agecat1               0.19678    0.06309   3.119 0.001816 ** 
## agecat3              -0.05641    0.05074  -1.112 0.266228    
## agecat4              -0.05634    0.05049  -1.116 0.264497    
## agecat5              -0.28613    0.05969  -4.794 1.64e-06 ***
## agecat6              -0.25989    0.07088  -3.667 0.000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 18715  on 50891  degrees of freedom
## Residual deviance: 18569  on 50862  degrees of freedom
## AIC: 25517
## 
## Number of Fisher Scoring iterations: 11

The frequency model with Poisson distribution results on that rating factors that have a significant p-value are: - veh_value (HIGH) - veh_body (COUPE) - area (D) - agecat (1, 5, and 6)

On the other hand, vehicle age and gender do not have significant p-values, which explains that these two ranking factors do not significantly affect the number of claims.

Deviance statistic

cbind(scaled_deviance = freq_pois$deviance, df = freq_pois$df.residual, p = 1-pchisq(freq_pois$deviance, freq_pois$df.residual))
##      scaled_deviance    df p
## [1,]         18569.3 50862 1
qchisq(0.95, freq_pois$df.residual)
## [1] 51387.75

The frequency model used Poisson distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.

Comparison distribution of frequency models

Implementing special cross validation to check model goodness of fit (NB Distribution)

library(stringr)
## Warning: package 'stringr' was built under R version 4.3.3
num_folds = 3
folds = cut(seq(1,nrow(train)),breaks=num_folds,labels=FALSE)
metrics_nb = data.frame('fold' = c(), 'AIC' = c())
for(i in 1:num_folds){
    val_idx = which(folds==i,arr.ind=TRUE)
    tr = rbind(train_wo_basecell[-val_idx, ], basecell)
    val = train_wo_basecell[val_idx, ]
    model = glm.nb(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)), 
             data = tr[tr$exposure>0,])
    get_metric = data.frame('fold' = str_glue('fold{i}'), 'AIC' = model$aic)
    metrics_nb = rbind(metrics_nb,get_metric)
}
metrics_nb

Implementing special cross validation to check model goodness of fit (Poisson Distribution)

metrics_pois = data.frame('fold' = c(), 'AIC' = c())
for(i in 1:num_folds){
    val_idx = which(folds==i,arr.ind=TRUE)
    tr = rbind(train_wo_basecell[-val_idx, ], basecell)
    val = train_wo_basecell[val_idx, ]
    model = glm(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)), 
             data = tr[tr$exposure>0,], family=poisson("log"))
    get_metric = data.frame('fold' = str_glue('fold{i}'), 'AIC' = model$aic)
    metrics_pois = rbind(metrics_pois,get_metric)
}
metrics_pois
aic_nbpoi =  data.frame(Distribution = c("NB", "Poisson"),
           AIC = c(mean(metrics_nb$AIC), mean(metrics_pois$AIC)))
aic_nbpoi

Although from the modelling results with the glm() function there is only a slight difference in the coefficient estimates between NB distribution and Poisson distribution, and the distribution comparison results above, it can be concluded that the Negative Binomial distribution is a more suitable distribution for modelling frequency than the Poisson distribution because of the smaller AIC values. The smaller the AIC indicates that the model not only has a good fit with the data, but also avoids excessive complexity.

Severity Model

Severity model with Gamma distribution

sev_gamma <- glm(severity ~ veh_value + veh_body + veh_age + gender + area + agecat, data = train[train$numclaims > 0, ], family = Gamma("log"), weights = numclaims)
summary(sev_gamma)
## 
## Call:
## glm(formula = severity ~ veh_value + veh_body + veh_age + gender + 
##     area + agecat, family = Gamma("log"), data = train[train$numclaims > 
##     0, ], weights = numclaims)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.514443   0.121827  61.681  < 2e-16 ***
## veh_valueLOW         0.085779   0.105519   0.813 0.416318    
## veh_valueMIDDLEHIGH  0.009238   0.091950   0.100 0.919979    
## veh_valueHIGH       -0.115149   0.115407  -0.998 0.318467    
## veh_bodyBUS         -0.276854   0.756794  -0.366 0.714519    
## veh_bodyCONVT       -1.030755   1.849179  -0.557 0.577283    
## veh_bodyCOUPE        0.374727   0.259369   1.445 0.148618    
## veh_bodyHDTOP        0.123624   0.208096   0.594 0.552504    
## veh_bodyMCARA       -1.022644   0.546427  -1.872 0.061361 .  
## veh_bodyMIBUS        0.272985   0.334817   0.815 0.414944    
## veh_bodyPANVN        0.145277   0.275929   0.527 0.598575    
## veh_bodySEDAN       -0.148193   0.085072  -1.742 0.081605 .  
## veh_bodySTNWG       -0.072168   0.117718  -0.613 0.539880    
## veh_bodyTRUCK        0.056175   0.203675   0.276 0.782713    
## veh_bodyUTE         -0.119374   0.158036  -0.755 0.450088    
## veh_age1            -0.043576   0.094762  -0.460 0.645654    
## veh_age3             0.059946   0.089599   0.669 0.503511    
## veh_age4             0.010553   0.118571   0.089 0.929085    
## genderM              0.178773   0.065060   2.748 0.006031 ** 
## areaB               -0.020476   0.092474  -0.221 0.824774    
## areaC                0.081724   0.083910   0.974 0.330157    
## areaD                0.044805   0.114927   0.390 0.696667    
## areaE                0.227978   0.124530   1.831 0.067234 .  
## areaF                0.499850   0.147097   3.398 0.000686 ***
## agecat1              0.068641   0.116321   0.590 0.555161    
## agecat3             -0.089848   0.093310  -0.963 0.335667    
## agecat4             -0.167505   0.093281  -1.796 0.072631 .  
## agecat5             -0.284988   0.109569  -2.601 0.009336 ** 
## agecat6             -0.055087   0.130009  -0.424 0.671800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.375574)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5427.7  on 3350  degrees of freedom
## AIC: 61175
## 
## Number of Fisher Scoring iterations: 8

The severity model with Gamma distribution results in the vehicle value and vehicle age ranking factors not having significant p-values, which explains that these two ranking factors do not significantly affect severity.

On the other hand, the vehicle body rating factor has a significant p-value for the MCARA vehicle type but with a negative coefficient, which means that the MCARA type has a lower severity compared to the baseline.

The male gender rating factor has a highly significant p-value, indicating that male gender drivers have a significant effect on severity and have a higher severity.

For area, area F has a significant p-value and a positive coefficient, indicating that area F has a higher severity compared to the baseline.

For driver age categories, specifically driver age category 5 has a significant p-value but a negative coefficient, indicating that this age category has less severity.

Deviance statistic

cbind(scaled_deviance = sev_gamma$deviance/summary(sev_gamma)$dispersion, df = sev_gamma$df.residual, p = 1-pchisq(sev_gamma$deviance/summary(sev_gamma)$dispersion, sev_gamma$df.residual))
##      scaled_deviance   df p
## [1,]        1607.921 3350 1
qchisq(0.95, sev_gamma$df.residual)
## [1] 3485.764

The severity model used Gamma distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.

3. Incorporate only the significant predictor to model each components

Frequency Modelling

summary(freq_nb)
## 
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age + 
##     gender + area + agecat + offset(log(exposure)), data = train[train$exposure > 
##     0, ], init.theta = 2.724312268, link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.76565    0.06748 -26.164  < 2e-16 ***
## veh_valueLOW          -0.06731    0.05824  -1.156 0.247789    
## veh_valueMIDDLEHIGH    0.05421    0.05060   1.071 0.284072    
## veh_valueHIGH          0.16749    0.06325   2.648 0.008098 ** 
## veh_bodyBUS            0.74877    0.42770   1.751 0.080002 .  
## veh_bodyCONVT         -1.55225    1.00507  -1.544 0.122486    
## veh_bodyCOUPE          0.41529    0.14502   2.864 0.004188 ** 
## veh_bodyHDTOP          0.08466    0.11593   0.730 0.465212    
## veh_bodyMCARA          0.52128    0.30483   1.710 0.087259 .  
## veh_bodyMIBUS         -0.04588    0.18409  -0.249 0.803200    
## veh_bodyPANVN          0.02443    0.15363   0.159 0.873651    
## veh_bodyRDSTR        -16.32780 1879.40165  -0.009 0.993068    
## veh_bodySEDAN          0.02601    0.04684   0.555 0.578594    
## veh_bodySTNWG         -0.01212    0.06412  -0.189 0.850139    
## veh_bodyTRUCK          0.05657    0.11337   0.499 0.617779    
## veh_bodyUTE           -0.16794    0.08638  -1.944 0.051860 .  
## veh_age1              -0.04346    0.05260  -0.826 0.408706    
## veh_age3              -0.08878    0.04954  -1.792 0.073143 .  
## veh_age4              -0.10840    0.06670  -1.625 0.104110    
## genderM               -0.02158    0.03590  -0.601 0.547894    
## areaB                  0.05400    0.05118   1.055 0.291364    
## areaC                  0.03939    0.04638   0.849 0.395736    
## areaD                 -0.12636    0.06336  -1.994 0.046100 *  
## areaE                 -0.02791    0.06885  -0.405 0.685167    
## areaF                 -0.01541    0.08085  -0.191 0.848871    
## agecat1                0.19860    0.06441   3.083 0.002047 ** 
## agecat3               -0.05581    0.05166  -1.080 0.280073    
## agecat4               -0.05609    0.05141  -1.091 0.275304    
## agecat5               -0.28619    0.06065  -4.719 2.37e-06 ***
## agecat6               -0.26143    0.07202  -3.630 0.000283 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.7243) family taken to be 1)
## 
##     Null deviance: 17525  on 50891  degrees of freedom
## Residual deviance: 17385  on 50862  degrees of freedom
## AIC: 25499
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.724 
##           Std. Err.:  0.691 
## 
##  2 x log-likelihood:  -25437.443

By looking at the p-value of the regression coefficient we can see that variable veh_age and gender are not significant since the p-value of all dummy related with those variable are all significant. So we doing backward selection.

Backward selection step 1 = Eliminate gender

#eliminate gender
freq_nb1 = glm.nb(numclaims~veh_value+veh_body+veh_age+area+agecat+offset(log(exposure)), 
             data = train[train$exposure>0,])
summary(freq_nb1)
## 
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age + 
##     area + agecat + offset(log(exposure)), data = train[train$exposure > 
##     0, ], init.theta = 2.724031997, link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -1.77085    0.06693 -26.456  < 2e-16 ***
## veh_valueLOW          -0.06622    0.05821  -1.138 0.255306    
## veh_valueMIDDLEHIGH    0.05352    0.05059   1.058 0.290069    
## veh_valueHIGH          0.16557    0.06317   2.621 0.008766 ** 
## veh_bodyBUS            0.74324    0.42753   1.738 0.082131 .  
## veh_bodyCONVT         -1.55198    1.00511  -1.544 0.122565    
## veh_bodyCOUPE          0.41222    0.14495   2.844 0.004456 ** 
## veh_bodyHDTOP          0.07996    0.11568   0.691 0.489409    
## veh_bodyMCARA          0.51647    0.30481   1.694 0.090186 .  
## veh_bodyMIBUS         -0.04650    0.18408  -0.253 0.800578    
## veh_bodyPANVN          0.01531    0.15285   0.100 0.920229    
## veh_bodyRDSTR        -16.33392 1876.66356  -0.009 0.993056    
## veh_bodySEDAN          0.02470    0.04679   0.528 0.597625    
## veh_bodySTNWG         -0.01551    0.06389  -0.243 0.808162    
## veh_bodyTRUCK          0.04667    0.11216   0.416 0.677339    
## veh_bodyUTE           -0.17670    0.08513  -2.076 0.037929 *  
## veh_age1              -0.04341    0.05260  -0.825 0.409292    
## veh_age3              -0.08956    0.04953  -1.808 0.070550 .  
## veh_age4              -0.11097    0.06656  -1.667 0.095471 .  
## areaB                  0.05413    0.05118   1.058 0.290202    
## areaC                  0.03964    0.04638   0.855 0.392698    
## areaD                 -0.12485    0.06331  -1.972 0.048601 *  
## areaE                 -0.02672    0.06882  -0.388 0.697865    
## areaF                 -0.01390    0.08081  -0.172 0.863389    
## agecat1                0.19772    0.06439   3.071 0.002136 ** 
## agecat3               -0.05595    0.05166  -1.083 0.278827    
## agecat4               -0.05694    0.05139  -1.108 0.267844    
## agecat5               -0.28791    0.06058  -4.753 2.01e-06 ***
## agecat6               -0.26466    0.07182  -3.685 0.000228 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.724) family taken to be 1)
## 
##     Null deviance: 17525  on 50891  degrees of freedom
## Residual deviance: 17385  on 50863  degrees of freedom
## AIC: 25498
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.724 
##           Std. Err.:  0.691 
## 
##  2 x log-likelihood:  -25437.804

Backward selection part 2 = Eliminate veh_age

freq_nb2 = glm.nb(numclaims~veh_value+veh_body+area+agecat+offset(log(exposure)), 
             data = train[train$exposure>0,])
summary(freq_nb2)
## 
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + area + agecat + 
##     offset(log(exposure)), data = train[train$exposure > 0, ], 
##     init.theta = 2.707842716, link = log)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.821e+00  6.119e-02 -29.764  < 2e-16 ***
## veh_valueLOW        -1.050e-01  5.015e-02  -2.094 0.036252 *  
## veh_valueMIDDLEHIGH  7.991e-02  4.834e-02   1.653 0.098351 .  
## veh_valueHIGH        2.121e-01  5.262e-02   4.030 5.58e-05 ***
## veh_bodyBUS          7.014e-01  4.268e-01   1.643 0.100348    
## veh_bodyCONVT       -1.595e+00  1.005e+00  -1.587 0.112434    
## veh_bodyCOUPE        3.755e-01  1.424e-01   2.638 0.008345 ** 
## veh_bodyHDTOP        3.092e-02  1.102e-01   0.281 0.779032    
## veh_bodyMCARA        4.503e-01  3.012e-01   1.495 0.134930    
## veh_bodyMIBUS       -1.107e-01  1.784e-01  -0.620 0.535104    
## veh_bodyPANVN       -1.391e-02  1.513e-01  -0.092 0.926774    
## veh_bodyRDSTR       -1.636e+01  1.877e+03  -0.009 0.993046    
## veh_bodySEDAN        7.090e-03  4.518e-02   0.157 0.875290    
## veh_bodySTNWG       -5.984e-02  5.519e-02  -1.084 0.278218    
## veh_bodyTRUCK        4.255e-03  1.084e-01   0.039 0.968696    
## veh_bodyUTE         -2.145e-01  8.103e-02  -2.647 0.008120 ** 
## areaB                5.431e-02  5.118e-02   1.061 0.288584    
## areaC                4.032e-02  4.638e-02   0.869 0.384622    
## areaD               -1.267e-01  6.331e-02  -2.002 0.045299 *  
## areaE               -2.918e-02  6.882e-02  -0.424 0.671605    
## areaF               -1.998e-02  8.074e-02  -0.247 0.804522    
## agecat1              2.004e-01  6.434e-02   3.114 0.001846 ** 
## agecat3             -5.643e-02  5.167e-02  -1.092 0.274722    
## agecat4             -5.697e-02  5.139e-02  -1.109 0.267622    
## agecat5             -2.876e-01  6.058e-02  -4.748 2.06e-06 ***
## agecat6             -2.633e-01  7.180e-02  -3.666 0.000246 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.7078) family taken to be 1)
## 
##     Null deviance: 17519  on 50891  degrees of freedom
## Residual deviance: 17383  on 50866  degrees of freedom
## AIC: 25496
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.708 
##           Std. Err.:  0.684 
## 
##  2 x log-likelihood:  -25441.632

Severity Modelling

recall our model summary for severity

summary(sev_gamma)
## 
## Call:
## glm(formula = severity ~ veh_value + veh_body + veh_age + gender + 
##     area + agecat, family = Gamma("log"), data = train[train$numclaims > 
##     0, ], weights = numclaims)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.514443   0.121827  61.681  < 2e-16 ***
## veh_valueLOW         0.085779   0.105519   0.813 0.416318    
## veh_valueMIDDLEHIGH  0.009238   0.091950   0.100 0.919979    
## veh_valueHIGH       -0.115149   0.115407  -0.998 0.318467    
## veh_bodyBUS         -0.276854   0.756794  -0.366 0.714519    
## veh_bodyCONVT       -1.030755   1.849179  -0.557 0.577283    
## veh_bodyCOUPE        0.374727   0.259369   1.445 0.148618    
## veh_bodyHDTOP        0.123624   0.208096   0.594 0.552504    
## veh_bodyMCARA       -1.022644   0.546427  -1.872 0.061361 .  
## veh_bodyMIBUS        0.272985   0.334817   0.815 0.414944    
## veh_bodyPANVN        0.145277   0.275929   0.527 0.598575    
## veh_bodySEDAN       -0.148193   0.085072  -1.742 0.081605 .  
## veh_bodySTNWG       -0.072168   0.117718  -0.613 0.539880    
## veh_bodyTRUCK        0.056175   0.203675   0.276 0.782713    
## veh_bodyUTE         -0.119374   0.158036  -0.755 0.450088    
## veh_age1            -0.043576   0.094762  -0.460 0.645654    
## veh_age3             0.059946   0.089599   0.669 0.503511    
## veh_age4             0.010553   0.118571   0.089 0.929085    
## genderM              0.178773   0.065060   2.748 0.006031 ** 
## areaB               -0.020476   0.092474  -0.221 0.824774    
## areaC                0.081724   0.083910   0.974 0.330157    
## areaD                0.044805   0.114927   0.390 0.696667    
## areaE                0.227978   0.124530   1.831 0.067234 .  
## areaF                0.499850   0.147097   3.398 0.000686 ***
## agecat1              0.068641   0.116321   0.590 0.555161    
## agecat3             -0.089848   0.093310  -0.963 0.335667    
## agecat4             -0.167505   0.093281  -1.796 0.072631 .  
## agecat5             -0.284988   0.109569  -2.601 0.009336 ** 
## agecat6             -0.055087   0.130009  -0.424 0.671800    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.375574)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5427.7  on 3350  degrees of freedom
## AIC: 61175
## 
## Number of Fisher Scoring iterations: 8

Backward Selection part 1 = veh_age

sev_gamma1 = glm(severity ~ veh_value + veh_body + gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma1)
## 
## Call:
## glm(formula = severity ~ veh_value + veh_body + gender + area + 
##     agecat, family = Gamma("log"), data = train[train$numclaims > 
##     0, ], weights = numclaims)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.52721    0.11146  67.533  < 2e-16 ***
## veh_valueLOW         0.08709    0.09040   0.963 0.335466    
## veh_valueMIDDLEHIGH -0.01306    0.08761  -0.149 0.881475    
## veh_valueHIGH       -0.14665    0.09655  -1.519 0.128866    
## veh_bodyBUS         -0.25964    0.75097  -0.346 0.729561    
## veh_bodyCONVT       -1.00219    1.83471  -0.546 0.584937    
## veh_bodyCOUPE        0.39084    0.25522   1.531 0.125772    
## veh_bodyHDTOP        0.15327    0.19877   0.771 0.440715    
## veh_bodyMCARA       -0.97675    0.53732  -1.818 0.069181 .  
## veh_bodyMIBUS        0.31171    0.32160   0.969 0.332503    
## veh_bodyPANVN        0.14864    0.27277   0.545 0.585847    
## veh_bodySEDAN       -0.13742    0.08201  -1.676 0.093903 .  
## veh_bodySTNWG       -0.05366    0.10238  -0.524 0.600198    
## veh_bodyTRUCK        0.07087    0.19787   0.358 0.720234    
## veh_bodyUTE         -0.09363    0.15012  -0.624 0.532848    
## genderM              0.17886    0.06468   2.765 0.005719 ** 
## areaB               -0.01738    0.09201  -0.189 0.850148    
## areaC                0.08302    0.08352   0.994 0.320274    
## areaD                0.05172    0.11436   0.452 0.651091    
## areaE                0.23519    0.12388   1.898 0.057716 .  
## areaF                0.49873    0.14600   3.416 0.000643 ***
## agecat1              0.06698    0.11575   0.579 0.562837    
## agecat3             -0.08678    0.09282  -0.935 0.349847    
## agecat4             -0.16741    0.09282  -1.804 0.071386 .  
## agecat5             -0.28435    0.10907  -2.607 0.009173 ** 
## agecat6             -0.06001    0.12940  -0.464 0.642832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.345349)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5431.5  on 3353  degrees of freedom
## AIC: 61172
## 
## Number of Fisher Scoring iterations: 8

Backward Selection part 2 = veh_body

sev_gamma2 = glm(severity ~ veh_value + gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma2)
## 
## Call:
## glm(formula = severity ~ veh_value + gender + area + agecat, 
##     family = Gamma("log"), data = train[train$numclaims > 0, 
##         ], weights = numclaims)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.473460   0.105353  70.938  < 2e-16 ***
## veh_valueLOW         0.105667   0.091148   1.159 0.246418    
## veh_valueMIDDLEHIGH -0.019356   0.087255  -0.222 0.824459    
## veh_valueHIGH       -0.132638   0.086906  -1.526 0.127050    
## genderM              0.185258   0.063188   2.932 0.003392 ** 
## areaB               -0.001307   0.092922  -0.014 0.988778    
## areaC                0.093544   0.084258   1.110 0.266992    
## areaD                0.041342   0.114791   0.360 0.718759    
## areaE                0.255235   0.123895   2.060 0.039467 *  
## areaF                0.504482   0.145296   3.472 0.000523 ***
## agecat1              0.080970   0.116300   0.696 0.486344    
## agecat3             -0.083439   0.093680  -0.891 0.373166    
## agecat4             -0.183608   0.093246  -1.969 0.049028 *  
## agecat5             -0.302600   0.109815  -2.756 0.005891 ** 
## agecat6             -0.087696   0.129874  -0.675 0.499571    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.421348)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5475.7  on 3364  degrees of freedom
## AIC: 61185
## 
## Number of Fisher Scoring iterations: 7

Backward selection part 3 = veh_value

sev_gamma3 = glm(severity ~ gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma3)
## 
## Call:
## glm(formula = severity ~ gender + area + agecat, family = Gamma("log"), 
##     data = train[train$numclaims > 0, ], weights = numclaims)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.45743    0.09089  82.050  < 2e-16 ***
## genderM      0.15572    0.06168   2.525  0.01163 *  
## areaB       -0.00398    0.09169  -0.043  0.96538    
## areaC        0.09838    0.08313   1.183  0.23672    
## areaD        0.04234    0.11306   0.374  0.70807    
## areaE        0.23664    0.12196   1.940  0.05242 .  
## areaF        0.44822    0.14112   3.176  0.00151 ** 
## agecat1      0.09146    0.11464   0.798  0.42501    
## agecat3     -0.06787    0.09244  -0.734  0.46287    
## agecat4     -0.15869    0.09195  -1.726  0.08449 .  
## agecat5     -0.28132    0.10835  -2.596  0.00946 ** 
## agecat6     -0.05836    0.12792  -0.456  0.64824    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.331778)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5499.2  on 3367  degrees of freedom
## AIC: 61198
## 
## Number of Fisher Scoring iterations: 7

After doing backward selection and evaluating AIC of the model, it can be inferred that the AIC is getting bigger after backward selection part 2 so we shall do anova.

sev_anova = anova(sev_gamma2,sev_gamma1)

1-pchisq(sev_anova[2,]$Deviance/summary(sev_gamma1)$dispersion,sev_anova[2,]$Df)
## [1] 0.280484

Because p-value = 0.280484 > 0.05 \(H_0\) accepted, thus by anova having variable veh_body is not really impactful. so we chose our current best non-recoded levels model is sev_gamma2. Then we want test it on anova with sev_gamma3

sev_anova = anova(sev_gamma3,sev_gamma2)

1-pchisq(sev_anova[2,]$Deviance/summary(sev_gamma2)$dispersion,sev_anova[2,]$Df)
## [1] 0.07670181

p-value >0.05 \(H_0\) accepted. Thus the variable veh_value is also not impactful. then our best model is sev_gamma3.

4. Do modification on the category of some predictors to get the better prediction power. Using the deviance statistic, comment on the overall fit

Since there are variable with non-significant dummy, it must be recoded and combined to the base cell

Recode part 1 = veh_value

# save newdata frame with recoded veh_value
train_rec1 = train
test_rec1 = test
levels(train_rec1$veh_value) = recode(levels(train_rec1$veh_value), 'MIDDLELOW' = 'MIDDLELOW+' )
levels(train_rec1$veh_value) = recode(levels(train_rec1$veh_value), 'MIDDLEHIGH' = 'MIDDLELOW+')
levels(test_rec1$veh_value) = recode(levels(test_rec1$veh_value), 'MIDDLELOW' = 'MIDDLELOW+' )
levels(test_rec1$veh_value) = recode(levels(test_rec1$veh_value), 'MIDDLEHIGH' = 'MIDDLELOW+')

levels(test_rec1$veh_value)
## [1] "MIDDLELOW+" "LOW"        "HIGH"
levels(train_rec1$veh_value)
## [1] "MIDDLELOW+" "LOW"        "HIGH"

Recode part 2 = veh_body

#save new dataframe with recoded veh_value
train_rec2 = train_rec1
test_rec2 = test_rec1

levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'HBACK' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'BUS' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'CONVT' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'HDTOP' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'MCARA' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'MIBUS' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'PANVN' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'RDSTR' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'SEDAN' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'STNWG' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'TRUCK' = 'HBACK+')

levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'HBACK' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'BUS' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'CONVT' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'HDTOP' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'MCARA' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'MIBUS' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'PANVN' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'RDSTR' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'SEDAN' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'STNWG' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'TRUCK' = 'HBACK+')

levels(train_rec2$veh_body)
## [1] "HBACK+" "COUPE"  "UTE"
levels(test_rec2$veh_body)
## [1] "HBACK+" "COUPE"  "UTE"

Recode part 3 = area

# save new dataframe
train_rec3 = train_rec2
test_rec3 = test_rec2


levels(train_rec3$area) = recode(levels(train_rec3$area),'A' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'B' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'C' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'E' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'F' = 'A+')

levels(test_rec3$area) = recode(levels(test_rec3$area),'A' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'B' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'C' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'E' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'F' = 'A+')

levels(train_rec3$area)
## [1] "A+" "D"
levels(test_rec3$area)
## [1] "A+" "D"
levels(train_rec3$agecat)
## [1] "2" "1" "3" "4" "5" "6"

Recode part 4 = agecat

train_rec4 = train_rec3
test_rec4 = test_rec3


levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'2' = '2+')
levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'3' = '2+')
levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'4' = '2+')

levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'2' = '2+' )
levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'3' = '2+' )
levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'4' = '2+' )

levels(train_rec4$agecat)
## [1] "2+" "1"  "5"  "6"
levels(test_rec4$agecat)
## [1] "2+" "1"  "5"  "6"

Frequency modelling after recoding

freq_nb_recode = glm.nb(numclaims~veh_value+veh_body+area+agecat+offset(log(exposure)), 
             data = train_rec4[train_rec4$exposure>0,])
summary(freq_nb_recode)
## 
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + area + agecat + 
##     offset(log(exposure)), data = train_rec4[train_rec4$exposure > 
##     0, ], init.theta = 2.639873673, link = log)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.80714    0.02762 -65.425  < 2e-16 ***
## veh_valueLOW  -0.13882    0.04371  -3.176 0.001493 ** 
## veh_valueHIGH  0.13851    0.04013   3.452 0.000557 ***
## veh_bodyCOUPE  0.38839    0.13943   2.786 0.005342 ** 
## veh_bodyUTE   -0.20133    0.07375  -2.730 0.006340 ** 
## areaD         -0.14898    0.05540  -2.689 0.007169 ** 
## agecat1        0.24270    0.05570   4.357 1.32e-05 ***
## agecat5       -0.24039    0.05120  -4.695 2.67e-06 ***
## agecat6       -0.21338    0.06365  -3.353 0.000801 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.6399) family taken to be 1)
## 
##     Null deviance: 17492  on 50891  degrees of freedom
## Residual deviance: 17377  on 50883  degrees of freedom
## AIC: 25482
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.640 
##           Std. Err.:  0.654 
## 
##  2 x log-likelihood:  -25462.236
#in terms of train model. the AIC has improved

Now let’s check whether the model is statistically good fit using deviance

cbind(scaled.deviance=freq_nb_recode$deviance,df=freq_nb_recode$df.residual, p=1-pchisq(freq_nb_recode$deviance, freq_nb_recode$df.residual))
##      scaled.deviance    df p
## [1,]        17376.72 50883 1

Since p-value = 1, \(H_0\) is accepted thus, model with recoded value have a good fit. Lets see the performance of non recoded model

cbind(scaled.deviance=freq_nb2$deviance,df=freq_nb2$df.residual, p=1-pchisq(freq_nb2$deviance, freq_nb2$df.residual))
##      scaled.deviance    df p
## [1,]        17382.87 50866 1

Since p-value = 1, \(H_0\) is accepted thus, model with non-recoded levels also have a good fit. Lets see the performance of both model outside training data. This is when we use our test data to validate once more about the predictive power since the statistic deviance test both show a good fit.

#define function to find residual deviance of model
nb_dev_resid = function(y,y_pred,theta){
  part1 = y*log(y/y_pred)
  part1[is.nan(part1)] = 0
  
  part2 = (y+1/theta)*log((1+theta*y)/(1+theta*y_pred))
  part2[is.nan(part2)] = 0
  
  2*sum(part1-part2)
}
#extracting residual deviance for test data on recoded model
y_recode_pred = exp(predict(freq_nb_recode,newdata = test_rec4[test_rec4$exposure>0,]))
y_recode = test_rec4[test_rec4$exposure>0,]$numclaims

y_no_recode_pred = exp(predict(freq_nb2, newdata = test[test$exposure>0,]))
y_no_recode = test[test$exposure>0,]$numclaims

data.frame('model' = c('freq_nb_recode', 'freq_nb2'),  'test_dev_resid' = c(nb_dev_resid(y_recode,y_recode_pred,freq_nb_recode$theta),nb_dev_resid(y_no_recode,y_no_recode_pred,freq_nb2$theta)))

Since residual deviance = 2*(L(saturated)-L(model) and a model is good when L(model) is maximal, then we chose the model with the smallest test residual deviance. To conclude, the best model is model with recoded parameters which is freq_nb_recode.

Severity Modelling

head(train)
summary(sev_gamma1)
## 
## Call:
## glm(formula = severity ~ veh_value + veh_body + gender + area + 
##     agecat, family = Gamma("log"), data = train[train$numclaims > 
##     0, ], weights = numclaims)
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          7.52721    0.11146  67.533  < 2e-16 ***
## veh_valueLOW         0.08709    0.09040   0.963 0.335466    
## veh_valueMIDDLEHIGH -0.01306    0.08761  -0.149 0.881475    
## veh_valueHIGH       -0.14665    0.09655  -1.519 0.128866    
## veh_bodyBUS         -0.25964    0.75097  -0.346 0.729561    
## veh_bodyCONVT       -1.00219    1.83471  -0.546 0.584937    
## veh_bodyCOUPE        0.39084    0.25522   1.531 0.125772    
## veh_bodyHDTOP        0.15327    0.19877   0.771 0.440715    
## veh_bodyMCARA       -0.97675    0.53732  -1.818 0.069181 .  
## veh_bodyMIBUS        0.31171    0.32160   0.969 0.332503    
## veh_bodyPANVN        0.14864    0.27277   0.545 0.585847    
## veh_bodySEDAN       -0.13742    0.08201  -1.676 0.093903 .  
## veh_bodySTNWG       -0.05366    0.10238  -0.524 0.600198    
## veh_bodyTRUCK        0.07087    0.19787   0.358 0.720234    
## veh_bodyUTE         -0.09363    0.15012  -0.624 0.532848    
## genderM              0.17886    0.06468   2.765 0.005719 ** 
## areaB               -0.01738    0.09201  -0.189 0.850148    
## areaC                0.08302    0.08352   0.994 0.320274    
## areaD                0.05172    0.11436   0.452 0.651091    
## areaE                0.23519    0.12388   1.898 0.057716 .  
## areaF                0.49873    0.14600   3.416 0.000643 ***
## agecat1              0.06698    0.11575   0.579 0.562837    
## agecat3             -0.08678    0.09282  -0.935 0.349847    
## agecat4             -0.16741    0.09282  -1.804 0.071386 .  
## agecat5             -0.28435    0.10907  -2.607 0.009173 ** 
## agecat6             -0.06001    0.12940  -0.464 0.642832    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.345349)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5431.5  on 3353  degrees of freedom
## AIC: 61172
## 
## Number of Fisher Scoring iterations: 8

Recode part 1 = area

#make new data to store
trainsev1 = train
testsev1 = test


levels(trainsev1$area) = recode(levels(trainsev1$area),'A' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'B' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'C' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'D' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'E' = 'A+')

levels(testsev1$area) = recode(levels(testsev1$area),'A' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'B' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'C' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'D' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'E' = 'A+')

levels(trainsev1$area)
## [1] "A+" "F"
levels(testsev1$area)
## [1] "A+" "F"

Recode part 2 : agecat

trainsev2 = trainsev1
testsev2 = testsev1

levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'2' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'1' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'3' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'4' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'6' = '2+')

levels(testsev2$agecat) = recode(levels(testsev2$agecat),'2' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'1' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'3' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'4' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'6' = '2+')

levels(trainsev2$agecat)
## [1] "2+" "5"
levels(testsev2$agecat)
## [1] "2+" "5"
sev_gamma_recode = glm(severity ~ gender + area + agecat, data = trainsev2[trainsev2$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma_recode)
## 
## Call:
## glm(formula = severity ~ gender + area + agecat, family = Gamma("log"), 
##     data = trainsev2[trainsev2$numclaims > 0, ], weights = numclaims)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.45669    0.04217 176.805  < 2e-16 ***
## genderM      0.16843    0.06120   2.752  0.00595 ** 
## areaF        0.41508    0.12913   3.214  0.00132 ** 
## agecat5     -0.22816    0.08995  -2.537  0.01124 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 3.295482)
## 
##     Null deviance: 5623.8  on 3378  degrees of freedom
## Residual deviance: 5536.0  on 3375  degrees of freedom
## AIC: 61211
## 
## Number of Fisher Scoring iterations: 6

Doing the same step like finding best model for frequency. For severity we try to compare the performance of recoded and non-recoded models, firstly using deviance statistics to show goodness of fit

cbind(scaled.deviance=sev_gamma_recode$deviance/summary(sev_gamma_recode)$dispersion,df=sev_gamma_recode$df.residual, p=1-pchisq(sev_gamma_recode$deviance/summary(sev_gamma_recode)$dispersion, sev_gamma_recode$df.residual))
##      scaled.deviance   df p
## [1,]        1679.862 3375 1

Since p-value = 1 >0.05, \(H_0\) accepted this shows severity model with recoded levels have a surpasses goodness of fit test. Now, let’s check model with non-recoded levels

cbind(scaled.deviance=sev_gamma3$deviance/summary(sev_gamma3)$dispersion,df=sev_gamma3$df.residual, p=1-pchisq(sev_gamma3$deviance/summary(sev_gamma3)$dispersion, sev_gamma3$df.residual))
##      scaled.deviance   df p
## [1,]        1650.517 3367 1

Since p-value = 1 >0.05, \(H _0\) accepted this shows severity model with non-recoded levels also have a surpasses goodness of fit test. Now, let’s check model with test data deviance

Gamma Residual Deviance
Gamma Residual Deviance
head(test)
gamma_dev_resid = function(y,y_pred,disp_params){
  part1 = -log(y/y_pred)
  part1[is.nan(part1)] = 0
  
  part2 = (y-y_pred)/y_pred
  part2[is.nan(part2)] = 0
  
  2*sum(part1+part2)/disp_params
}
sev_recode_pred = exp(predict(sev_gamma_recode, newdata = testsev2[testsev2$numclaims>0,]))
sev_recode_test = testsev2[testsev2$numclaims>0,]$severity

sev_no_recode_pred = exp(predict(sev_gamma3, newdata = test[test$numclaims>0,]))
sev_no_recode_test = test[test$numclaims>0,]$severity

data.frame('model' = c('sev_gamma_recode', 'sev_gamma3'),  'test_dev_resid' = c(gamma_dev_resid(sev_recode_test,sev_recode_pred,summary(sev_gamma_recode)$dispersion),nb_dev_resid(sev_no_recode_test,sev_no_recode_pred,summary(sev_gamma3)$dispersion)))

Analogously on how to pick the best frequency model, model with the smallest value of test scaled deviance are chosen. Then best model for severity is sev_gamma3

5. Get the relatives for each frequency and severity with their corresponding confidence interval

In insurance pricing, relativities are adjustments made to premium according to different tariff cells. They measure how the expected frequency or severity of claims differs in comparison to a baseline or in this case we call as base cell.

Relativities help ensure policyholders are charged accordingly based on the level of risk that they are transferring to insurer. This way, contributions to the fund required for covering claims are more proportional.

Link function exp(\(\beta_i\)) is used to obtain the value of relativities from each tariff cells. All base cell (and levels that has been merged into their respective base cell) from each rating factors would have a relativity value of 1 as they are used as a baseline or reference group in this calculation.

Relativities for Frequency Model

freq_lower_bound <- exp(freq_nb_recode$coefficients - qnorm(0.975) * sqrt(diag(vcov(freq_nb_recode))))
freq_upper_bound <- exp(freq_nb_recode$coefficients + qnorm(0.975) * sqrt(diag(vcov(freq_nb_recode))))

freq_relative = data.frame(Estimates = freq_nb_recode$coefficients,
                           Relativities = exp(freq_nb_recode$coefficients),
                           `Std. Error` = sqrt(diag(vcov(freq_nb_recode))),
                           `Lower Bound` = freq_lower_bound,
                           `Upper Bound` = freq_upper_bound,
                           `95% CI Relativities` = paste0("( ", freq_lower_bound, " , ", freq_upper_bound, " )"),
                           check.names = F)
freq_relative

The frequency relativity value indicates how much more or less likely a policyholder in a specific level of a rating factor is to make a claim compared to those in the baseline level (base cell). A relativity value of >1 indicates that the variable would increase the value of its baseline, in the amount of their relativity. Contrarily, A relativity value of <1 indicates that the variable would decrease the value of its baseline, in the amount of their relativity.

This could be taken as how much more or less likely that this policyholder would make a claim, or in other word, how much higher or lower of a risk is this policyholder compared to our baseline.

Relativities for Severity Model

sev_lower_bound <- exp(sev_gamma3$coefficients - qnorm(0.975) * sqrt(diag(vcov(sev_gamma3))))
sev_upper_bound <- exp(sev_gamma3$coefficients + qnorm(0.975) * sqrt(diag(vcov(sev_gamma3))))

sev_relative = data.frame(Estimates = sev_gamma3$coefficients,
                          Relativities = exp(sev_gamma3$coefficients),
                          `Std. Error` = sqrt(diag(vcov(sev_gamma3))),
                          `Lower Bound` = sev_lower_bound,
                          `Upper Bound` = sev_upper_bound,
                          check.names = F)
sev_relative

Similarly, the severity relativity value indicates how much more or less severe is a policyholder in a specific level of a rating factor averagely in making claims, compared to those in the baseline level (base cell). A relativity value of >1 indicates that the variable would increase the value of its baseline, in the amount of their relativity. Contrarily, A relativity value of <1 indicates that the variable would decrease the value of its baseline, in the amount of their relativity.

This could be taken as how much more or less costly (in average) that this policyholder in making a claim. Again, in other word, explains how much higher or lower of a risk is this policyholder compared to our baseline.

Usually, when the value of 1 is included in the relativity confidence intervals, it would mean that the variable is not significant enough for the model and is preferably to be merged with their base cell (or even removed altogether). That is because a relativity of 1, that we can write as exp(\(\beta_i\)) = 1, would mean that \(\beta_i\) = 0. An estimate of 0 has no meaning in a model as it does nothing. However, according to Section 4, the severity model that still includes insignificant variables is considered a better model, and therefore, it is selected as our final model.

6. Combine your multiplier estimates to get multiplier (and associated 95% confidence interval) for the premium overall

To estimate the overall risk premium, both frequency and severity models are combined. The risk premium is calculated by multiplying the expected frequency of claims by the expected severity of those claims. This approach ensures that premiums accurately reflect both the likelihood and potential cost of claims.

A multiplier for the premium overall is obtained by multiplying multiplier estimates from both the frequency and severity model (done by using excel).

multiplier = read_excel("PURE PREMIUM RELATIVITIES.xlsx", sheet = "Pure Premium Multiplier", range = "A2:E18")
multiplier

This makes our final model for the premium overall:

Pure Premium = 284.374800 * (0.870382\(x_1\) + 1.148561\(x_2\)) * (1.474611\(x_3\) + 0.817646\(x_4\)) * 1.168499\(x_5\) * (0.996028\(x_6\) + 1.103385\(x_7\) + 0.898861\(x_8\) + 1.266991\(x_9\) + 1.565519\(x_{10}\)) * (1.396773\(x_{11}\) + 0.934385\(x_{12}\) + 0.853264\(x_{13}\) + 0.593508\(x_{14}\) + 0.762049\(x_{15}\))

in which:

Multipliers for pure premium overall and pure premium for all possible combinations is accessible here.