Asuransi Umum dan Kerugian - Project GLM Pricing
Group 6:
Ariel Haposan (22/503114/PA/21606)
Kanaya Azlia Sava (21/480251/PA/20855)
Priscilla Deviana Tachija (21/477681/PA/20684)
Victorius Chendryanto (22/496953/PA/21379)
The main goal of this project is to compute the relativities for the premium based on a set of rating factors. Statistical tests will be performed to measure the quality of the fit and determine whether the given rating factors are statistically significant to predict the frequency and/or severity component of the tariff model. All statistical tests should be performed at a confidence level of 95%.
Importing dataset
library(readxl)
## Warning: package 'readxl' was built under R version 4.3.2
G3_6 <- read_excel("G3-6.xlsx")
head(G3_6)
The imported dataset is a modified version of the original given dataset, where some variables in the original dataset were read as chr (character) instead of dbl (double). The data was modified by manually changing the values stored as text in Excel, although this could also be done using R.
To understand the dataset better, we need to define each variable in the dataset. To do this, we found a similar dataset on the internet for reference, and we must first analyze it by examining the characteristics of each variable.
summary(G3_6)
## veh_value exposure clm numclaims
## Min. : 0.00 Min. :0.002738 Min. :0.00000 Min. :0.00000
## 1st Qu.: 1.01 1st Qu.:0.219028 1st Qu.:0.00000 1st Qu.:0.00000
## Median : 1.50 Median :0.446270 Median :0.00000 Median :0.00000
## Mean : 74.81 Mean :0.468651 Mean :0.06814 Mean :0.07276
## 3rd Qu.: 2.20 3rd Qu.:0.709103 3rd Qu.:0.00000 3rd Qu.:0.00000
## Max. :7995.00 Max. :0.999316 Max. :1.00000 Max. :4.00000
## claimcst0 veh_body veh_age gender
## Min. : 0.0 Length:67856 Min. :1.000 Length:67856
## 1st Qu.: 0.0 Class :character 1st Qu.:2.000 Class :character
## Median : 0.0 Mode :character Median :3.000 Mode :character
## Mean : 137.3 Mean :2.674
## 3rd Qu.: 0.0 3rd Qu.:4.000
## Max. :55922.1 Max. :4.000
## area agecat
## Length:67856 Min. :1.000
## Class :character 1st Qu.:2.000
## Mode :character Median :3.000
## Mean :3.485
## 3rd Qu.:5.000
## Max. :6.000
cat("NAs data:", sum(is.na(G3_6)))
## NAs data: 0
As we can see, the variables veh_value and claimcst0 have significant gaps between the minimum, median, and maximum values. We may need to investigate whether there are anomalies in these values. However, for variable claimcst0, we can likely rule out our suspicion, as it is defined as the claim severity, which is known to be highly variable, with many instances of 0 (as confirmed by the fact that the third quartile is 0). Therefore, we will focus on checking for anomalies in the variable veh_value.
We also acknowledge that the values of the variables clm, veh_body, veh_age, gender, area, and agecat are likely categorical, so we need to factor these variables. From a similar dataset we found online, veh_value is defined as the value of vehicles, which seems odd, as vehicle prices usually don’t exhibit such large variations. Considering this, we will further examine the distribution of veh_value and its correlation with other variables.
print(cor(G3_6[c(1:5,7,10)]))
## veh_value exposure clm numclaims claimcst0
## veh_value 1.000000000 0.01839012 0.006417623 0.006179527 3.352714e-03
## exposure 0.018390122 1.00000000 0.132980367 0.134393258 3.652955e-02
## clm 0.006417623 0.13298037 1.000000000 0.967106972 4.805655e-01
## numclaims 0.006179527 0.13439326 0.967106972 1.000000000 4.817623e-01
## claimcst0 0.003352714 0.03652955 0.480565452 0.481762292 1.000000e+00
## veh_age -0.216648123 0.03696233 -0.012128772 -0.011572424 -1.157106e-05
## agecat -0.022104660 0.02503435 -0.030433931 -0.029571784 -2.743031e-02
## veh_age agecat
## veh_value -2.166481e-01 -0.02210466
## exposure 3.696233e-02 0.02503435
## clm -1.212877e-02 -0.03043393
## numclaims -1.157242e-02 -0.02957178
## claimcst0 -1.157106e-05 -0.02743031
## veh_age 1.000000e+00 0.01761360
## agecat 1.761360e-02 1.00000000
plot(G3_6[c(1:5,7,10)])
The plot of each variable strengthens our hypothesis that the variables clm, veh_body, veh_age, gender, area, and agecat are categorical variables. This is evident from the patterns that follow distinct lines, indicating a finite set of discrete values, consistent with the definitions found in a similar dataset.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.3.3
library(plotly)
## Warning: package 'plotly' was built under R version 4.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# Create a basic ggplot2 boxplot
p <- ggplot(G3_6, aes(x = "", y = veh_value)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
labs(title = "Interactive Boxplot of Vehicle Value", y = "Vehicle Value") +
theme_minimal()
# Convert ggplot2 plot to an interactive plotly plot
interactive_plot <- ggplotly(p)
# Show the interactive plot
interactive_plot
From the boxplot of the variable veh_value, it appears there are anomalies in the data, with some values significantly higher than expected. Upon further examination of the data samples, we noticed a substantial gap between certain values, indicating potential errors or outliers. To address this issue, we will apply scaling to the anomalous data.
We hypothesize that any value above 999 is incorrect, possibly due to a misplacement of decimal points or other data entry errors. Therefore, we will scale down these values by a factor of 1000 to bring them into a reasonable range. This adjustment will ensure the data is more consistent with expectations and facilitate more accurate analysis moving forward.
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
G3_6 = G3_6 %>% mutate(veh_value = case_when(veh_value > 999 ~ veh_value/1000, veh_value < 999 ~ veh_value))
G3_6 <- within(G3_6, {
clm <- factor(clm)
veh_body <- factor(veh_body)
veh_age <- factor(veh_age)
gender <- factor(gender)
area <- factor(area)
agecat <- factor(agecat)
})
summary(G3_6)
## veh_value exposure clm numclaims
## Min. : 0.000 Min. :0.002738 0:63232 Min. :0.00000
## 1st Qu.: 1.010 1st Qu.:0.219028 1: 4624 1st Qu.:0.00000
## Median : 1.500 Median :0.446270 Median :0.00000
## Mean : 1.777 Mean :0.468651 Mean :0.07276
## 3rd Qu.: 2.150 3rd Qu.:0.709103 3rd Qu.:0.00000
## Max. :34.560 Max. :0.999316 Max. :4.00000
##
## claimcst0 veh_body veh_age gender area agecat
## Min. : 0.0 SEDAN :22233 1:12257 F:38603 A:16312 1: 5742
## 1st Qu.: 0.0 HBACK :18915 2:16587 M:29253 B:13341 2:12875
## Median : 0.0 STNWG :16261 3:20064 C:20540 3:15767
## Mean : 137.3 UTE : 4586 4:18948 D: 8173 4:16189
## 3rd Qu.: 0.0 TRUCK : 1750 E: 5912 5:10736
## Max. :55922.1 HDTOP : 1579 F: 3578 6: 6547
## (Other): 2532
After examining the characteristics and cross-checking with the similar dataset, we are able to define each of the variables as follows:
For the purpose of creating tariff cells for premium calculation, all variables need to be in categorical form to simplify the analysis. Therefore, we need to convert the veh_value variable into a categorical variable. We have decided to divide veh_value into four categories based on its range. The four categories are chosen by taking the first quartile, median, and third quartile of the variable. The four categories are then defined as follows:
LOW: 0 to 1.01 MIDDLELOW: 1.01 to 1.5 MIDDLEHIGH: 1.5 to 2.15 HIGH: 2.15 to 100
The following code was used to implement this categorization:
library(ggplot2)
library(plotly)
# Create a basic ggplot2 boxplot
p <- ggplot(G3_6, aes(x = "", y = veh_value)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 3) +
labs(title = "Interactive Boxplot of Vehicle Value", y = "Vehicle Value") +
theme_minimal()
# Convert ggplot2 plot to an interactive plotly plot
interactive_plot <- ggplotly(p)
# Show the interactive plot
interactive_plot
G3_6 = G3_6 %>% mutate(veh_value = cut(veh_value, breaks = c(0, 1.01, 1.5, 2.15, 100), labels = c("LOW", "MIDDLELOW", "MIDDLEHIGH", "HIGH"), include.lowest = TRUE))
We also need to define frequency and severity for further analysis
G3_6$frequency = G3_6$numclaims / G3_6$exposure
G3_6$severity = G3_6$claimcst0 / G3_6$numclaims
G3_6$severity[G3_6$numclaims == 0] = 0 #If the number of claims = 0, thus the severity should be 0 too
summary(G3_6)
## veh_value exposure clm numclaims
## LOW :17060 Min. :0.002738 0:63232 Min. :0.00000
## MIDDLELOW :17308 1st Qu.:0.219028 1: 4624 1st Qu.:0.00000
## MIDDLEHIGH:16550 Median :0.446270 Median :0.00000
## HIGH :16938 Mean :0.468651 Mean :0.07276
## 3rd Qu.:0.709103 3rd Qu.:0.00000
## Max. :0.999316 Max. :4.00000
##
## claimcst0 veh_body veh_age gender area agecat
## Min. : 0.0 SEDAN :22233 1:12257 F:38603 A:16312 1: 5742
## 1st Qu.: 0.0 HBACK :18915 2:16587 M:29253 B:13341 2:12875
## Median : 0.0 STNWG :16261 3:20064 C:20540 3:15767
## Mean : 137.3 UTE : 4586 4:18948 D: 8173 4:16189
## 3rd Qu.: 0.0 TRUCK : 1750 E: 5912 5:10736
## Max. :55922.1 HDTOP : 1579 F: 3578 6: 6547
## (Other): 2532
## frequency severity
## Min. : 0.0000 Min. : 0.0
## 1st Qu.: 0.0000 1st Qu.: 0.0
## Median : 0.0000 Median : 0.0
## Mean : 0.2142 Mean : 130.6
## 3rd Qu.: 0.0000 3rd Qu.: 0.0
## Max. :365.2500 Max. :55922.1
##
The function “levels” enumerate the different outcomes/categories of each predictor
cat("Level of variable veh_value:", levels(G3_6$veh_value), "\n")
## Level of variable veh_value: LOW MIDDLELOW MIDDLEHIGH HIGH
cat("Level of variable veh_body:", levels(G3_6$veh_body), "\n")
## Level of variable veh_body: BUS CONVT COUPE HBACK HDTOP MCARA MIBUS PANVN RDSTR SEDAN STNWG TRUCK UTE
cat("Level of variable veh_age:", levels(G3_6$veh_age), "\n")
## Level of variable veh_age: 1 2 3 4
cat("Level of variable gender:", levels(G3_6$gender), "\n")
## Level of variable gender: F M
cat("Level of variable area:", levels(G3_6$area), "\n")
## Level of variable area: A B C D E F
cat("Level of variable agecat:", levels(G3_6$agecat))
## Level of variable agecat: 1 2 3 4 5 6
Observation: the variable area and agecat each have 6 categories, gender has 2 categories, veh_body has 13 categories, and both veh_age and veh_value have 4 categories.
To better understand the relationship between predictors and claim frequency
par(mfrow = c(2, 3))
plot(G3_6$veh_value, G3_6$frequency, col="red",xlab="Vehicle Value",ylab="Claim Frequency")
plot(G3_6$veh_body, G3_6$frequency, col="red",xlab="Vehicle Body",ylab="Claim Frequency")
plot(G3_6$veh_age, G3_6$frequency, col="red",xlab="Vehicle Age",ylab="Claim Frequency")
plot(G3_6$gender, G3_6$frequency, col="red",xlab="Gender",ylab="Claim Frequency")
plot(G3_6$area, G3_6$frequency, col="red",xlab="Area",ylab="Claim Frequency")
plot(G3_6$agecat, G3_6$frequency, col="red",xlab="Driver's Age",ylab="Claim Frequency")
Observations:
Vehicle Value (LOW and MIDDLEHIGH)
The claim frequency for both categories shows many outliers, particularly in the LOW category. The distribution is relatively compact, suggesting less variation within the core data but significant variation due to outliers.
Vehicle Body (BUS, HDTOP, SEDAN)
The claim frequency varies notably across different vehicle body types. The HDTOP category appears to have a wider spread of claim frequencies, including more outliers compared to BUS and SEDAN, suggesting a higher variability in claims.
Vehicle Age (1, 2, 3, 4)
There seems to be a consistent pattern of claim frequency across different vehicle ages, with outliers present in each category. The range of core claim frequencies does not change dramatically, indicating stable claim behavior across vehicle ages.
Gender (F, M)
The claim frequencies between genders show considerable overlap, with both having outliers. However, there appears to be a slightly higher concentration of outliers in the male category.
Area (A, B, C, D, E, F)
Claim frequencies across different areas show significant differences, with some areas like B and E having higher frequencies and more outliers. This indicates varying risk levels or claim propensities based on area.
Driver’s Age (1, 2, 3, 4, 5, 6)
The claim frequency is similar across all age groups, with no significant differences. Outliers are present in each group, but the core data remains consistent, indicating stable claim behavior across different driver ages.
Summary: Numerous outliers are present across all variables, potentially affecting the perception of claim frequency variability. There is significant variability in claim frequencies among different categories within each variable, notably in vehicle body type and area. The prevalence of outliers complicates the identification of clear patterns in claim frequencies, indicating a need for further analysis. Utilizing median values or outlier-resistant measures could provide a clearer understanding of the central tendencies and dispersions within the data.
par(mfrow = c(2, 3))
plot(G3_6$veh_value, G3_6$severity, col="red",xlab="Vehicle Body",ylab="Claim Severity")
plot(G3_6$veh_body, G3_6$severity, col="red",xlab="Vehicle Body",ylab="Claim Severity")
plot(G3_6$veh_age, G3_6$severity, col="red",xlab="Vehicle Age",ylab="Claim Severity")
plot(G3_6$gender, G3_6$severity, col="red",xlab="Gender",ylab="Claim Severity")
plot(G3_6$area, G3_6$severity, col="red",xlab="Area",ylab="Claim Severity")
plot(G3_6$agecat, G3_6$severity, col="red",xlab="Driver's Age",ylab="Claim Severity")
Observations: Vehicle Value (LOW, MIDDLEHIGH): Claim severity is similar between low and middle-high vehicle values, with both having outliers, but no significant difference in the overall distribution.
Vehicle Body (BUS, HDTOP, SEDAN): The SEDAN category shows a wider spread of claim severities with more outliers compared to BUS and HDTOP. Overall, there are higher claim severities for sedans.
Vehicle Age (1, 2, 3, 4): Claim severity remains fairly consistent across vehicle age categories, with outliers present in all. There is no strong pattern suggesting that vehicle age affects claim severity.
Gender (F, M): Claim severity between males and females appears quite similar, with both categories containing a few large outliers. No major differences in the distributions.
Area (A, B, C, D, E, F): Areas exhibit similar distributions of claim severity, with outliers present in each. No specific area stands out as having significantly higher or lower severities.
Driver’s Age (1, 2, 3, 4, 5, 6): Claim severity is consistent across driver age groups, with outliers present in each. There is no significant variation in the core claim severities across different age groups.
Summary: The claim severity appears fairly consistent across most variables, with outliers present in all categories. There is no significant difference in severity between vehicle values (low and middle-high) or between genders. Vehicle body type shows some variation, with sedans exhibiting a wider spread of severities and more outliers compared to buses and hardtops. Similarly, claim severity remains stable across different vehicle ages and driver age groups, with no notable patterns or changes. The various areas also demonstrate similar severity distributions, with outliers in each region but no significant differences overall.
library(fitdistrplus)
## Warning: package 'fitdistrplus' was built under R version 4.3.3
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## The following object is masked from 'package:plotly':
##
## select
## Loading required package: survival
## Warning: package 'survival' was built under R version 4.3.3
library(MASS)
descdist(G3_6$numclaims, discrete = TRUE)
## summary statistics
## ------
## min: 0 max: 4
## median: 0
## mean: 0.07275701
## estimated sd: 0.2782038
## estimated skewness: 4.067377
## estimated kurtosis: 21.50576
Possible fit distributions are normal, nbinom, and poisson based on Cullen and Frey graph.
freqnorm <- fitdist(G3_6$numclaims, "norm")
freqbinom <- fitdist(G3_6$numclaims, "nbinom")
freqpois <- fitdist(G3_6$numclaims, "pois")
freqfit <- list(freqnorm, freqbinom, freqpois)
summary(freqnorm)
## Fitting of the distribution ' norm ' by maximum likelihood
## Parameters :
## estimate Std. Error
## mean 0.07275701 0.0010679865
## sd 0.27820178 0.0007551366
## Loglikelihood: -9467.943 AIC: 18939.89 BIC: 18958.14
## Correlation matrix:
## mean sd
## mean 1.000000e+00 -5.867883e-12
## sd -5.867883e-12 1.000000e+00
summary(freqbinom)
## Fitting of the distribution ' nbinom ' by maximum likelihood
## Parameters :
## estimate Std. Error
## size 1.15881641 0.143342061
## mu 0.07275935 0.001067318
## Loglikelihood: -18049.68 AIC: 36103.36 BIC: 36121.61
## Correlation matrix:
## size mu
## size 1.000000e+00 -1.230821e-05
## mu -1.230821e-05 1.000000e+00
summary(freqpois)
## Fitting of the distribution ' pois ' by maximum likelihood
## Parameters :
## estimate Std. Error
## lambda 0.07275701 0.001035288
## Loglikelihood: -18101.5 AIC: 36205 BIC: 36214.13
data.frame(Dist = c("Normal", "NB", "Poisson"),
Loglik = c(freqnorm$loglik, freqbinom$loglik, freqpois$loglik),
AIC = c(freqnorm$aic, freqbinom$aic, freqpois$aic),
BIC = c(freqnorm$bic, freqbinom$bic, freqpois$bic))
From the fitting process, based on the AIC and BIC goodness-of-fit tests, we found that the negative binomial distribution is the best fit for claim frequency, with parameters size = 1.15881641 and mu = 0.07275935. Although the normal distribution has the smallest AIC and BIC values, it is not recommended for modeling frequency, as the normal distribution is continuous.
library(fitdistrplus)
library(MASS)
descdist(G3_6$claimcst0, discrete = FALSE)
## summary statistics
## ------
## min: 0 max: 55922.13
## median: 0
## mean: 137.2702
## estimated sd: 1056.298
## estimated skewness: 17.5025
## estimated kurtosis: 482.9257
Based on the Cullen and Frey graph, the possible fit distributions are Gamma, Beta, and Weibull. However, to fit a GLM model, the selected distribution must belong to the exponential family. Since Weibull is not part of the exponential family, it will be excluded as a potential fit.
G3_6_sev <- G3_6[G3_6$claimcst0 > 0, ]
sevgamma <- fitdist(G3_6_sev$claimcst0, "gamma", "mme")
G3_6_beta <- G3_6_sev$claimcst0 / max(G3_6_sev$claimcst0)
sevbeta <- fitdist(G3_6_beta, "beta", "mme")
Compare AIC/BIC and summaries
data.frame(Dist = c("Gamma", "Beta"),
Loglik = c(sevgamma$loglik, sevbeta$loglik),
AIC = c(sevgamma$aic, sevbeta$aic),
BIC = c(sevgamma$bic, sevbeta$bic))
summary(sevgamma)
## Fitting of the distribution ' gamma ' by matching moments
## Parameters :
## estimate Std. Error
## shape 0.3222537552 0.923148113
## rate 0.0001599747 0.000537987
## Loglikelihood: -40576.4 AIC: 81156.79 BIC: 81169.67
## Correlation matrix:
## shape rate
## shape 1.0000000 0.8518301
## rate 0.8518301 1.0000000
summary(sevbeta)
## Fitting of the distribution ' beta ' by matching moments
## Parameters :
## estimate Std. Error
## shape1 0.2746241 1.929803
## shape2 7.3492500 71.819583
## Loglikelihood: -Inf AIC: Inf BIC: Inf
## Correlation matrix:
## shape1 shape2
## shape1 1.000000 1.012924
## shape2 1.012924 1.000000
Visual comparison of the fits
par(mfrow = c(2, 2)) # Arrange plots in 2x2 grid
plot(sevgamma)
plot(sevbeta)
From the fitting process, based on the AIC and BIC goodness-of-fit tests, we found that the Weibull distribution is the best fit for claim severity. The Beta distribution fitting failed, as indicated by the -Inf log-likelihood and Inf AIC/BIC values, meaning the Beta distribution is not suitable for modeling claim severity.
Reordering for the base tariff cell: We typically choose the base tariff cell as the one with the largest exposure (e.g., longest duration). Therefore, we select the tariff cell (MIDDLELOW, HBACK, 2, F, A, 2) as the base tariff cell.
print(basecell<- G3_6[which.max(G3_6$exposure),c(1:12)])
## # A tibble: 1 × 12
## veh_value exposure clm numclaims claimcst0 veh_body veh_age gender area
## <fct> <dbl> <fct> <dbl> <dbl> <fct> <fct> <fct> <fct>
## 1 MIDDLELOW 0.999 0 0 0 HBACK 2 F A
## # ℹ 3 more variables: agecat <fct>, frequency <dbl>, severity <dbl>
We will do a relevel for each rating factors
G3_6$veh_value<- relevel(G3_6$veh_value, as.character(basecell$veh_value))
G3_6$veh_body<- relevel(G3_6$veh_body, as.character(basecell$veh_body))
G3_6$veh_age<- relevel(G3_6$veh_age, as.character(basecell$veh_age))
G3_6$gender<- relevel(G3_6$gender, as.character(basecell$gender))
G3_6$area<- relevel(G3_6$area, as.character(basecell$area))
G3_6$agecat<- relevel(G3_6$agecat, as.character(basecell$agecat))
Splitting 75% of data to be trained and 25% to be tested
train = G3_6[1:floor(0.75*nrow(G3_6)),]
test = G3_6[(floor(0.75*nrow(G3_6))+1):nrow(G3_6),]
train
test
Make new train data but without basecell, this data is going to be used while crossvalidating the models
train_wo_basecell = train[-which.max(G3_6$exposure),]
train_wo_basecell
We will perform GLM modelling for frequency and severity using the possible distributions obtained in number 1.
Note: we use offset to transform the
exposure data from rate to count.
freq_nb = glm.nb(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)),
data = train[train$exposure>0,])
summary(freq_nb)
##
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age +
## gender + area + agecat + offset(log(exposure)), data = train[train$exposure >
## 0, ], init.theta = 2.724312268, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76565 0.06748 -26.164 < 2e-16 ***
## veh_valueLOW -0.06731 0.05824 -1.156 0.247789
## veh_valueMIDDLEHIGH 0.05421 0.05060 1.071 0.284072
## veh_valueHIGH 0.16749 0.06325 2.648 0.008098 **
## veh_bodyBUS 0.74877 0.42770 1.751 0.080002 .
## veh_bodyCONVT -1.55225 1.00507 -1.544 0.122486
## veh_bodyCOUPE 0.41529 0.14502 2.864 0.004188 **
## veh_bodyHDTOP 0.08466 0.11593 0.730 0.465212
## veh_bodyMCARA 0.52128 0.30483 1.710 0.087259 .
## veh_bodyMIBUS -0.04588 0.18409 -0.249 0.803200
## veh_bodyPANVN 0.02443 0.15363 0.159 0.873651
## veh_bodyRDSTR -16.32780 1879.40165 -0.009 0.993068
## veh_bodySEDAN 0.02601 0.04684 0.555 0.578594
## veh_bodySTNWG -0.01212 0.06412 -0.189 0.850139
## veh_bodyTRUCK 0.05657 0.11337 0.499 0.617779
## veh_bodyUTE -0.16794 0.08638 -1.944 0.051860 .
## veh_age1 -0.04346 0.05260 -0.826 0.408706
## veh_age3 -0.08878 0.04954 -1.792 0.073143 .
## veh_age4 -0.10840 0.06670 -1.625 0.104110
## genderM -0.02158 0.03590 -0.601 0.547894
## areaB 0.05400 0.05118 1.055 0.291364
## areaC 0.03939 0.04638 0.849 0.395736
## areaD -0.12636 0.06336 -1.994 0.046100 *
## areaE -0.02791 0.06885 -0.405 0.685167
## areaF -0.01541 0.08085 -0.191 0.848871
## agecat1 0.19860 0.06441 3.083 0.002047 **
## agecat3 -0.05581 0.05166 -1.080 0.280073
## agecat4 -0.05609 0.05141 -1.091 0.275304
## agecat5 -0.28619 0.06065 -4.719 2.37e-06 ***
## agecat6 -0.26143 0.07202 -3.630 0.000283 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.7243) family taken to be 1)
##
## Null deviance: 17525 on 50891 degrees of freedom
## Residual deviance: 17385 on 50862 degrees of freedom
## AIC: 25499
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.724
## Std. Err.: 0.691
##
## 2 x log-likelihood: -25437.443
The frequency model with Negative Binomial distribution results on that rating factors that have a significant p-value are: - veh_value (HIGH) - veh_body (COUPE) - area (D) - agecat (1, 5, and 6)
On the other hand, vehicle age and gender do not have significant p-values, which explains that these two ranking factors do not significantly affect the number of claims.
Deviance statistic
cbind(scaled_deviance = freq_nb$deviance, df = freq_nb$df.residual, p = 1-pchisq(freq_nb$deviance, freq_nb$df.residual))
## scaled_deviance df p
## [1,] 17384.99 50862 1
qchisq(0.95, freq_nb$df.residual)
## [1] 51387.75
The frequency model used Negative Binomial distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.
freq_pois = glm(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)),
data = train[train$exposure>0,], family=poisson("log"))
summary(freq_pois)
##
## Call:
## glm(formula = numclaims ~ veh_value + veh_body + veh_age + gender +
## area + agecat + offset(log(exposure)), family = poisson("log"),
## data = train[train$exposure > 0, ])
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76700 0.06626 -26.666 < 2e-16 ***
## veh_valueLOW -0.06818 0.05735 -1.189 0.234480
## veh_valueMIDDLEHIGH 0.05449 0.04972 1.096 0.273116
## veh_valueHIGH 0.16690 0.06212 2.687 0.007216 **
## veh_bodyBUS 0.75288 0.41123 1.831 0.067134 .
## veh_bodyCONVT -1.55192 1.00160 -1.549 0.121277
## veh_bodyCOUPE 0.41671 0.14175 2.940 0.003284 **
## veh_bodyHDTOP 0.08737 0.11360 0.769 0.441866
## veh_bodyMCARA 0.52259 0.29580 1.767 0.077273 .
## veh_bodyMIBUS -0.04240 0.18104 -0.234 0.814812
## veh_bodyPANVN 0.02892 0.15046 0.192 0.847592
## veh_bodyRDSTR -10.32610 93.56994 -0.110 0.912126
## veh_bodySEDAN 0.02759 0.04606 0.599 0.549225
## veh_bodySTNWG -0.00987 0.06302 -0.157 0.875560
## veh_bodyTRUCK 0.06048 0.11114 0.544 0.586335
## veh_bodyUTE -0.16457 0.08502 -1.936 0.052914 .
## veh_age1 -0.04218 0.05161 -0.817 0.413733
## veh_age3 -0.08943 0.04864 -1.839 0.065971 .
## veh_age4 -0.10928 0.06556 -1.667 0.095537 .
## genderM -0.02126 0.03528 -0.603 0.546819
## areaB 0.05308 0.05026 1.056 0.290969
## areaC 0.03809 0.04556 0.836 0.403167
## areaD -0.12848 0.06237 -2.060 0.039394 *
## areaE -0.03029 0.06771 -0.447 0.654559
## areaF -0.01674 0.07934 -0.211 0.832877
## agecat1 0.19678 0.06309 3.119 0.001816 **
## agecat3 -0.05641 0.05074 -1.112 0.266228
## agecat4 -0.05634 0.05049 -1.116 0.264497
## agecat5 -0.28613 0.05969 -4.794 1.64e-06 ***
## agecat6 -0.25989 0.07088 -3.667 0.000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 18715 on 50891 degrees of freedom
## Residual deviance: 18569 on 50862 degrees of freedom
## AIC: 25517
##
## Number of Fisher Scoring iterations: 11
The frequency model with Poisson distribution results on that rating factors that have a significant p-value are: - veh_value (HIGH) - veh_body (COUPE) - area (D) - agecat (1, 5, and 6)
On the other hand, vehicle age and gender do not have significant p-values, which explains that these two ranking factors do not significantly affect the number of claims.
Deviance statistic
cbind(scaled_deviance = freq_pois$deviance, df = freq_pois$df.residual, p = 1-pchisq(freq_pois$deviance, freq_pois$df.residual))
## scaled_deviance df p
## [1,] 18569.3 50862 1
qchisq(0.95, freq_pois$df.residual)
## [1] 51387.75
The frequency model used Poisson distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.
Implementing special cross validation to check model goodness of fit (NB Distribution)
library(stringr)
## Warning: package 'stringr' was built under R version 4.3.3
num_folds = 3
folds = cut(seq(1,nrow(train)),breaks=num_folds,labels=FALSE)
metrics_nb = data.frame('fold' = c(), 'AIC' = c())
for(i in 1:num_folds){
val_idx = which(folds==i,arr.ind=TRUE)
tr = rbind(train_wo_basecell[-val_idx, ], basecell)
val = train_wo_basecell[val_idx, ]
model = glm.nb(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)),
data = tr[tr$exposure>0,])
get_metric = data.frame('fold' = str_glue('fold{i}'), 'AIC' = model$aic)
metrics_nb = rbind(metrics_nb,get_metric)
}
metrics_nb
Implementing special cross validation to check model goodness of fit (Poisson Distribution)
metrics_pois = data.frame('fold' = c(), 'AIC' = c())
for(i in 1:num_folds){
val_idx = which(folds==i,arr.ind=TRUE)
tr = rbind(train_wo_basecell[-val_idx, ], basecell)
val = train_wo_basecell[val_idx, ]
model = glm(numclaims~veh_value+veh_body+veh_age+gender+area+agecat+offset(log(exposure)),
data = tr[tr$exposure>0,], family=poisson("log"))
get_metric = data.frame('fold' = str_glue('fold{i}'), 'AIC' = model$aic)
metrics_pois = rbind(metrics_pois,get_metric)
}
metrics_pois
aic_nbpoi = data.frame(Distribution = c("NB", "Poisson"),
AIC = c(mean(metrics_nb$AIC), mean(metrics_pois$AIC)))
aic_nbpoi
Although from the modelling results with the glm() function there is only a slight difference in the coefficient estimates between NB distribution and Poisson distribution, and the distribution comparison results above, it can be concluded that the Negative Binomial distribution is a more suitable distribution for modelling frequency than the Poisson distribution because of the smaller AIC values. The smaller the AIC indicates that the model not only has a good fit with the data, but also avoids excessive complexity.
sev_gamma <- glm(severity ~ veh_value + veh_body + veh_age + gender + area + agecat, data = train[train$numclaims > 0, ], family = Gamma("log"), weights = numclaims)
summary(sev_gamma)
##
## Call:
## glm(formula = severity ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = Gamma("log"), data = train[train$numclaims >
## 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.514443 0.121827 61.681 < 2e-16 ***
## veh_valueLOW 0.085779 0.105519 0.813 0.416318
## veh_valueMIDDLEHIGH 0.009238 0.091950 0.100 0.919979
## veh_valueHIGH -0.115149 0.115407 -0.998 0.318467
## veh_bodyBUS -0.276854 0.756794 -0.366 0.714519
## veh_bodyCONVT -1.030755 1.849179 -0.557 0.577283
## veh_bodyCOUPE 0.374727 0.259369 1.445 0.148618
## veh_bodyHDTOP 0.123624 0.208096 0.594 0.552504
## veh_bodyMCARA -1.022644 0.546427 -1.872 0.061361 .
## veh_bodyMIBUS 0.272985 0.334817 0.815 0.414944
## veh_bodyPANVN 0.145277 0.275929 0.527 0.598575
## veh_bodySEDAN -0.148193 0.085072 -1.742 0.081605 .
## veh_bodySTNWG -0.072168 0.117718 -0.613 0.539880
## veh_bodyTRUCK 0.056175 0.203675 0.276 0.782713
## veh_bodyUTE -0.119374 0.158036 -0.755 0.450088
## veh_age1 -0.043576 0.094762 -0.460 0.645654
## veh_age3 0.059946 0.089599 0.669 0.503511
## veh_age4 0.010553 0.118571 0.089 0.929085
## genderM 0.178773 0.065060 2.748 0.006031 **
## areaB -0.020476 0.092474 -0.221 0.824774
## areaC 0.081724 0.083910 0.974 0.330157
## areaD 0.044805 0.114927 0.390 0.696667
## areaE 0.227978 0.124530 1.831 0.067234 .
## areaF 0.499850 0.147097 3.398 0.000686 ***
## agecat1 0.068641 0.116321 0.590 0.555161
## agecat3 -0.089848 0.093310 -0.963 0.335667
## agecat4 -0.167505 0.093281 -1.796 0.072631 .
## agecat5 -0.284988 0.109569 -2.601 0.009336 **
## agecat6 -0.055087 0.130009 -0.424 0.671800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.375574)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5427.7 on 3350 degrees of freedom
## AIC: 61175
##
## Number of Fisher Scoring iterations: 8
The severity model with Gamma distribution results in the vehicle value and vehicle age ranking factors not having significant p-values, which explains that these two ranking factors do not significantly affect severity.
On the other hand, the vehicle body rating factor has a significant p-value for the MCARA vehicle type but with a negative coefficient, which means that the MCARA type has a lower severity compared to the baseline.
The male gender rating factor has a highly significant p-value, indicating that male gender drivers have a significant effect on severity and have a higher severity.
For area, area F has a significant p-value and a positive coefficient, indicating that area F has a higher severity compared to the baseline.
For driver age categories, specifically driver age category 5 has a significant p-value but a negative coefficient, indicating that this age category has less severity.
Deviance statistic
cbind(scaled_deviance = sev_gamma$deviance/summary(sev_gamma)$dispersion, df = sev_gamma$df.residual, p = 1-pchisq(sev_gamma$deviance/summary(sev_gamma)$dispersion, sev_gamma$df.residual))
## scaled_deviance df p
## [1,] 1607.921 3350 1
qchisq(0.95, sev_gamma$df.residual)
## [1] 3485.764
The severity model used Gamma distribution is suitable for use because the chisq value > scaled deviance. Thus, based on this analysis, as the deviance in the model is relatively small compared to the chi-squared value, and the p-value suggests that there is no significant evidence against the model fit, the model appears to be suitable.
summary(freq_nb)
##
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age +
## gender + area + agecat + offset(log(exposure)), data = train[train$exposure >
## 0, ], init.theta = 2.724312268, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.76565 0.06748 -26.164 < 2e-16 ***
## veh_valueLOW -0.06731 0.05824 -1.156 0.247789
## veh_valueMIDDLEHIGH 0.05421 0.05060 1.071 0.284072
## veh_valueHIGH 0.16749 0.06325 2.648 0.008098 **
## veh_bodyBUS 0.74877 0.42770 1.751 0.080002 .
## veh_bodyCONVT -1.55225 1.00507 -1.544 0.122486
## veh_bodyCOUPE 0.41529 0.14502 2.864 0.004188 **
## veh_bodyHDTOP 0.08466 0.11593 0.730 0.465212
## veh_bodyMCARA 0.52128 0.30483 1.710 0.087259 .
## veh_bodyMIBUS -0.04588 0.18409 -0.249 0.803200
## veh_bodyPANVN 0.02443 0.15363 0.159 0.873651
## veh_bodyRDSTR -16.32780 1879.40165 -0.009 0.993068
## veh_bodySEDAN 0.02601 0.04684 0.555 0.578594
## veh_bodySTNWG -0.01212 0.06412 -0.189 0.850139
## veh_bodyTRUCK 0.05657 0.11337 0.499 0.617779
## veh_bodyUTE -0.16794 0.08638 -1.944 0.051860 .
## veh_age1 -0.04346 0.05260 -0.826 0.408706
## veh_age3 -0.08878 0.04954 -1.792 0.073143 .
## veh_age4 -0.10840 0.06670 -1.625 0.104110
## genderM -0.02158 0.03590 -0.601 0.547894
## areaB 0.05400 0.05118 1.055 0.291364
## areaC 0.03939 0.04638 0.849 0.395736
## areaD -0.12636 0.06336 -1.994 0.046100 *
## areaE -0.02791 0.06885 -0.405 0.685167
## areaF -0.01541 0.08085 -0.191 0.848871
## agecat1 0.19860 0.06441 3.083 0.002047 **
## agecat3 -0.05581 0.05166 -1.080 0.280073
## agecat4 -0.05609 0.05141 -1.091 0.275304
## agecat5 -0.28619 0.06065 -4.719 2.37e-06 ***
## agecat6 -0.26143 0.07202 -3.630 0.000283 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.7243) family taken to be 1)
##
## Null deviance: 17525 on 50891 degrees of freedom
## Residual deviance: 17385 on 50862 degrees of freedom
## AIC: 25499
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.724
## Std. Err.: 0.691
##
## 2 x log-likelihood: -25437.443
By looking at the p-value of the regression coefficient we can see that variable veh_age and gender are not significant since the p-value of all dummy related with those variable are all significant. So we doing backward selection.
#eliminate gender
freq_nb1 = glm.nb(numclaims~veh_value+veh_body+veh_age+area+agecat+offset(log(exposure)),
data = train[train$exposure>0,])
summary(freq_nb1)
##
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + veh_age +
## area + agecat + offset(log(exposure)), data = train[train$exposure >
## 0, ], init.theta = 2.724031997, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.77085 0.06693 -26.456 < 2e-16 ***
## veh_valueLOW -0.06622 0.05821 -1.138 0.255306
## veh_valueMIDDLEHIGH 0.05352 0.05059 1.058 0.290069
## veh_valueHIGH 0.16557 0.06317 2.621 0.008766 **
## veh_bodyBUS 0.74324 0.42753 1.738 0.082131 .
## veh_bodyCONVT -1.55198 1.00511 -1.544 0.122565
## veh_bodyCOUPE 0.41222 0.14495 2.844 0.004456 **
## veh_bodyHDTOP 0.07996 0.11568 0.691 0.489409
## veh_bodyMCARA 0.51647 0.30481 1.694 0.090186 .
## veh_bodyMIBUS -0.04650 0.18408 -0.253 0.800578
## veh_bodyPANVN 0.01531 0.15285 0.100 0.920229
## veh_bodyRDSTR -16.33392 1876.66356 -0.009 0.993056
## veh_bodySEDAN 0.02470 0.04679 0.528 0.597625
## veh_bodySTNWG -0.01551 0.06389 -0.243 0.808162
## veh_bodyTRUCK 0.04667 0.11216 0.416 0.677339
## veh_bodyUTE -0.17670 0.08513 -2.076 0.037929 *
## veh_age1 -0.04341 0.05260 -0.825 0.409292
## veh_age3 -0.08956 0.04953 -1.808 0.070550 .
## veh_age4 -0.11097 0.06656 -1.667 0.095471 .
## areaB 0.05413 0.05118 1.058 0.290202
## areaC 0.03964 0.04638 0.855 0.392698
## areaD -0.12485 0.06331 -1.972 0.048601 *
## areaE -0.02672 0.06882 -0.388 0.697865
## areaF -0.01390 0.08081 -0.172 0.863389
## agecat1 0.19772 0.06439 3.071 0.002136 **
## agecat3 -0.05595 0.05166 -1.083 0.278827
## agecat4 -0.05694 0.05139 -1.108 0.267844
## agecat5 -0.28791 0.06058 -4.753 2.01e-06 ***
## agecat6 -0.26466 0.07182 -3.685 0.000228 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.724) family taken to be 1)
##
## Null deviance: 17525 on 50891 degrees of freedom
## Residual deviance: 17385 on 50863 degrees of freedom
## AIC: 25498
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.724
## Std. Err.: 0.691
##
## 2 x log-likelihood: -25437.804
freq_nb2 = glm.nb(numclaims~veh_value+veh_body+area+agecat+offset(log(exposure)),
data = train[train$exposure>0,])
summary(freq_nb2)
##
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + area + agecat +
## offset(log(exposure)), data = train[train$exposure > 0, ],
## init.theta = 2.707842716, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.821e+00 6.119e-02 -29.764 < 2e-16 ***
## veh_valueLOW -1.050e-01 5.015e-02 -2.094 0.036252 *
## veh_valueMIDDLEHIGH 7.991e-02 4.834e-02 1.653 0.098351 .
## veh_valueHIGH 2.121e-01 5.262e-02 4.030 5.58e-05 ***
## veh_bodyBUS 7.014e-01 4.268e-01 1.643 0.100348
## veh_bodyCONVT -1.595e+00 1.005e+00 -1.587 0.112434
## veh_bodyCOUPE 3.755e-01 1.424e-01 2.638 0.008345 **
## veh_bodyHDTOP 3.092e-02 1.102e-01 0.281 0.779032
## veh_bodyMCARA 4.503e-01 3.012e-01 1.495 0.134930
## veh_bodyMIBUS -1.107e-01 1.784e-01 -0.620 0.535104
## veh_bodyPANVN -1.391e-02 1.513e-01 -0.092 0.926774
## veh_bodyRDSTR -1.636e+01 1.877e+03 -0.009 0.993046
## veh_bodySEDAN 7.090e-03 4.518e-02 0.157 0.875290
## veh_bodySTNWG -5.984e-02 5.519e-02 -1.084 0.278218
## veh_bodyTRUCK 4.255e-03 1.084e-01 0.039 0.968696
## veh_bodyUTE -2.145e-01 8.103e-02 -2.647 0.008120 **
## areaB 5.431e-02 5.118e-02 1.061 0.288584
## areaC 4.032e-02 4.638e-02 0.869 0.384622
## areaD -1.267e-01 6.331e-02 -2.002 0.045299 *
## areaE -2.918e-02 6.882e-02 -0.424 0.671605
## areaF -1.998e-02 8.074e-02 -0.247 0.804522
## agecat1 2.004e-01 6.434e-02 3.114 0.001846 **
## agecat3 -5.643e-02 5.167e-02 -1.092 0.274722
## agecat4 -5.697e-02 5.139e-02 -1.109 0.267622
## agecat5 -2.876e-01 6.058e-02 -4.748 2.06e-06 ***
## agecat6 -2.633e-01 7.180e-02 -3.666 0.000246 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.7078) family taken to be 1)
##
## Null deviance: 17519 on 50891 degrees of freedom
## Residual deviance: 17383 on 50866 degrees of freedom
## AIC: 25496
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.708
## Std. Err.: 0.684
##
## 2 x log-likelihood: -25441.632
recall our model summary for severity
summary(sev_gamma)
##
## Call:
## glm(formula = severity ~ veh_value + veh_body + veh_age + gender +
## area + agecat, family = Gamma("log"), data = train[train$numclaims >
## 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.514443 0.121827 61.681 < 2e-16 ***
## veh_valueLOW 0.085779 0.105519 0.813 0.416318
## veh_valueMIDDLEHIGH 0.009238 0.091950 0.100 0.919979
## veh_valueHIGH -0.115149 0.115407 -0.998 0.318467
## veh_bodyBUS -0.276854 0.756794 -0.366 0.714519
## veh_bodyCONVT -1.030755 1.849179 -0.557 0.577283
## veh_bodyCOUPE 0.374727 0.259369 1.445 0.148618
## veh_bodyHDTOP 0.123624 0.208096 0.594 0.552504
## veh_bodyMCARA -1.022644 0.546427 -1.872 0.061361 .
## veh_bodyMIBUS 0.272985 0.334817 0.815 0.414944
## veh_bodyPANVN 0.145277 0.275929 0.527 0.598575
## veh_bodySEDAN -0.148193 0.085072 -1.742 0.081605 .
## veh_bodySTNWG -0.072168 0.117718 -0.613 0.539880
## veh_bodyTRUCK 0.056175 0.203675 0.276 0.782713
## veh_bodyUTE -0.119374 0.158036 -0.755 0.450088
## veh_age1 -0.043576 0.094762 -0.460 0.645654
## veh_age3 0.059946 0.089599 0.669 0.503511
## veh_age4 0.010553 0.118571 0.089 0.929085
## genderM 0.178773 0.065060 2.748 0.006031 **
## areaB -0.020476 0.092474 -0.221 0.824774
## areaC 0.081724 0.083910 0.974 0.330157
## areaD 0.044805 0.114927 0.390 0.696667
## areaE 0.227978 0.124530 1.831 0.067234 .
## areaF 0.499850 0.147097 3.398 0.000686 ***
## agecat1 0.068641 0.116321 0.590 0.555161
## agecat3 -0.089848 0.093310 -0.963 0.335667
## agecat4 -0.167505 0.093281 -1.796 0.072631 .
## agecat5 -0.284988 0.109569 -2.601 0.009336 **
## agecat6 -0.055087 0.130009 -0.424 0.671800
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.375574)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5427.7 on 3350 degrees of freedom
## AIC: 61175
##
## Number of Fisher Scoring iterations: 8
sev_gamma1 = glm(severity ~ veh_value + veh_body + gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma1)
##
## Call:
## glm(formula = severity ~ veh_value + veh_body + gender + area +
## agecat, family = Gamma("log"), data = train[train$numclaims >
## 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.52721 0.11146 67.533 < 2e-16 ***
## veh_valueLOW 0.08709 0.09040 0.963 0.335466
## veh_valueMIDDLEHIGH -0.01306 0.08761 -0.149 0.881475
## veh_valueHIGH -0.14665 0.09655 -1.519 0.128866
## veh_bodyBUS -0.25964 0.75097 -0.346 0.729561
## veh_bodyCONVT -1.00219 1.83471 -0.546 0.584937
## veh_bodyCOUPE 0.39084 0.25522 1.531 0.125772
## veh_bodyHDTOP 0.15327 0.19877 0.771 0.440715
## veh_bodyMCARA -0.97675 0.53732 -1.818 0.069181 .
## veh_bodyMIBUS 0.31171 0.32160 0.969 0.332503
## veh_bodyPANVN 0.14864 0.27277 0.545 0.585847
## veh_bodySEDAN -0.13742 0.08201 -1.676 0.093903 .
## veh_bodySTNWG -0.05366 0.10238 -0.524 0.600198
## veh_bodyTRUCK 0.07087 0.19787 0.358 0.720234
## veh_bodyUTE -0.09363 0.15012 -0.624 0.532848
## genderM 0.17886 0.06468 2.765 0.005719 **
## areaB -0.01738 0.09201 -0.189 0.850148
## areaC 0.08302 0.08352 0.994 0.320274
## areaD 0.05172 0.11436 0.452 0.651091
## areaE 0.23519 0.12388 1.898 0.057716 .
## areaF 0.49873 0.14600 3.416 0.000643 ***
## agecat1 0.06698 0.11575 0.579 0.562837
## agecat3 -0.08678 0.09282 -0.935 0.349847
## agecat4 -0.16741 0.09282 -1.804 0.071386 .
## agecat5 -0.28435 0.10907 -2.607 0.009173 **
## agecat6 -0.06001 0.12940 -0.464 0.642832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.345349)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5431.5 on 3353 degrees of freedom
## AIC: 61172
##
## Number of Fisher Scoring iterations: 8
sev_gamma2 = glm(severity ~ veh_value + gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma2)
##
## Call:
## glm(formula = severity ~ veh_value + gender + area + agecat,
## family = Gamma("log"), data = train[train$numclaims > 0,
## ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.473460 0.105353 70.938 < 2e-16 ***
## veh_valueLOW 0.105667 0.091148 1.159 0.246418
## veh_valueMIDDLEHIGH -0.019356 0.087255 -0.222 0.824459
## veh_valueHIGH -0.132638 0.086906 -1.526 0.127050
## genderM 0.185258 0.063188 2.932 0.003392 **
## areaB -0.001307 0.092922 -0.014 0.988778
## areaC 0.093544 0.084258 1.110 0.266992
## areaD 0.041342 0.114791 0.360 0.718759
## areaE 0.255235 0.123895 2.060 0.039467 *
## areaF 0.504482 0.145296 3.472 0.000523 ***
## agecat1 0.080970 0.116300 0.696 0.486344
## agecat3 -0.083439 0.093680 -0.891 0.373166
## agecat4 -0.183608 0.093246 -1.969 0.049028 *
## agecat5 -0.302600 0.109815 -2.756 0.005891 **
## agecat6 -0.087696 0.129874 -0.675 0.499571
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.421348)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5475.7 on 3364 degrees of freedom
## AIC: 61185
##
## Number of Fisher Scoring iterations: 7
sev_gamma3 = glm(severity ~ gender + area + agecat, data = train[train$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma3)
##
## Call:
## glm(formula = severity ~ gender + area + agecat, family = Gamma("log"),
## data = train[train$numclaims > 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.45743 0.09089 82.050 < 2e-16 ***
## genderM 0.15572 0.06168 2.525 0.01163 *
## areaB -0.00398 0.09169 -0.043 0.96538
## areaC 0.09838 0.08313 1.183 0.23672
## areaD 0.04234 0.11306 0.374 0.70807
## areaE 0.23664 0.12196 1.940 0.05242 .
## areaF 0.44822 0.14112 3.176 0.00151 **
## agecat1 0.09146 0.11464 0.798 0.42501
## agecat3 -0.06787 0.09244 -0.734 0.46287
## agecat4 -0.15869 0.09195 -1.726 0.08449 .
## agecat5 -0.28132 0.10835 -2.596 0.00946 **
## agecat6 -0.05836 0.12792 -0.456 0.64824
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.331778)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5499.2 on 3367 degrees of freedom
## AIC: 61198
##
## Number of Fisher Scoring iterations: 7
After doing backward selection and evaluating AIC of the model, it can be inferred that the AIC is getting bigger after backward selection part 2 so we shall do anova.
sev_anova = anova(sev_gamma2,sev_gamma1)
1-pchisq(sev_anova[2,]$Deviance/summary(sev_gamma1)$dispersion,sev_anova[2,]$Df)
## [1] 0.280484
Because p-value = 0.280484 > 0.05 \(H_0\) accepted, thus by anova having variable veh_body is not really impactful. so we chose our current best non-recoded levels model is sev_gamma2. Then we want test it on anova with sev_gamma3
sev_anova = anova(sev_gamma3,sev_gamma2)
1-pchisq(sev_anova[2,]$Deviance/summary(sev_gamma2)$dispersion,sev_anova[2,]$Df)
## [1] 0.07670181
p-value >0.05 \(H_0\) accepted. Thus the variable veh_value is also not impactful. then our best model is sev_gamma3.
Since there are variable with non-significant dummy, it must be recoded and combined to the base cell
# save newdata frame with recoded veh_value
train_rec1 = train
test_rec1 = test
levels(train_rec1$veh_value) = recode(levels(train_rec1$veh_value), 'MIDDLELOW' = 'MIDDLELOW+' )
levels(train_rec1$veh_value) = recode(levels(train_rec1$veh_value), 'MIDDLEHIGH' = 'MIDDLELOW+')
levels(test_rec1$veh_value) = recode(levels(test_rec1$veh_value), 'MIDDLELOW' = 'MIDDLELOW+' )
levels(test_rec1$veh_value) = recode(levels(test_rec1$veh_value), 'MIDDLEHIGH' = 'MIDDLELOW+')
levels(test_rec1$veh_value)
## [1] "MIDDLELOW+" "LOW" "HIGH"
levels(train_rec1$veh_value)
## [1] "MIDDLELOW+" "LOW" "HIGH"
#save new dataframe with recoded veh_value
train_rec2 = train_rec1
test_rec2 = test_rec1
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'HBACK' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'BUS' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'CONVT' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'HDTOP' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'MCARA' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'MIBUS' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'PANVN' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'RDSTR' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'SEDAN' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'STNWG' = 'HBACK+')
levels(train_rec2$veh_body) = recode(levels(train_rec2$veh_body) , 'TRUCK' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'HBACK' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'BUS' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'CONVT' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'HDTOP' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'MCARA' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'MIBUS' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'PANVN' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'RDSTR' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'SEDAN' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'STNWG' = 'HBACK+')
levels(test_rec2$veh_body) = recode(levels(test_rec2$veh_body) , 'TRUCK' = 'HBACK+')
levels(train_rec2$veh_body)
## [1] "HBACK+" "COUPE" "UTE"
levels(test_rec2$veh_body)
## [1] "HBACK+" "COUPE" "UTE"
# save new dataframe
train_rec3 = train_rec2
test_rec3 = test_rec2
levels(train_rec3$area) = recode(levels(train_rec3$area),'A' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'B' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'C' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'E' = 'A+')
levels(train_rec3$area) = recode(levels(train_rec3$area),'F' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'A' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'B' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'C' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'E' = 'A+')
levels(test_rec3$area) = recode(levels(test_rec3$area),'F' = 'A+')
levels(train_rec3$area)
## [1] "A+" "D"
levels(test_rec3$area)
## [1] "A+" "D"
levels(train_rec3$agecat)
## [1] "2" "1" "3" "4" "5" "6"
train_rec4 = train_rec3
test_rec4 = test_rec3
levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'2' = '2+')
levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'3' = '2+')
levels(train_rec4$agecat) = recode(levels(train_rec4$agecat),'4' = '2+')
levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'2' = '2+' )
levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'3' = '2+' )
levels(test_rec4$agecat) = recode(levels(test_rec4$agecat),'4' = '2+' )
levels(train_rec4$agecat)
## [1] "2+" "1" "5" "6"
levels(test_rec4$agecat)
## [1] "2+" "1" "5" "6"
freq_nb_recode = glm.nb(numclaims~veh_value+veh_body+area+agecat+offset(log(exposure)),
data = train_rec4[train_rec4$exposure>0,])
summary(freq_nb_recode)
##
## Call:
## glm.nb(formula = numclaims ~ veh_value + veh_body + area + agecat +
## offset(log(exposure)), data = train_rec4[train_rec4$exposure >
## 0, ], init.theta = 2.639873673, link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.80714 0.02762 -65.425 < 2e-16 ***
## veh_valueLOW -0.13882 0.04371 -3.176 0.001493 **
## veh_valueHIGH 0.13851 0.04013 3.452 0.000557 ***
## veh_bodyCOUPE 0.38839 0.13943 2.786 0.005342 **
## veh_bodyUTE -0.20133 0.07375 -2.730 0.006340 **
## areaD -0.14898 0.05540 -2.689 0.007169 **
## agecat1 0.24270 0.05570 4.357 1.32e-05 ***
## agecat5 -0.24039 0.05120 -4.695 2.67e-06 ***
## agecat6 -0.21338 0.06365 -3.353 0.000801 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.6399) family taken to be 1)
##
## Null deviance: 17492 on 50891 degrees of freedom
## Residual deviance: 17377 on 50883 degrees of freedom
## AIC: 25482
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.640
## Std. Err.: 0.654
##
## 2 x log-likelihood: -25462.236
#in terms of train model. the AIC has improved
Now let’s check whether the model is statistically good fit using deviance
cbind(scaled.deviance=freq_nb_recode$deviance,df=freq_nb_recode$df.residual, p=1-pchisq(freq_nb_recode$deviance, freq_nb_recode$df.residual))
## scaled.deviance df p
## [1,] 17376.72 50883 1
Since p-value = 1, \(H_0\) is accepted thus, model with recoded value have a good fit. Lets see the performance of non recoded model
cbind(scaled.deviance=freq_nb2$deviance,df=freq_nb2$df.residual, p=1-pchisq(freq_nb2$deviance, freq_nb2$df.residual))
## scaled.deviance df p
## [1,] 17382.87 50866 1
Since p-value = 1, \(H_0\) is accepted thus, model with non-recoded levels also have a good fit. Lets see the performance of both model outside training data. This is when we use our test data to validate once more about the predictive power since the statistic deviance test both show a good fit.
#define function to find residual deviance of model
nb_dev_resid = function(y,y_pred,theta){
part1 = y*log(y/y_pred)
part1[is.nan(part1)] = 0
part2 = (y+1/theta)*log((1+theta*y)/(1+theta*y_pred))
part2[is.nan(part2)] = 0
2*sum(part1-part2)
}
#extracting residual deviance for test data on recoded model
y_recode_pred = exp(predict(freq_nb_recode,newdata = test_rec4[test_rec4$exposure>0,]))
y_recode = test_rec4[test_rec4$exposure>0,]$numclaims
y_no_recode_pred = exp(predict(freq_nb2, newdata = test[test$exposure>0,]))
y_no_recode = test[test$exposure>0,]$numclaims
data.frame('model' = c('freq_nb_recode', 'freq_nb2'), 'test_dev_resid' = c(nb_dev_resid(y_recode,y_recode_pred,freq_nb_recode$theta),nb_dev_resid(y_no_recode,y_no_recode_pred,freq_nb2$theta)))
Since residual deviance = 2*(L(saturated)-L(model) and a model is good when L(model) is maximal, then we chose the model with the smallest test residual deviance. To conclude, the best model is model with recoded parameters which is freq_nb_recode.
head(train)
summary(sev_gamma1)
##
## Call:
## glm(formula = severity ~ veh_value + veh_body + gender + area +
## agecat, family = Gamma("log"), data = train[train$numclaims >
## 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.52721 0.11146 67.533 < 2e-16 ***
## veh_valueLOW 0.08709 0.09040 0.963 0.335466
## veh_valueMIDDLEHIGH -0.01306 0.08761 -0.149 0.881475
## veh_valueHIGH -0.14665 0.09655 -1.519 0.128866
## veh_bodyBUS -0.25964 0.75097 -0.346 0.729561
## veh_bodyCONVT -1.00219 1.83471 -0.546 0.584937
## veh_bodyCOUPE 0.39084 0.25522 1.531 0.125772
## veh_bodyHDTOP 0.15327 0.19877 0.771 0.440715
## veh_bodyMCARA -0.97675 0.53732 -1.818 0.069181 .
## veh_bodyMIBUS 0.31171 0.32160 0.969 0.332503
## veh_bodyPANVN 0.14864 0.27277 0.545 0.585847
## veh_bodySEDAN -0.13742 0.08201 -1.676 0.093903 .
## veh_bodySTNWG -0.05366 0.10238 -0.524 0.600198
## veh_bodyTRUCK 0.07087 0.19787 0.358 0.720234
## veh_bodyUTE -0.09363 0.15012 -0.624 0.532848
## genderM 0.17886 0.06468 2.765 0.005719 **
## areaB -0.01738 0.09201 -0.189 0.850148
## areaC 0.08302 0.08352 0.994 0.320274
## areaD 0.05172 0.11436 0.452 0.651091
## areaE 0.23519 0.12388 1.898 0.057716 .
## areaF 0.49873 0.14600 3.416 0.000643 ***
## agecat1 0.06698 0.11575 0.579 0.562837
## agecat3 -0.08678 0.09282 -0.935 0.349847
## agecat4 -0.16741 0.09282 -1.804 0.071386 .
## agecat5 -0.28435 0.10907 -2.607 0.009173 **
## agecat6 -0.06001 0.12940 -0.464 0.642832
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.345349)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5431.5 on 3353 degrees of freedom
## AIC: 61172
##
## Number of Fisher Scoring iterations: 8
#make new data to store
trainsev1 = train
testsev1 = test
levels(trainsev1$area) = recode(levels(trainsev1$area),'A' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'B' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'C' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'D' = 'A+')
levels(trainsev1$area) = recode(levels(trainsev1$area),'E' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'A' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'B' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'C' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'D' = 'A+')
levels(testsev1$area) = recode(levels(testsev1$area),'E' = 'A+')
levels(trainsev1$area)
## [1] "A+" "F"
levels(testsev1$area)
## [1] "A+" "F"
trainsev2 = trainsev1
testsev2 = testsev1
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'2' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'1' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'3' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'4' = '2+')
levels(trainsev2$agecat) = recode(levels(trainsev2$agecat),'6' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'2' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'1' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'3' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'4' = '2+')
levels(testsev2$agecat) = recode(levels(testsev2$agecat),'6' = '2+')
levels(trainsev2$agecat)
## [1] "2+" "5"
levels(testsev2$agecat)
## [1] "2+" "5"
sev_gamma_recode = glm(severity ~ gender + area + agecat, data = trainsev2[trainsev2$numclaims > 0,], family = Gamma("log"), weights = numclaims)
summary(sev_gamma_recode)
##
## Call:
## glm(formula = severity ~ gender + area + agecat, family = Gamma("log"),
## data = trainsev2[trainsev2$numclaims > 0, ], weights = numclaims)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.45669 0.04217 176.805 < 2e-16 ***
## genderM 0.16843 0.06120 2.752 0.00595 **
## areaF 0.41508 0.12913 3.214 0.00132 **
## agecat5 -0.22816 0.08995 -2.537 0.01124 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 3.295482)
##
## Null deviance: 5623.8 on 3378 degrees of freedom
## Residual deviance: 5536.0 on 3375 degrees of freedom
## AIC: 61211
##
## Number of Fisher Scoring iterations: 6
Doing the same step like finding best model for frequency. For severity we try to compare the performance of recoded and non-recoded models, firstly using deviance statistics to show goodness of fit
cbind(scaled.deviance=sev_gamma_recode$deviance/summary(sev_gamma_recode)$dispersion,df=sev_gamma_recode$df.residual, p=1-pchisq(sev_gamma_recode$deviance/summary(sev_gamma_recode)$dispersion, sev_gamma_recode$df.residual))
## scaled.deviance df p
## [1,] 1679.862 3375 1
Since p-value = 1 >0.05, \(H_0\) accepted this shows severity model with recoded levels have a surpasses goodness of fit test. Now, let’s check model with non-recoded levels
cbind(scaled.deviance=sev_gamma3$deviance/summary(sev_gamma3)$dispersion,df=sev_gamma3$df.residual, p=1-pchisq(sev_gamma3$deviance/summary(sev_gamma3)$dispersion, sev_gamma3$df.residual))
## scaled.deviance df p
## [1,] 1650.517 3367 1
Since p-value = 1 >0.05, \(H _0\) accepted this shows severity model with non-recoded levels also have a surpasses goodness of fit test. Now, let’s check model with test data deviance
head(test)
gamma_dev_resid = function(y,y_pred,disp_params){
part1 = -log(y/y_pred)
part1[is.nan(part1)] = 0
part2 = (y-y_pred)/y_pred
part2[is.nan(part2)] = 0
2*sum(part1+part2)/disp_params
}
sev_recode_pred = exp(predict(sev_gamma_recode, newdata = testsev2[testsev2$numclaims>0,]))
sev_recode_test = testsev2[testsev2$numclaims>0,]$severity
sev_no_recode_pred = exp(predict(sev_gamma3, newdata = test[test$numclaims>0,]))
sev_no_recode_test = test[test$numclaims>0,]$severity
data.frame('model' = c('sev_gamma_recode', 'sev_gamma3'), 'test_dev_resid' = c(gamma_dev_resid(sev_recode_test,sev_recode_pred,summary(sev_gamma_recode)$dispersion),nb_dev_resid(sev_no_recode_test,sev_no_recode_pred,summary(sev_gamma3)$dispersion)))
Analogously on how to pick the best frequency model, model with the smallest value of test scaled deviance are chosen. Then best model for severity is sev_gamma3
In insurance pricing, relativities are adjustments made to premium according to different tariff cells. They measure how the expected frequency or severity of claims differs in comparison to a baseline or in this case we call as base cell.
Relativities help ensure policyholders are charged accordingly based on the level of risk that they are transferring to insurer. This way, contributions to the fund required for covering claims are more proportional.
Link function exp(\(\beta_i\)) is used to obtain the value of relativities from each tariff cells. All base cell (and levels that has been merged into their respective base cell) from each rating factors would have a relativity value of 1 as they are used as a baseline or reference group in this calculation.
freq_lower_bound <- exp(freq_nb_recode$coefficients - qnorm(0.975) * sqrt(diag(vcov(freq_nb_recode))))
freq_upper_bound <- exp(freq_nb_recode$coefficients + qnorm(0.975) * sqrt(diag(vcov(freq_nb_recode))))
freq_relative = data.frame(Estimates = freq_nb_recode$coefficients,
Relativities = exp(freq_nb_recode$coefficients),
`Std. Error` = sqrt(diag(vcov(freq_nb_recode))),
`Lower Bound` = freq_lower_bound,
`Upper Bound` = freq_upper_bound,
`95% CI Relativities` = paste0("( ", freq_lower_bound, " , ", freq_upper_bound, " )"),
check.names = F)
freq_relative
The frequency relativity value indicates how much more or less likely a policyholder in a specific level of a rating factor is to make a claim compared to those in the baseline level (base cell). A relativity value of >1 indicates that the variable would increase the value of its baseline, in the amount of their relativity. Contrarily, A relativity value of <1 indicates that the variable would decrease the value of its baseline, in the amount of their relativity.
This could be taken as how much more or less likely that this policyholder would make a claim, or in other word, how much higher or lower of a risk is this policyholder compared to our baseline.
sev_lower_bound <- exp(sev_gamma3$coefficients - qnorm(0.975) * sqrt(diag(vcov(sev_gamma3))))
sev_upper_bound <- exp(sev_gamma3$coefficients + qnorm(0.975) * sqrt(diag(vcov(sev_gamma3))))
sev_relative = data.frame(Estimates = sev_gamma3$coefficients,
Relativities = exp(sev_gamma3$coefficients),
`Std. Error` = sqrt(diag(vcov(sev_gamma3))),
`Lower Bound` = sev_lower_bound,
`Upper Bound` = sev_upper_bound,
check.names = F)
sev_relative
Similarly, the severity relativity value indicates how much more or less severe is a policyholder in a specific level of a rating factor averagely in making claims, compared to those in the baseline level (base cell). A relativity value of >1 indicates that the variable would increase the value of its baseline, in the amount of their relativity. Contrarily, A relativity value of <1 indicates that the variable would decrease the value of its baseline, in the amount of their relativity.
This could be taken as how much more or less costly (in average) that this policyholder in making a claim. Again, in other word, explains how much higher or lower of a risk is this policyholder compared to our baseline.
Usually, when the value of 1 is included in the relativity confidence intervals, it would mean that the variable is not significant enough for the model and is preferably to be merged with their base cell (or even removed altogether). That is because a relativity of 1, that we can write as exp(\(\beta_i\)) = 1, would mean that \(\beta_i\) = 0. An estimate of 0 has no meaning in a model as it does nothing. However, according to Section 4, the severity model that still includes insignificant variables is considered a better model, and therefore, it is selected as our final model.