Modeling claim costs with Bayesian Hierarchical Generalized Linear Models

In this project, we will attempt to model claim severity (expected claim amount per claim) using a Gamma Generalized Linear Model (glm) as well as a Normal glm with log link.

Load packages

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.2     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)
library(ggplot2)
library(caret)
Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift
library(rjags)
Loading required package: coda
Linked to JAGS 4.3.2
Loaded modules: basemod,bugs

Data description

This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of which 4624 (6.8% notified claims) filed claims.

library(insuranceData)
# library(CASdatasets)
# library(raw)
data(dataCar)
?dataCar
dat <- as_tibble(dataCar)
head(dat)
# A tibble: 6 × 11
  veh_value exposure   clm numclaims claimcst0 veh_body veh_age gender area 
      <dbl>    <dbl> <int>     <int>     <dbl> <fct>      <int> <fct>  <fct>
1      1.06    0.304     0         0         0 HBACK          3 F      C    
2      1.03    0.649     0         0         0 HBACK          2 F      A    
3      3.26    0.569     0         0         0 UTE            2 F      E    
4      4.14    0.318     0         0         0 STNWG          2 F      D    
5      0.72    0.649     0         0         0 HBACK          4 F      C    
6      2.01    0.854     0         0         0 HDTOP          3 M      C    
# ℹ 2 more variables: agecat <int>, X_OBSTAT_ <fct>
str(dat)
tibble [67,856 × 11] (S3: tbl_df/tbl/data.frame)
 $ veh_value: num [1:67856] 1.06 1.03 3.26 4.14 0.72 2.01 1.6 1.47 0.52 0.38 ...
 $ exposure : num [1:67856] 0.304 0.649 0.569 0.318 0.649 ...
 $ clm      : int [1:67856] 0 0 0 0 0 0 0 0 0 0 ...
 $ numclaims: int [1:67856] 0 0 0 0 0 0 0 0 0 0 ...
 $ claimcst0: num [1:67856] 0 0 0 0 0 0 0 0 0 0 ...
 $ veh_body : Factor w/ 13 levels "BUS","CONVT",..: 4 4 13 11 4 5 8 4 4 4 ...
 $ veh_age  : int [1:67856] 3 2 2 2 4 3 3 2 4 4 ...
 $ gender   : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 2 1 1 ...
 $ area     : Factor w/ 6 levels "A","B","C","D",..: 3 1 5 4 3 3 1 2 1 2 ...
 $ agecat   : int [1:67856] 2 4 2 2 2 4 4 6 3 4 ...
 $ X_OBSTAT_: Factor w/ 1 level "01101    0    0    0": 1 1 1 1 1 1 1 1 1 1 ...
summary(dat)
   veh_value         exposure             clm            numclaims      
 Min.   : 0.000   Min.   :0.002738   Min.   :0.00000   Min.   :0.00000  
 1st Qu.: 1.010   1st Qu.:0.219028   1st Qu.:0.00000   1st Qu.:0.00000  
 Median : 1.500   Median :0.446270   Median :0.00000   Median :0.00000  
 Mean   : 1.777   Mean   :0.468651   Mean   :0.06814   Mean   :0.07276  
 3rd Qu.: 2.150   3rd Qu.:0.709103   3rd Qu.:0.00000   3rd Qu.:0.00000  
 Max.   :34.560   Max.   :0.999316   Max.   :1.00000   Max.   :4.00000  
                                                                        
   claimcst0          veh_body        veh_age      gender    area     
 Min.   :    0.0   SEDAN  :22233   Min.   :1.000   F:38603   A:16312  
 1st Qu.:    0.0   HBACK  :18915   1st Qu.:2.000   M:29253   B:13341  
 Median :    0.0   STNWG  :16261   Median :3.000             C:20540  
 Mean   :  137.3   UTE    : 4586   Mean   :2.674             D: 8173  
 3rd Qu.:    0.0   TRUCK  : 1750   3rd Qu.:4.000             E: 5912  
 Max.   :55922.1   HDTOP  : 1579   Max.   :4.000             F: 3578  
                   (Other): 2532                                      
     agecat                     X_OBSTAT_    
 Min.   :1.000   01101    0    0    0:67856  
 1st Qu.:2.000                               
 Median :3.000                               
 Mean   :3.485                               
 3rd Qu.:5.000                               
 Max.   :6.000                               
                                             

Explore and prepare data

We first analyze all the variables independently and also with respect to the dependent variable claimcst0.

ggplot(dat, aes(clm)) +
  geom_bar() +
  ggtitle("Claim occurrence")

ggplot(dat, aes(claimcst0)) +
  geom_histogram() +
  ggtitle("Claim cost")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

We can see an imbalanced as most policyholders did not file a claim. If a policyholder did not file a claim then the claim cost is then $0.

Looking at the above graph and based on one important property of gamma, i.e., all outcomes must be positive, we can not model our data in the current form where we have zero outcomes for 93% of observations.

We will split this dataset into two parts, observations without any loss and observations which generated even a single dollar loss.

without_loss_dat <- dat |> 
  filter(claimcst0 == 0)

with_loss_dat <- dat |> 
  filter(claimcst0 > 0)

We will focus on modeling only positive losses.

# claim cost for the positive subset of the data
ggplot(with_loss_dat, aes(claimcst0)) +
  geom_histogram() +
  ggtitle("Claim cost including outliers")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(with_loss_dat$claimcst0)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  200.0   353.8   761.6  2014.4  2091.4 55922.1 
# 95th percentile
quantile(with_loss_dat$claimcst0, probs = 0.99)
     99% 
17937.13 
dim(with_loss_dat)
[1] 4624   11
sum(with_loss_dat$claimcst0 > 17937.13) # records with loss above 99th percentile
[1] 47
ggplot(with_loss_dat[with_loss_dat$claimcst0 < 17937.13,], aes(claimcst0)) +
  geom_histogram() +
  ggtitle("Claim cost, excluding outliers above 99th percentile")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

# vehicle value
summary(with_loss_dat$veh_value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   1.100   1.570   1.859   2.310  13.900 

Some records have $0 as vehicle value. This does not make sense. We will remove those.

# remove records with veh_value = 0
with_loss_dat <- with_loss_dat |> filter(veh_value != 0)
summary(with_loss_dat$veh_value)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.220   1.100   1.570   1.862   2.317  13.900 
boxplot(dat$veh_value, horizontal = TRUE, main = "overall data, vehicle value in $10,000s")

boxplot(with_loss_dat$veh_value, horizontal = TRUE, 
        main = "claim occurrence data, vehicle value in $10,000s")

We can convert continuous values into categorical to minimize the effect of outliers. I have created 5 different categories for vehicle value.

hist(with_loss_dat$veh_value)

# define breaks for the categories
breaks <- c(0, 0.5, 1.5, 2, 3, Inf)

# create categorical variable based on the breaks
with_loss_dat$veh_value_cat <- 
  cut(with_loss_dat$veh_value, breaks = breaks, 
      labels = c("Very Low", "Low", "Medium", "High", "Very High"))

summary(with_loss_dat$veh_value_cat)
 Very Low       Low    Medium      High Very High 
      228      1925       987       829       649 
# veh_body
veh_body_counts <- table(with_loss_dat$veh_body)
veh_body_counts <- veh_body_counts[order(-veh_body_counts)]
veh_body_counts

SEDAN HBACK STNWG   UTE HDTOP TRUCK COUPE PANVN MIBUS MCARA   BUS CONVT RDSTR 
 1476  1264  1170   260   130   119    68    62    43    14     7     3     2 
# Create the bar plot
barplot(veh_body_counts, las = 2)  # las = 2 to rotate x-axis labels vertically

We see many categories with the very low frequency which could be a disadvantage in training the feature in lack of enough data points, let’s merge the bottom 4 categories as “OTHERS”.

with_loss_dat$veh_body <- with_loss_dat$veh_body |> 
  recode("RDSTR" = "OTHERS",
         "CONVT" = "OTHERS",
         "BUS" = "OTHERS",
         "MCARA" = "OTHERS")

the “exposure” variable indicates the exposure period for each policy. In the context of car insurance, it represents the time in years. Recall that this data set only contains policies in 2004 or 2005.

# exposure
summary(with_loss_dat$exposure)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
0.002738 0.410678 0.636550 0.610834 0.832307 0.999316 
# numclaims
hist(with_loss_dat$numclaims)

table(with_loss_dat$numclaims)

   1    2    3    4 
4329  269   18    2 

We can see, for each policies that have claim occurrence, the majority have only one claim. Nothing is unusual here. Let’s group this variable into either one claim or more than one claim.

with_loss_dat$nclaim_cat <- ifelse(with_loss_dat$numclaims == 1, 
                                    "One claim", "More than one claim")
with_loss_dat$nclaim_cat <- as.factor(with_loss_dat$nclaim_cat)
# vehicle age
with_loss_dat$veh_age <- as.factor(with_loss_dat$veh_age)
plot(with_loss_dat$veh_age, 
     xlab = "vehicle age",
     ylab = "count")

boxplot(log(claimcst0) ~ veh_age, data=with_loss_dat, horizontal = T)

# gender
plot(with_loss_dat$gender)

boxplot(log(claimcst0) ~ gender, data=with_loss_dat, horizontal = T)

The distribution of each age is not too imbalanced. Same with gender.

We see a general increase in the median of claim cost as the vehicle age increases.

Male policyholders also seem to have slightly higher claim costs.

boxplot(log(claimcst0) ~ area, data=with_loss_dat, horizontal = T)

ggplot(with_loss_dat,aes(log(claimcst0),area)) +
  geom_violin()

Each area appears to have slightly different median log claim costs.

A, B, C appear to have similar distributions.

D, E, F appear to also have similar distributions.

This might mean there are two main areas split into six sub-areas.

hist(log(with_loss_dat$claimcst0),breaks=25)

with_loss_dat |> 
  group_by(area) |> 
  summarise(sd_area = sqrt(var(claimcst0)))
# A tibble: 6 × 2
  area  sd_area
  <fct>   <dbl>
1 A       3606.
2 B       2982.
3 C       3403.
4 D       3025.
5 E       3840.
6 F       5582.

The log of claimcs0 roughly follows a normal distribution, except the two peaks at the beginning of the distribution. The variances of claimcst0 between areas vary as well.

Based on these two observations, we can conduct a Welch’s ANOVA test to see if there are statistically significant differences in claim costs among different areas.

oneway.test(log(claimcst0) ~ area, data=with_loss_dat, var.equal=F)

    One-way analysis of means (not assuming equal variances)

data:  log(claimcst0) and area
F = 4.0998, num df = 5.0, denom df = 1377.1, p-value = 0.001065

A small p-value < 0.05 indicates that the claim costs vary significantly between at least some pairs of areas.

with_loss_dat$agecat <- as.factor(with_loss_dat$agecat)
boxplot(log(claimcst0) ~ agecat, data=with_loss_dat, horizontal = T)

Now, we are done with the necessary data check and transformation. We can move to the modeling steps.

Modeling

Split into train and test sets

Let’s first split the preprocessed data set into train and test set.

set.seed(98)
dat_partition <- createDataPartition(with_loss_dat$claimcst0, 
                                      times = 1,p = 0.8,list = FALSE)
str(dat_partition)
 int [1:3696, 1] 1 2 3 4 5 8 10 11 12 13 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : chr "Resample1"
train <- with_loss_dat[dat_partition,]
test  <- with_loss_dat[-dat_partition,]

Understand model choices

Based on industry experience that GLM with gamma distribution and OLS with log-transformed response might be a better fit for this scenario.

Here is a comparison of the original distribution versus its log-transformed value.

hist(with_loss_dat$claimcst0, breaks = 25)

hist(log(with_loss_dat$claimcst0), breaks = 25)

The first histogram gives an intuition that data is very close to gamma distribution and later one suggests we can try a OLS with the log-transformed response variable, though it is far from perfectly normal, has two peaks in the beginning.

Recall that claim severity is the expected claim amount per claim. Note that each row in the data set represents a policy, which can have more than one claim. Thus, the claim amount column is the total cost for each policy.

We have to find out a technique that mathematically replaces claim amount with a claim severity. We wanted to influence the model results without actually including the number of claims variable in modeling.

log(claimcst0) = f(X) + log(numclaims)

where f(X) is the linear combination of predictors and the second term is the offset.

The offset effectively scales the expected total claim cost by the number of claims for each policy, allowing us to focus on modeling the conditional relationship between the predictors and the claim severity (total claim cost per claim).

We avoid directly estimating a coefficient for numclaims in the model, because we have a specific theoretical understanding that the effect of the number of claims is fixed and has a multiplicative effect on the response variable, ie. total claim cost, irrespective of the predictor values.

Note that we eliminate the possibility of using a normal linear model on log-transformed response variable because it cannot handle the offset term.

We will also assume that there is a grouping effect in the claim costs for policyholders based on their location, indicated by the area variable. We will reflect this in our model by having a different intercept for each area. Each of the intercept will come from the same normal distribution.

Specify and fit models

We will specify a hierarchical gamma glm with a log link as well as one with a normal likelihood and a log link. We can fit both models and after assessment and comparison, finalize the better one.

Target Variable — We will use the claim cost as the target variable

Independent Variables — All these variables are used as categorical — Vehicle Body, Vehicle Age, Gender, Area and Age category, and Vehicle value. Claim indicators and exposure are dropped as they are not required.

Offset — Number of claims will be used as an offset term

## GAMMA GLM
mod_string <- " model {
  for (i in 1:length(y)) {
    y[i] ~ dgamma(shape, shape/mu[i])
    log(mu[i]) = a[area[i]] + b[1]*veh_value_cat[i] + b[2]*veh_body[i] + b[3]*veh_age[i] + b[4]*gender[i] + b[5]*agecat[i] + log(numclaims[i])
  }
  
  ## weakly informative prior for the degree of skewness of likelihood
  shape ~ dgamma(0.01,1.0/0.01) 
  
  for (j in 1:max(area)) {
    a[j] ~ dnorm(a0, prec_a)
  }
  a0 ~ dnorm(0.0, 1.0/1.0e6)
  prec_a ~ dgamma(0.01, 1.0/0.01)
  tau = sqrt(1.0/prec_a) ## standard deviation
  
  for (j in 1:5) {
    b[j] ~ dnorm(0.0, 1.0/1.0e6)
  }

} "

set.seed(101)

data_jags <- list(y=train$claimcst0,
                  area=as.numeric(train$area),
                  veh_value_cat=as.numeric(train$veh_value_cat),
                  veh_body=as.numeric(train$veh_body),
                  veh_age=as.numeric(train$veh_age),
                  gender=as.numeric(train$gender),
                  agecat=as.numeric(train$agecat),
                  numclaims=train$numclaims)

params <- c("shape", "a0", "tau", "a", "b")

mod = jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 3696
   Unobserved stochastic nodes: 14
   Total graph size: 35204

Initializing model
update(mod, 1e3) # burn-in

mod_sim = coda.samples(model=mod,
                       variable.names=params,
                       n.iter=5e3)

mod_csim = as.mcmc(do.call(rbind, mod_sim)) # combine multiple chains

dic_gamma <- dic.samples(mod, n.iter = 1e3)
## convergence diagnostics
par(mar=c(1,1,1,1))
plot(mod_sim)

gelman.diag(mod_sim)
Potential scale reduction factors:

      Point est. Upper C.I.
a[1]           1       1.01
a[2]           1       1.01
a[3]           1       1.01
a[4]           1       1.01
a[5]           1       1.01
a[6]           1       1.01
a0             1       1.00
b[1]           1       1.01
b[2]           1       1.01
b[3]           1       1.01
b[4]           1       1.00
b[5]           1       1.00
shape          1       1.00
tau            1       1.00

Multivariate psrf

1.01
# gelman.plot(mod_sim)

autocorr.diag(mod_sim)
            a[1]      a[2]      a[3]      a[4]      a[5]      a[6]           a0
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000  1.000000000
Lag 1  0.9167821 0.9092965 0.9366472 0.8455384 0.8325978 0.7894979 -0.008550014
Lag 5  0.8229910 0.8136244 0.8465045 0.7344129 0.7182343 0.6699913  0.010729602
Lag 10 0.7364302 0.7237622 0.7519686 0.6573433 0.6400562 0.5964011  0.007137061
Lag 50 0.1714422 0.1660728 0.1657687 0.1512886 0.1436664 0.1335162  0.022064126
            b[1]       b[2]       b[3]        b[4]       b[5]        shape
Lag 0  1.0000000 1.00000000 1.00000000 1.000000000 1.00000000  1.000000000
Lag 1  0.9471786 0.93135742 0.92989719 0.922259279 0.88362619  0.224427158
Lag 5  0.7735444 0.69189856 0.72186457 0.665727111 0.53558191  0.014595557
Lag 10 0.6156303 0.45651578 0.55818515 0.434706983 0.29925771 -0.004993518
Lag 50 0.1102934 0.00837784 0.09375806 0.004248468 0.02916688  0.007947822
                tau
Lag 0   1.000000000
Lag 1   0.231270803
Lag 5  -0.002025309
Lag 10 -0.007617170
Lag 50 -0.013137700
autocorr.plot(mod_sim)

effectiveSize(mod_sim)
      a[1]       a[2]       a[3]       a[4]       a[5]       a[6]         a0 
  202.9424   198.8500   189.8654   224.3960   233.0960   243.0859 15232.5111 
      b[1]       b[2]       b[3]       b[4]       b[5]      shape        tau 
  366.6049   550.5071   414.9538   623.6640   905.9078  9708.1701  9748.4696 
raftery.diag(mod_sim) # if interest in reliable posterior intervals, want to run more iter's
[[1]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                             
       Burn-in  Total Lower bound  Dependence
       (M)      (N)   (Nmin)       factor (I)
 a[1]  30       31290 3746          8.35     
 a[2]  44       48492 3746         12.90     
 a[3]  30       33840 3746          9.03     
 a[4]  24       30560 3746          8.16     
 a[5]  20       21980 3746          5.87     
 a[6]  32       35052 3746          9.36     
 a0    2        3930  3746          1.05     
 b[1]  28       30620 3746          8.17     
 b[2]  22       23208 3746          6.20     
 b[3]  24       24968 3746          6.67     
 b[4]  24       26516 3746          7.08     
 b[5]  15       17328 3746          4.63     
 shape 4        5299  3746          1.41     
 tau   2        3866  3746          1.03     


[[2]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                             
       Burn-in  Total Lower bound  Dependence
       (M)      (N)   (Nmin)       factor (I)
 a[1]  57       60084 3746         16.000    
 a[2]  42       40641 3746         10.800    
 a[3]  72       80416 3746         21.500    
 a[4]  33       33624 3746          8.980    
 a[5]  42       43023 3746         11.500    
 a[6]  49       50204 3746         13.400    
 a0    3        4129  3746          1.100    
 b[1]  20       20953 3746          5.590    
 b[2]  39       41037 3746         11.000    
 b[3]  16       17506 3746          4.670    
 b[4]  18       17668 3746          4.720    
 b[5]  21       21543 3746          5.750    
 shape 5        6078  3746          1.620    
 tau   2        3741  3746          0.999    


[[3]]

Quantile (q) = 0.025
Accuracy (r) = +/- 0.005
Probability (s) = 0.95 
                                             
       Burn-in  Total Lower bound  Dependence
       (M)      (N)   (Nmin)       factor (I)
 a[1]  66       83598 3746         22.30     
 a[2]  75       86025 3746         23.00     
 a[3]  64       66956 3746         17.90     
 a[4]  48       60468 3746         16.10     
 a[5]  33       32631 3746          8.71     
 a[6]  60       54720 3746         14.60     
 a0    3        4410  3746          1.18     
 b[1]  24       24700 3746          6.59     
 b[2]  19       20303 3746          5.42     
 b[3]  22       24236 3746          6.47     
 b[4]  30       31166 3746          8.32     
 b[5]  20       23540 3746          6.28     
 shape 5        5483  3746          1.46     
 tau   2        3803  3746          1.02     

With most parameters, the chains are not mixing well due to high autocorrelation. This means the chains are not effectively exploring the posterior. Poor mixing can lead to slow convergence and biased estimates for posterior intervals.

However, the chains appear to have converged, indicated by roughly flat average values of the chains in each trace plot, and the effective sample size for each parameter is enough to get good posterior mean estimates.

summary(mod_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

           Mean       SD  Naive SE Time-series SE
a[1]   7.470616 0.125341 1.023e-03      0.0088851
a[2]   7.445147 0.124819 1.019e-03      0.0088530
a[3]   7.554159 0.122375 9.992e-04      0.0089110
a[4]   7.442163 0.133226 1.088e-03      0.0090331
a[5]   7.697399 0.139561 1.140e-03      0.0091507
a[6]   7.791280 0.146354 1.195e-03      0.0094402
a0     7.572800 3.355018 2.739e-02      0.0273219
b[1]   0.002074 0.023289 1.902e-04      0.0012284
b[2]  -0.026678 0.009536 7.786e-05      0.0004065
b[3]   0.045765 0.022783 1.860e-04      0.0011138
b[4]   0.183872 0.040259 3.287e-04      0.0016154
b[5]  -0.061864 0.013679 1.117e-04      0.0004559
shape  0.739416 0.014715 1.202e-04      0.0001497
tau    7.534628 3.191976 2.606e-02      0.0324564

2. Quantiles for each variable:

           2.5%      25%       50%      75%     97.5%
a[1]   7.226368  7.38902  7.472316  7.55243  7.721892
a[2]   7.205547  7.36341  7.444146  7.52594  7.694160
a[3]   7.316480  7.47540  7.554598  7.63181  7.799058
a[4]   7.180923  7.35576  7.440387  7.52788  7.706972
a[5]   7.420490  7.60676  7.697841  7.78937  7.969732
a[6]   7.505063  7.69542  7.790632  7.88864  8.079555
a0     0.784013  5.68749  7.563198  9.47436 14.145780
b[1]  -0.043896 -0.01374  0.002488  0.01825  0.046236
b[2]  -0.045373 -0.03301 -0.026541 -0.02030 -0.007826
b[3]   0.002553  0.03025  0.045252  0.06077  0.092552
b[4]   0.104790  0.15700  0.183330  0.21133  0.260753
b[5]  -0.088713 -0.07080 -0.061986 -0.05308 -0.034190
shape  0.710982  0.72938  0.739431  0.74921  0.768773
tau    3.942550  5.48962  6.806328  8.70290 15.332972

We can interpret \(a0\) as the overall mean intercept and \(tau\) as the standard deviation of intercepts across areas.

The area-specific intercept \(a[j]\) represents the deviation from the overall intercept \(a0\) for the j-th area.

It captures the effect of the specific area on claim costs while accounting for differences due to other predictor variables.

The parameters for this model are not interpretable due to the way the factor predictors are handled (converted from factor type to numeric type). This could use some improvement of using one hot encoding in the next iteration.

Let’s now fit the Normal hierarchical glm with log link.

## GAUSSIAN GLM WITH LOG LINK
mod2_string <- " model {
  for (i in 1:length(y)) {
    y[i] ~ dnorm(mu[i], prec)
    log(mu[i]) = a[area[i]] + b[1]*veh_value_cat[i] + b[2]*veh_body[i] + b[3]*veh_age[i] + b[4]*gender[i] + b[5]*agecat[i] + log(numclaims[i])
  }
  
  for (j in 1:max(area)) {
    a[j] ~ dnorm(a0, prec_a)
  }
  a0 ~ dnorm(0.0, 1.0/1.0e6)
  prec_a ~ dgamma(0.01,1.0/0.01)
  tau = sqrt(1.0 /prec_a)
  
  for (j in 1:5) {
    b[j] ~ dnorm(0.0, 1.0/1.0e6)
  }
  
  prec ~ dgamma(0.01, 1.0/0.01)
  sig = sqrt(1.0/prec)
  
} "

set.seed(101)

data2_jags <- list(y=train$claimcst0,
                  area=as.numeric(train$area),
                  veh_value_cat=as.numeric(train$veh_value_cat),
                  veh_body=as.numeric(train$veh_body),
                  veh_age=as.numeric(train$veh_age),
                  gender=as.numeric(train$gender),
                  agecat=as.numeric(train$agecat),
                  numclaims=train$numclaims)

params2 <- c("sig", "a0", "tau", "a", "b")

mod2 = jags.model(textConnection(mod2_string), data=data2_jags, n.chains=3)
Compiling model graph
   Resolving undeclared variables
   Allocating nodes
Graph information:
   Observed stochastic nodes: 3696
   Unobserved stochastic nodes: 14
   Total graph size: 33345

Initializing model
update(mod2, 1e3) # burn-in

mod2_sim = coda.samples(model=mod2,
                       variable.names=params2,
                       n.iter=5e3)

mod2_csim = as.mcmc(do.call(rbind, mod2_sim)) # combine multiple chains

dic_Norm_logLink <- dic.samples(mod2, n.iter = 1e3)
## convergence diagnostics
par(mar=c(1,1,1,1))
plot(mod2_sim)

gelman.diag(mod2_sim)
Potential scale reduction factors:

     Point est. Upper C.I.
a[1]       1.02       1.08
a[2]       1.02       1.08
a[3]       1.02       1.08
a[4]       1.02       1.07
a[5]       1.02       1.08
a[6]       1.02       1.08
a0         1.00       1.00
b[1]       1.00       1.01
b[2]       1.01       1.02
b[3]       1.01       1.02
b[4]       1.00       1.01
b[5]       1.01       1.05
sig        1.00       1.00
tau        1.00       1.00

Multivariate psrf

1.03
# gelman.plot(mod_sim)

autocorr.diag(mod2_sim)
            a[1]      a[2]      a[3]      a[4]      a[5]      a[6]          a0
Lag 0  1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.000000000
Lag 1  0.9039102 0.8952467 0.9290081 0.8122482 0.8593210 0.8519131 0.006812066
Lag 5  0.8030852 0.7940320 0.8358333 0.6914850 0.7510186 0.7406355 0.012472898
Lag 10 0.7146814 0.7073708 0.7449205 0.6222933 0.6673708 0.6602780 0.006049749
Lag 50 0.2321473 0.2277453 0.2451839 0.2097478 0.2142120 0.2240939 0.009020322
            b[1]       b[2]      b[3]         b[4]        b[5]          sig
Lag 0  1.0000000 1.00000000 1.0000000  1.000000000  1.00000000  1.000000000
Lag 1  0.9437673 0.91433409 0.9297732  0.927726156  0.87974536  0.010613891
Lag 5  0.7507380 0.63634248 0.7205831  0.680474406  0.52300992  0.006551702
Lag 10 0.5846717 0.38440274 0.5544085  0.452968196  0.27241551 -0.003299073
Lag 50 0.1516437 0.01397482 0.1480127 -0.002236417 -0.01721304 -0.007052097
                tau
Lag 0   1.000000000
Lag 1   0.216145054
Lag 5   0.006686331
Lag 10  0.004905119
Lag 50 -0.007365991
autocorr.plot(mod2_sim)

effectiveSize(mod2_sim)
      a[1]       a[2]       a[3]       a[4]       a[5]       a[6]         a0 
  200.5175   204.1044   185.6379   228.4084   219.0606   224.0909 13946.7356 
      b[1]       b[2]       b[3]       b[4]       b[5]        sig        tau 
  424.6970   697.7683   486.5429   576.4584   959.4208 14759.6889  9464.7385 
summary(mod2_sim)

Iterations = 2001:7000
Thinning interval = 1 
Number of chains = 3 
Sample size per chain = 5000 

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

           Mean       SD  Naive SE Time-series SE
a[1]    7.37701  0.18210 0.0014869      0.0129422
a[2]    7.40155  0.18385 0.0015011      0.0128858
a[3]    7.43943  0.17726 0.0014473      0.0130740
a[4]    7.36714  0.20562 0.0016789      0.0137141
a[5]    7.60240  0.19906 0.0016253      0.0134420
a[6]    7.64691  0.21138 0.0017259      0.0141758
a0      7.47601  3.26689 0.0266741      0.0279727
b[1]    0.02447  0.03333 0.0002721      0.0016119
b[2]   -0.02761  0.01350 0.0001102      0.0005110
b[3]    0.03892  0.03336 0.0002724      0.0015323
b[4]    0.20676  0.05922 0.0004835      0.0024690
b[5]   -0.07689  0.02238 0.0001828      0.0007191
sig  3577.40197 41.60241 0.3396823      0.3425466
tau     7.45582  3.02462 0.0246959      0.0310877

2. Quantiles for each variable:

           2.5%        25%        50%        75%      97.5%
a[1]    7.02181  7.254e+00    7.37890    7.50251  7.728e+00
a[2]    7.04147  7.277e+00    7.40315    7.53062  7.755e+00
a[3]    7.09654  7.317e+00    7.44327    7.56504  7.778e+00
a[4]    6.95631  7.225e+00    7.37207    7.51070  7.763e+00
a[5]    7.21376  7.465e+00    7.60464    7.74226  7.983e+00
a[6]    7.23130  7.501e+00    7.65162    7.79539  8.049e+00
a0      0.80866  5.605e+00    7.46077    9.31465  1.407e+01
b[1]   -0.04147  1.617e-03    0.02490    0.04749  8.757e-02
b[2]   -0.05417 -3.681e-02   -0.02754   -0.01857 -1.862e-03
b[3]   -0.02436  1.558e-02    0.03853    0.06213  1.040e-01
b[4]    0.09267  1.670e-01    0.20678    0.24664  3.227e-01
b[5]   -0.12097 -9.186e-02   -0.07667   -0.06155 -3.419e-02
sig  3496.21204  3.549e+03 3577.19623 3605.15829  3.660e+03
tau     3.96171  5.468e+00    6.73305    8.61797  1.524e+01
# model comparison based on train data
dic_gamma
Mean deviance:  63317 
penalty 11.95 
Penalized deviance: 63329 
dic_Norm_logLink
Mean deviance:  70973 
penalty 12.01 
Penalized deviance: 70985 

Based on the fit of the train set, the preferred model is gamma glm due to its lower dic score.

Predict test data

# predictions from gamma GLM
pm_params_gamma <- colMeans(mod_csim)

test_preped <- test |> 
  select(claimcst0,area,veh_value_cat,veh_body,
         veh_age,gender,agecat,numclaims) |> 
  mutate(area=as.numeric(area),
         veh_value_cat=as.numeric(veh_value_cat),
         veh_body=as.numeric(veh_body),
         veh_age=as.numeric(veh_age),
         gender=as.numeric(gender),
         agecat=as.numeric(agecat),
         numclaims=log(numclaims))

# Extract relevant parameters from the posterior mean estimates
intercept_area <- pm_params_gamma[grep("^a", names(pm_params_gamma))]
coefficients_b <- pm_params_gamma[grep("^b", names(pm_params_gamma))]

# Calculate the linear predictor for each observation in the test set
linear_predictor <- as.matrix(test_preped[, -(1:2)]) %*%  c(coefficients_b,1)

# Add the area-specific intercepts to the linear predictor based on the area of each observation
linear_predictor <- linear_predictor + intercept_area[test_preped$area]

# Apply the inverse of the gamma link function to obtain the predicted mean for each observation
predicted_mean <- drop(exp(linear_predictor))
test$pred_gamma <- predicted_mean
pm_params_gauss <- colMeans(mod2_csim)

test_preped <- test |>
  select(claimcst0,area,veh_value_cat,veh_body,
         veh_age,gender,agecat,numclaims) |>
  mutate(area=as.numeric(area),
         veh_value_cat=as.numeric(veh_value_cat),
         veh_body=as.numeric(veh_body),
         veh_age=as.numeric(veh_age),
         gender=as.numeric(gender),
         agecat=as.numeric(agecat),
         numclaims=log(numclaims))

# Extract relevant parameters from the posterior mean estimates
intercept_area_gauss <- pm_params_gauss[grep("^a", names(pm_params_gauss))]
coefficients_b_gauss <- pm_params_gauss[grep("^b", names(pm_params_gauss))]

# Calculate the linear predictor for each observation in the test set
linear_predictor_gauss <- as.matrix(test_preped[, -(1:2)]) %*%  c(coefficients_b_gauss,1)

# Add the area-specific intercepts to the linear predictor based on the area of each observation
linear_predictor_gauss <- linear_predictor_gauss + intercept_area_gauss[test_preped$area]

# Apply the inverse of the gamma link function to obtain the predicted mean for each observation
predicted_mean_gauss <- drop(exp(linear_predictor_gauss))
test$pred_gauss <- predicted_mean_gauss

Model Assessment and Comparison

RMSE (for test set)

We validate root mean squared error on the test set.

# obtain residuals for gamma glm
resid_gamma <- test$claimcst0 - test$pred_gamma
sqrt(mean((predicted_mean - test$claimcst0)^2)) # RMSE
[1] 3327.891
# obtain residuals for gaussian glm
resid_gauss <- test$claimcst0 - test$pred_gauss
sqrt(mean((predicted_mean_gauss - test$claimcst0)^2)) # RMSE
[1] 3332.602

Residual checks (for test set)

For Gamma hglm

First, we have observation residuals, based on the estimates of area means.

plot(resid_gamma)

plot(test$pred_gamma,resid_gamma)

hist(resid_gamma)

qqnorm(resid_gamma)
qqline(resid_gamma)

rownames(test_preped)[order(resid_gamma,decreasing = T)][1:5]
[1] "362" "772" "725" "824" "904"
test_preped[362,] # record with largest residual
# A tibble: 1 × 8
  claimcst0  area veh_value_cat veh_body veh_age gender agecat numclaims
      <dbl> <dbl>         <dbl>    <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
1    32815.     6             5        4       2      2      1         0

When examining the residuals against the index plot to assess whether a model fits the data well, we are looking for random and symmetric scatter around zero. This indicates that the model’s predictions are unbiased and not systematically overestimating or underestimating the target.

In this case, the residual points appear to randomly scatter but not be symmetrically distributed around zero.

This suggests the presence of a systematic bias in our model as it does not capture all the relevant predictors or interactions, leading to biased predictions. It’s possible that some key variables affecting the response are missing from your model.

In the predicted vs residuals plot, the residuals don’t appear to be randomly distributed around zero. There is a downwards trend as the predicted values get larger. This again suggests we don’t have enough information in our model to explain the target variable claimcst0.

The histogram shows a gamma error distribution, which is expected.

How area means differ from the overall area mean

area_resid_gamma = pm_params_gamma[1:6] - pm_params_gamma["a0"]
plot(area_resid_gamma)
abline(h=0, lty=2)

There are only six areas so not a whole lot of information to see here. But there is no obvious violation of our assumptions.

For Gaussian with log link glm

plot(resid_gauss)

plot(test$pred_gauss,resid_gauss)

hist(resid_gauss)

qqnorm(resid_gauss)
qqline(resid_gauss)

rownames(test_preped)[order(resid_gauss,decreasing = T)][1:5]
[1] "362" "772" "725" "824" "904"
test_preped[362,] # record with largest residual
# A tibble: 1 × 8
  claimcst0  area veh_value_cat veh_body veh_age gender agecat numclaims
      <dbl> <dbl>         <dbl>    <dbl>   <dbl>  <dbl>  <dbl>     <dbl>
1    32815.     6             5        4       2      2      1         0
area_resid_gauss = pm_params_gauss[1:6] - pm_params_gauss["a0"]
plot(area_resid_gauss)
abline(h=0, lty=2)

The plots for Gaussian with log link glm appear to be very similar to those of Gamma glm.

Posterior predictive checks

For the gamma hglm, we will simulate new data from the fitted model using the posterior samples of the parameters. We then compare the simulated data to the observed data in the train set.

# Get the number of iterations from your MCMC samples
# mod_csim holds the posterior samples of the parameters
n_iterations <- nrow(mod_csim) # 15000

# Initialize an empty matrix to store posterior predictive samples
posterior_predictive_samples <- matrix(NA, n_iterations, length(data_jags$y))

# Loop through each iteration and generate posterior predictive samples
for (i in 1:n_iterations) {
  # Get parameters from the posterior distribution
  shape <- mod_csim[i, "shape"]
  a <- mod_csim[i, grep("^a", colnames(mod_csim))]
  b <- mod_csim[i, grep("^b", colnames(mod_csim))]
  numclaims <- data_jags$numclaims
  
  # Generate posterior predictive samples for each data point
  for (j in 1:length(data_jags$y)) {
    # Simulate a gamma-distributed value based on the parameters
    posterior_predictive_samples[i, j] <- 
      rgamma(1, shape = shape, rate = shape / exp(a[data_jags$area[j]] + 
                                                    b[1] * data_jags$veh_value_cat[j] + 
                                                    b[2] * data_jags$veh_body[j] +
                                                    b[3] * data_jags$veh_age[j] +
                                                    b[4] * data_jags$gender[j] +
                                                    b[5] * data_jags$agecat[j] +
                                                    log(numclaims[j])))
  }
}

Now ‘posterior_predictive_samples’ contains the generated posterior predictive samples. Each row represents a different set of posterior predictive values for the train set. Each column corresponds to a data point from the train set.

The question is whether, on the whole, the simulated data is similar to the observed data.

library(bayesplot)
This is bayesplot version 1.10.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
   * Does _not_ affect other ggplot2 plots
   * See ?bayesplot_theme_set for details on theme setting
y <- train$claimcst0
ppc_dens_overlay(y, posterior_predictive_samples[1:50,]) + 
  xlim(0,40000) +
  ylim(0,8e-4)
Warning: Removed 5 rows containing non-finite values (`stat_density()`).
Warning: Removed 3 rows containing non-finite values (`stat_density()`).

# Choose 5 random indices to select 5 sets of parameters
num_samples <- 15000
num_sets_to_compare <- 5
random_indices <- sample(1:num_samples, num_sets_to_compare, replace = F)

# Extract the selected sets of parameters for posterior predictive check
selected_posterior_predictive_samples <- posterior_predictive_samples[random_indices, ]

# Perform the posterior predictive check for the selected parameter sets
ppc_hist(y, selected_posterior_predictive_samples) +
  coord_cartesian(ylim = c(0, 3000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

With density plots, the simulated data does capture the general center and spread.

However, the fact that the peak of the observed data is much higher implies that the most typical claim costs are not being adequately captured by our model. In other words, the model’s predictions are consistently underestimating the frequency of the most common claim costs in comparison to what is observed in the train data.

# Calculate summary statistics from observed data in train set
obs_mean <- mean(train$claimcst0)
obs_variance <- var(train$claimcst0)
obs_quantiles <- quantile(train$claimcst0, c(0.25, 0.5, 0.75))

# Calculate summary statistics from posterior predictive samples
pp_mean <- apply(posterior_predictive_samples, 1, mean)
pp_variance <- apply(posterior_predictive_samples, 1, var)
pp_quantiles <- t(apply(posterior_predictive_samples, 1, quantile, probs = c(0.25, 0.5, 0.75)))

# Compare observed data and posterior predictive samples
# You can use graphical methods like histograms, density plots, or Q-Q plots
# Or you can calculate how often the observed data's statistics fall within the range of posterior predictive samples' statistics


# Compare quantiles using box plots
boxplot(pp_quantiles, col = "skyblue", names = c("25%", "50%", "75%"), 
        main = "Posterior Predictive Quantiles vs. Observed Quantiles")
points(obs_quantiles, col = "red", pch = 19)
legend("bottomright", legend = c("Posterior Predictive Quantiles", "Observed Quantiles"), 
       col = c("skyblue", "red"), pch = c(15, 19))

Only the observed data’s 25% quantile falls within the posterior predictive samples’ 25% quantile range.

Show the predicted means vs actual claim costs in the test set.

par(mfrow = c(1, 2))

hist(predicted_mean_gauss, col = "skyblue", 
     xlim = c(0, 60000),
     ylim = c(0,700),
     main = "Predicted means for gamma glm", xlab = "Claim Amount")

hist(test_preped$claimcst0, breaks = 50, col = "salmon", add = FALSE,
     xlim = c(0, 60000),
     ylim = c(0,700),
     main = "Actual claim costs for test set", xlab = "Claim Amount")

hist(predicted_mean, col = "skyblue", 
     xlim = c(0, 40000),
     ylim = c(0,700),
     main = "Predicted means for gaussian glm", xlab = "Claim Amount")

hist(test_preped$claimcst0, breaks = 50, col = "salmon", add = FALSE,
     xlim = c(0, 40000),
     ylim = c(0,700),
     main = "Actual claim costs for test set", xlab = "Claim Amount")

We notice that the extremely right skewed aspect of claim costs is not captured by our models’ predictions.

ggplot(as.data.frame(predicted_mean),aes(predicted_mean)) +
  geom_density() +
  geom_density(data=test_preped, aes(claimcst0), color="red") +
  labs(title = "Predicted vs. Observed Claim Costs for Gamma HGLM with Log Link",
       x = "predicted claim cost")

ggplot(as.data.frame(predicted_mean),aes(predicted_mean)) +
  geom_density() +
  geom_density(data=test_preped, aes(claimcst0), color="red") +
  labs(title = "Predicted vs. Observed Claim Costs for Gamma HGLM with Log Link",
       x = "predicted claim cost") +
  coord_cartesian(xlim = c(0, 5000))

ggplot(as.data.frame(predicted_mean_gauss),aes(predicted_mean_gauss)) +
  geom_density() +
  geom_density(data=test_preped, aes(claimcst0), color="red") +
  labs(title = "Predicted vs. Observed Claim Costs for Gaussian HGLM with Log Link",
       x = "predicted claim cost")

For the test data, we under predicted the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000).

Thus, these two models are not adequate to generalize the actual data.

Conclusions

Results

  • A hold-out test set was used to evaluate how well the two models generalize to new data. Upon checking, the predicted claim costs against the test set’s, we under predicted the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000). Thus, these two models are not adequate to generalize to actual data.
  • For the train set, a lot of values were overestimated by the two models. Some underestimated values with large error magnitudes were also observed.
  • Changing the likelihood between gamma and normal does not seem to affect test set’s predictive performance much although the gamma glm is preferred based on its higher DIC score (63330 vs 70986) with the train set.
  • The test set RMSE prefers the Gamma hierarchical glm over the Gaussian with log link hierarchical glm (3327.484 vs 3332.711).
  • We don’t have enough information in our model to explain the target variable claimcst0, as indicated by both models’ fitted vs residuals plots (based on train set).

Limitations and improvements

  • It is difficult to verify the data collection process and its authenticity as the source website provided by the R documentation is inaccessible.

  • The parameters for this model are not interpretable due to the way the factor predictors are handled (converted from factor type to numeric type). This could use some improvement of using one hot encoding in the next iteration.

  • The analysis does not contain prior sensitivity analysis.

  • Weakly informative priors were used. We could improve our prior selection by incorporating expert knowledge.

  • Interaction terms can be explored as additional predictors.

  • The six areas can be aggregated into only two main areas, (A,B,C) and (D,E,F).

  • Either write vectorized code for conducting posterior predictive checks.

  • Used the train set twice: we used our train data for building the models and then, for checking if the model fits the data. Generally it is a bad idea and it would be better to validate the models on external data, that was not used for building our models.

  • When a severity model is finalized, we can move on to build a frequency model and then a pure premium model.

References