Modeling claim costs with Bayesian Hierarchical Generalized Linear Models
In this project, we will attempt to model claim severity (expected claim amount per claim) using a Gamma Generalized Linear Model (glm) as well as a Normal glm with log link.
Load packages
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.2 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.0
✔ ggplot2 3.4.2 ✔ tibble 3.2.1
✔ lubridate 1.9.2 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(skimr)library(ggplot2)library(caret)
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
This data set is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67,856 policies, of which 4624 (6.8% notified claims) filed claims.
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can see an imbalanced as most policyholders did not file a claim. If a policyholder did not file a claim then the claim cost is then $0.
Looking at the above graph and based on one important property of gamma, i.e., all outcomes must be positive, we can not model our data in the current form where we have zero outcomes for 93% of observations.
We will split this dataset into two parts, observations without any loss and observations which generated even a single dollar loss.
without_loss_dat <- dat |>filter(claimcst0 ==0)with_loss_dat <- dat |>filter(claimcst0 >0)
We will focus on modeling only positive losses.
# claim cost for the positive subset of the dataggplot(with_loss_dat, aes(claimcst0)) +geom_histogram() +ggtitle("Claim cost including outliers")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
summary(with_loss_dat$claimcst0)
Min. 1st Qu. Median Mean 3rd Qu. Max.
200.0 353.8 761.6 2014.4 2091.4 55922.1
SEDAN HBACK STNWG UTE HDTOP TRUCK COUPE PANVN MIBUS MCARA BUS CONVT RDSTR
1476 1264 1170 260 130 119 68 62 43 14 7 3 2
# Create the bar plotbarplot(veh_body_counts, las =2) # las = 2 to rotate x-axis labels vertically
We see many categories with the very low frequency which could be a disadvantage in training the feature in lack of enough data points, let’s merge the bottom 4 categories as “OTHERS”.
the “exposure” variable indicates the exposure period for each policy. In the context of car insurance, it represents the time in years. Recall that this data set only contains policies in 2004 or 2005.
# exposuresummary(with_loss_dat$exposure)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.002738 0.410678 0.636550 0.610834 0.832307 0.999316
# numclaimshist(with_loss_dat$numclaims)
table(with_loss_dat$numclaims)
1 2 3 4
4329 269 18 2
We can see, for each policies that have claim occurrence, the majority have only one claim. Nothing is unusual here. Let’s group this variable into either one claim or more than one claim.
with_loss_dat$nclaim_cat <-ifelse(with_loss_dat$numclaims ==1, "One claim", "More than one claim")with_loss_dat$nclaim_cat <-as.factor(with_loss_dat$nclaim_cat)
# A tibble: 6 × 2
area sd_area
<fct> <dbl>
1 A 3606.
2 B 2982.
3 C 3403.
4 D 3025.
5 E 3840.
6 F 5582.
The log of claimcs0 roughly follows a normal distribution, except the two peaks at the beginning of the distribution. The variances of claimcst0 between areas vary as well.
Based on these two observations, we can conduct a Welch’s ANOVA test to see if there are statistically significant differences in claim costs among different areas.
Based on industry experience that GLM with gamma distribution and OLS with log-transformed response might be a better fit for this scenario.
Here is a comparison of the original distribution versus its log-transformed value.
hist(with_loss_dat$claimcst0, breaks =25)
hist(log(with_loss_dat$claimcst0), breaks =25)
The first histogram gives an intuition that data is very close to gamma distribution and later one suggests we can try a OLS with the log-transformed response variable, though it is far from perfectly normal, has two peaks in the beginning.
Recall that claim severity is the expected claim amount per claim. Note that each row in the data set represents a policy, which can have more than one claim. Thus, the claim amount column is the total cost for each policy.
We have to find out a technique that mathematically replaces claim amount with a claim severity. We wanted to influence the model results without actually including the number of claims variable in modeling.
log(claimcst0) = f(X) + log(numclaims)
where f(X) is the linear combination of predictors and the second term is the offset.
The offset effectively scales the expected total claim cost by the number of claims for each policy, allowing us to focus on modeling the conditional relationship between the predictors and the claim severity (total claim cost per claim).
We avoid directly estimating a coefficient for numclaims in the model, because we have a specific theoretical understanding that the effect of the number of claims is fixed and has a multiplicative effect on the response variable, ie. total claim cost, irrespective of the predictor values.
Note that we eliminate the possibility of using a normal linear model on log-transformed response variable because it cannot handle the offset term.
We will also assume that there is a grouping effect in the claim costs for policyholders based on their location, indicated by the area variable. We will reflect this in our model by having a different intercept for each area. Each of the intercept will come from the same normal distribution.
Specify and fit models
We will specify a hierarchical gamma glm with a log link as well as one with a normal likelihood and a log link. We can fit both models and after assessment and comparison, finalize the better one.
Target Variable — We will use the claim cost as the target variable
Independent Variables — All these variables are used as categorical — Vehicle Body, Vehicle Age, Gender, Area and Age category, and Vehicle value. Claim indicators and exposure are dropped as they are not required.
Offset — Number of claims will be used as an offset term
## GAMMA GLMmod_string <-" model { for (i in 1:length(y)) { y[i] ~ dgamma(shape, shape/mu[i]) log(mu[i]) = a[area[i]] + b[1]*veh_value_cat[i] + b[2]*veh_body[i] + b[3]*veh_age[i] + b[4]*gender[i] + b[5]*agecat[i] + log(numclaims[i]) } ## weakly informative prior for the degree of skewness of likelihood shape ~ dgamma(0.01,1.0/0.01) for (j in 1:max(area)) { a[j] ~ dnorm(a0, prec_a) } a0 ~ dnorm(0.0, 1.0/1.0e6) prec_a ~ dgamma(0.01, 1.0/0.01) tau = sqrt(1.0/prec_a) ## standard deviation for (j in 1:5) { b[j] ~ dnorm(0.0, 1.0/1.0e6) }} "set.seed(101)data_jags <-list(y=train$claimcst0,area=as.numeric(train$area),veh_value_cat=as.numeric(train$veh_value_cat),veh_body=as.numeric(train$veh_body),veh_age=as.numeric(train$veh_age),gender=as.numeric(train$gender),agecat=as.numeric(train$agecat),numclaims=train$numclaims)params <-c("shape", "a0", "tau", "a", "b")mod =jags.model(textConnection(mod_string), data=data_jags, n.chains=3)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 3696
Unobserved stochastic nodes: 14
Total graph size: 35204
Initializing model
With most parameters, the chains are not mixing well due to high autocorrelation. This means the chains are not effectively exploring the posterior. Poor mixing can lead to slow convergence and biased estimates for posterior intervals.
However, the chains appear to have converged, indicated by roughly flat average values of the chains in each trace plot, and the effective sample size for each parameter is enough to get good posterior mean estimates.
We can interpret \(a0\) as the overall mean intercept and \(tau\) as the standard deviation of intercepts across areas.
The area-specific intercept \(a[j]\) represents the deviation from the overall intercept \(a0\) for the j-th area.
It captures the effect of the specific area on claim costs while accounting for differences due to other predictor variables.
The parameters for this model are not interpretable due to the way the factor predictors are handled (converted from factor type to numeric type). This could use some improvement of using one hot encoding in the next iteration.
Let’s now fit the Normal hierarchical glm with log link.
## GAUSSIAN GLM WITH LOG LINKmod2_string <-" model { for (i in 1:length(y)) { y[i] ~ dnorm(mu[i], prec) log(mu[i]) = a[area[i]] + b[1]*veh_value_cat[i] + b[2]*veh_body[i] + b[3]*veh_age[i] + b[4]*gender[i] + b[5]*agecat[i] + log(numclaims[i]) } for (j in 1:max(area)) { a[j] ~ dnorm(a0, prec_a) } a0 ~ dnorm(0.0, 1.0/1.0e6) prec_a ~ dgamma(0.01,1.0/0.01) tau = sqrt(1.0 /prec_a) for (j in 1:5) { b[j] ~ dnorm(0.0, 1.0/1.0e6) } prec ~ dgamma(0.01, 1.0/0.01) sig = sqrt(1.0/prec)} "set.seed(101)data2_jags <-list(y=train$claimcst0,area=as.numeric(train$area),veh_value_cat=as.numeric(train$veh_value_cat),veh_body=as.numeric(train$veh_body),veh_age=as.numeric(train$veh_age),gender=as.numeric(train$gender),agecat=as.numeric(train$agecat),numclaims=train$numclaims)params2 <-c("sig", "a0", "tau", "a", "b")mod2 =jags.model(textConnection(mod2_string), data=data2_jags, n.chains=3)
Compiling model graph
Resolving undeclared variables
Allocating nodes
Graph information:
Observed stochastic nodes: 3696
Unobserved stochastic nodes: 14
Total graph size: 33345
Initializing model
Iterations = 2001:7000
Thinning interval = 1
Number of chains = 3
Sample size per chain = 5000
1. Empirical mean and standard deviation for each variable,
plus standard error of the mean:
Mean SD Naive SE Time-series SE
a[1] 7.37701 0.18210 0.0014869 0.0129422
a[2] 7.40155 0.18385 0.0015011 0.0128858
a[3] 7.43943 0.17726 0.0014473 0.0130740
a[4] 7.36714 0.20562 0.0016789 0.0137141
a[5] 7.60240 0.19906 0.0016253 0.0134420
a[6] 7.64691 0.21138 0.0017259 0.0141758
a0 7.47601 3.26689 0.0266741 0.0279727
b[1] 0.02447 0.03333 0.0002721 0.0016119
b[2] -0.02761 0.01350 0.0001102 0.0005110
b[3] 0.03892 0.03336 0.0002724 0.0015323
b[4] 0.20676 0.05922 0.0004835 0.0024690
b[5] -0.07689 0.02238 0.0001828 0.0007191
sig 3577.40197 41.60241 0.3396823 0.3425466
tau 7.45582 3.02462 0.0246959 0.0310877
2. Quantiles for each variable:
2.5% 25% 50% 75% 97.5%
a[1] 7.02181 7.254e+00 7.37890 7.50251 7.728e+00
a[2] 7.04147 7.277e+00 7.40315 7.53062 7.755e+00
a[3] 7.09654 7.317e+00 7.44327 7.56504 7.778e+00
a[4] 6.95631 7.225e+00 7.37207 7.51070 7.763e+00
a[5] 7.21376 7.465e+00 7.60464 7.74226 7.983e+00
a[6] 7.23130 7.501e+00 7.65162 7.79539 8.049e+00
a0 0.80866 5.605e+00 7.46077 9.31465 1.407e+01
b[1] -0.04147 1.617e-03 0.02490 0.04749 8.757e-02
b[2] -0.05417 -3.681e-02 -0.02754 -0.01857 -1.862e-03
b[3] -0.02436 1.558e-02 0.03853 0.06213 1.040e-01
b[4] 0.09267 1.670e-01 0.20678 0.24664 3.227e-01
b[5] -0.12097 -9.186e-02 -0.07667 -0.06155 -3.419e-02
sig 3496.21204 3.549e+03 3577.19623 3605.15829 3.660e+03
tau 3.96171 5.468e+00 6.73305 8.61797 1.524e+01
# model comparison based on train datadic_gamma
Mean deviance: 63317
penalty 11.95
Penalized deviance: 63329
dic_Norm_logLink
Mean deviance: 70973
penalty 12.01
Penalized deviance: 70985
Based on the fit of the train set, the preferred model is gamma glm due to its lower dic score.
Predict test data
# predictions from gamma GLMpm_params_gamma <-colMeans(mod_csim)test_preped <- test |>select(claimcst0,area,veh_value_cat,veh_body, veh_age,gender,agecat,numclaims) |>mutate(area=as.numeric(area),veh_value_cat=as.numeric(veh_value_cat),veh_body=as.numeric(veh_body),veh_age=as.numeric(veh_age),gender=as.numeric(gender),agecat=as.numeric(agecat),numclaims=log(numclaims))# Extract relevant parameters from the posterior mean estimatesintercept_area <- pm_params_gamma[grep("^a", names(pm_params_gamma))]coefficients_b <- pm_params_gamma[grep("^b", names(pm_params_gamma))]# Calculate the linear predictor for each observation in the test setlinear_predictor <-as.matrix(test_preped[, -(1:2)]) %*%c(coefficients_b,1)# Add the area-specific intercepts to the linear predictor based on the area of each observationlinear_predictor <- linear_predictor + intercept_area[test_preped$area]# Apply the inverse of the gamma link function to obtain the predicted mean for each observationpredicted_mean <-drop(exp(linear_predictor))test$pred_gamma <- predicted_mean
pm_params_gauss <-colMeans(mod2_csim)test_preped <- test |>select(claimcst0,area,veh_value_cat,veh_body, veh_age,gender,agecat,numclaims) |>mutate(area=as.numeric(area),veh_value_cat=as.numeric(veh_value_cat),veh_body=as.numeric(veh_body),veh_age=as.numeric(veh_age),gender=as.numeric(gender),agecat=as.numeric(agecat),numclaims=log(numclaims))# Extract relevant parameters from the posterior mean estimatesintercept_area_gauss <- pm_params_gauss[grep("^a", names(pm_params_gauss))]coefficients_b_gauss <- pm_params_gauss[grep("^b", names(pm_params_gauss))]# Calculate the linear predictor for each observation in the test setlinear_predictor_gauss <-as.matrix(test_preped[, -(1:2)]) %*%c(coefficients_b_gauss,1)# Add the area-specific intercepts to the linear predictor based on the area of each observationlinear_predictor_gauss <- linear_predictor_gauss + intercept_area_gauss[test_preped$area]# Apply the inverse of the gamma link function to obtain the predicted mean for each observationpredicted_mean_gauss <-drop(exp(linear_predictor_gauss))test$pred_gauss <- predicted_mean_gauss
Model Assessment and Comparison
RMSE (for test set)
We validate root mean squared error on the test set.
When examining the residuals against the index plot to assess whether a model fits the data well, we are looking for random and symmetric scatter around zero. This indicates that the model’s predictions are unbiased and not systematically overestimating or underestimating the target.
In this case, the residual points appear to randomly scatter but not be symmetrically distributed around zero.
This suggests the presence of a systematic bias in our model as it does not capture all the relevant predictors or interactions, leading to biased predictions. It’s possible that some key variables affecting the response are missing from your model.
In the predicted vs residuals plot, the residuals don’t appear to be randomly distributed around zero. There is a downwards trend as the predicted values get larger. This again suggests we don’t have enough information in our model to explain the target variable claimcst0.
The histogram shows a gamma error distribution, which is expected.
The plots for Gaussian with log link glm appear to be very similar to those of Gamma glm.
Posterior predictive checks
For the gamma hglm, we will simulate new data from the fitted model using the posterior samples of the parameters. We then compare the simulated data to the observed data in the train set.
# Get the number of iterations from your MCMC samples# mod_csim holds the posterior samples of the parametersn_iterations <-nrow(mod_csim) # 15000# Initialize an empty matrix to store posterior predictive samplesposterior_predictive_samples <-matrix(NA, n_iterations, length(data_jags$y))# Loop through each iteration and generate posterior predictive samplesfor (i in1:n_iterations) {# Get parameters from the posterior distribution shape <- mod_csim[i, "shape"] a <- mod_csim[i, grep("^a", colnames(mod_csim))] b <- mod_csim[i, grep("^b", colnames(mod_csim))] numclaims <- data_jags$numclaims# Generate posterior predictive samples for each data pointfor (j in1:length(data_jags$y)) {# Simulate a gamma-distributed value based on the parameters posterior_predictive_samples[i, j] <-rgamma(1, shape = shape, rate = shape /exp(a[data_jags$area[j]] + b[1] * data_jags$veh_value_cat[j] + b[2] * data_jags$veh_body[j] + b[3] * data_jags$veh_age[j] + b[4] * data_jags$gender[j] + b[5] * data_jags$agecat[j] +log(numclaims[j]))) }}
Now ‘posterior_predictive_samples’ contains the generated posterior predictive samples. Each row represents a different set of posterior predictive values for the train set. Each column corresponds to a data point from the train set.
The question is whether, on the whole, the simulated data is similar to the observed data.
library(bayesplot)
This is bayesplot version 1.10.0
- Online documentation and vignettes at mc-stan.org/bayesplot
- bayesplot theme set to bayesplot::theme_default()
* Does _not_ affect other ggplot2 plots
* See ?bayesplot_theme_set for details on theme setting
y <- train$claimcst0ppc_dens_overlay(y, posterior_predictive_samples[1:50,]) +xlim(0,40000) +ylim(0,8e-4)
# Choose 5 random indices to select 5 sets of parametersnum_samples <-15000num_sets_to_compare <-5random_indices <-sample(1:num_samples, num_sets_to_compare, replace = F)# Extract the selected sets of parameters for posterior predictive checkselected_posterior_predictive_samples <- posterior_predictive_samples[random_indices, ]# Perform the posterior predictive check for the selected parameter setsppc_hist(y, selected_posterior_predictive_samples) +coord_cartesian(ylim =c(0, 3000))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
With density plots, the simulated data does capture the general center and spread.
However, the fact that the peak of the observed data is much higher implies that the most typical claim costs are not being adequately captured by our model. In other words, the model’s predictions are consistently underestimating the frequency of the most common claim costs in comparison to what is observed in the train data.
# Calculate summary statistics from observed data in train setobs_mean <-mean(train$claimcst0)obs_variance <-var(train$claimcst0)obs_quantiles <-quantile(train$claimcst0, c(0.25, 0.5, 0.75))# Calculate summary statistics from posterior predictive samplespp_mean <-apply(posterior_predictive_samples, 1, mean)pp_variance <-apply(posterior_predictive_samples, 1, var)pp_quantiles <-t(apply(posterior_predictive_samples, 1, quantile, probs =c(0.25, 0.5, 0.75)))# Compare observed data and posterior predictive samples# You can use graphical methods like histograms, density plots, or Q-Q plots# Or you can calculate how often the observed data's statistics fall within the range of posterior predictive samples' statistics# Compare quantiles using box plotsboxplot(pp_quantiles, col ="skyblue", names =c("25%", "50%", "75%"), main ="Posterior Predictive Quantiles vs. Observed Quantiles")points(obs_quantiles, col ="red", pch =19)legend("bottomright", legend =c("Posterior Predictive Quantiles", "Observed Quantiles"), col =c("skyblue", "red"), pch =c(15, 19))
Only the observed data’s 25% quantile falls within the posterior predictive samples’ 25% quantile range.
Show the predicted means vs actual claim costs in the test set.
par(mfrow =c(1, 2))hist(predicted_mean_gauss, col ="skyblue", xlim =c(0, 60000),ylim =c(0,700),main ="Predicted means for gamma glm", xlab ="Claim Amount")hist(test_preped$claimcst0, breaks =50, col ="salmon", add =FALSE,xlim =c(0, 60000),ylim =c(0,700),main ="Actual claim costs for test set", xlab ="Claim Amount")
hist(predicted_mean, col ="skyblue", xlim =c(0, 40000),ylim =c(0,700),main ="Predicted means for gaussian glm", xlab ="Claim Amount")hist(test_preped$claimcst0, breaks =50, col ="salmon", add =FALSE,xlim =c(0, 40000),ylim =c(0,700),main ="Actual claim costs for test set", xlab ="Claim Amount")
We notice that the extremely right skewed aspect of claim costs is not captured by our models’ predictions.
ggplot(as.data.frame(predicted_mean),aes(predicted_mean)) +geom_density() +geom_density(data=test_preped, aes(claimcst0), color="red") +labs(title ="Predicted vs. Observed Claim Costs for Gamma HGLM with Log Link",x ="predicted claim cost")
ggplot(as.data.frame(predicted_mean),aes(predicted_mean)) +geom_density() +geom_density(data=test_preped, aes(claimcst0), color="red") +labs(title ="Predicted vs. Observed Claim Costs for Gamma HGLM with Log Link",x ="predicted claim cost") +coord_cartesian(xlim =c(0, 5000))
ggplot(as.data.frame(predicted_mean_gauss),aes(predicted_mean_gauss)) +geom_density() +geom_density(data=test_preped, aes(claimcst0), color="red") +labs(title ="Predicted vs. Observed Claim Costs for Gaussian HGLM with Log Link",x ="predicted claim cost")
For the test data, we under predicted the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000).
Thus, these two models are not adequate to generalize the actual data.
Conclusions
Results
A hold-out test set was used to evaluate how well the two models generalize to new data. Upon checking, the predicted claim costs against the test set’s, we under predicted the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000). Thus, these two models are not adequate to generalize to actual data.
For the train set, a lot of values were overestimated by the two models. Some underestimated values with large error magnitudes were also observed.
Changing the likelihood between gamma and normal does not seem to affect test set’s predictive performance much although the gamma glm is preferred based on its higher DIC score (63330 vs 70986) with the train set.
The test set RMSE prefers the Gamma hierarchical glm over the Gaussian with log link hierarchical glm (3327.484 vs 3332.711).
We don’t have enough information in our model to explain the target variable claimcst0, as indicated by both models’ fitted vs residuals plots (based on train set).
Limitations and improvements
It is difficult to verify the data collection process and its authenticity as the source website provided by the R documentation is inaccessible.
The parameters for this model are not interpretable due to the way the factor predictors are handled (converted from factor type to numeric type). This could use some improvement of using one hot encoding in the next iteration.
The analysis does not contain prior sensitivity analysis.
Weakly informative priors were used. We could improve our prior selection by incorporating expert knowledge.
Interaction terms can be explored as additional predictors.
The six areas can be aggregated into only two main areas, (A,B,C) and (D,E,F).
Either write vectorized code for conducting posterior predictive checks.
Used the train set twice: we used our train data for building the models and then, for checking if the model fits the data. Generally it is a bad idea and it would be better to validate the models on external data, that was not used for building our models.
When a severity model is finalized, we can move on to build a frequency model and then a pure premium model.