Claim Costs Bayesian GLM Final Report

For the full analysis with R code, please check out this Quarto Markdown.

Abstract

This report aims to model highly skewed claim costs of a publicly available insurance claims data using Bayesian Generalized Linear Modeling approach. The objective is to estimate the claim cost for each policy while accounting for the variability between areas. By employing Bayesian statistics and MCMC techniques, the report further explores whether using a Gamma or a Gaussian likelihood in a hierarchical generalized linear model (hglm) is more beneficial for estimating claim costs. With the chosen data set, the Gamma hglm is preferred over the Gaussian with log link hglm.

Introduction

The research problem pertains to understanding how to model highly skewed data eg. insurance claim costs. The analysis will be conducted using Bayesian Gamma with log link and Gaussian with log link hglm’s, allowing for the incorporation of prior knowledge and the quantification of uncertainty. The report proceeds through several key steps to address the problem systematically. At the end, the two models will be assessed and compared.

Data

The data set is available in the insuranceData package in R. It is based on one-year vehicle insurance policies taken out in 2004 or 2005. There are 67856 policies, of which 4624 (6.8% notified claims) filed claims. Each policy contains a variety of information, eg. vehicle value, exposure, number of claims, etc. This allows exploration of appropriate predictor variables to estimate our target variable claimcst0. Data preparation, such as variable transformations and data errors, are also addressed.

To understand what each column means, please go to Appendix A: Data dictionary.

If a policyholder did not file a claim then the claim cost is $0.

One important property of the Gamma distribution is that all outcomes must be positive. We can not model our data in the current form where we have zero outcomes for 93% of policies, as shown below with a bar plot.

There is an imbalanced as most policyholders did not file a claim.

Thus, the data set was split into two parts, policies without any loss and policies which generated even a single dollar loss. This report only focuses on modeling positive losses. Shown below is the distribution of claim costs for policies with filed claims.

Some records have $0 as vehicle value. This does not make sense and thus were removed.

There are also extreme values that could affect the stability of the model. We can clearly see the outliers shown by the box plot below.

Continuous values for vehicle value column were converted into categorical to minimize the effect of these outliers. As a result, there are 5 different ordinal categories created for vehicle value, “Very Low”, “Low”, “Medium”, “High”, “Very High”.

In vehicle body column , there many categories with a very low frequency which could be a disadvantage in training the feature. Thus, the bottom 4 categories were merged as “OTHERS”.

Each area appears to have slightly different median log claim costs.

A Welch’s ANOVA test (alpha = 0.05) was conducted and it also suggests that the claim costs vary significantly between at least a pair of areas.

Models

Understand model choices

A glm with gamma distribution or a OLS with log-transformed response might be a good fit for modeling claim costs.

Here is a comparison of the original distribution versus its log-transformed version.

The first histogram gives an intuition that data is very close to gamma distribution and the latter one suggests we can try a OLS with the log-transformed response variable, though it is far from perfectly normal, with two peaks at the beginning.

Note that claim severity is the expected claim amount per claim. Recall that each row in the data set represents a policy, which can have more than one claim. Thus, the claim amount column is the total cost for each policy.

We wanted to influence the model results without actually including the number of claims variable in modeling. Therefore, the first line is the set up for a glm with log link and we derive the following equation at the end:

\[ \begin{align*} & claimcst0/numclaims = exp(f(X)) \\ & log(claimcst0/numclaims) = f(X) \\ & log(claimcst0) - log(numclaims) = f(X) \\ & log(claimcst0) = f(X) + log(numclaims) \end{align*} \]

where f(X) is the linear combination of predictors and the second term is the offset.

The offset effectively scales the expected total claim cost by the number of claims for each policy, allowing us to focus on modeling the conditional relationship between the predictors and the claim severity (total claim cost per claim).

We avoid directly estimating a coefficient for numclaims in the model, because we have a specific theoretical understanding that the effect of the number of claims is fixed and has a multiplicative effect on the response variable claimcst0, irrespective of the predictor values.

Note that we eliminate the possibility of using a normal linear model on log-transformed response variable because it cannot handle the offset term. As the number of claims (numclaims) increases for a policy, the claim cost (claimcst0) is expected to increase proportionally. The log link function is better suited to capture this multiplicative relationship.

We will also assume that there is a grouping effect in the claim costs for policyholders based on their area, indicated by the area variable. We will reflect this in our model by having a different intercept for each area. Each of the intercept will come from the same normal distribution.

We will specify a Gamma hierarchical glm with a log link as well as one with a Gaussian likelihood and a log link. We can fit both models and after assessment and comparison, finalize the better one.

Target Variable — We will use the claim cost as the target variable.

Predictor Variables — All these variables are used as categorical — Vehicle Body, Vehicle Age, Gender, Area and Age category, and Vehicle value. Claim indicators and exposure are dropped as they are not relevant.

Offset — Number of claims will be used as an offset term.

Gamma model

Model specification in JAGS

\[ \begin{align*} & y_i \sim \text{Gamma}(\text{shape}, \text{rate}_i) \\ & \text{shape} \sim \text{Gamma}(0.01, \frac{1}{0.01}) \\ & \text{rate}_i = \frac{\text{shape}}{\mu_i} \\ & \log(\mu_i) = a_{\text{area}_i} + \mathbf{b} \cdot \mathbf{X} + \log(\text{numclaims}_i) \\ & a_{\text{area}_i} \sim \text{Normal}(a_0, \text{precision}_a) \quad \text{for } \text{area}_i = 1, 2, \ldots, \text{number of areas} \\ & a_0 \sim \text{Normal}(0, \frac{1}{10^6}) \\ & \text{precision}_a \sim \text{Gamma}(0.01, \frac{1}{0.01}) \\ & b_j \sim \text{Normal}(0, \frac{1}{10^6}) \quad \text{for } j = 1, 2, \ldots, \text{number of predictors} \end{align*} \]

Note X is a matrix of predictor variables and b is a vector of corresponding coefficients.

Model checking

Check that modeling assumptions are met (e.g., residual analyses, predictive performance, etc.).

Posterior predictive checks (for train set)

If the combined model assumptions are reasonable, then our posterior model should be able to simulate claim costs data that’s similar to the original 3,696 policies observations in the train set.

The simulated data appear to capture the general center and spread.

However, the fact that the peak of the observed data is much higher implies that the most typical claim costs are not being adequately captured by our model. In other words, the model’s predictions are consistently underestimating the frequency of the most common claim costs in comparison to what is observed in the train data.

Residual checks (for test set)

First, we have observation residuals, based on the estimates of area means.

When examining the residuals against the index plot to assess whether a model fits the data well, we are looking for random and symmetric scatter around zero. This would suggest that the model’s predictions are unbiased and not systematically overestimating or underestimating the target.

In this case, the residual points appear to randomly scatter but not be symmetrically distributed around zero.

This suggests the presence of a systematic bias in our model as it does not capture all the relevant predictors or interactions, leading to biased predictions. It’s possible that some key variables affecting the target are missing from our model.

In the predicted vs residuals plot, the residuals don’t appear to be randomly distributed around zero. There is a downwards trend as the predicted values get larger. This again suggests we don’t have enough information in our model to explain the target variable claimcst0.

The histogram shows a gamma error distribution, which is expected.

Next, let’s see how area means differ from the overall area mean.

There are only six areas so not a whole lot of information to see here. But there is no obvious violation of our assumptions.

Gaussian model

Model specification in JAGS

\[ \begin{align*} & y_i \sim \text{Normal}(\mu_i, \text{precision}) \\ & \log(\mu_i) = a_{\text{area}_i} + \mathbf{b} \cdot \mathbf{X} + \log(\text{numclaims}_i) \\ & a_{\text{area}_i} \sim \text{Normal}(a_0, \text{precision}_a) \quad \text{for } \text{area}_i = 1, 2, \ldots, \text{number of areas} \\ & a_0 \sim \text{Normal}(0, \frac{1}{10^6}) \\ & \text{precision}_a \sim \text{Gamma}(0.01, \frac{1}{0.01}) \\ & b_j \sim \text{Normal}(0, \frac{1}{10^6}) \quad \text{for } j = 1, 2, \ldots, \text{number of predictors} \\ & \text{precision} \sim \text{Gamma}(0.01, \frac{1}{0.01}) \end{align*} \]

Note X is a matrix of predictor variables and b is a vector of corresponding coefficients.

Model checking

All the plots are very similar to those of Gamma hglm. Therefore, similar comments were arrived.

Predicted claim costs vs test set’s actual claim costs

We zoomed into the (0,5000) range to investigate further

For the test data, both models fail to account for the proportion of the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000).

Thus, these two models are not adequate to generalize to actual data.

Conclusions

Results

A hold-out test set was used to evaluate how well the two models generalize to new data. Upon checking, the predicted claim costs against the test set’s, we under predicted the lowest range of claim costs ($0, $1000) and over predicted the range ($1000, $3000). Thus, these two models are not adequate to generalize to actual data.
For the train set, a lot of values were overestimated by the two models. Some underestimated values with large error magnitudes were also observed.
Changing the likelihood between gamma and normal does not seem to affect test set’s predictive performance much although the gamma glm is preferred based on its higher DIC score (63330 vs 70986) with the train set.
The test set RMSE prefers the Gamma hierarchical glm over the Gaussian with log link hierarchical glm (3327.484 vs 3332.711).
We don’t have enough information in our model to explain the target variable claimcst0, as indicated by both models’ fitted vs residuals plots (based on train set).

Limitations and future work

It is difficult to verify the data collection process and its authenticity as the source website provided by the R documentation is inaccessible.
The parameters for this model are not interpretable due to the way the factor predictors are handled (converted from factor type to numeric type). This could use some improvement of using one hot encoding in the next iteration.
The analysis does not contain prior sensitivity analysis.
Weakly informative priors were used. We could improve our prior selection by incorporating expert knowledge.
Interaction terms can be explored as additional predictors.
The six areas can be aggregated into only two main areas, (A,B,C) and (D,E,F).
Write vectorized code for conducting posterior predictive checks and use DHARMa package for residual diagnostics.
Used the train set twice: we used our train data for building the models and then, for checking if the model fits the data. Generally it is a bad idea and it would be better to validate the models on external data, that was not used for building our models.
When a severity model is finalized, we can move on to build a frequency model and then a pure premium model.

References

Appendix A: Data dictionary

dataCar is a data frame with 67856 observations on the following 10 variables.

veh_value: vehicle value, in $10,000s
exposure: 0-1
clm: occurrence of claim (0 = no, 1 = yes)
numclaims: number of claims
claimcst0: claim amount (0 if no claim)
veh_body: vehicle body, coded as BUS CONVT COUPE HBACK HDTOP MCARA MIBUS PANVN RDSTR SEDAN STNWGTRUCK UTE
veh_age: 1 (youngest), 2, 3, 4
gender: a factor with levels F M
area: a factor with levels A B C D E F
agecat: 1 (youngest), 2, 3, 4, 5, 6