I created this notebook in efforts to learn about generalized linear models, and to get a taste of how GLMs are used in the P&C insurance industry. The original analysis was conducted by Ernesto Schirmacher, and he describes his work as a guided tutorial in “Predictive Modeling Applications in Actuarial Science.” The data (sim-modeling-data.csv) contains features about car insurance policyholders, as well as their risk exposure, claim counts, and claim severity. The table below summarizes the types of variables in the data.

data= data.frame(col1=c("year", "exposure", "", "", "", "", "", "", "", ""), 
                 col2=c("age", "driver.gender", "marital.status", "yrs.licensed", 
                        "ncd.level", "nb.rb", "prior.claims", "", "", ""),
                 col3=c("body.code", "driver.age", "vehicle.value", "seats", "ccm (engine                            size)", "hp (horsepower)", "length", "width", "height", "fuel type"),
                 col4=c("region", "", "", "", "", "", "", "", "", ""), 
                 col5=c("clm.count", "clm.incurred", "", "", "", "", "", "", "", ""))
colnames(data) = c("Control", "Driver", "Vehicle", "Geographic", "Response")
data

This notebook is divided into 3 parts.

Frequency Modeling: I used a log-link, Poisson regression model to model the accident frequency (claim counts per unit exposure)
Severity Modeling: I used a log-link, Gamma regression model to model the accident severity (claim cost per claim count, condition on a positive number of claim counts)
Pure Premium Modeling: I multiplied the outputs of the frequency model and the severity model, to compute the predicted pure premium (claim cost per unit exposure) for each policyholder. Finally, I divided the policyholders into 20 equally sized bins, ordered from smallest to largest predicted pure premium, and compared the avg. predicted pure premium to the avg. actual claim cost per unit exposure for each bin.

library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(forcats)
library(set)
library(data.table)

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(cluster)
library(statmod)
library(ggpubr)

source("utils/data_preprocessing.R")
source("utils/categorical_eda.R")
source("utils/region_eda.R")

df_full = read.csv("utils/sim-modeling-dataset.csv", row.names='row.id')

# remove outlier from yrs.lic (there's one sample with yrs.lic = 9+)
ix = which(df_full$yrs.lic == "9+")
df_full = df_full[-ix, ]

# make categorical features
df_full = make_categorical(df_full)

# drop duplicate features
df_full = subset(df_full, select=-c(driver.age, yrs.licensed, vehicle.age))

# frequency: lambda parameter of Poisson distribution
freq = df_full$clm.count/df_full$exposure
df_full$freq = freq

#Use 70% of dataset as training set and remaining 30% as testing set
sample = sample(c(TRUE, FALSE), nrow(df_full), replace=TRUE, prob=c(0.6,0.4))
df_train = df_full[sample, ]
df_test = df_full[!sample, ]

Frequency Modeling

1.) Region EDA

The region variable has too many categories. These variables need to be grouped to create a more efficient model. Schirmacher fits a regression coefficient for each region and groups the regions that have similar coefficient values. If we had an understanding of which regions were close together, this would have been the ideal way to group these variables, but this information is not available.

df_region = df_train["region"]
df_region$freq = df_train$freq
df_region_freq = df_region %>% group_by(region) %>% summarise(avg_freq=mean(freq))
df_region_freq = arrange(df_region_freq, avg_freq)

I will describe my method, which is just slightly different from Schirmacher’s method of grouping regions.

Use K means to group the regions into K clusters, based on average accident frequency (i.e. regions with similar levels of accident frequencies get grouped together into the same cluster.)
Perform a Poisson regression model with claim counts as the response variable and a re-coded region variable as the feature. The region variable is re-coded based on the clusters generated by the K-means algorithm.
Compute the model’s Akaike Information Criterion (AIC).
Loop through K = (1 to 15), and for each K, conduct the 3 steps mentioned above. Determine “best_K” the K that produces the lowest AIC.
Finally, group the regions into “best_K” clusters to construct the transformed region variable.

The plot below shows how the AIC scores varied by the number of clusters. The red point shows the “best_K,” number of clusters that produced the lowest AIC score.

# cluster the regions based on average accident frequency
N = 1 # number of trials 
clusters = factor(seq(1, 15))
aic_matrix = matrix(nrow=length(clusters), ncol=N)

for (i in 1:N){
  aic = determine_num_clusters(df_train, df_region_freq, clusters)
  aic_matrix[, i] = aic
}

aic_df = data.frame(rowMeans(data.frame(aic_matrix)))
colnames(aic_df) = "AIC"
rownames(aic_df) = clusters

# extract the number of clusters that produces the minimum average AIC score
min_aic = min(aic_df$AIC)
min_cluster = which(aic_df == min(aic_df$AIC))

# plot the average AIC scores and isolate the mean
ggplot(data=aic_df, aes(x=clusters, y=AIC)) +
  geom_point() + 
  geom_point(data=aic_df,
             aes(x=min_cluster, y=min_aic),
             color="red")

I created two different one way models: one using the original region variable with 38 different categories and another using the re-coded region variable with “best_K” different categories. The model using the re-coded region variable produced a much lower AIC, and as evident from the summaries printed below, the coefficients of the model using re-coded region are more statistically significant. I will proceed with my analysis using the re-coded region variable.

cluster_dict = cluster_regions(df_region_freq, min_cluster)
df_train_ = recode_region(df_train, cluster_dict)

region_m = glm(formula=clm.count ~ region, 
                  family=poisson(link='log'),
                  data=df_train_,
                  offset=log(exposure))

region_recode_m = glm(formula=clm.count ~ region.recode, 
                  family=poisson(link='log'),
                  data=df_train_,
                  offset=log(exposure))

print(summary(region_m))

## 
## Call:
## glm(formula = clm.count ~ region, family = poisson(link = "log"), 
##     data = df_train_, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6781  -0.4717  -0.3811  -0.2543   4.4226  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.62019    0.04730 -34.255  < 2e-16 ***
## region1     -0.28791    0.12339  -2.333  0.01963 *  
## region2     -0.83194    0.27141  -3.065  0.00218 ** 
## region3     -0.24326    0.09109  -2.670  0.00757 ** 
## region4     -0.30860    0.09659  -3.195  0.00140 ** 
## region5     -0.30907    0.18573  -1.664  0.09610 .  
## region6     -0.12804    0.18860  -0.679  0.49720    
## region7     -0.87232    0.31974  -2.728  0.00637 ** 
## region8      0.14243    0.09344   1.524  0.12742    
## region9     -0.14275    0.21838  -0.654  0.51334    
## region10    -0.25720    0.15189  -1.693  0.09039 .  
## region11    -0.17889    0.24710  -0.724  0.46909    
## region12    -0.24207    0.19818  -1.221  0.22191    
## region13    -0.65676    0.19818  -3.314  0.00092 ***
## region14    -0.45002    0.17790  -2.530  0.01142 *  
## region15    -0.43908    0.11249  -3.903 9.49e-05 ***
## region16    -0.35557    0.21838  -1.628  0.10349    
## region18     0.04574    0.10605   0.431  0.66621    
## region19    -0.30162    0.21381  -1.411  0.15834    
## region20    -0.38976    0.24040  -1.621  0.10496    
## region21    -0.03292    0.19481  -0.169  0.86581    
## region22    -0.26688    0.26250  -1.017  0.30930    
## region23    -0.04349    0.19162  -0.227  0.82044    
## region24    -0.08984    0.22855  -0.393  0.69425    
## region25    -0.08681    0.12550  -0.692  0.48915    
## region26     0.08486    0.19481   0.436  0.66312    
## region27    -0.52176    0.20174  -2.586  0.00970 ** 
## region28    -0.36984    0.23424  -1.579  0.11436    
## region29    -0.71294    0.24040  -2.966  0.00302 ** 
## region30    -0.39818    0.25443  -1.565  0.11760    
## region31    -0.47894    0.30520  -1.569  0.11658    
## region32    -0.07220    0.11298  -0.639  0.52278    
## region33     0.14826    0.20953   0.708  0.47920    
## region34    -0.35054    0.24040  -1.458  0.14480    
## region35    -0.45897    0.16898  -2.716  0.00660 ** 
## region36    -0.04045    0.15484  -0.261  0.79392    
## region37    -0.24932    0.19481  -1.280  0.20062    
## region38     0.15020    0.22328   0.673  0.50115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10003.4  on 24555  degrees of freedom
## Residual deviance:  9906.2  on 24518  degrees of freedom
## AIC: 13898
## 
## Number of Fisher Scoring iterations: 6

print(summary(region_recode_m))

## 
## Call:
## glm(formula = clm.count ~ region.recode, family = poisson(link = "log"), 
##     data = df_train_, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6322  -0.4687  -0.3827  -0.2581   4.4138  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    -1.64049    0.09091 -18.045  < 2e-16 ***
## region.recode2 -0.28045    0.09877  -2.839  0.00452 ** 
## region.recode3 -0.37548    0.11012  -3.410  0.00065 ***
## region.recode4 -0.64397    0.15971  -4.032 5.53e-05 ***
## region.recode5  0.03031    0.09647   0.314  0.75340    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 10003.4  on 24555  degrees of freedom
## Residual deviance:  9928.7  on 24551  degrees of freedom
## AIC: 13854
## 
## Number of Fisher Scoring iterations: 6

print(paste("AIC of model using region variable: ", round(region_m$aic, 4)))

## [1] "AIC of model using region variable:  13897.8791"

print(paste("AIC of model using region re-coded variable: ", round(region_recode_m$aic, 4)))

## [1] "AIC of model using region re-coded variable:  13854.3553"

2.) Driver Age EDA

Plotting the avg. frequency against the driver age shows an interesting trend. Average frequency of accidents is relatively large for very young drivers and very old drivers.

# construct a bar plot of driver age
df_freq_age = df_train %>% group_by(drv.age) %>% summarise(avg.freq=mean(freq))
age_count = df_train %>% group_by(drv.age) %>% tally() %>% as.data.frame()

# plot the counts of the driver ages
ggplot(data=age_count, aes(x=drv.age, y=n)) + 
  geom_bar(stat='identity') + 
  scale_x_continuous(name="Driver Age", breaks=seq(20, 90, by=10)) + 
  labs(title="Count of Driver Ages")

# plot the average accident frequency by age
plt1 = ggplot(df_freq_age, aes(x=drv.age, y=avg.freq)) +
        geom_point() +
        geom_line() +
        scale_x_continuous(breaks=seq(15, 90, by=5))

plt2 = ggplot(df_freq_age, aes(x=drv.age, y=log(avg.freq))) +
        geom_point() +
        geom_line() +
        scale_x_continuous(breaks=seq(15, 90, by=5))

ggarrange(plt1, plt2, 
          nrow=2, ncol=1)

The lack of drivers beyond the age of 80 is the reason for the trailing zeros in the avg.freq vs. drv.age plot, and we shouldn’t draw any conclusions from this information. Thus, I’ll remove the outliers with drv.age >= 80.

When performing a Poisson regression with a log link, we assume that the log of the mean response is linearly related to the predictors. This is clearly not the case, as evident from the second plot above. Therefore, I think a transformation to the drv.age variable is necessary. Just like the region variable, I’ll try binning the drivers into different groups, to create a categorical variable.

# filter out drivers of age < 80
df_train = filter(df_train, drv.age < 80)

# recode the driver ages
age_bins = seq(15, 80, by=5)
df_train = recode_drv_age(df_train, age_bins)

I tried binning the drivers into the following age groups (15-19, 20-24, …, 75-79). Binning the driver ages provides significant improvement as compared to the model that uses the original drv.age variable. The AIC score drops significantly and there are several statistically significant age groups. I will continue with my analysis using the re-coded drv.age variable.

model1 = glm(formula=clm.count ~ drv.age,
                  family=poisson(link='log'),
                  data=df_train,
                  offset=log(exposure))


model2 = glm(formula=clm.count ~ drv.age.recode,
                  family=poisson(link='log'),
                  data=df_train,
                  offset=log(exposure))

print(summary(model1))

## 
## Call:
## glm(formula = clm.count ~ drv.age, family = poisson(link = "log"), 
##     data = df_train, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6495  -0.4791  -0.3845  -0.2533   4.4022  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.353006   0.093172 -14.522  < 2e-16 ***
## drv.age     -0.010161   0.002085  -4.874  1.1e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9991.9  on 24514  degrees of freedom
## Residual deviance: 9967.8  on 24513  degrees of freedom
## AIC: 13883
## 
## Number of Fisher Scoring iterations: 6

print(summary(model2))

## 
## Call:
## glm(formula = clm.count ~ drv.age.recode, family = poisson(link = "log"), 
##     data = df_train, offset = log(exposure))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3887  -0.4753  -0.3710  -0.2567   4.2364  
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -1.35213    0.07143 -18.930  < 2e-16 ***
## drv.age.recode1   1.60344    0.58175   2.756  0.00585 ** 
## drv.age.recode2   0.14124    0.18133   0.779  0.43604    
## drv.age.recode4  -0.28603    0.09221  -3.102  0.00192 ** 
## drv.age.recode5  -0.64617    0.09313  -6.938 3.97e-12 ***
## drv.age.recode6  -0.51193    0.09120  -5.613 1.98e-08 ***
## drv.age.recode7  -0.44861    0.08998  -4.986 6.17e-07 ***
## drv.age.recode8  -0.50101    0.09459  -5.296 1.18e-07 ***
## drv.age.recode9  -0.47834    0.10279  -4.653 3.27e-06 ***
## drv.age.recode10 -0.67461    0.12372  -5.453 4.96e-08 ***
## drv.age.recode11 -0.73020    0.18133  -4.027 5.65e-05 ***
## drv.age.recode12 -0.53905    0.24028  -2.243  0.02487 *  
## drv.age.recode13  0.13882    0.26000   0.534  0.59339    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 9991.9  on 24514  degrees of freedom
## Residual deviance: 9905.8  on 24502  degrees of freedom
## AIC: 13843
## 
## Number of Fisher Scoring iterations: 6

print(paste("AIC of model using drv.age variable: ", round(model1$aic, 4)))

## [1] "AIC of model using drv.age variable:  13883.4845"

print(paste("AIC of model using drv.age re-coded variable: ", round(model2$aic, 4)))

## [1] "AIC of model using drv.age re-coded variable:  13843.4306"

3.) EDA for other Categorical Variables

I generated the histogram below by sampling policyholders from each category within the marital status feature (i.e. divorced, married, single, widow), and computed the average accident frequency (sample mean). I repeated this “num_trials = 250” times, to generate a bell-shaped curve for each category within marital status (Because the plot below is a histogram of sample means, we get a bell-shaped curve as a result of the central limit theorem)

It’s really interesting that widows have a higher accident frequency than other demographics. My suspicion is that many of these widows are senior citizens, and as we saw from the analysis on drv.age, accident frequencies are relatively large for the elderly.

category = "marital.status"
num_trials = 250
res = category_sample_means(df_train, category, num_trials)
plt_year = res[[1]] # plt_year is a ggplot and should be plotted from the terminal
df_result = res[[2]]
show(plt_year)

I created a widow dummy variable and compared the feature’s performance against the original marital status variable. As shown from the output below, the two features do not show a significant difference in AIC. Given that the widow dummy uses 1 degree of freedom as opposed to the marital status feature’s 3, we can conclude that the widow dummy is simpler and provides just as much predictive power.

SIDE NOTE: I conducted this kind of analysis for all the categorical features. For the sake of space and concision, I have not included histograms for all the categorical features. That being said, this exercise gave me a quick understanding of which features were likely to useful when frequency modeling. Variables such as year, nb.rb (new business vs. renewal business), and ncd.level (no claim discount level) showed promise.

df_train_ = recode_marital_status(df_train)
marital_stat_m = glm(formula=clm.count ~ marital.status, 
                      family=poisson(link='log'),
                      data=df_train_,
                      offset=log(exposure))

marital_stat_recode_m = glm(formula=clm.count ~ marital.status.recode, 
                            family=poisson(link='log'),
                            data=df_train_,
                            offset=log(exposure))

print(paste("AIC of model using marital status variable: ", round(marital_stat_m$aic, 4)))

## [1] "AIC of model using marital status variable:  13890.0904"

print(paste("AIC of model using widow dummy variable: ", round(marital_stat_recode_m$aic, 4)))

## [1] "AIC of model using widow dummy variable:  13890.324"

4.) Construction of One-Way Models and Ranking Predictors by Deviance and AIC

I generated the plot below by constructing one-way frequency models (Poisson regression models that use claim counts as the response, and a single feature as the explanatory variable) and computing each model’s deviance and AIC. The deviance measures how much the model’s log likelihood differs from the log likelihood of the saturated model. The AIC measures the fit of the model but also has a penalty term for a larger number of parameters. Thus, a model producing a low deviance and low AIC score is desired => Explanatory variables on the lower left corner have more predictive power. A quick glance at this plot shows that ncd.level, year, yrs.lic, region.recode, drv.age.recode, and nb.rb contain meaningful information.

I measured the Euclidean distance of each one way model from the null model within the deviance vs AIC plot. The table contains the one way models, ranked by their distance from the null model. Although this plot changes every time I run this script, the features mentioned above (ncd.level, year, yrs.lic, region.recode, drv.age.recode, and nb.rb) consistently have the greatest Euclidean distance from the null model. Thus, I constructed my frequency model using these 6 variables.

# create a Deviance vs. AIC plot of the continuous variables - variables that produce a lower deviance and AIC score are more promising as explanatory variables
df_train_ = recode_marital_status(df_train)
df_train_ = recode_region(df_train_, cluster_dict)
df_train_ = recode_drv_age(df_train_, seq(15, 80, by=5))
df_train_ = subset(df_train_, select=-c(region, marital.status, drv.age, freq, clm.incurred))

# extract the variables
vars = as.list(colnames(df_train_))
vars[[which(vars=="exposure")]] = NULL
vars[[which(vars=="clm.count")]] = NULL

# compute the AIC and Deviance for each one way model 
df_result = compute_aic_dev(df_train_, vars, model_type="frequency")

plt = ggplot(df_result, aes(x=aic, y=dev, label=rownames(df_result))) + 
           geom_text(check_overlap = TRUE, size=2, nudge_x=2) + 
           geom_point(size=3, shape=23) +
           ggtitle("Deviance AIC Plot of Variables")

show(plt)

# # append the dataframe with distance from null
distance = aic_dev_dist(df_result, dist_null=TRUE)
df_result$dist_to_null = distance
df_result = df_result[order(-df_result$dist_to_null), ]
print(df_result)

##                            aic      dev dist_to_null
## ncd.level             13682.39 9758.736   322.751195
## year                  13747.23 9827.577   228.185987
## yrs.lic               13783.06 9853.405   184.894585
## nb.rb                 13827.13 9911.470   112.340129
## drv.age.recode        13843.43 9905.774   106.195238
## region.recode         13838.47 9916.811   100.695170
## prior.claims          13853.19 9937.535    75.483967
## veh.age               13872.78 9925.125    74.385846
## body.code             13892.54 9964.885    29.988102
## height                13889.13 9973.475    24.684392
## marital.status.recode 13890.32 9974.668    23.000487
## ccm                   13896.15 9980.498    14.779381
## seats                 13909.73 9982.071    10.677864
## weight                13900.83 9985.174     8.221077
## width                 13901.44 9985.788     7.367146
## driver.gender         13902.55 9986.890     5.845823
## length                13903.60 9987.948     4.409024
## fuel.type             13909.41 9991.752     3.854321
## vehicle.value         13906.10 9990.441     1.556500
## hp                    13906.36 9990.706     1.440767
## null                  13905.56 9991.901     0.000000

5.) Evaluation of Frequency Model

In this cell I did a simple evaluation of the 6 variable frequency model by using it to perform predictions on the training and test data. The total number of predicted claims on the training data is exactly equal to the total number of actual claims. There’s a slight disparity between these values on the test data but that’s expected, and it’s small enough for me to feel good about this model.

cc_m = glm(formula=clm.count ~ year+ncd.level+yrs.lic+drv.age.recode+nb.rb+region.recode, 
           family=poisson(link='log'),
           data=df_train_,
           offset=log(exposure))

# recode the variables in df_test
df_test_ = recode_marital_status(df_test)
df_test_ = recode_region(df_test_, cluster_dict)
df_test_ = recode_drv_age(df_test_, seq(15, 80, by=5))
df_test_ = subset(df_test_, select=-c(region, marital.status, drv.age))

eval_df = evaluate_model(cc_m, df_train_, df_test_)

eval_df

Here I divided the policyholders into 20 equally sized bins, ordered from smallest to largest predicted frequency, and compared the avg. predicted frequency to the avg. actual frequency for each bin. The red points are the avg. predicted frequencies for each bin, and the black dots are the avg. actual frequency for each bin. Given that the predicted avg. frequency is tracking the actual avg. frequency quite closely across the bins, I’m happy with this frequency model.

#################################
### Frequency Prediction ########
#################################

# predict the claim counts, assuming we have access to the exposure
cc_pred = predict(cc_m, df_test_, type="response")

# divide the predicted claim counts with exposure to get the frequency
df_test$p.freq = cc_pred/df_test_$exposure

# split the predictions into 20 qunatiles
q_pred = quantile(df_test$p.freq, probs=seq(0, 1, by=0.05))
N = length(q_pred)
predictions = numeric(N-1)
ground_truth = numeric(N-1)

for (i in 1:(N-1)){
  # get the predicted frequencies of the policyholders in the i_th quantile
  qi_pred = filter(df_test, (p.freq < q_pred[i+1]) & (p.freq > q_pred[i]))
  qi_pred$id = rownames(qi_pred)
  predictions[i] = mean(qi_pred$p.freq)
  ground_truth[i] = mean(qi_pred$freq)
}

# plot the predictions and ground truth
df_freq = data.frame(predictions, ground_truth)
x = seq(1, N-1, by=1)
ggplot(df_freq) + 
  geom_point(mapping = aes(x=x, y=predictions), color="red") + 
  geom_point(mapping = aes(x=x, y=ground_truth), color="black") + 
  xlab("quantile") + 
  ylab("average frequency") +
  scale_x_continuous(breaks=x)

Severity Modeling

The Gamma distribution is used to model data that takes on positive values with a heavy right-skew, making it the convention for severity modeling. The figure below demonstrates that the severity of our data has these attributes. Because the response variable (severity) has a heavy right skew, I used the median severity rather than the mean severity when analysing the relation between severity and different explanatory variables.

source("utils/sev_data_preprocessing.R")

df_train_sev = df_full[sample, ]
df_test_sev = df_full[!sample, ]

# filter data with positive claim counts
df_train_sev = filter(df_train_sev, clm.count > 0)
df_test_sev = filter(df_test_sev, clm.count > 0)

# severity plot
severity = df_train_sev$clm.incurred/df_train_sev$clm.count
plt = ggplot() + geom_histogram(mapping=aes(x=severity))
show(plt)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

1.) Exploration of vehicle-related, continuous features

I originally assumed that the vehicle-related features like weight, length, width, vehicle value, horsepower, and engine size (ccm) would be more important for severity modeling than frequency modeling. However, as evident from the plots depicting the relationship between the log(severity) and each predictor, there doesn’t seem to be any relationship.

The third row shows the relation between log(severity) and vehicle.value. The negative relation seems contradictory because one would assume larger vehicle values generate larger accident severities. Looking closer at the plot shows that the outliers (vehicle.values >= 50) may be creating this misleading result. The final plot is the same plot with the outliers excluded, and doesn’t show a significant relation between severity and vehicle value.

# filter the data by attributes related to the vehicle
tmp = df_train_sev
tmp$severity = tmp$clm.incurred/tmp$clm.count
df_vehicle = subset(tmp, select=c(vehicle.value, ccm, hp, weight, length, width, height, severity))

# generate plots 
plt_vv = plot_log_severity(df_vehicle, "vehicle.value")
plt_ccm = plot_log_severity(df_vehicle, "ccm")
plt_hp = plot_log_severity(df_vehicle, "hp")
plt_weight = plot_log_severity(df_vehicle, "weight")
plt_length = plot_log_severity(df_vehicle, "length")
plt_width = plot_log_severity(df_vehicle, "width")
plt_height = plot_log_severity(df_vehicle, "height")

# cut out outliers from vehicle.value
df_vehicle_ = filter(df_vehicle, vehicle.value <= 50)
plt_vv_ = plot_log_severity(df_vehicle_, "vehicle.value")

ggarrange(plt_ccm, plt_hp, plt_weight, plt_length, plt_width, plt_height, plt_vv, plt_vv_,
          ncol = 3, nrow = 3)

## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

# Let's cut these outliers from the training data so that the model doesn't think there is 
# significant relation between vehicle.value and severity
df_train_sev = filter(df_train_sev, vehicle.value <= 50)

2.) Region EDA

I used the same method described in the “Region EDA” section of frequency modeling, to group the 38 regions into more meaningful clusters. The one and only difference is that I am now using a Gamma regression model with claims incurred (a.k.a. claim cost) as the response variable. Again, the red point in the plot below indicates the number of clusters that produced the lowest AIC score.

df_region = df_train_sev["region"]
df_region$sev = df_train_sev$clm.incurred/df_train_sev$clm.count
df_region_sev = df_region %>% group_by(region) %>% summarise(med.sev=median(sev))
df_region_sev = arrange(df_region_sev, med.sev)

# region by count
df_region_count = df_region %>% group_by(region) %>% tally()

N = nrow(df_train_sev)
clusters = seq(1, 15, 1)
aic_dict = c()

for (K in clusters){
  # perform k means on the average severities
  df_region_sev = make_clusters(df_region_sev, K)
  
  # recode the regions based on the clusters
  df_train_sev = recode_region(df_train_sev, df_region_sev)
  
  if(K == 1){
    fmla = as.formula(clm.incurred ~ region)}
  else{
    fmla = as.formula(clm.incurred ~ region.recode)}
  
  # create the GLM and compute the AIC
  model = glm(formula=fmla,
               data=df_train_sev,
               family=Gamma(link="log"),
               weights=clm.count,
               offset=log(clm.count))
  aic_dict[as.character(K)] = model$aic
}

# compute the average of the AIC
aic = data.frame(aic_dict)
colnames(aic) = "aic"
aic$cluster = rownames(aic)

# generate plot of AIC vs. number of clusters
best_aic = min(aic$aic)
best_cluster = aic[aic$aic == best_aic, ]$cluster
ggplot(data=aic, aes(x=cluster, y=aic)) + 
  geom_point() + 
  geom_point(aes(x=best_cluster, y=best_aic), color="red") + 
  scale_x_discrete(limits=aic$cluster)

The summaries printed below show similar results to the frequency model. The model using the re-coded region variable produced a much lower AIC as compared to the model using the original region variable, and as evident from the model summaries, the coefficients of the model using re-coded region are more statistically significant. I will proceed with my analysis using the re-coded region variable, but keep in mind that this variable is different from the re-coded region variable from the frequency modeling section.

# best cluster was the number of clusters that generated the best AIC
# Recode the region variable into "best_cluster" clusters
df_region_sev = make_clusters(df_region_sev, best_cluster)
df_train_sev = recode_region(df_train_sev, df_region_sev)

df_region_sev$cluster = as.character(df_region_sev$cluster)
plt = ggplot(data=df_region_sev, aes(x=region, y=med.sev, color=cluster)) +
      geom_point() +
      scale_x_discrete(limits = c(df_region_sev$region)) +
      ggtitle("Median Accident Severity by Region (Clustered), Given > 0 Claim Counts") +
      xlab("Region (Clustered)") +
      ylab("Median Accident Severity")

show(plt)

model = glm(formula=clm.incurred ~ region.recode,
            data=df_train_sev,
            family=Gamma(link="log"),
            weights=clm.count,
            offset=log(clm.count))

model1 = glm(formula=clm.incurred ~ region,
             data=df_train_sev,
             family=Gamma(link="log"),
             weights=clm.count,
             offset=log(clm.count))

print(summary(model1))

## 
## Call:
## glm(formula = clm.incurred ~ region, family = Gamma(link = "log"), 
##     data = df_train_sev, weights = clm.count, offset = log(clm.count))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1551  -1.6574  -0.6637   0.2994   3.7611  
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.6871127  0.0647966 103.202  < 2e-16 ***
## region1     -0.2600992  0.1688720  -1.540  0.12368    
## region2      0.5806993  0.3714215   1.563  0.11812    
## region3     -0.1970908  0.1255297  -1.570  0.11657    
## region4      0.2582641  0.1329299   1.943  0.05218 .  
## region5     -0.1059994  0.2541737  -0.417  0.67670    
## region6      0.1353294  0.2581042   0.524  0.60012    
## region7      0.3730865  0.4375570   0.853  0.39396    
## region8      0.0223680  0.1278991   0.175  0.86119    
## region9      0.1788675  0.2988572   0.599  0.54958    
## region10    -0.2801058  0.2078715  -1.347  0.17799    
## region11    -0.3229219  0.3381569  -0.955  0.33973    
## region12    -0.2599754  0.2712070  -0.959  0.33789    
## region13     0.0177614  0.2712070   0.065  0.94779    
## region14    -0.6810043  0.2468670  -2.759  0.00586 ** 
## region15     0.1047801  0.1546283   0.678  0.49809    
## region16     0.3090058  0.3055632   1.011  0.31202    
## region18     0.1105214  0.1456778   0.759  0.44815    
## region19     0.0439537  0.2926002   0.150  0.88061    
## region20    -0.0763531  0.3289841  -0.232  0.81650    
## region21    -0.4758559  0.2666013  -1.785  0.07444 .  
## region22    -0.0936600  0.3592171  -0.261  0.79433    
## region23    -0.0890879  0.2666013  -0.334  0.73830    
## region24     0.0383657  0.3127736   0.123  0.90239    
## region25    -0.1478387  0.1781473  -0.830  0.40672    
## region26     0.4373783  0.2666013   1.641  0.10106    
## region27     0.0878237  0.2812501   0.312  0.75488    
## region28     0.4444020  0.3205545   1.386  0.16581    
## region29    -0.0531197  0.3289841  -0.161  0.87174    
## region30    -0.3032735  0.3481875  -0.871  0.38386    
## region31    -0.7890751  0.4176514  -1.889  0.05901 .  
## region32     0.1059696  0.1546283   0.685  0.49323    
## region33    -0.2514078  0.2867448  -0.877  0.38073    
## region34    -0.4771018  0.3289841  -1.450  0.14716    
## region35    -0.1528550  0.2312507  -0.661  0.50870    
## region36     0.0575542  0.2119121   0.272  0.78596    
## region37     0.0004425  0.2666013   0.002  0.99868    
## region38     0.0904434  0.3055632   0.296  0.76727    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.872575)
## 
##     Null deviance: 4042.2  on 1893  degrees of freedom
## Residual deviance: 3956.4  on 1856  degrees of freedom
## AIC: 31679
## 
## Number of Fisher Scoring iterations: 9

print(summary(model))

## 
## Call:
## glm(formula = clm.incurred ~ region.recode, family = Gamma(link = "log"), 
##     data = df_train_sev, weights = clm.count, offset = log(clm.count))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.1208  -1.6683  -0.6837   0.2585   3.8094  
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     6.66791    0.04041 165.015  < 2e-16 ***
## region.recode2  0.37398    0.23658   1.581  0.11409    
## region.recode3 -0.24463    0.08415  -2.907  0.00369 ** 
## region.recode4  0.20614    0.07690   2.681  0.00741 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.956099)
## 
##     Null deviance: 4042.2  on 1893  degrees of freedom
## Residual deviance: 3996.7  on 1890  degrees of freedom
## AIC: 31637
## 
## Number of Fisher Scoring iterations: 7

print(paste("AIC of model using region variable: ", round(model1$aic, 4)))

## [1] "AIC of model using region variable:  31679.191"

print(paste("AIC of model using region re-coded variable: ", round(model$aic, 4)))

## [1] "AIC of model using region re-coded variable:  31637.4615"

2.) Driver Age EDA

The severity has heavy right tails, with large values that can inflate the mean, so I checked the median severity for each age group. The plot below shows a similar trend to the accident frequency. It looks like severity tends to be larger for very young drivers and very old drivers.

df_age = df_train_sev["drv.age"]
df_age$sev = df_train_sev$clm.incurred/df_train_sev$clm.count
df_age_sev = df_age %>% group_by(drv.age) %>% summarise(med.sev=median(sev))
df_age_count = df_age %>% group_by(drv.age) %>% tally()

plt = ggplot(df_age_sev, aes(x=drv.age, y=med.sev)) + 
        geom_point() + 
        geom_line() + 
        scale_x_discrete(limits=seq(15, 90, 5))+
        ylab("Median Severity") + 
        xlab("Driver Age")

## Warning: Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?

plt1 = ggplot(df_age_count, aes(x=drv.age, y=n)) +
        geom_bar(stat="identity") +
        scale_x_discrete(limits=seq(15, 90, 5))

## Warning: Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?

show(plt)

show(plt1)

Just like the frequency model, I think it will be necessary to bin driver ages into groups. Given that there is a small number of drivers under the age of 20 and over the age of 80, I will remove these samples. As evident from the plot above, the median severity seems relatively consistent for drivers of ages [25, 65], so I will create 3 age groups: < 25, [25, 65], >65.

Splitting the drivers into 3 age bins seems to provide a little bit of improvement. The model summaries show that the model using the re-coded driver age variable has a lower deviance and AIC as compared to the model using the original driver age variable. Also, the parameter estimates have lower p-values.

# remove outliers
df_train_sev = filter(df_train_sev, (drv.age>19) & (drv.age<80))
bins = c(25, 65)
df_train_sev = recode_drv_age(df_train_sev, bins)

model1 = glm(formula=clm.incurred ~ drv.age,
             data=df_train_sev,
             family=Gamma(link="log"),
             weights=clm.count,
             offset=log(clm.count))

model2 = glm(formula=clm.incurred ~ drv.age.recode,
             data=df_train_sev,
             family=Gamma(link="log"),
             weights=clm.count,
             offset=log(clm.count))

print(summary(model1))

## 
## Call:
## glm(formula = clm.incurred ~ drv.age, family = Gamma(link = "log"), 
##     data = df_train_sev, weights = clm.count, offset = log(clm.count))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.9638  -1.6648  -0.6951   0.2720   3.9065  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.902754   0.122820  56.202   <2e-16 ***
## drv.age     -0.004971   0.002735  -1.817   0.0693 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.932435)
## 
##     Null deviance: 4041.2  on 1890  degrees of freedom
## Residual deviance: 4034.4  on 1889  degrees of freedom
## AIC: 31571
## 
## Number of Fisher Scoring iterations: 7

print(summary(model2))

## 
## Call:
## glm(formula = clm.incurred ~ drv.age.recode, family = Gamma(link = "log"), 
##     data = df_train_sev, weights = clm.count, offset = log(clm.count))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.0085  -1.6665  -0.6917   0.2727   3.8064  
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.3124     0.2322  31.486  < 2e-16 ***
## drv.age.recode1  -0.6388     0.2344  -2.725  0.00648 ** 
## drv.age.recode2  -0.6680     0.2858  -2.337  0.01952 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.941721)
## 
##     Null deviance: 4041.2  on 1890  degrees of freedom
## Residual deviance: 4023.2  on 1888  degrees of freedom
## AIC: 31566
## 
## Number of Fisher Scoring iterations: 7

3.) Categorical EDA

Again, I decided to plot the sample means of various categorical features to get a sense of which features could be useful for predicting severity. A feature has the potential to be useful when the sample means are significantly different from one another. Again, marital status seems to show some explanatory power, as the single community is generating a much larger average accident severity than other demographics. Driver gender also seems to be a promising feature, as the data shows that females generate a larger accident severity.

category = "marital.status"
num_trials = 250
r = category_sample_means(df_train_sev, category, num_trials, model_type="severity")
plt = r[[1]]
df_result = r[[2]]
show(plt)

category = "driver.gender"
num_trials = 250
r = category_sample_means(df_train_sev, category, num_trials, model_type="severity")
plt = r[[1]]
df_result = r[[2]]
show(plt)

4.) Construction of One-Way Models and Ranking Predictors by Deviance and AIC

Unlike the deviance vs. AIC plot from the frequency model, this plot shows a lot of variance every time the train-test split changes, and we do not have as many features that have significant separation from the null model. In fact, many one-way models land to the right of the null model. The year and region re-code variables are the only two variables that consistently produce significant separation from the null model, so these two features will be included in the severity model.

vars = as.list(c("year", "drv.age", "vehicle.value", "ccm", "hp", "weight", "length", 
                 "width", "height", "prior.claims", "marital.status", "yrs.lic",
                 "ncd.level", "region", "body.code", "veh.age", "seats", "fuel.type",
                 "driver.gender", "nb.rb", "region.recode", "drv.age.recode"))

# compute the AIC and Deviance for each one way model 
df_result = compute_aic_dev(df_train_sev, vars, model_type="severity")

# # append the datafrmae with distance from null
distance = aic_dev_dist(df_result, dist_null=TRUE)
df_result$dist_to_null = distance
df_result = df_result[order(-df_result$dist_to_null), ]

plt = ggplot(df_result, aes(x=aic, y=dev, label=rownames(df_result))) + 
           geom_text(check_overlap = TRUE, size=2, nudge_x=2) + 
           geom_point(size=3, shape=23) +
           ggtitle("Deviance AIC Plot of Variables")

show(plt)

5.) Construction of Multiway Models by adding features sequentially

Rather than considering each feature independently, we can try adding features sequentially. Starting with a base model including year and region.recode, I added features sequentially by running a partial F-test in which a “reduced model” is constructed using all prior features that have already been added, and a “full model” with the new candidate feature. The candidate feature was added to the model if it produced the lowest p-value among candidate features, and if it had a magnitude < 0.05.

vars = as.list(c("vehicle.value", "ccm", "hp", "weight", "length", 
                 "width", "height", "prior.claims", "marital.status", "yrs.lic",
                 "ncd.level", "body.code", "veh.age", "seats", "fuel.type",
                 "driver.gender", "nb.rb", "drv.age.recode"))

predictor_vars = as.list(c("year", 'region.recode'))
threshold = 0.05
p_val_best = Inf
fmla_best = ""
hit_condition = TRUE

while(hit_condition){
  p_val_best = Inf
  hit_condition = FALSE
  base_fmla = as.formula(paste("clm.incurred~", paste(predictor_vars, collapse="+")))
  base_model = glm(formula=base_fmla,
                   data=df_train_sev,
                   family=Gamma(link="log"),
                   offset=log(clm.count))
  
  for (var in vars){
    
      # add var to preds
      preds = paste(predictor_vars, sep = "", collapse=" + ")
      preds = paste(preds, var, sep = " + ") 
      fmla = as.formula(paste("clm.incurred ~ ", preds, sep = ""))
      
      # construct the glm model
      model = glm(formula=fmla,
                   data=df_train_sev,
                   family=Gamma(link="log"),
                   offset=log(clm.count))
      
      # get p-alue from F test
      anova_tbl = anova(model, test='F')
      p_values = anova_tbl$`Pr(>F)`
      p_val = p_values[length(p_values)]

      # if we find a variable that generates a p-value less than 0.05
      if ((p_val < threshold) & (p_val < p_val_best)){
        hit_condition = TRUE
        p_val_best = p_val
        fmla_best = fmla
        var_best = var
      }
  }
  # if the hit condition is TRUE, append predictor_vars with var
  if(hit_condition){
    print(paste("best p-value from candidate features: ", round(p_val_best, 3)))
    print(paste("current model formula:", format(fmla_best)))
    predictor_vars = append(predictor_vars, var_best)
    vars[[which(vars==var_best)]] = NULL
  }
}

## [1] "best p-value from candidate features:  0.023"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender"
## [1] "best p-value from candidate features:  0.03"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender + drv.age.recode"
## [1] "best p-value from candidate features:  0.036"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender + drv.age.recode + "
## [2] "current model formula:     vehicle.value"

# coming out of the while loop, we have the best formula
# create the claims incurred model with the best formula
ci_m = glm(formula=fmla_best,
             data=df_train_sev,
             family=Gamma(link="log"),
             weights=clm.count,
             offset=log(clm.count))

6.) Evaluation of Severity Model

# perform the necessary transformation on df_test
df_test_sev = recode_region(df_test_sev, df_region_sev)
df_test_sev = recode_drv_age(df_test_sev, bins)

eval_df = evaluate_model_sev(ci_m, df_train_sev, df_test_sev)
eval_df

Pure Premium Modeling

After completing the construction of my frequency and severity models, I generated predicted pure premiums (a prediction of claims incurred per unit exposure) for each policyholder by multiplying the predicted frequency with the predicted severity. I tested my results by dividing the policyholders into 20 equally sized bins (quantiles), ordered from smallest to largest predicted pure premium, and compared the avg. predicted pure premium to the avg. actual claim cost per unit exposure, for each bin.

The red points are the “avg. predicted pure premium” and the black points are the “avg. actual claim cost per unit exposure” for the policyholders in each bin or quantile. Although there is a difference for some of the bins, especially for the latter bins containing the higher risk policyholders, the predictions are tracking the ground truth quite well.

################################
### Severity Prediction ########
################################
# extract the model coefficients
sv_coef = ci_m$coefficients

# configure the full test data for model
df_test = recode_region(df_test, df_region_sev)
df_test = recode_drv_age(df_test, bins)
df_test_tr = model.matrix(object=ci_m$formula, data=df_test)

# take the product of df_test_ and sv_coef to compute the linear component of the model
lm_out = df_test_tr %*% sv_coef

# severity prediction are simply the exponential of lm_out
sev_pred = exp(lm_out)
df_test$p.sev = sev_pred

# compute the pure premium by multiplying frequency and severity
df_test$p.prem = df_test$p.freq * df_test$p.sev

# split the predictions into 20 qunatiles
q_pred = quantile(df_test$p.prem, probs=seq(0, 1, by=0.05))
N = length(q_pred)
predictions = numeric(N-1)
ground_truth = numeric(N-1)

for (i in 1:(N-1)){
  # get the predicted frequencies of the policyholders in the i_th quantile
  qi_pred = filter(df_test, (p.prem < q_pred[i+1]) & (p.prem > q_pred[i]))
  predictions[i] = mean(qi_pred$p.prem)
  ground_truth[i] = sum(qi_pred$clm.incurred)/sum(qi_pred$exposure)
}

df_pp = data.frame(predictions, ground_truth)
x = seq(1, N-1, by=1)
ggplot(df_pp) + 
  geom_point(mapping = aes(x=x, y=predictions), color="red") + 
  geom_point(mapping = aes(x=x, y=ground_truth), color="black") + 
  xlab("quantile") + 
  ylab("pure premium") +
  scale_x_continuous(breaks=x)