I created this notebook in efforts to learn about generalized linear models, and to get a taste of how GLMs are used in the P&C insurance industry. The original analysis was conducted by Ernesto Schirmacher, and he describes his work as a guided tutorial in “Predictive Modeling Applications in Actuarial Science.” The data (sim-modeling-data.csv) contains features about car insurance policyholders, as well as their risk exposure, claim counts, and claim severity. The table below summarizes the types of variables in the data.
data= data.frame(col1=c("year", "exposure", "", "", "", "", "", "", "", ""),
col2=c("age", "driver.gender", "marital.status", "yrs.licensed",
"ncd.level", "nb.rb", "prior.claims", "", "", ""),
col3=c("body.code", "driver.age", "vehicle.value", "seats", "ccm (engine size)", "hp (horsepower)", "length", "width", "height", "fuel type"),
col4=c("region", "", "", "", "", "", "", "", "", ""),
col5=c("clm.count", "clm.incurred", "", "", "", "", "", "", "", ""))
colnames(data) = c("Control", "Driver", "Vehicle", "Geographic", "Response")
data
This notebook is divided into 3 parts.
Frequency Modeling: I used a log-link, Poisson regression model to model the accident frequency (claim counts per unit exposure)
Severity Modeling: I used a log-link, Gamma regression model to model the accident severity (claim cost per claim count, condition on a positive number of claim counts)
Pure Premium Modeling: I multiplied the outputs of the frequency model and the severity model, to compute the predicted pure premium (claim cost per unit exposure) for each policyholder. Finally, I divided the policyholders into 20 equally sized bins, ordered from smallest to largest predicted pure premium, and compared the avg. predicted pure premium to the avg. actual claim cost per unit exposure for each bin.
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forcats)
library(set)
library(data.table)
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
library(cluster)
library(statmod)
library(ggpubr)
source("utils/data_preprocessing.R")
source("utils/categorical_eda.R")
source("utils/region_eda.R")
df_full = read.csv("utils/sim-modeling-dataset.csv", row.names='row.id')
# remove outlier from yrs.lic (there's one sample with yrs.lic = 9+)
ix = which(df_full$yrs.lic == "9+")
df_full = df_full[-ix, ]
# make categorical features
df_full = make_categorical(df_full)
# drop duplicate features
df_full = subset(df_full, select=-c(driver.age, yrs.licensed, vehicle.age))
# frequency: lambda parameter of Poisson distribution
freq = df_full$clm.count/df_full$exposure
df_full$freq = freq
#Use 70% of dataset as training set and remaining 30% as testing set
sample = sample(c(TRUE, FALSE), nrow(df_full), replace=TRUE, prob=c(0.6,0.4))
df_train = df_full[sample, ]
df_test = df_full[!sample, ]
The region variable has too many categories. These variables need to be grouped to create a more efficient model. Schirmacher fits a regression coefficient for each region and groups the regions that have similar coefficient values. If we had an understanding of which regions were close together, this would have been the ideal way to group these variables, but this information is not available.
df_region = df_train["region"]
df_region$freq = df_train$freq
df_region_freq = df_region %>% group_by(region) %>% summarise(avg_freq=mean(freq))
df_region_freq = arrange(df_region_freq, avg_freq)
I will describe my method, which is just slightly different from Schirmacher’s method of grouping regions.
Use K means to group the regions into K clusters, based on average accident frequency (i.e. regions with similar levels of accident frequencies get grouped together into the same cluster.)
Perform a Poisson regression model with claim counts as the response variable and a re-coded region variable as the feature. The region variable is re-coded based on the clusters generated by the K-means algorithm.
Compute the model’s Akaike Information Criterion (AIC).
Loop through K = (1 to 15), and for each K, conduct the 3 steps mentioned above. Determine “best_K” the K that produces the lowest AIC.
Finally, group the regions into “best_K” clusters to construct the transformed region variable.
The plot below shows how the AIC scores varied by the number of clusters. The red point shows the “best_K,” number of clusters that produced the lowest AIC score.
# cluster the regions based on average accident frequency
N = 1 # number of trials
clusters = factor(seq(1, 15))
aic_matrix = matrix(nrow=length(clusters), ncol=N)
for (i in 1:N){
aic = determine_num_clusters(df_train, df_region_freq, clusters)
aic_matrix[, i] = aic
}
aic_df = data.frame(rowMeans(data.frame(aic_matrix)))
colnames(aic_df) = "AIC"
rownames(aic_df) = clusters
# extract the number of clusters that produces the minimum average AIC score
min_aic = min(aic_df$AIC)
min_cluster = which(aic_df == min(aic_df$AIC))
# plot the average AIC scores and isolate the mean
ggplot(data=aic_df, aes(x=clusters, y=AIC)) +
geom_point() +
geom_point(data=aic_df,
aes(x=min_cluster, y=min_aic),
color="red")
I created two different one way models: one using the original region variable with 38 different categories and another using the re-coded region variable with “best_K” different categories. The model using the re-coded region variable produced a much lower AIC, and as evident from the summaries printed below, the coefficients of the model using re-coded region are more statistically significant. I will proceed with my analysis using the re-coded region variable.
cluster_dict = cluster_regions(df_region_freq, min_cluster)
df_train_ = recode_region(df_train, cluster_dict)
region_m = glm(formula=clm.count ~ region,
family=poisson(link='log'),
data=df_train_,
offset=log(exposure))
region_recode_m = glm(formula=clm.count ~ region.recode,
family=poisson(link='log'),
data=df_train_,
offset=log(exposure))
print(summary(region_m))
##
## Call:
## glm(formula = clm.count ~ region, family = poisson(link = "log"),
## data = df_train_, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6781 -0.4717 -0.3811 -0.2543 4.4226
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62019 0.04730 -34.255 < 2e-16 ***
## region1 -0.28791 0.12339 -2.333 0.01963 *
## region2 -0.83194 0.27141 -3.065 0.00218 **
## region3 -0.24326 0.09109 -2.670 0.00757 **
## region4 -0.30860 0.09659 -3.195 0.00140 **
## region5 -0.30907 0.18573 -1.664 0.09610 .
## region6 -0.12804 0.18860 -0.679 0.49720
## region7 -0.87232 0.31974 -2.728 0.00637 **
## region8 0.14243 0.09344 1.524 0.12742
## region9 -0.14275 0.21838 -0.654 0.51334
## region10 -0.25720 0.15189 -1.693 0.09039 .
## region11 -0.17889 0.24710 -0.724 0.46909
## region12 -0.24207 0.19818 -1.221 0.22191
## region13 -0.65676 0.19818 -3.314 0.00092 ***
## region14 -0.45002 0.17790 -2.530 0.01142 *
## region15 -0.43908 0.11249 -3.903 9.49e-05 ***
## region16 -0.35557 0.21838 -1.628 0.10349
## region18 0.04574 0.10605 0.431 0.66621
## region19 -0.30162 0.21381 -1.411 0.15834
## region20 -0.38976 0.24040 -1.621 0.10496
## region21 -0.03292 0.19481 -0.169 0.86581
## region22 -0.26688 0.26250 -1.017 0.30930
## region23 -0.04349 0.19162 -0.227 0.82044
## region24 -0.08984 0.22855 -0.393 0.69425
## region25 -0.08681 0.12550 -0.692 0.48915
## region26 0.08486 0.19481 0.436 0.66312
## region27 -0.52176 0.20174 -2.586 0.00970 **
## region28 -0.36984 0.23424 -1.579 0.11436
## region29 -0.71294 0.24040 -2.966 0.00302 **
## region30 -0.39818 0.25443 -1.565 0.11760
## region31 -0.47894 0.30520 -1.569 0.11658
## region32 -0.07220 0.11298 -0.639 0.52278
## region33 0.14826 0.20953 0.708 0.47920
## region34 -0.35054 0.24040 -1.458 0.14480
## region35 -0.45897 0.16898 -2.716 0.00660 **
## region36 -0.04045 0.15484 -0.261 0.79392
## region37 -0.24932 0.19481 -1.280 0.20062
## region38 0.15020 0.22328 0.673 0.50115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10003.4 on 24555 degrees of freedom
## Residual deviance: 9906.2 on 24518 degrees of freedom
## AIC: 13898
##
## Number of Fisher Scoring iterations: 6
print(summary(region_recode_m))
##
## Call:
## glm(formula = clm.count ~ region.recode, family = poisson(link = "log"),
## data = df_train_, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6322 -0.4687 -0.3827 -0.2581 4.4138
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.64049 0.09091 -18.045 < 2e-16 ***
## region.recode2 -0.28045 0.09877 -2.839 0.00452 **
## region.recode3 -0.37548 0.11012 -3.410 0.00065 ***
## region.recode4 -0.64397 0.15971 -4.032 5.53e-05 ***
## region.recode5 0.03031 0.09647 0.314 0.75340
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 10003.4 on 24555 degrees of freedom
## Residual deviance: 9928.7 on 24551 degrees of freedom
## AIC: 13854
##
## Number of Fisher Scoring iterations: 6
print(paste("AIC of model using region variable: ", round(region_m$aic, 4)))
## [1] "AIC of model using region variable: 13897.8791"
print(paste("AIC of model using region re-coded variable: ", round(region_recode_m$aic, 4)))
## [1] "AIC of model using region re-coded variable: 13854.3553"
Plotting the avg. frequency against the driver age shows an interesting trend. Average frequency of accidents is relatively large for very young drivers and very old drivers.
# construct a bar plot of driver age
df_freq_age = df_train %>% group_by(drv.age) %>% summarise(avg.freq=mean(freq))
age_count = df_train %>% group_by(drv.age) %>% tally() %>% as.data.frame()
# plot the counts of the driver ages
ggplot(data=age_count, aes(x=drv.age, y=n)) +
geom_bar(stat='identity') +
scale_x_continuous(name="Driver Age", breaks=seq(20, 90, by=10)) +
labs(title="Count of Driver Ages")
# plot the average accident frequency by age
plt1 = ggplot(df_freq_age, aes(x=drv.age, y=avg.freq)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks=seq(15, 90, by=5))
plt2 = ggplot(df_freq_age, aes(x=drv.age, y=log(avg.freq))) +
geom_point() +
geom_line() +
scale_x_continuous(breaks=seq(15, 90, by=5))
ggarrange(plt1, plt2,
nrow=2, ncol=1)
The lack of drivers beyond the age of 80 is the reason for the trailing zeros in the avg.freq vs. drv.age plot, and we shouldn’t draw any conclusions from this information. Thus, I’ll remove the outliers with drv.age >= 80.
When performing a Poisson regression with a log link, we assume that the log of the mean response is linearly related to the predictors. This is clearly not the case, as evident from the second plot above. Therefore, I think a transformation to the drv.age variable is necessary. Just like the region variable, I’ll try binning the drivers into different groups, to create a categorical variable.
# filter out drivers of age < 80
df_train = filter(df_train, drv.age < 80)
# recode the driver ages
age_bins = seq(15, 80, by=5)
df_train = recode_drv_age(df_train, age_bins)
I tried binning the drivers into the following age groups (15-19, 20-24, …, 75-79). Binning the driver ages provides significant improvement as compared to the model that uses the original drv.age variable. The AIC score drops significantly and there are several statistically significant age groups. I will continue with my analysis using the re-coded drv.age variable.
model1 = glm(formula=clm.count ~ drv.age,
family=poisson(link='log'),
data=df_train,
offset=log(exposure))
model2 = glm(formula=clm.count ~ drv.age.recode,
family=poisson(link='log'),
data=df_train,
offset=log(exposure))
print(summary(model1))
##
## Call:
## glm(formula = clm.count ~ drv.age, family = poisson(link = "log"),
## data = df_train, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6495 -0.4791 -0.3845 -0.2533 4.4022
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.353006 0.093172 -14.522 < 2e-16 ***
## drv.age -0.010161 0.002085 -4.874 1.1e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 9991.9 on 24514 degrees of freedom
## Residual deviance: 9967.8 on 24513 degrees of freedom
## AIC: 13883
##
## Number of Fisher Scoring iterations: 6
print(summary(model2))
##
## Call:
## glm(formula = clm.count ~ drv.age.recode, family = poisson(link = "log"),
## data = df_train, offset = log(exposure))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3887 -0.4753 -0.3710 -0.2567 4.2364
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.35213 0.07143 -18.930 < 2e-16 ***
## drv.age.recode1 1.60344 0.58175 2.756 0.00585 **
## drv.age.recode2 0.14124 0.18133 0.779 0.43604
## drv.age.recode4 -0.28603 0.09221 -3.102 0.00192 **
## drv.age.recode5 -0.64617 0.09313 -6.938 3.97e-12 ***
## drv.age.recode6 -0.51193 0.09120 -5.613 1.98e-08 ***
## drv.age.recode7 -0.44861 0.08998 -4.986 6.17e-07 ***
## drv.age.recode8 -0.50101 0.09459 -5.296 1.18e-07 ***
## drv.age.recode9 -0.47834 0.10279 -4.653 3.27e-06 ***
## drv.age.recode10 -0.67461 0.12372 -5.453 4.96e-08 ***
## drv.age.recode11 -0.73020 0.18133 -4.027 5.65e-05 ***
## drv.age.recode12 -0.53905 0.24028 -2.243 0.02487 *
## drv.age.recode13 0.13882 0.26000 0.534 0.59339
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 9991.9 on 24514 degrees of freedom
## Residual deviance: 9905.8 on 24502 degrees of freedom
## AIC: 13843
##
## Number of Fisher Scoring iterations: 6
print(paste("AIC of model using drv.age variable: ", round(model1$aic, 4)))
## [1] "AIC of model using drv.age variable: 13883.4845"
print(paste("AIC of model using drv.age re-coded variable: ", round(model2$aic, 4)))
## [1] "AIC of model using drv.age re-coded variable: 13843.4306"
I generated the histogram below by sampling policyholders from each category within the marital status feature (i.e. divorced, married, single, widow), and computed the average accident frequency (sample mean). I repeated this “num_trials = 250” times, to generate a bell-shaped curve for each category within marital status (Because the plot below is a histogram of sample means, we get a bell-shaped curve as a result of the central limit theorem)
It’s really interesting that widows have a higher accident frequency than other demographics. My suspicion is that many of these widows are senior citizens, and as we saw from the analysis on drv.age, accident frequencies are relatively large for the elderly.
category = "marital.status"
num_trials = 250
res = category_sample_means(df_train, category, num_trials)
plt_year = res[[1]] # plt_year is a ggplot and should be plotted from the terminal
df_result = res[[2]]
show(plt_year)
I created a widow dummy variable and compared the feature’s performance against the original marital status variable. As shown from the output below, the two features do not show a significant difference in AIC. Given that the widow dummy uses 1 degree of freedom as opposed to the marital status feature’s 3, we can conclude that the widow dummy is simpler and provides just as much predictive power.
SIDE NOTE: I conducted this kind of analysis for all the categorical features. For the sake of space and concision, I have not included histograms for all the categorical features. That being said, this exercise gave me a quick understanding of which features were likely to useful when frequency modeling. Variables such as year, nb.rb (new business vs. renewal business), and ncd.level (no claim discount level) showed promise.
df_train_ = recode_marital_status(df_train)
marital_stat_m = glm(formula=clm.count ~ marital.status,
family=poisson(link='log'),
data=df_train_,
offset=log(exposure))
marital_stat_recode_m = glm(formula=clm.count ~ marital.status.recode,
family=poisson(link='log'),
data=df_train_,
offset=log(exposure))
print(paste("AIC of model using marital status variable: ", round(marital_stat_m$aic, 4)))
## [1] "AIC of model using marital status variable: 13890.0904"
print(paste("AIC of model using widow dummy variable: ", round(marital_stat_recode_m$aic, 4)))
## [1] "AIC of model using widow dummy variable: 13890.324"
I generated the plot below by constructing one-way frequency models (Poisson regression models that use claim counts as the response, and a single feature as the explanatory variable) and computing each model’s deviance and AIC. The deviance measures how much the model’s log likelihood differs from the log likelihood of the saturated model. The AIC measures the fit of the model but also has a penalty term for a larger number of parameters. Thus, a model producing a low deviance and low AIC score is desired => Explanatory variables on the lower left corner have more predictive power. A quick glance at this plot shows that ncd.level, year, yrs.lic, region.recode, drv.age.recode, and nb.rb contain meaningful information.
I measured the Euclidean distance of each one way model from the null model within the deviance vs AIC plot. The table contains the one way models, ranked by their distance from the null model. Although this plot changes every time I run this script, the features mentioned above (ncd.level, year, yrs.lic, region.recode, drv.age.recode, and nb.rb) consistently have the greatest Euclidean distance from the null model. Thus, I constructed my frequency model using these 6 variables.
# create a Deviance vs. AIC plot of the continuous variables - variables that produce a lower deviance and AIC score are more promising as explanatory variables
df_train_ = recode_marital_status(df_train)
df_train_ = recode_region(df_train_, cluster_dict)
df_train_ = recode_drv_age(df_train_, seq(15, 80, by=5))
df_train_ = subset(df_train_, select=-c(region, marital.status, drv.age, freq, clm.incurred))
# extract the variables
vars = as.list(colnames(df_train_))
vars[[which(vars=="exposure")]] = NULL
vars[[which(vars=="clm.count")]] = NULL
# compute the AIC and Deviance for each one way model
df_result = compute_aic_dev(df_train_, vars, model_type="frequency")
plt = ggplot(df_result, aes(x=aic, y=dev, label=rownames(df_result))) +
geom_text(check_overlap = TRUE, size=2, nudge_x=2) +
geom_point(size=3, shape=23) +
ggtitle("Deviance AIC Plot of Variables")
show(plt)
# # append the dataframe with distance from null
distance = aic_dev_dist(df_result, dist_null=TRUE)
df_result$dist_to_null = distance
df_result = df_result[order(-df_result$dist_to_null), ]
print(df_result)
## aic dev dist_to_null
## ncd.level 13682.39 9758.736 322.751195
## year 13747.23 9827.577 228.185987
## yrs.lic 13783.06 9853.405 184.894585
## nb.rb 13827.13 9911.470 112.340129
## drv.age.recode 13843.43 9905.774 106.195238
## region.recode 13838.47 9916.811 100.695170
## prior.claims 13853.19 9937.535 75.483967
## veh.age 13872.78 9925.125 74.385846
## body.code 13892.54 9964.885 29.988102
## height 13889.13 9973.475 24.684392
## marital.status.recode 13890.32 9974.668 23.000487
## ccm 13896.15 9980.498 14.779381
## seats 13909.73 9982.071 10.677864
## weight 13900.83 9985.174 8.221077
## width 13901.44 9985.788 7.367146
## driver.gender 13902.55 9986.890 5.845823
## length 13903.60 9987.948 4.409024
## fuel.type 13909.41 9991.752 3.854321
## vehicle.value 13906.10 9990.441 1.556500
## hp 13906.36 9990.706 1.440767
## null 13905.56 9991.901 0.000000
In this cell I did a simple evaluation of the 6 variable frequency model by using it to perform predictions on the training and test data. The total number of predicted claims on the training data is exactly equal to the total number of actual claims. There’s a slight disparity between these values on the test data but that’s expected, and it’s small enough for me to feel good about this model.
cc_m = glm(formula=clm.count ~ year+ncd.level+yrs.lic+drv.age.recode+nb.rb+region.recode,
family=poisson(link='log'),
data=df_train_,
offset=log(exposure))
# recode the variables in df_test
df_test_ = recode_marital_status(df_test)
df_test_ = recode_region(df_test_, cluster_dict)
df_test_ = recode_drv_age(df_test_, seq(15, 80, by=5))
df_test_ = subset(df_test_, select=-c(region, marital.status, drv.age))
eval_df = evaluate_model(cc_m, df_train_, df_test_)
eval_df
Here I divided the policyholders into 20 equally sized bins, ordered from smallest to largest predicted frequency, and compared the avg. predicted frequency to the avg. actual frequency for each bin. The red points are the avg. predicted frequencies for each bin, and the black dots are the avg. actual frequency for each bin. Given that the predicted avg. frequency is tracking the actual avg. frequency quite closely across the bins, I’m happy with this frequency model.
#################################
### Frequency Prediction ########
#################################
# predict the claim counts, assuming we have access to the exposure
cc_pred = predict(cc_m, df_test_, type="response")
# divide the predicted claim counts with exposure to get the frequency
df_test$p.freq = cc_pred/df_test_$exposure
# split the predictions into 20 qunatiles
q_pred = quantile(df_test$p.freq, probs=seq(0, 1, by=0.05))
N = length(q_pred)
predictions = numeric(N-1)
ground_truth = numeric(N-1)
for (i in 1:(N-1)){
# get the predicted frequencies of the policyholders in the i_th quantile
qi_pred = filter(df_test, (p.freq < q_pred[i+1]) & (p.freq > q_pred[i]))
qi_pred$id = rownames(qi_pred)
predictions[i] = mean(qi_pred$p.freq)
ground_truth[i] = mean(qi_pred$freq)
}
# plot the predictions and ground truth
df_freq = data.frame(predictions, ground_truth)
x = seq(1, N-1, by=1)
ggplot(df_freq) +
geom_point(mapping = aes(x=x, y=predictions), color="red") +
geom_point(mapping = aes(x=x, y=ground_truth), color="black") +
xlab("quantile") +
ylab("average frequency") +
scale_x_continuous(breaks=x)
The Gamma distribution is used to model data that takes on positive values with a heavy right-skew, making it the convention for severity modeling. The figure below demonstrates that the severity of our data has these attributes. Because the response variable (severity) has a heavy right skew, I used the median severity rather than the mean severity when analysing the relation between severity and different explanatory variables.
source("utils/sev_data_preprocessing.R")
df_train_sev = df_full[sample, ]
df_test_sev = df_full[!sample, ]
# filter data with positive claim counts
df_train_sev = filter(df_train_sev, clm.count > 0)
df_test_sev = filter(df_test_sev, clm.count > 0)
# severity plot
severity = df_train_sev$clm.incurred/df_train_sev$clm.count
plt = ggplot() + geom_histogram(mapping=aes(x=severity))
show(plt)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
I used the same method described in the “Region EDA” section of frequency modeling, to group the 38 regions into more meaningful clusters. The one and only difference is that I am now using a Gamma regression model with claims incurred (a.k.a. claim cost) as the response variable. Again, the red point in the plot below indicates the number of clusters that produced the lowest AIC score.
df_region = df_train_sev["region"]
df_region$sev = df_train_sev$clm.incurred/df_train_sev$clm.count
df_region_sev = df_region %>% group_by(region) %>% summarise(med.sev=median(sev))
df_region_sev = arrange(df_region_sev, med.sev)
# region by count
df_region_count = df_region %>% group_by(region) %>% tally()
N = nrow(df_train_sev)
clusters = seq(1, 15, 1)
aic_dict = c()
for (K in clusters){
# perform k means on the average severities
df_region_sev = make_clusters(df_region_sev, K)
# recode the regions based on the clusters
df_train_sev = recode_region(df_train_sev, df_region_sev)
if(K == 1){
fmla = as.formula(clm.incurred ~ region)}
else{
fmla = as.formula(clm.incurred ~ region.recode)}
# create the GLM and compute the AIC
model = glm(formula=fmla,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
aic_dict[as.character(K)] = model$aic
}
# compute the average of the AIC
aic = data.frame(aic_dict)
colnames(aic) = "aic"
aic$cluster = rownames(aic)
# generate plot of AIC vs. number of clusters
best_aic = min(aic$aic)
best_cluster = aic[aic$aic == best_aic, ]$cluster
ggplot(data=aic, aes(x=cluster, y=aic)) +
geom_point() +
geom_point(aes(x=best_cluster, y=best_aic), color="red") +
scale_x_discrete(limits=aic$cluster)
The summaries printed below show similar results to the frequency model. The model using the re-coded region variable produced a much lower AIC as compared to the model using the original region variable, and as evident from the model summaries, the coefficients of the model using re-coded region are more statistically significant. I will proceed with my analysis using the re-coded region variable, but keep in mind that this variable is different from the re-coded region variable from the frequency modeling section.
# best cluster was the number of clusters that generated the best AIC
# Recode the region variable into "best_cluster" clusters
df_region_sev = make_clusters(df_region_sev, best_cluster)
df_train_sev = recode_region(df_train_sev, df_region_sev)
df_region_sev$cluster = as.character(df_region_sev$cluster)
plt = ggplot(data=df_region_sev, aes(x=region, y=med.sev, color=cluster)) +
geom_point() +
scale_x_discrete(limits = c(df_region_sev$region)) +
ggtitle("Median Accident Severity by Region (Clustered), Given > 0 Claim Counts") +
xlab("Region (Clustered)") +
ylab("Median Accident Severity")
show(plt)
model = glm(formula=clm.incurred ~ region.recode,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
model1 = glm(formula=clm.incurred ~ region,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
print(summary(model1))
##
## Call:
## glm(formula = clm.incurred ~ region, family = Gamma(link = "log"),
## data = df_train_sev, weights = clm.count, offset = log(clm.count))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1551 -1.6574 -0.6637 0.2994 3.7611
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6871127 0.0647966 103.202 < 2e-16 ***
## region1 -0.2600992 0.1688720 -1.540 0.12368
## region2 0.5806993 0.3714215 1.563 0.11812
## region3 -0.1970908 0.1255297 -1.570 0.11657
## region4 0.2582641 0.1329299 1.943 0.05218 .
## region5 -0.1059994 0.2541737 -0.417 0.67670
## region6 0.1353294 0.2581042 0.524 0.60012
## region7 0.3730865 0.4375570 0.853 0.39396
## region8 0.0223680 0.1278991 0.175 0.86119
## region9 0.1788675 0.2988572 0.599 0.54958
## region10 -0.2801058 0.2078715 -1.347 0.17799
## region11 -0.3229219 0.3381569 -0.955 0.33973
## region12 -0.2599754 0.2712070 -0.959 0.33789
## region13 0.0177614 0.2712070 0.065 0.94779
## region14 -0.6810043 0.2468670 -2.759 0.00586 **
## region15 0.1047801 0.1546283 0.678 0.49809
## region16 0.3090058 0.3055632 1.011 0.31202
## region18 0.1105214 0.1456778 0.759 0.44815
## region19 0.0439537 0.2926002 0.150 0.88061
## region20 -0.0763531 0.3289841 -0.232 0.81650
## region21 -0.4758559 0.2666013 -1.785 0.07444 .
## region22 -0.0936600 0.3592171 -0.261 0.79433
## region23 -0.0890879 0.2666013 -0.334 0.73830
## region24 0.0383657 0.3127736 0.123 0.90239
## region25 -0.1478387 0.1781473 -0.830 0.40672
## region26 0.4373783 0.2666013 1.641 0.10106
## region27 0.0878237 0.2812501 0.312 0.75488
## region28 0.4444020 0.3205545 1.386 0.16581
## region29 -0.0531197 0.3289841 -0.161 0.87174
## region30 -0.3032735 0.3481875 -0.871 0.38386
## region31 -0.7890751 0.4176514 -1.889 0.05901 .
## region32 0.1059696 0.1546283 0.685 0.49323
## region33 -0.2514078 0.2867448 -0.877 0.38073
## region34 -0.4771018 0.3289841 -1.450 0.14716
## region35 -0.1528550 0.2312507 -0.661 0.50870
## region36 0.0575542 0.2119121 0.272 0.78596
## region37 0.0004425 0.2666013 0.002 0.99868
## region38 0.0904434 0.3055632 0.296 0.76727
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.872575)
##
## Null deviance: 4042.2 on 1893 degrees of freedom
## Residual deviance: 3956.4 on 1856 degrees of freedom
## AIC: 31679
##
## Number of Fisher Scoring iterations: 9
print(summary(model))
##
## Call:
## glm(formula = clm.incurred ~ region.recode, family = Gamma(link = "log"),
## data = df_train_sev, weights = clm.count, offset = log(clm.count))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.1208 -1.6683 -0.6837 0.2585 3.8094
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.66791 0.04041 165.015 < 2e-16 ***
## region.recode2 0.37398 0.23658 1.581 0.11409
## region.recode3 -0.24463 0.08415 -2.907 0.00369 **
## region.recode4 0.20614 0.07690 2.681 0.00741 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.956099)
##
## Null deviance: 4042.2 on 1893 degrees of freedom
## Residual deviance: 3996.7 on 1890 degrees of freedom
## AIC: 31637
##
## Number of Fisher Scoring iterations: 7
print(paste("AIC of model using region variable: ", round(model1$aic, 4)))
## [1] "AIC of model using region variable: 31679.191"
print(paste("AIC of model using region re-coded variable: ", round(model$aic, 4)))
## [1] "AIC of model using region re-coded variable: 31637.4615"
The severity has heavy right tails, with large values that can inflate the mean, so I checked the median severity for each age group. The plot below shows a similar trend to the accident frequency. It looks like severity tends to be larger for very young drivers and very old drivers.
df_age = df_train_sev["drv.age"]
df_age$sev = df_train_sev$clm.incurred/df_train_sev$clm.count
df_age_sev = df_age %>% group_by(drv.age) %>% summarise(med.sev=median(sev))
df_age_count = df_age %>% group_by(drv.age) %>% tally()
plt = ggplot(df_age_sev, aes(x=drv.age, y=med.sev)) +
geom_point() +
geom_line() +
scale_x_discrete(limits=seq(15, 90, 5))+
ylab("Median Severity") +
xlab("Driver Age")
## Warning: Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
plt1 = ggplot(df_age_count, aes(x=drv.age, y=n)) +
geom_bar(stat="identity") +
scale_x_discrete(limits=seq(15, 90, 5))
## Warning: Continuous limits supplied to discrete scale.
## ℹ Did you mean `limits = factor(...)` or `scale_*_continuous()`?
show(plt)
show(plt1)
Just like the frequency model, I think it will be necessary to bin driver ages into groups. Given that there is a small number of drivers under the age of 20 and over the age of 80, I will remove these samples. As evident from the plot above, the median severity seems relatively consistent for drivers of ages [25, 65], so I will create 3 age groups: < 25, [25, 65], >65.
Splitting the drivers into 3 age bins seems to provide a little bit of improvement. The model summaries show that the model using the re-coded driver age variable has a lower deviance and AIC as compared to the model using the original driver age variable. Also, the parameter estimates have lower p-values.
# remove outliers
df_train_sev = filter(df_train_sev, (drv.age>19) & (drv.age<80))
bins = c(25, 65)
df_train_sev = recode_drv_age(df_train_sev, bins)
model1 = glm(formula=clm.incurred ~ drv.age,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
model2 = glm(formula=clm.incurred ~ drv.age.recode,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
print(summary(model1))
##
## Call:
## glm(formula = clm.incurred ~ drv.age, family = Gamma(link = "log"),
## data = df_train_sev, weights = clm.count, offset = log(clm.count))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.9638 -1.6648 -0.6951 0.2720 3.9065
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.902754 0.122820 56.202 <2e-16 ***
## drv.age -0.004971 0.002735 -1.817 0.0693 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.932435)
##
## Null deviance: 4041.2 on 1890 degrees of freedom
## Residual deviance: 4034.4 on 1889 degrees of freedom
## AIC: 31571
##
## Number of Fisher Scoring iterations: 7
print(summary(model2))
##
## Call:
## glm(formula = clm.incurred ~ drv.age.recode, family = Gamma(link = "log"),
## data = df_train_sev, weights = clm.count, offset = log(clm.count))
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0085 -1.6665 -0.6917 0.2727 3.8064
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3124 0.2322 31.486 < 2e-16 ***
## drv.age.recode1 -0.6388 0.2344 -2.725 0.00648 **
## drv.age.recode2 -0.6680 0.2858 -2.337 0.01952 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1.941721)
##
## Null deviance: 4041.2 on 1890 degrees of freedom
## Residual deviance: 4023.2 on 1888 degrees of freedom
## AIC: 31566
##
## Number of Fisher Scoring iterations: 7
Again, I decided to plot the sample means of various categorical features to get a sense of which features could be useful for predicting severity. A feature has the potential to be useful when the sample means are significantly different from one another. Again, marital status seems to show some explanatory power, as the single community is generating a much larger average accident severity than other demographics. Driver gender also seems to be a promising feature, as the data shows that females generate a larger accident severity.
category = "marital.status"
num_trials = 250
r = category_sample_means(df_train_sev, category, num_trials, model_type="severity")
plt = r[[1]]
df_result = r[[2]]
show(plt)
category = "driver.gender"
num_trials = 250
r = category_sample_means(df_train_sev, category, num_trials, model_type="severity")
plt = r[[1]]
df_result = r[[2]]
show(plt)
Unlike the deviance vs. AIC plot from the frequency model, this plot shows a lot of variance every time the train-test split changes, and we do not have as many features that have significant separation from the null model. In fact, many one-way models land to the right of the null model. The year and region re-code variables are the only two variables that consistently produce significant separation from the null model, so these two features will be included in the severity model.
vars = as.list(c("year", "drv.age", "vehicle.value", "ccm", "hp", "weight", "length",
"width", "height", "prior.claims", "marital.status", "yrs.lic",
"ncd.level", "region", "body.code", "veh.age", "seats", "fuel.type",
"driver.gender", "nb.rb", "region.recode", "drv.age.recode"))
# compute the AIC and Deviance for each one way model
df_result = compute_aic_dev(df_train_sev, vars, model_type="severity")
# # append the datafrmae with distance from null
distance = aic_dev_dist(df_result, dist_null=TRUE)
df_result$dist_to_null = distance
df_result = df_result[order(-df_result$dist_to_null), ]
plt = ggplot(df_result, aes(x=aic, y=dev, label=rownames(df_result))) +
geom_text(check_overlap = TRUE, size=2, nudge_x=2) +
geom_point(size=3, shape=23) +
ggtitle("Deviance AIC Plot of Variables")
show(plt)
Rather than considering each feature independently, we can try adding features sequentially. Starting with a base model including year and region.recode, I added features sequentially by running a partial F-test in which a “reduced model” is constructed using all prior features that have already been added, and a “full model” with the new candidate feature. The candidate feature was added to the model if it produced the lowest p-value among candidate features, and if it had a magnitude < 0.05.
vars = as.list(c("vehicle.value", "ccm", "hp", "weight", "length",
"width", "height", "prior.claims", "marital.status", "yrs.lic",
"ncd.level", "body.code", "veh.age", "seats", "fuel.type",
"driver.gender", "nb.rb", "drv.age.recode"))
predictor_vars = as.list(c("year", 'region.recode'))
threshold = 0.05
p_val_best = Inf
fmla_best = ""
hit_condition = TRUE
while(hit_condition){
p_val_best = Inf
hit_condition = FALSE
base_fmla = as.formula(paste("clm.incurred~", paste(predictor_vars, collapse="+")))
base_model = glm(formula=base_fmla,
data=df_train_sev,
family=Gamma(link="log"),
offset=log(clm.count))
for (var in vars){
# add var to preds
preds = paste(predictor_vars, sep = "", collapse=" + ")
preds = paste(preds, var, sep = " + ")
fmla = as.formula(paste("clm.incurred ~ ", preds, sep = ""))
# construct the glm model
model = glm(formula=fmla,
data=df_train_sev,
family=Gamma(link="log"),
offset=log(clm.count))
# get p-alue from F test
anova_tbl = anova(model, test='F')
p_values = anova_tbl$`Pr(>F)`
p_val = p_values[length(p_values)]
# if we find a variable that generates a p-value less than 0.05
if ((p_val < threshold) & (p_val < p_val_best)){
hit_condition = TRUE
p_val_best = p_val
fmla_best = fmla
var_best = var
}
}
# if the hit condition is TRUE, append predictor_vars with var
if(hit_condition){
print(paste("best p-value from candidate features: ", round(p_val_best, 3)))
print(paste("current model formula:", format(fmla_best)))
predictor_vars = append(predictor_vars, var_best)
vars[[which(vars==var_best)]] = NULL
}
}
## [1] "best p-value from candidate features: 0.023"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender"
## [1] "best p-value from candidate features: 0.03"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender + drv.age.recode"
## [1] "best p-value from candidate features: 0.036"
## [1] "current model formula: clm.incurred ~ year + region.recode + driver.gender + drv.age.recode + "
## [2] "current model formula: vehicle.value"
# coming out of the while loop, we have the best formula
# create the claims incurred model with the best formula
ci_m = glm(formula=fmla_best,
data=df_train_sev,
family=Gamma(link="log"),
weights=clm.count,
offset=log(clm.count))
# perform the necessary transformation on df_test
df_test_sev = recode_region(df_test_sev, df_region_sev)
df_test_sev = recode_drv_age(df_test_sev, bins)
eval_df = evaluate_model_sev(ci_m, df_train_sev, df_test_sev)
eval_df