1 Introduction

1.1 Purpose and Scope

This report describes a model for predicting incremental paid losses on an individual claim basis (“The Model”). The model uses a mixture of predictive modeling and simulation techniques to mimic real world claim development.

Note that this report is a much more simplified version of the full ensemble of model’s used in the full analysis performed and is merely meant as a starting point for teaching/learning purposes.

1.2 Background

Models, by definition, use simplified assumptions of reality to reveal information or predict events based on the underlying data. Models which closely mimic the fundamental forces driving the data have the best chance of providing valuable insights.

Due to data and computing limitations, actuaries have traditionally aggregated loss information by policy, accident, or calendar period to project future losses. By aggregating losses, the actuary loses valuable claim level information.

The model assumes that individual claims and their claim level characteristics are the fundamental drivers of future payments. Therefore, In accordance with the philosophy that the best models are those which mimic reality most closely, the model uses information on an individual claim level, and runs statistically rigorous techniques to fit and simulate individual claim development.

1.3 Overview

The model is meant to be a starting point for anyone looking to discover new and advanced methods for performing micro-claims analysis and machine-learning modelling techniques that provide insights beyond the typical aggregated actuarial practices in P&C. Additionally, the model is a showcase for the statistical power that the R Programming language can provide, specifically for those with apriori statistical and mathematical knowledge in applied predictive analytics and probability theory.

I decided to use only a few very common predictor variables so the model could easily be applied to other data sets. For transparency and to aid interested individuals, I provide this report with access to the R code used to fit and run predictions. The code can be viewed by clicking the code boxes on the right side of the report. The R savvy reader can run the R code to reproduce the output, apply the model to other data sets, and expand and improve upon the model.

The model is only applicable to reported claims and their corresponding incremental payments. IBNR claim predictions are beyond the scope of this model.

1.4 Vocabulary

For consistency and clarity I use the following terms:

Response Variable The value being predicted by the model (claim status and claim incremental payment in this report)
Predictor Variable The values used to fit the model or to predict the response variable (i.e. I use certain claim characteristics as predictor variable to model and predict the response variable)

1.5 Data

In the spirit of mimicking the real world, this report communicates the model through a working example using real auto-liability data supplied mostly from the insuranceData R Package (GitHub Repo) as well as publicly available data supplied by the CAS.

Note that this specific model has been tuned to form predictions related specifically to Bodily Injury claims only, as these claims drive the foundation of the risk behind Auto Liability reserving and rate-making.

That being said, although the results of the model are specific to Auto Liability in this instance, the modeling techniques and machine-learned tuning procedures can easily be generalized to other lines of coverage, areas of business, and risk portfolios.

R Code for Data Load and Ingestion:

library(dplyr, warn.conflicts = FALSE)
library(tidyr)
library(caret, warn.conflicts = FALSE)
library(lubridate)
library(ggplot2, warn.conflicts = FALSE)
library(knitr)
library(DiagrammeR)
library(scales)
require(e1071) 
require(bindrcpp)
library(qs)
library(webshot)

# turn off scientific notation in printing
options(scipen = 999)

# load data
claims <- qs::qread("data/model-claims.qs")

# development age to project from
# e.g. if I set this to 18 I will use information available
# as of the 18 month evaluation to predict stuff at age 30)
devt_period <- 30

# year to predict with model
# evaluations at or greater than this time will not be included in
# the model fit
predict_eval <- as.Date("2010-11-30")

# remove unneeded data
claims <- dplyr::filter(claims, 
                        eval <= predict_eval,
                        eval >= predict_eval - years(5)) %>%
            dplyr::select(eval, devt, claim_number, 
                          status, tot_rx, tot_pd_incr,
                          status_act, tot_pd_incr_act)

devt_periods_needed <- seq(6, devt_period + 12, by = 12)

# for showing in triangle
claims_display <- dplyr::filter(claims, devt %in% devt_periods_needed)

# only need the claims at the selected `dev_period`
claims <- dplyr::filter(claims, devt %in% devt_period)

I am using data from fiscal years 2003 to 2011.

Fiscal years begin at 6/1 of the year prior to the fiscal year and end at 5/31 of the fiscal year (i.e. fiscal year 2003 includes claims which occurred between 6/1/2002 and 5/31/2003).

The claims are evaluated at 11/30 of each year from 2002 to 2010.

I created a large data set containing all the data used in fitting the model and making predictions from the model.

The original data and the script used to prepare the large data set used in this report is located in the data/ directory and details can be viewed in this report’s Appendix.

2 The Model

The model uses several advanced statistical techniques. For compactness and because I lack the expertise to explain everything in detail, a comprehensive explanation of these techniques is beyond the scope of this analysis.

Where-ever possible I have included links to additional resources for diving into the statistics behind the model. The statistics will be only very briefly touched upon as each technique is used in the model fit.

2.0.1 Train and Test Data

The first step in fitting the model is to feed training data into the model.

I am using all data from fiscal years prior to 2011 from development time 30 months to 42 months to fit the model.

Later I will pass the test data (i.e. claims from fiscal year 2011 at 30 months) to predict the status of each of these claims at 42 months and the incremental payment per claim from 30 to 42 months.

2.1 Model Overview Diagram

The following diagram illustrates how the model is fit:

mermaid("
  graph TD
  A(Claim Train Data)-->B{Fit Closure Model}
  A(Claim Train Data)-->C[Remove Closed-Closed]
  A(Claim Train Data)-->D[Remove Zero Paid]
  C-->E{Fit Zero Model}
  D-->F{Fit Payment Model}
")

After fitting the models (pictured as a rhombus in the above diagram) with the claims training data I can use the three models to predict a probability for status and zero payment or a dollar value for incremental payments on the test data.

The claims test data flows through the following diagram to arrive at the final output:

At each model (pictured as a rhombus in the above diagram) the claim in the test data is given a predicted value based on the model. I then run a simulation based on this predicted value to model real world variability.

At each step the simulated claims are passed to the next model based on the results of the simulation in the previous model’s simulated results/probabilities.

2.2 Claim Closure Model

2.2.1 Assumptions

To predict whether a claim will close within a given period of time, I use a logistic regression with center, scale, and Yeo-Johnson transformations applied to all continuous predictor variables.

I am modeling the following variables:

Response Variable

status_act Actual claim status at 42 months.

Predictor Variables

status Claim status at 30 months (“C” for Closed and “O” for open).
tot_rx Total case reserve dollar value at 30.
tot_pd_incr Total incremental paid loss dollar value between 18 months and 30 months.

2.2.2 Data Preparation

# remove data the same valuation or newer than the prediction eval
# Only claims from valuations before the valuation I am predicting will be used
# to fit the model
model_data <- dplyr::filter(claims, eval < predict_eval)

2.2.3 Model Fit

The model fit uses 10-fold cross validation to optimize coefficient estimation and a stepwise Akaiki Information Critereon (AIC) algorithm for feature selection:

cm_model <- caret::train(status_act ~ status + tot_rx + tot_pd_incr, 
                         data = model_data,
                         method = "glmStepAIC",
                         trace = FALSE,
                         preProcess = c("center", "scale", "YeoJohnson"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  repeats = 2))

cm_summary <- cm_model$results[, -1]

kable(cm_summary,
      digits = 5,
      row.names = FALSE)

Accuracy	Kappa	AccuracySD	KappaSD
0.8824	0.45185	0.015	0.10189

For a more detailed statistical summary of the claim closure model fits see Appendix cm_summary

cm_probs <- cbind(model_data, predict(cm_model, 
                                      newdata = model_data, 
                                      type = "prob"))
# find the logit value
cm_probs$logits <- log(cm_probs$O / cm_probs$C)

In the plots below the blue line indicates the fitted probability of the claim at age 30 months being open at age 42 months.

The red dots at the top and bottom are the actual status for the training data at 42 months (i.e. model fits the blue line to the red dots).

cm_probs$status_act <- ifelse(cm_probs$status_act == "C", 0, 1)
  
ggplot(cm_probs, aes(x = logits, y = status_act)) +
       geom_point(colour = "red", 
                  position = position_jitter(height = 0.1, width = 0.1),
                  size = 0.5,
                  alpha = 0.2) + 
       geom_smooth(method = "glm", method.args = list(family = "binomial"), 
                   size = 1) + 
       ylab("Probability Open") +
       xlab("Logit Odds") +
       ggtitle(paste0("Age ", devt_period, " to ", devt_period + 12, " Months Claim Open Probabilities"))

2.3 Zero Payment Model

2.3.1 Assumptions

The zero payment model is similar to the claim closure model in that I am looking at a binomial response variable. I am modeling whether the claim has zero or nonzero incremental payments.

I remove all claims that have a status at 30 months of closed and a status of closed at 42 (I refer to these claims as closed-closed claims).

Additionally, I assume that all of these claims will ultimately have zero incremental payments in the final payment model.

Reponse Variable

zero Factor indicating whether the claim had zero or nonzero incremental payments between age 30 months and 42 months.

Predictor Variables

status_act Actual claim status at 42 months (“C”” for Closed and “O”” for open).
status Claim status at 30 months
tot_rx Total case reserve dollar value at 30 months.
tot_pd_incr Total incremental paid loss dollar value between 18 months and 30 months.

Note: I could use status_act as a predictor variable here because for the test data I will simulate the status at 42 first and then use that simulated status as a predictor variable in the zero payment model.

2.3.2 Data Prep

# remove all claims that have a closed closed status from the data
# these will be set to incremental payments of 0
zm_model_data <- filter(model_data, status == "O" |  status_act == "O")

# Add in response variable for zero payment:
zm_model_data$zero <- factor(ifelse(zm_model_data$tot_pd_incr_act == 0, 
                                    "Zero", "NonZero"))

2.3.3 Fit

I use the same data prepared for the claim closure model to fit the zero payment model.

zm_model <- caret::train(zero ~ status + status_act + tot_rx + tot_pd_incr, 
                         data = zm_model_data,
                         method = "glmStepAIC",
                         trace = FALSE,
                         preProcess = c("center", "scale", "YeoJohnson"),
                         trControl = trainControl(method = "repeatedcv", 
                                                  repeats = 2))

zm_summary <- zm_model$results[, -1]

kable(zm_summary,
      row.names = FALSE)

Accuracy	Kappa	AccuracySD	KappaSD
0.7735232	0.032139	0.0063228	0.0348476

zm_probs <- cbind(zm_model_data, predict(zm_model, 
                                         newdata = zm_model_data, 
                                         type = "prob"))

zm_probs$logits <- log(zm_probs$NonZero / zm_probs$Zero)

In the plots below, the blue line indicates the fitted probability of the claim having a payments between age 30 and 42 months. The red dots at the top are the actual claims with payments between age 30 and 42 months, and the dots at the bottom are the claims with zero payments during this time period. (i.e. Zero Payment model fits the blue line to the red dots)

zm_probs$zero <- ifelse(zm_probs$zero == "Zero", 0, 1)
  
ggplot(zm_probs, aes(x = logits, y = zero)) +
       geom_point(colour = "red", 
                  position = position_jitter(height = 0.1, width = 0.1),
                  size = 0.5,
                  alpha = 0.2) + 
       geom_smooth(method = "glm", method.args = list(family = "binomial"), 
                   size = 1) + 
       ylab("Payment Probability") +
       xlab("Logit Odds") +
       ggtitle(paste0("Age ", devt_period, " to ", devt_period + 12, 
                      " Non-Zero Incremental Payment"))

2.4 Incremental Payment Model

2.4.1 Assumptions

The incremental payment model models incremental payments between 30 and 42. The incremental payment model uses a generalized additive model (GAM) with an integrated smoothness estimation and a quasi-poisson log link function ;).

Response Variable

tot_pd_incr_act Total incremental payment between 30 and 42 months.

Predictor Variables

status_act The actual status at 42 months.
tot_rx Total case reserve dollar value at 30 months.
tot_pd_incr Incremental payments between 18 and 30 months.

2.4.2 Data Prep

#Take out zero pmnts:
nzm_model_data <- zm_model_data[zm_model_data$tot_pd_incr_act > 0, ]

2.4.3 Fit

# fit incremental payment model
nzm_model <- mgcv::gam(tot_pd_incr_act ~ status_act + s(tot_rx) + s(tot_pd_incr),
                       data = nzm_model_data,
                       family = quasipoisson(link = "log"))

nzm_fit <- cbind(nzm_model_data, 
                 tot_pd_incr_sim = exp(predict(nzm_model, newdata = nzm_model_data)))

# plots to be determined

3 Simulation

3.1 Closure Status

set.seed(1234)
n_sims <- 2000

cm_pred_data <- dplyr::filter(claims, eval == predict_eval)

cm_probs <- cbind(cm_pred_data, 
                  predict(cm_model, newdata = cm_pred_data, type = "prob"))
  
cm_pred <- lapply(cm_probs$O, rbinom, n = n_sims, size = 1)
cm_pred <- matrix(unlist(cm_pred), ncol = n_sims, byrow = TRUE)
cm_pred <- ifelse(cm_pred == 1, "O", "C")
cm_pred <- as.data.frame(cm_pred)

I use the probabilities returned from the closure model to simulate the status of all of the claims.

I simulate each claim 2000 times.

The table below shows selected age 30 claims after they had their closure probability predicted by the closure model and their status simulated using a simulated binomial random variable.

cm_out <- cm_probs

cm_out <- dplyr::select(cm_out, claim_number, status, tot_rx, tot_pd_incr, O)

cm_out$status_sim <- cm_pred[, 1]
cm_out <- cm_out[c(1, 6, 24, 2, 10), ]
names(cm_out) <- c("Claim Number", "Status", "Case", "Paid Incre", 
                   "Prob Open", "Sim Status")
kable(cm_out,
      row.names = FALSE)

Claim Number	Status	Paid Incre	Prob Open	Sim Status
2008137571	C	0.00	0.0075471	C
2008137835	C	17754.26	0.0072112	C
2008138427	C	0.00	0.0075471	C
2008137654	C	0.00	0.0075471	C
2008138095	C	0.00	0.0075471	C

The Prob Open column is the probability that the age 30 claim will be open at age 42 as modeled in the closure model. The Sim Status column is the result of a Bernoulli simulation on each of those probabilities.

I am running this simulation 2000 times to simulate 2000 closure scenarios.

The simulations allow me to determine the corresponding distribution’s confidence intervals.

3.2 Zero Payment Model

Next the simulated claims with their simulated statuses have their probability of having a non zero incremental payment simulated by the zero payment model. This probability is then simulated using the same random binomial simulation approach as used when simulating closure status.

# put closure model predictions together
cm_pred <- cbind(cm_probs[, c("claim_number"), drop = FALSE], cm_pred)

# gather `cm_pred` into a long data frame
cm_pred <- tidyr::gather(cm_pred, key = "sim_num", 
                         value = "status_sim", 
                         -claim_number)

# join `zm_pred_data` to predictions from closure model
# remove status_act and rename the simulated states as status_act
zm_pred_data <- left_join(cm_pred, cm_probs, by = "claim_number") %>%
                  dplyr::select(-status_act) %>%
                  dplyr::rename(status_act = status_sim)

# remove all claims that have a closed closed status from the data
# these will be set to incremental payments of 0 
closed_closed_data <- dplyr::filter(zm_pred_data, status == "C" &  status_act == "C")

zm_pred_data <- filter(zm_pred_data, status == "O" |  status_act == "O")

zm_pred <- cbind(zm_pred_data, 
                  predict(zm_model, newdata = zm_pred_data, type = "prob"))
  
zm_pred$zero_sim <- sapply(zm_pred$NonZero, rbinom, n = 1, size = 1)
zm_pred$zero_sim <- ifelse(zm_pred$zero_sim == 1, "NonZero", "Zero")

zm_out <- zm_pred

zm_out <- dplyr::select(zm_out, claim_number, status, tot_rx, tot_pd_incr, 
                        status_act, NonZero, zero_sim)

zm_out <- head(zm_out, 8)
names(zm_out) <- c("Claim Number", "Status", "Case", "Paid Incre", 
                   "Sim Status", "Prob Non Zero", "Zero Sim")
kable(zm_out,
      row.names = FALSE)

Claim Number	Status	Case	Paid Incre	Sim Status	Prob Non Zero	Zero Sim
2008137674	O	86018.83	3741.17	O	0.8621934	NonZero
2008137801	O	76990.10	4588.70	O	0.8541665	Zero
2008138088	O	123715.66	4861.24	C	0.8075995	NonZero
2008138093	O	20475.05	83510.25	C	0.6467032	Zero
2008138329	O	124989.76	3316019.94	C	0.1010250	Zero
2008138368	O	63180.31	16819.69	C	0.7281607	NonZero
2008138496	O	204459.93	15540.07	O	0.9354681	NonZero
2008138605	O	3541.86	322.00	C	0.6396368	NonZero

3.3 Incremental Payment Simulation

Since I am only interested in predicting incremental payments for claims that were simulated to have a non-zero incremental payment, all claims that were closed at age 30 and were simulated to be closed at 42 will be given an incremental payment of zero.

Additionally, all claims that were simulated by the Zero Payment Model to have a Zero payment will be given an incremental payment of zero.

# separate zeros from non zeros
zero_claims <- filter(zm_pred, zero_sim == "Zero")

nzm_pred <- filter(zm_pred, zero_sim == "NonZero")

Now for the final simulations I simulate all the claims that were predicted to have a non-zero incremental payment.

### Quasi Poisson Simulation
nzm_pred$tot_pd_incr_fit <- exp(predict(nzm_model, newdata = nzm_pred))

# use negative binomial to randomly disperse claims from predicted fit
nzm_pred$tot_pd_incr_sim <- sapply(nzm_pred$tot_pd_incr_fit,
                                    function(x) {
                                      rnbinom(n = 1, size = x ^ (1/5), prob = 1 / (1 + x ^ (4/5))) 
                                    })

closed_closed_data$tot_pd_incr_sim <- 0
zero_claims$tot_pd_incr_sim <- 0

closed_closed_data$sim_type <- "Close_Close"
zero_claims$sim_type <- "Zero"
nzm_pred$sim_type <- "Non_Zero"


cols <- c("sim_num", "claim_number", "status_act", "tot_pd_incr_sim", "sim_type")

sim_1 <- closed_closed_data[, cols]
sim_2 <- zero_claims[, cols]
sim_3 <- nzm_pred[, cols]


full_sim <- rbind(sim_1, sim_2, sim_3)

kable(
  full_sim[sample(1:nrow(full_sim), 20), ], 
  row.names = FALSE,
  col.names = c("Sim Num", "Claim Num", "Sim Status", "Sim Payment", "Sim Type"))

Sim Num	Claim Num	Sim Status	Sim Payment	Sim Type
V1436	2008143184	C	0	Close_Close
V1907	2008141841	C	0	Close_Close
V1753	2009155522	O	282069	Non_Zero
V1109	2008146393	C	0	Close_Close
V1587	2009152667	C	0	Close_Close
V1530	2008138722	C	0	Close_Close
V404	2009155055	C	0	Close_Close
V226	2009156783	O	13473	Non_Zero
V550	2008144828	C	0	Close_Close
V306	2009149406	O	145973	Non_Zero
V1921	2009156607	C	0	Close_Close
V798	2008142796	C	0	Close_Close
V1642	2008141299	C	0	Close_Close
V298	2008145434	O	185065	Non_Zero
V853	2008146293	C	0	Close_Close
V1249	2009153881	C	0	Close_Close
V1383	2008149110	C	0	Close_Close
V1691	2008143511	C	0	Close_Close
V683	2008142919	C	0	Close_Close
V1186	2008140829	C	0	Close_Close

3.4 Results

3.4.1 All Claims Aggregated by Simulation

# find actual number of open claims and incremental payment dollars
pred_data_actuals <- mutate(cm_pred_data, status_act = ifelse(status_act == "C", 0, 1))

open_actual <- sum(pred_data_actuals$status_act)
payments_actual <- sum(pred_data_actuals$tot_pd_incr_act)

The blue dashed vertical line marks the actual number of open claims in the test data at 42 months development. The white histogram shows the simulated distribution of open claims at 42 as determined from the simulation based on the claim closure model.

full_sim_agg <- mutate(full_sim, open = ifelse(status_act == "C", 0, 1)) %>%
                  group_by(sim_num) %>%
                  summarise(n = n(),
                            open_claims = sum(open),
                            incremental_paid = sum(tot_pd_incr_sim))


ggplot(full_sim_agg, aes(x = open_claims)) +
  geom_histogram(fill = "white", colour = "black") +
  ggtitle("Histogram of Simulated Open Claim Counts") +
  ylab("Number of Observations") +
  xlab("Open Claim Counts") +
  geom_vline(xintercept = open_actual, size = 1, 
             colour = "blue", linetype = "longdash")

The blue dashed vertical line marks the actual incremental payments in the test data between age 30 and 42. The white histogram shows the simulated distribution of incremental payments between age 30 and 42 months for all claims in the test data. The simulation is based on the incremental payment model.

ggplot(full_sim_agg, aes(x = incremental_paid)) +
  geom_histogram(fill = "white", colour = "black") +
  ggtitle("Histogram of Simulated Incremental Payments") +
  ylab("Number of Observations") +
  xlab("Incremental Payments") +
  geom_vline(xintercept = payments_actual, size = 1, 
             colour = "blue", linetype = "longdash") +
  scale_x_continuous(labels = dollar)

3.4.2 Individual Claim

The blue dashed vertical line marks the actual incremental payments in the test data for the claim in the Select Claim Number input box between age 30 and 42.

# selectInput(
#   "sel_claim",
#   "Select Claim number",
#   choices = unique(claims$claim_number)[1:50],
#   selected = 2008146184
# )
input <- list()
input$sel_claim <- "2008146184"

indiv <- full_sim[full_sim$claim_number == input$sel_claim, ]
indiv_act <- claims[claims$claim_number == input$sel_claim, "tot_pd_incr_act"]
plot_data <- list(
    indiv,
    indiv_act
  )

ggplot(plot_data[[1]], aes(x = tot_pd_incr_sim)) +
    geom_histogram(fill = "white", colour = "black") +
    ggtitle(paste0("Histogram of Simulated Incremental Payments for claim ", input$sel_claim)) +
    ylab("Number of Observations") +
    xlab("Incremental Payments") +
    geom_vline(xintercept = plot_data[[2]], size = 1, 
               colour = "blue", linetype = "longdash") +
    scale_x_continuous(labels = dollar)

out <- claims[claims$claim_number == input$sel_claim, 3:8]
names(out) <- c("Claim Num", "Status", "Case", "Incemental Payment", "Actual Status", "Actual Payment")
claim_stats <- out

kable(claim_stats)

	Claim Num	Status	Case	Incemental Payment	Actual Status	Actual Payment
416	2008146184	O	73000	479.7	C	61897.94

4 Conclusion

WIP

5 Appendices

5.1 A. Software

I used R, the free and open source statistical programming environment, for all the data analysis, model fitting, simulations, graphics, and data output.

The caret package was used extensively for the heavy lifting predictive modeling.

Detail of the R environment at the time this report is available below:

sessionInfo()

## R version 4.1.1 (2021-08-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 18363)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] webshot_0.5.2      qs_0.25.1          bindrcpp_0.2.2     e1071_1.7-9       
##  [5] shiny_1.7.0        scales_1.1.1       DiagrammeR_1.0.6.1 knitr_1.36        
##  [9] lubridate_1.7.10   caret_6.0-89       lattice_0.20-44    ggplot2_3.3.5     
## [13] tidyr_1.1.4        dplyr_1.0.7       
## 
## loaded via a namespace (and not attached):
##  [1] sass_0.4.0           jsonlite_1.7.2       splines_4.1.1       
##  [4] foreach_1.5.1        prodlim_2019.11.13   RcppParallel_5.1.4  
##  [7] bslib_0.3.0          highr_0.9            stats4_4.1.1        
## [10] yaml_2.2.1           globals_0.14.0       ipred_0.9-12        
## [13] pillar_1.6.3         glue_1.4.2           pROC_1.18.0         
## [16] digest_0.6.28        RColorBrewer_1.1-2   promises_1.2.0.1    
## [19] stringfish_0.15.2    colorspace_2.0-2     recipes_0.1.17      
## [22] htmltools_0.5.2      httpuv_1.6.3         Matrix_1.3-4        
## [25] plyr_1.8.6           timeDate_3043.102    pkgconfig_2.0.3     
## [28] listenv_0.8.0        purrr_0.3.4          xtable_1.8-4        
## [31] later_1.3.0          gower_0.2.2          RApiSerialize_0.1.0 
## [34] lava_1.6.10          tibble_3.1.4         proxy_0.4-26        
## [37] mgcv_1.8-36          farver_2.1.0         generics_0.1.0      
## [40] ellipsis_0.3.2       withr_2.4.2          nnet_7.3-16         
## [43] survival_3.2-11      magrittr_2.0.1       crayon_1.4.1        
## [46] mime_0.12            evaluate_0.14        future_1.22.1       
## [49] fansi_0.5.0          parallelly_1.28.1    nlme_3.1-152        
## [52] MASS_7.3-54          class_7.3-19         tools_4.1.1         
## [55] data.table_1.14.2    lifecycle_1.0.1      stringr_1.4.0       
## [58] munsell_0.5.0        compiler_4.1.1       jquerylib_0.1.4     
## [61] rlang_0.4.11         grid_4.1.1           iterators_1.0.13    
## [64] htmlwidgets_1.5.4    visNetwork_2.1.0     labeling_0.4.2      
## [67] rmarkdown_2.11       gtable_0.3.0         ModelMetrics_1.2.2.2
## [70] codetools_0.2-18     reshape2_1.4.4       R6_2.5.1            
## [73] fastmap_1.1.0        future.apply_1.8.1   utf8_1.2.2          
## [76] bindr_0.1.1          stringi_1.7.4        parallel_4.1.1      
## [79] Rcpp_1.0.7           vctrs_0.3.8          rpart_4.1-15        
## [82] tidyselect_1.1.1     xfun_0.26

5.2 Closure Model Summary Statistics

summary(cm_model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9227  -0.1231  -0.1231  -0.1222   3.1558  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) -3.70922    0.15916 -23.304 < 0.0000000000000002 ***
## statusO      2.08476    0.09473  22.008 < 0.0000000000000002 ***
## tot_rx       0.12898    0.04326   2.982              0.00286 ** 
## tot_pd_incr -0.29871    0.11372  -2.627              0.00862 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3127.4  on 4089  degrees of freedom
## Residual deviance: 1628.8  on 4086  degrees of freedom
## AIC: 1636.8
## 
## Number of Fisher Scoring iterations: 7

5.3 Zero Payment Model Summary Statistics}

summary(zm_model)

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8772  -0.7730  -0.6295  -0.0603   5.5902  
## 
## Coefficients:
##             Estimate Std. Error z value             Pr(>|z|)    
## (Intercept) -1.55421    0.11849 -13.116 < 0.0000000000000002 ***
## status_actO -0.33479    0.07890  -4.243           0.00002202 ***
## tot_rx      -1.82081    0.40290  -4.519           0.00000621 ***
## tot_pd_incr  0.22879    0.08503   2.691              0.00713 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1098.7  on 1019  degrees of freedom
## Residual deviance: 1021.1  on 1016  degrees of freedom
## AIC: 1029.1
## 
## Number of Fisher Scoring iterations: 7

5.4 Incremental Payment Model Summary Statistics

summary(nzm_model)

## 
## Family: quasipoisson 
## Link function: log 
## 
## Formula:
## tot_pd_incr_act ~ status_act + s(tot_rx) + s(tot_pd_incr)
## 
## Parametric coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)  10.7247     0.1018  105.34 < 0.0000000000000002 ***
## status_actO  -0.3492     0.1145   -3.05              0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df      F              p-value    
## s(tot_rx)      8.734  8.967 47.905 < 0.0000000000000002 ***
## s(tot_pd_incr) 7.327  8.112  3.323             0.000893 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.687   Deviance explained = 66.3%
## GCV =  96448  Scale est. = 1.6565e+05  n = 786

Micro-Claims Analysis and Modelling: Claim Status and Payment Model

Jimmy Briggs

May 9, 2016

1 Introduction

1.1 Purpose and Scope

1.2 Background

1.3 Overview

1.4 Vocabulary

1.5 Data

2 The Model

2.0.1 Train and Test Data

2.1 Model Overview Diagram

2.2 Claim Closure Model

2.2.1 Assumptions

2.2.2 Data Preparation

2.2.3 Model Fit

2.3 Zero Payment Model

2.3.1 Assumptions

2.3.2 Data Prep

2.3.3 Fit

2.4 Incremental Payment Model

2.4.1 Assumptions

2.4.2 Data Prep

2.4.3 Fit

3 Simulation

3.1 Closure Status

3.2 Zero Payment Model

3.3 Incremental Payment Simulation

3.4 Results

3.4.1 All Claims Aggregated by Simulation

3.4.2 Individual Claim

4 Conclusion

5 Appendices

5.1 A. Software

5.2 Closure Model Summary Statistics

5.3 Zero Payment Model Summary Statistics}

5.4 Incremental Payment Model Summary Statistics