Modeling Car Insurance Claim using Bayesian Logistic : Part 1

Data Science nuggets

Introduction

  • No lie ,Classical statistics or so called frequentist statistics rules the way people do data science . This project is meant to look at ways in which `bayesian statistics can be used to model data.
  • Main focus is on the most common logistic regression
  • this is the first article and no desriptive statistics are done here
  • a complete analysis for this dataset will also be done using machine learning algorithms

Background

Probability theory and Odds Ratio

  • Probability is the percent chance that something will happen, or \(\pi\).
  • The odds of an event is the ratio of the probability that it will happen vs. the probability that it won’t happen, or

\[ \text{odds} = \frac{\pi}{1 - \pi} \]

Probabilities are limited to 0–1. Odds can range from 0 to infinity. This makes interpretation a little weird.

If the probability of rain tomorrow is \(\pi = 2/3\), the probability that it doesn’t rain is \(1 - \pi\), or \(1/3\). The odds are thus

\[ \frac{2/3}{1/3} = 2 \]

…which means that it’s twice as likely to rain than not rain.

In general:

  • If odds are < 1, \(\pi < 0.5\)
  • If odds are > 1, \(\pi > 0.5\)
  • If odds are 1, \(\pi = 0.5\)

Building the logistic regression model

Prelude

In general GLMs let us model non-linear data in linear ways by transforming the outcome to a more linear scale. With logistic regression, raw probabilities tend to be really curvy and sigmoidal, and they’re bound between 0 and 1, and linear regression can’t handle bounds like that.

Our GLM link is thus a logit function:

\[ \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots \]

Or more simply:

\[ \operatorname{logit}(\pi_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots \]

Setup

library(tidyverse)
library(rstanarm)
library(broom)
library(broom.mixed)
library(patchwork)
library(latex2exp)
library(scales)
library(VIM)
library(naniar)


clrs <- MetBrewer::met.brewer("Lakota", 6)
theme_set(theme_bw())
logit_df <- tibble(mileage = seq(0, 100, length.out = 101),
                   logits = seq(-4, 6, length.out = 101)) |> 
  mutate(odds = exp(logits)) |> 
  mutate(probs = plogis(logits))

p1 <- ggplot(logit_df, aes(x = mileage, y = probs)) +
  geom_line(size = 1, color = clrs[3]) +
  labs(title = "Probabilities", 
       subtitle = "Can't be modeled linearly",
       x = "mileage",
       y = TeX("$\\pi$")) +
  theme(plot.title = element_text(face = "bold"),
        axis.title.y = element_text(angle = 0, hjust = 0))

p2 <- ggplot(logit_df, aes(x = mileage, y = odds)) +
  geom_line(size = 1, color = clrs[2]) +
  labs(title = "Odds", 
       subtitle = "No longer bound between 0–1, but still too curvy for linear models",
       x = "mileage",
       y = TeX("$\\frac{\\pi}{1 - \\pi}$")) +
  theme(plot.title = element_text(face = "bold"),
        axis.title.y = element_text(angle = 0, hjust = 0))

p3 <- ggplot(logit_df, aes(x = mileage, y = logits)) +
  geom_line(size = 1, color = clrs[1]) +
  labs(title = "Log odds (logits)", 
       subtitle = "Linear models are comfortable",
       x = "mileage",
       y = TeX("$log \\left( \\frac{\\pi}{1 - \\pi} \\right)$")) +
  theme(plot.title = element_text(face = "bold"),
        axis.title.y = element_text(angle = 0, hjust = 0))

p1 / p2 / p3

refer to this post https://rpubs.com/BayesianN/logistic for more on logistic regression

Bayesian Logistic regression

The four steps of a Bayesian analysis are

  1. Specify a joint distribution for the outcome(s) and all the unknowns, which.This joint distribution is proportional to a posterior distribution of the unknowns conditional on the observed data
  2. Draw from posterior distribution using Markov Chain Monte Carlo (MCMC).
  3. Evaluate how well the model fits the data and possibly revise the model.
  4. Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects (a function of) the outcome(s).

Likelihood

For a binomial GLM the likelihood for one observation \(y\) can be written as a conditionally binomial PMF \[\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},\] where \(n\) is the known number of trials, \(\pi = g^{-1}(\eta)\) is the probability of success and \(\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}\) is a linear predictor. With the logit (or log-odds) link function \(g(x) = \ln{\left(\frac{x}{1-x}\right)}\), the likelihood for a single observation becomes

\[\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}\]

Prior distributions

A full Bayesian analysis requires specifying prior distributions \(f(\alpha)\) and \(f(\boldsymbol{\beta})\) for the intercept and vector of regression coefficients. When using stan_glm, these distributions can be set using the prior_intercept and prior arguments.

Posterior

With independent prior distributions, the joint posterior distribution for \(\alpha\) and \(\boldsymbol{\beta}\) is proportional to the product of the priors and the \(N\) likelihood contributions:

\[f\left(\alpha,\boldsymbol{\beta} | \mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.\]

This is posterior distribution that stan_glm will draw from when using MCMC.

Bayesian Logistic Regression in R

As an example, here we will show how to carry out an analysis on car insuarance modeling

Car Insuarance data

The dataset

Column Description
id Unique client identifier
age Client’s age:
gender Client’s gender:
driving_experience Years the client has been driving:
education Client’s level of education:
income Client’s income level:
credit_score Client’s credit score (between zero and one)
vehicle_ownership Client’s vehicle ownership status:
vehcile_year Year of vehicle registration:
married Client’s marital status:
children Client’s number of children
postal_code Client’s postal code
annual_mileage Number of miles driven by the client each year
vehicle_type Type of car:
speeding_violations Total number of speeding violations received by the client
duis Number of times the client has been caught driving under the influence of alcohol
past_accidents Total number of previous accidents the client has been involved in
outcome Whether the client made a claim on their car insurance (response variable):
# file preview shows a header row
car_insurance <- read.csv("car_insurance.csv", header = TRUE)

# first look at the data set using summary() and str() to understand what type of data are you working
# with
summary(car_insurance)
#>        id              age           gender           race       
#>  Min.   :   101   Min.   :0.00   Min.   :0.000   Min.   :0.0000  
#>  1st Qu.:249639   1st Qu.:1.00   1st Qu.:0.000   1st Qu.:1.0000  
#>  Median :501777   Median :1.00   Median :0.000   Median :1.0000  
#>  Mean   :500522   Mean   :1.49   Mean   :0.499   Mean   :0.9012  
#>  3rd Qu.:753975   3rd Qu.:2.00   3rd Qu.:1.000   3rd Qu.:1.0000  
#>  Max.   :999976   Max.   :3.00   Max.   :1.000   Max.   :1.0000  
#>                                                                  
#>  driving_experience   education        income     credit_score   
#>  Min.   :0.000      Min.   :0.00   Min.   :0.0   Min.   :0.0534  
#>  1st Qu.:0.000      1st Qu.:2.00   1st Qu.:1.0   1st Qu.:0.4172  
#>  Median :1.000      Median :2.00   Median :2.0   Median :0.5250  
#>  Mean   :1.069      Mean   :2.01   Mean   :1.9   Mean   :0.5158  
#>  3rd Qu.:2.000      3rd Qu.:3.00   3rd Qu.:3.0   3rd Qu.:0.6183  
#>  Max.   :3.000      Max.   :3.00   Max.   :3.0   Max.   :0.9608  
#>                                                  NA's   :982     
#>  vehicle_ownership  vehicle_year       married          children     
#>  Min.   :0.000     Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
#>  1st Qu.:0.000     1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
#>  Median :1.000     Median :0.0000   Median :0.0000   Median :1.0000  
#>  Mean   :0.697     Mean   :0.3033   Mean   :0.4982   Mean   :0.6888  
#>  3rd Qu.:1.000     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
#>  Max.   :1.000     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
#>                                                                      
#>   postal_code    annual_mileage   vehicle_type    speeding_violations
#>  Min.   :10238   Min.   : 2000   Min.   :0.0000   Min.   : 0.000     
#>  1st Qu.:10238   1st Qu.:10000   1st Qu.:0.0000   1st Qu.: 0.000     
#>  Median :10238   Median :12000   Median :0.0000   Median : 0.000     
#>  Mean   :19865   Mean   :11697   Mean   :0.0477   Mean   : 1.483     
#>  3rd Qu.:32765   3rd Qu.:14000   3rd Qu.:0.0000   3rd Qu.: 2.000     
#>  Max.   :92101   Max.   :22000   Max.   :1.0000   Max.   :22.000     
#>                  NA's   :957                                         
#>       duis        past_accidents      outcome      
#>  Min.   :0.0000   Min.   : 0.000   Min.   :0.0000  
#>  1st Qu.:0.0000   1st Qu.: 0.000   1st Qu.:0.0000  
#>  Median :0.0000   Median : 0.000   Median :0.0000  
#>  Mean   :0.2392   Mean   : 1.056   Mean   :0.3133  
#>  3rd Qu.:0.0000   3rd Qu.: 2.000   3rd Qu.:1.0000  
#>  Max.   :6.0000   Max.   :15.000   Max.   :1.0000  
#> 
str(car_insurance)
#> 'data.frame':    10000 obs. of  19 variables:
#>  $ id                 : int  569520 750365 199901 478866 731664 877557 930134 461006 68366 445911 ...
#>  $ age                : int  3 0 0 0 1 2 3 1 2 2 ...
#>  $ gender             : int  0 1 0 1 1 0 1 0 0 0 ...
#>  $ race               : int  1 1 1 1 1 1 1 1 1 1 ...
#>  $ driving_experience : int  0 0 0 0 1 2 3 0 2 0 ...
#>  $ education          : int  2 0 2 3 0 2 2 3 3 2 ...
#>  $ income             : int  3 0 1 1 1 3 3 1 1 3 ...
#>  $ credit_score       : num  0.629 0.358 0.493 0.206 0.388 ...
#>  $ vehicle_ownership  : int  1 0 1 1 1 1 0 0 0 1 ...
#>  $ vehicle_year       : int  1 0 0 0 0 1 1 1 0 0 ...
#>  $ married            : int  0 0 0 0 0 0 1 0 1 0 ...
#>  $ children           : int  1 0 0 1 0 1 1 1 0 1 ...
#>  $ postal_code        : int  10238 10238 10238 32765 32765 10238 10238 10238 10238 32765 ...
#>  $ annual_mileage     : int  12000 16000 11000 11000 12000 13000 13000 14000 13000 11000 ...
#>  $ vehicle_type       : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ speeding_violations: int  0 0 0 0 2 3 7 0 0 0 ...
#>  $ duis               : int  0 0 0 0 0 0 0 0 0 0 ...
#>  $ past_accidents     : int  0 0 0 0 1 3 3 0 0 0 ...
#>  $ outcome            : int  0 1 0 0 1 0 0 1 0 1 ...

Glimpse of the dataset

head(car_insurance)
#>       id age gender race driving_experience education income credit_score
#> 1 569520   3      0    1                  0         2      3    0.6290273
#> 2 750365   0      1    1                  0         0      0    0.3577571
#> 3 199901   0      0    1                  0         2      1    0.4931458
#> 4 478866   0      1    1                  0         3      1    0.2060129
#> 5 731664   1      1    1                  1         0      1    0.3883659
#> 6 877557   2      0    1                  2         2      3    0.6191274
#>   vehicle_ownership vehicle_year married children postal_code annual_mileage
#> 1                 1            1       0        1       10238          12000
#> 2                 0            0       0        0       10238          16000
#> 3                 1            0       0        0       10238          11000
#> 4                 1            0       0        1       32765          11000
#> 5                 1            0       0        0       32765          12000
#> 6                 1            1       0        1       10238          13000
#>   vehicle_type speeding_violations duis past_accidents outcome
#> 1            0                   0    0              0       0
#> 2            0                   0    0              0       1
#> 3            0                   0    0              0       0
#> 4            0                   0    0              0       0
#> 5            0                   2    0              1       1
#> 6            0                   3    0              3       0

look at missing data

colSums(is.na(car_insurance))
#>                  id                 age              gender                race 
#>                   0                   0                   0                   0 
#>  driving_experience           education              income        credit_score 
#>                   0                   0                   0                 982 
#>   vehicle_ownership        vehicle_year             married            children 
#>                   0                   0                   0                   0 
#>         postal_code      annual_mileage        vehicle_type speeding_violations 
#>                   0                 957                   0                   0 
#>                duis      past_accidents             outcome 
#>                   0                   0                   0
aggr(car_insurance, numbers = TRUE, prop = FALSE)

Data Pre-processing

  • For this data i will replace missing with average for illustration
  • will select a few variables for illustration as well
car_insurance_new<- car_insurance %>% 
  mutate(credit_score=replace_na(credit_score,mean(credit_score,na.rm=T)),
         annual_mileage=as.numeric(annual_mileage),
         annual_mileage=replace_na(annual_mileage,mean(annual_mileage,na.rm=T))) %>% 
  select(married,vehicle_type,vehicle_year,outcome,speeding_violations,duis,gender)

## check for missingness
anyNA(car_insurance_new)
#> [1] FALSE


## look at the data
head(car_insurance) %>% 
  relocate(annual_mileage,credit_score)
#>   annual_mileage credit_score     id age gender race driving_experience
#> 1          12000    0.6290273 569520   3      0    1                  0
#> 2          16000    0.3577571 750365   0      1    1                  0
#> 3          11000    0.4931458 199901   0      0    1                  0
#> 4          11000    0.2060129 478866   0      1    1                  0
#> 5          12000    0.3883659 731664   1      1    1                  1
#> 6          13000    0.6191274 877557   2      0    1                  2
#>   education income vehicle_ownership vehicle_year married children postal_code
#> 1         2      3                 1            1       0        1       10238
#> 2         0      0                 0            0       0        0       10238
#> 3         2      1                 1            0       0        0       10238
#> 4         3      1                 1            0       0        1       32765
#> 5         0      1                 1            0       0        0       32765
#> 6         2      3                 1            1       0        1       10238
#>   vehicle_type speeding_violations duis past_accidents outcome
#> 1            0                   0    0              0       0
#> 2            0                   0    0              0       1
#> 3            0                   0    0              0       0
#> 4            0                   0    0              0       0
#> 5            0                   2    0              1       1
#> 6            0                   3    0              3       0

build the model

car_insurance_new$outcome <- factor(car_insurance_new$outcome)

A Bayesian logistic regression model

A Bayesian logistic regression model can be estimated using the stan_glm function. Here we’ll use a Student t prior with 7 degrees of freedom and a scale of 2.5, which, as discussed above, is a reasonable default prior when coefficients should be close to zero but have some chance of being large.

The formula, data and family arguments to stan_glm are specified in exactly the same way as for glm.

A Frequentist model would look like this

frequentist_model<-glm(outcome~married+vehicle_type+vehicle_year+outcome+speeding_violations+duis+gender, data = car_insurance_new,
                 family = binomial(link = "logit"))
summary(frequentist_model)
#> 
#> Call:
#> glm(formula = outcome ~ married + vehicle_type + vehicle_year + 
#>     outcome + speeding_violations + duis + gender, family = binomial(link = "logit"), 
#>     data = car_insurance_new)
#> 
#> Coefficients:
#>                     Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)          0.11140    0.04255   2.618  0.00883 ** 
#> married             -0.85085    0.05142 -16.547  < 2e-16 ***
#> vehicle_type        -0.01553    0.11570  -0.134  0.89320    
#> vehicle_year        -1.68338    0.06787 -24.802  < 2e-16 ***
#> speeding_violations -0.48407    0.02138 -22.644  < 2e-16 ***
#> duis                -0.64878    0.06979  -9.296  < 2e-16 ***
#> gender               0.90562    0.05113  17.713  < 2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 12434.3  on 9999  degrees of freedom
#> Residual deviance:  9613.6  on 9993  degrees of freedom
#> AIC: 9627.6
#> 
#> Number of Fisher Scoring iterations: 5

A bayesian model uses stan_glm

# Seed stuff
SEED=14124869
## set priors
t_prior <- student_t(df = 7, location = 0, scale = 2.5)

## build the model
post1 <- stan_glm(outcome~married+vehicle_type+vehicle_year+outcome+speeding_violations+duis+gender, data = car_insurance_new,
                 family = binomial(link = "logit"), 
                 prior = t_prior, prior_intercept = t_prior, QR=TRUE,
                 seed = SEED, refresh=0)

stan_glm returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values:

summary(post1)
#> 
#> Model Info:
#>  function:     stan_glm
#>  family:       binomial [logit]
#>  formula:      outcome ~ married + vehicle_type + vehicle_year + outcome + speeding_violations + 
#>     duis + gender
#>  algorithm:    sampling
#>  sample:       4000 (posterior sample size)
#>  priors:       see help('prior_summary')
#>  observations: 10000
#>  predictors:   7
#> 
#> Estimates:
#>                       mean   sd   10%   50%   90%
#> (Intercept)          0.1    0.0  0.1   0.1   0.2 
#> married             -0.9    0.1 -0.9  -0.8  -0.8 
#> vehicle_type         0.0    0.1 -0.2   0.0   0.1 
#> vehicle_year        -1.7    0.1 -1.8  -1.7  -1.6 
#> speeding_violations -0.5    0.0 -0.5  -0.5  -0.5 
#> duis                -0.6    0.1 -0.7  -0.6  -0.6 
#> gender               0.9    0.1  0.8   0.9   1.0 
#> 
#> Fit Diagnostics:
#>            mean   sd   10%   50%   90%
#> mean_PPD 0.3    0.0  0.3   0.3   0.3  
#> 
#> The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
#> 
#> MCMC diagnostics
#>                     mcse Rhat n_eff
#> (Intercept)         0.0  1.0  5779 
#> married             0.0  1.0  5232 
#> vehicle_type        0.0  1.0  5505 
#> vehicle_year        0.0  1.0  4511 
#> speeding_violations 0.0  1.0  3416 
#> duis                0.0  1.0  5062 
#> gender              0.0  1.0  4092 
#> mean_PPD            0.0  1.0  4885 
#> log-posterior       0.0  1.0  1715 
#> 
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).

Rhat<1.1 so the MCMC converged.

prior_summary(post1)
#> Priors for model 'post1' 
#> ------
#> Intercept (after predictors centered)
#>  ~ student_t(df = 7, location = 0, scale = 2.5)
#> 
#> Coefficients (in Q-space)
#>  ~ student_t(df = [7,7,7,...], location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
#> ------
#> See help('prior_summary.stanreg') for more details
out_new<-as.array(post1)
bayesplot::mcmc_trace(out_new,pairs=c("married","vehicle_type","vehicle_year","outcome","speeding_violations","duis","gender"))

plot(post1,"neff")

- [x] the plot above shows effective size ,total posterior samples for each parameter

plot(post1,"acf_bar")

pplot<-plot(post1, "areas", prob = 0.95, prob_outer = 1)
pplot+ geom_vline(xintercept = 0)

We can extract corresponding posterior median estimates using ‘coef’ function and to get a sense for the uncertainty in our estimates we can use the posterior_interval function to get Bayesian uncertainty intervals. The uncertainty intervals are computed by finding the relevant quantiles of the draws from the posterior distribution. For example, to compute median and 90% intervals we use:

round(coef(post1), 2)
#>         (Intercept)             married        vehicle_type        vehicle_year 
#>                0.11               -0.85               -0.01               -1.68 
#> speeding_violations                duis              gender 
#>               -0.48               -0.65                0.90
round(posterior_interval(post1, prob = 0.9), 2)
#>                        5%   95%
#> (Intercept)          0.04  0.18
#> married             -0.93 -0.77
#> vehicle_type        -0.21  0.18
#> vehicle_year        -1.80 -1.57
#> speeding_violations -0.52 -0.45
#> duis                -0.76 -0.53
#> gender               0.82  0.99

Leave-one-out cross-validation

rstanarm supports loo package which implements fast Pareto smoothed leave-one-out cross-validation.

(loo1 <- loo(post1, save_psis = TRUE))
#> 
#> Computed from 4000 by 10000 log-likelihood matrix
#> 
#>          Estimate    SE
#> elpd_loo  -4814.2  54.9
#> p_loo         7.9   0.3
#> looic      9628.5 109.9
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.

Above we see that PSIS-LOO result is reliable as all Pareto \(k\) estimates are small (k< 0.5).

Comparison to a baseline model

Compute baseline result without covariates.

post0 <- update(post1, formula = outcome ~ 1, QR = FALSE, refresh=0)

Compare to baseline

(loo0 <- loo(post0))
#> 
#> Computed from 4000 by 10000 log-likelihood matrix
#> 
#>          Estimate   SE
#> elpd_loo  -6218.2 36.4
#> p_loo         1.1  0.0
#> looic     12436.5 72.8
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#> 
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
loo_compare(loo0, loo1)
#>       elpd_diff se_diff
#> post1     0.0       0.0
#> post0 -1404.0      49.5

Covariates contain clearly useful information for predictions.

distribution of claim probability

pp_check(post1,"stat_2d")

to be continued , references in the coming article