Modeling Car Insurance Claim using Bayesian Logistic : Part 1
Data Science nuggets
Introduction
- No lie ,Classical statistics or so called frequentist statistics rules the way people do data science . This project is meant to look at ways in which `bayesian statistics can be used to model data.
- Main focus is on the most common logistic regression
- this is the first article and no desriptive statistics are done here
- a complete analysis for this dataset will also be done using machine learning algorithms
Background
Probability theory and Odds Ratio
- Probability is the percent chance that something will happen, or \(\pi\).
- The odds of an event is the ratio of the probability that it will happen vs. the probability that it won’t happen, or
\[ \text{odds} = \frac{\pi}{1 - \pi} \]
Probabilities are limited to 0–1. Odds can range from 0 to infinity. This makes interpretation a little weird.
If the probability of rain tomorrow is \(\pi = 2/3\), the probability that it doesn’t rain is \(1 - \pi\), or \(1/3\). The odds are thus
\[ \frac{2/3}{1/3} = 2 \]
…which means that it’s twice as likely to rain than not rain.
In general:
- If odds are < 1, \(\pi < 0.5\)
- If odds are > 1, \(\pi > 0.5\)
- If odds are 1, \(\pi = 0.5\)
Building the logistic regression model
Prelude
In general GLMs let us model non-linear data in linear ways by transforming the outcome to a more linear scale. With logistic regression, raw probabilities tend to be really curvy and sigmoidal, and they’re bound between 0 and 1, and linear regression can’t handle bounds like that.
Our GLM link is thus a logit function:
\[ \log \left( \frac{\pi_i}{1 - \pi_i} \right) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots \]
Or more simply:
\[ \operatorname{logit}(\pi_i) = \beta_0 + \beta_1 X_{i1} + \beta_2 X_{i2} + \dots \]
Setup
library(tidyverse)
library(rstanarm)
library(broom)
library(broom.mixed)
library(patchwork)
library(latex2exp)
library(scales)
library(VIM)
library(naniar)
<- MetBrewer::met.brewer("Lakota", 6)
clrs theme_set(theme_bw())
<- tibble(mileage = seq(0, 100, length.out = 101),
logit_df logits = seq(-4, 6, length.out = 101)) |>
mutate(odds = exp(logits)) |>
mutate(probs = plogis(logits))
<- ggplot(logit_df, aes(x = mileage, y = probs)) +
p1 geom_line(size = 1, color = clrs[3]) +
labs(title = "Probabilities",
subtitle = "Can't be modeled linearly",
x = "mileage",
y = TeX("$\\pi$")) +
theme(plot.title = element_text(face = "bold"),
axis.title.y = element_text(angle = 0, hjust = 0))
<- ggplot(logit_df, aes(x = mileage, y = odds)) +
p2 geom_line(size = 1, color = clrs[2]) +
labs(title = "Odds",
subtitle = "No longer bound between 0–1, but still too curvy for linear models",
x = "mileage",
y = TeX("$\\frac{\\pi}{1 - \\pi}$")) +
theme(plot.title = element_text(face = "bold"),
axis.title.y = element_text(angle = 0, hjust = 0))
<- ggplot(logit_df, aes(x = mileage, y = logits)) +
p3 geom_line(size = 1, color = clrs[1]) +
labs(title = "Log odds (logits)",
subtitle = "Linear models are comfortable",
x = "mileage",
y = TeX("$log \\left( \\frac{\\pi}{1 - \\pi} \\right)$")) +
theme(plot.title = element_text(face = "bold"),
axis.title.y = element_text(angle = 0, hjust = 0))
/ p2 / p3 p1
refer to this post
https://rpubs.com/BayesianN/logistic
for more on logistic regression
Bayesian Logistic regression
The four steps of a Bayesian analysis are
- Specify a joint distribution for the outcome(s) and all the unknowns, which.This joint distribution is proportional to a posterior distribution of the unknowns conditional on the observed data
- Draw from posterior distribution using Markov Chain Monte Carlo (MCMC).
- Evaluate how well the model fits the data and possibly revise the model.
- Draw from the posterior predictive distribution of the outcome(s) given interesting values of the predictors in order to visualize how a manipulation of a predictor affects (a function of) the outcome(s).
Likelihood
For a binomial GLM the likelihood for one observation \(y\) can be written as a conditionally binomial PMF \[\binom{n}{y} \pi^{y} (1 - \pi)^{n - y},\] where \(n\) is the known number of trials, \(\pi = g^{-1}(\eta)\) is the probability of success and \(\eta = \alpha + \mathbf{x}^\top \boldsymbol{\beta}\) is a linear predictor. With the logit (or log-odds) link function \(g(x) = \ln{\left(\frac{x}{1-x}\right)}\), the likelihood for a single observation becomes
\[\binom{n}{y}\left(\text{logit}^{-1}(\eta)\right)^y \left(1 - \text{logit}^{-1}(\eta)\right)^{n-y} = \binom{n}{y} \left(\frac{e^{\eta}}{1 + e^{\eta}}\right)^{y} \left(\frac{1}{1 + e^{\eta}}\right)^{n - y}\]
Prior distributions
A full Bayesian analysis requires specifying prior distributions
\(f(\alpha)\) and \(f(\boldsymbol{\beta})\) for the intercept
and vector of regression coefficients. When using stan_glm
,
these distributions can be set using the prior_intercept
and prior
arguments.
Posterior
With independent prior distributions, the joint posterior distribution for \(\alpha\) and \(\boldsymbol{\beta}\) is proportional to the product of the priors and the \(N\) likelihood contributions:
\[f\left(\alpha,\boldsymbol{\beta} | \mathbf{y},\mathbf{X}\right) \propto f\left(\alpha\right) \times \prod_{k=1}^K f\left(\beta_k\right) \times \prod_{i=1}^N { g^{-1}\left(\eta_i\right)^{y_i} \left(1 - g^{-1}\left(\eta_i\right)\right)^{n_i-y_i}}.\]
This is posterior distribution that stan_glm
will draw
from when using MCMC.
Bayesian Logistic Regression in R
As an example, here we will show how to carry out an analysis on
car insuarance modeling
Car Insuarance data
The dataset
Column | Description |
---|---|
id |
Unique client identifier |
age |
Client’s age: |
gender |
Client’s gender: |
driving_experience |
Years the client has been driving: |
education |
Client’s level of education: |
income |
Client’s income level: |
credit_score |
Client’s credit score (between zero and one) |
vehicle_ownership |
Client’s vehicle ownership status: |
vehcile_year |
Year of vehicle registration: |
married |
Client’s marital status: |
children |
Client’s number of children |
postal_code |
Client’s postal code |
annual_mileage |
Number of miles driven by the client each year |
vehicle_type |
Type of car: |
speeding_violations |
Total number of speeding violations received by the client |
duis |
Number of times the client has been caught driving under the influence of alcohol |
past_accidents |
Total number of previous accidents the client has been involved in |
outcome |
Whether the client made a claim on their car insurance (response
variable): |
# file preview shows a header row
<- read.csv("car_insurance.csv", header = TRUE)
car_insurance
# first look at the data set using summary() and str() to understand what type of data are you working
# with
summary(car_insurance)
#> id age gender race
#> Min. : 101 Min. :0.00 Min. :0.000 Min. :0.0000
#> 1st Qu.:249639 1st Qu.:1.00 1st Qu.:0.000 1st Qu.:1.0000
#> Median :501777 Median :1.00 Median :0.000 Median :1.0000
#> Mean :500522 Mean :1.49 Mean :0.499 Mean :0.9012
#> 3rd Qu.:753975 3rd Qu.:2.00 3rd Qu.:1.000 3rd Qu.:1.0000
#> Max. :999976 Max. :3.00 Max. :1.000 Max. :1.0000
#>
#> driving_experience education income credit_score
#> Min. :0.000 Min. :0.00 Min. :0.0 Min. :0.0534
#> 1st Qu.:0.000 1st Qu.:2.00 1st Qu.:1.0 1st Qu.:0.4172
#> Median :1.000 Median :2.00 Median :2.0 Median :0.5250
#> Mean :1.069 Mean :2.01 Mean :1.9 Mean :0.5158
#> 3rd Qu.:2.000 3rd Qu.:3.00 3rd Qu.:3.0 3rd Qu.:0.6183
#> Max. :3.000 Max. :3.00 Max. :3.0 Max. :0.9608
#> NA's :982
#> vehicle_ownership vehicle_year married children
#> Min. :0.000 Min. :0.0000 Min. :0.0000 Min. :0.0000
#> 1st Qu.:0.000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
#> Median :1.000 Median :0.0000 Median :0.0000 Median :1.0000
#> Mean :0.697 Mean :0.3033 Mean :0.4982 Mean :0.6888
#> 3rd Qu.:1.000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
#> Max. :1.000 Max. :1.0000 Max. :1.0000 Max. :1.0000
#>
#> postal_code annual_mileage vehicle_type speeding_violations
#> Min. :10238 Min. : 2000 Min. :0.0000 Min. : 0.000
#> 1st Qu.:10238 1st Qu.:10000 1st Qu.:0.0000 1st Qu.: 0.000
#> Median :10238 Median :12000 Median :0.0000 Median : 0.000
#> Mean :19865 Mean :11697 Mean :0.0477 Mean : 1.483
#> 3rd Qu.:32765 3rd Qu.:14000 3rd Qu.:0.0000 3rd Qu.: 2.000
#> Max. :92101 Max. :22000 Max. :1.0000 Max. :22.000
#> NA's :957
#> duis past_accidents outcome
#> Min. :0.0000 Min. : 0.000 Min. :0.0000
#> 1st Qu.:0.0000 1st Qu.: 0.000 1st Qu.:0.0000
#> Median :0.0000 Median : 0.000 Median :0.0000
#> Mean :0.2392 Mean : 1.056 Mean :0.3133
#> 3rd Qu.:0.0000 3rd Qu.: 2.000 3rd Qu.:1.0000
#> Max. :6.0000 Max. :15.000 Max. :1.0000
#>
str(car_insurance)
#> 'data.frame': 10000 obs. of 19 variables:
#> $ id : int 569520 750365 199901 478866 731664 877557 930134 461006 68366 445911 ...
#> $ age : int 3 0 0 0 1 2 3 1 2 2 ...
#> $ gender : int 0 1 0 1 1 0 1 0 0 0 ...
#> $ race : int 1 1 1 1 1 1 1 1 1 1 ...
#> $ driving_experience : int 0 0 0 0 1 2 3 0 2 0 ...
#> $ education : int 2 0 2 3 0 2 2 3 3 2 ...
#> $ income : int 3 0 1 1 1 3 3 1 1 3 ...
#> $ credit_score : num 0.629 0.358 0.493 0.206 0.388 ...
#> $ vehicle_ownership : int 1 0 1 1 1 1 0 0 0 1 ...
#> $ vehicle_year : int 1 0 0 0 0 1 1 1 0 0 ...
#> $ married : int 0 0 0 0 0 0 1 0 1 0 ...
#> $ children : int 1 0 0 1 0 1 1 1 0 1 ...
#> $ postal_code : int 10238 10238 10238 32765 32765 10238 10238 10238 10238 32765 ...
#> $ annual_mileage : int 12000 16000 11000 11000 12000 13000 13000 14000 13000 11000 ...
#> $ vehicle_type : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ speeding_violations: int 0 0 0 0 2 3 7 0 0 0 ...
#> $ duis : int 0 0 0 0 0 0 0 0 0 0 ...
#> $ past_accidents : int 0 0 0 0 1 3 3 0 0 0 ...
#> $ outcome : int 0 1 0 0 1 0 0 1 0 1 ...
Glimpse of the dataset
head(car_insurance)
#> id age gender race driving_experience education income credit_score
#> 1 569520 3 0 1 0 2 3 0.6290273
#> 2 750365 0 1 1 0 0 0 0.3577571
#> 3 199901 0 0 1 0 2 1 0.4931458
#> 4 478866 0 1 1 0 3 1 0.2060129
#> 5 731664 1 1 1 1 0 1 0.3883659
#> 6 877557 2 0 1 2 2 3 0.6191274
#> vehicle_ownership vehicle_year married children postal_code annual_mileage
#> 1 1 1 0 1 10238 12000
#> 2 0 0 0 0 10238 16000
#> 3 1 0 0 0 10238 11000
#> 4 1 0 0 1 32765 11000
#> 5 1 0 0 0 32765 12000
#> 6 1 1 0 1 10238 13000
#> vehicle_type speeding_violations duis past_accidents outcome
#> 1 0 0 0 0 0
#> 2 0 0 0 0 1
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 2 0 1 1
#> 6 0 3 0 3 0
look at missing data
colSums(is.na(car_insurance))
#> id age gender race
#> 0 0 0 0
#> driving_experience education income credit_score
#> 0 0 0 982
#> vehicle_ownership vehicle_year married children
#> 0 0 0 0
#> postal_code annual_mileage vehicle_type speeding_violations
#> 0 957 0 0
#> duis past_accidents outcome
#> 0 0 0
aggr(car_insurance, numbers = TRUE, prop = FALSE)
Data Pre-processing
- For this data i will replace missing with
average
for illustration - will select a few variables for illustration as well
<- car_insurance %>%
car_insurance_newmutate(credit_score=replace_na(credit_score,mean(credit_score,na.rm=T)),
annual_mileage=as.numeric(annual_mileage),
annual_mileage=replace_na(annual_mileage,mean(annual_mileage,na.rm=T))) %>%
select(married,vehicle_type,vehicle_year,outcome,speeding_violations,duis,gender)
## check for missingness
anyNA(car_insurance_new)
#> [1] FALSE
## look at the data
head(car_insurance) %>%
relocate(annual_mileage,credit_score)
#> annual_mileage credit_score id age gender race driving_experience
#> 1 12000 0.6290273 569520 3 0 1 0
#> 2 16000 0.3577571 750365 0 1 1 0
#> 3 11000 0.4931458 199901 0 0 1 0
#> 4 11000 0.2060129 478866 0 1 1 0
#> 5 12000 0.3883659 731664 1 1 1 1
#> 6 13000 0.6191274 877557 2 0 1 2
#> education income vehicle_ownership vehicle_year married children postal_code
#> 1 2 3 1 1 0 1 10238
#> 2 0 0 0 0 0 0 10238
#> 3 2 1 1 0 0 0 10238
#> 4 3 1 1 0 0 1 32765
#> 5 0 1 1 0 0 0 32765
#> 6 2 3 1 1 0 1 10238
#> vehicle_type speeding_violations duis past_accidents outcome
#> 1 0 0 0 0 0
#> 2 0 0 0 0 1
#> 3 0 0 0 0 0
#> 4 0 0 0 0 0
#> 5 0 2 0 1 1
#> 6 0 3 0 3 0
build the model
$outcome <- factor(car_insurance_new$outcome) car_insurance_new
A Bayesian logistic regression model
A Bayesian logistic regression model can be estimated using the
stan_glm
function. Here we’ll use a Student t prior with 7
degrees of freedom and a scale of 2.5, which, as discussed above, is a
reasonable default prior when coefficients should be close to zero but
have some chance of being large.
The formula
, data
and family
arguments to stan_glm
are specified in exactly the same way
as for glm
.
A Frequentist model would look like this
<-glm(outcome~married+vehicle_type+vehicle_year+outcome+speeding_violations+duis+gender, data = car_insurance_new,
frequentist_modelfamily = binomial(link = "logit"))
summary(frequentist_model)
#>
#> Call:
#> glm(formula = outcome ~ married + vehicle_type + vehicle_year +
#> outcome + speeding_violations + duis + gender, family = binomial(link = "logit"),
#> data = car_insurance_new)
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.11140 0.04255 2.618 0.00883 **
#> married -0.85085 0.05142 -16.547 < 2e-16 ***
#> vehicle_type -0.01553 0.11570 -0.134 0.89320
#> vehicle_year -1.68338 0.06787 -24.802 < 2e-16 ***
#> speeding_violations -0.48407 0.02138 -22.644 < 2e-16 ***
#> duis -0.64878 0.06979 -9.296 < 2e-16 ***
#> gender 0.90562 0.05113 17.713 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 12434.3 on 9999 degrees of freedom
#> Residual deviance: 9613.6 on 9993 degrees of freedom
#> AIC: 9627.6
#>
#> Number of Fisher Scoring iterations: 5
A bayesian model uses stan_glm
# Seed stuff
=14124869
SEED## set priors
<- student_t(df = 7, location = 0, scale = 2.5)
t_prior
## build the model
<- stan_glm(outcome~married+vehicle_type+vehicle_year+outcome+speeding_violations+duis+gender, data = car_insurance_new,
post1 family = binomial(link = "logit"),
prior = t_prior, prior_intercept = t_prior, QR=TRUE,
seed = SEED, refresh=0)
stan_glm returns the posterior distribution for the parameters describing the uncertainty related to unknown parameter values:
summary(post1)
#>
#> Model Info:
#> function: stan_glm
#> family: binomial [logit]
#> formula: outcome ~ married + vehicle_type + vehicle_year + outcome + speeding_violations +
#> duis + gender
#> algorithm: sampling
#> sample: 4000 (posterior sample size)
#> priors: see help('prior_summary')
#> observations: 10000
#> predictors: 7
#>
#> Estimates:
#> mean sd 10% 50% 90%
#> (Intercept) 0.1 0.0 0.1 0.1 0.2
#> married -0.9 0.1 -0.9 -0.8 -0.8
#> vehicle_type 0.0 0.1 -0.2 0.0 0.1
#> vehicle_year -1.7 0.1 -1.8 -1.7 -1.6
#> speeding_violations -0.5 0.0 -0.5 -0.5 -0.5
#> duis -0.6 0.1 -0.7 -0.6 -0.6
#> gender 0.9 0.1 0.8 0.9 1.0
#>
#> Fit Diagnostics:
#> mean sd 10% 50% 90%
#> mean_PPD 0.3 0.0 0.3 0.3 0.3
#>
#> The mean_ppd is the sample average posterior predictive distribution of the outcome variable (for details see help('summary.stanreg')).
#>
#> MCMC diagnostics
#> mcse Rhat n_eff
#> (Intercept) 0.0 1.0 5779
#> married 0.0 1.0 5232
#> vehicle_type 0.0 1.0 5505
#> vehicle_year 0.0 1.0 4511
#> speeding_violations 0.0 1.0 3416
#> duis 0.0 1.0 5062
#> gender 0.0 1.0 4092
#> mean_PPD 0.0 1.0 4885
#> log-posterior 0.0 1.0 1715
#>
#> For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Rhat<1.1 so the MCMC converged.
prior_summary(post1)
#> Priors for model 'post1'
#> ------
#> Intercept (after predictors centered)
#> ~ student_t(df = 7, location = 0, scale = 2.5)
#>
#> Coefficients (in Q-space)
#> ~ student_t(df = [7,7,7,...], location = [0,0,0,...], scale = [2.5,2.5,2.5,...])
#> ------
#> See help('prior_summary.stanreg') for more details
<-as.array(post1)
out_new::mcmc_trace(out_new,pairs=c("married","vehicle_type","vehicle_year","outcome","speeding_violations","duis","gender")) bayesplot
plot(post1,"neff")
- [x] the plot above shows effective size ,total posterior samples for
each parameter
plot(post1,"acf_bar")
<-plot(post1, "areas", prob = 0.95, prob_outer = 1)
pplot+ geom_vline(xintercept = 0) pplot
We can extract corresponding posterior median estimates using ‘coef’
function and to get a sense for the uncertainty in our estimates we can
use the posterior_interval
function to get Bayesian
uncertainty intervals. The uncertainty intervals are computed by finding
the relevant quantiles of the draws from the posterior distribution. For
example, to compute median and 90% intervals we use:
round(coef(post1), 2)
#> (Intercept) married vehicle_type vehicle_year
#> 0.11 -0.85 -0.01 -1.68
#> speeding_violations duis gender
#> -0.48 -0.65 0.90
round(posterior_interval(post1, prob = 0.9), 2)
#> 5% 95%
#> (Intercept) 0.04 0.18
#> married -0.93 -0.77
#> vehicle_type -0.21 0.18
#> vehicle_year -1.80 -1.57
#> speeding_violations -0.52 -0.45
#> duis -0.76 -0.53
#> gender 0.82 0.99
Leave-one-out cross-validation
rstanarm supports loo package which implements fast Pareto smoothed leave-one-out cross-validation.
<- loo(post1, save_psis = TRUE))
(loo1 #>
#> Computed from 4000 by 10000 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -4814.2 54.9
#> p_loo 7.9 0.3
#> looic 9628.5 109.9
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#>
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
Above we see that PSIS-LOO result is reliable as all Pareto \(k\) estimates are small (k< 0.5).
Comparison to a baseline model
Compute baseline result without covariates.
<- update(post1, formula = outcome ~ 1, QR = FALSE, refresh=0) post0
Compare to baseline
<- loo(post0))
(loo0 #>
#> Computed from 4000 by 10000 log-likelihood matrix
#>
#> Estimate SE
#> elpd_loo -6218.2 36.4
#> p_loo 1.1 0.0
#> looic 12436.5 72.8
#> ------
#> Monte Carlo SE of elpd_loo is 0.0.
#>
#> All Pareto k estimates are good (k < 0.5).
#> See help('pareto-k-diagnostic') for details.
loo_compare(loo0, loo1)
#> elpd_diff se_diff
#> post1 0.0 0.0
#> post0 -1404.0 49.5
Covariates contain clearly useful information for predictions.
distribution of claim probability
pp_check(post1,"stat_2d")
to be continued , references in the coming article