Now, we will lear how to use GLM’s to fit a model for claim frequency. We will be working with the same datase from last session, so I will start by loading the file:
setwd("~/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/Presentaciones")
data<-read.csv("freMTPLfreq.csv")
Just to remember the variables we had:
names(data)
## [1] "X" "PolicyID" "ClaimNb" "Exposure" "Power" "CarAge"
## [7] "DriverAge" "Brand" "Gas" "Region" "Density"
head(data)
## X PolicyID ClaimNb Exposure Power CarAge DriverAge
## 1 1 1 0 0.09 g 0 46
## 2 2 2 0 0.84 g 0 46
## 3 3 3 0 0.52 f 2 38
## 4 4 4 0 0.45 f 2 38
## 5 5 5 0 0.15 g 0 41
## 6 6 6 0 0.75 g 0 41
## Brand Gas Region Density
## 1 Japanese (except Nissan) or Korean Diesel Aquitaine 76
## 2 Japanese (except Nissan) or Korean Diesel Aquitaine 76
## 3 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais 3003
## 4 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais 3003
## 5 Japanese (except Nissan) or Korean Diesel Pays-de-la-Loire 60
## 6 Japanese (except Nissan) or Korean Diesel Pays-de-la-Loire 60
Last session we fit a logistic model to analyse the likelihood of a policy presenting at least one claim. Now, we are interested in the number of claims.
For this, we will use the variable ClaimNb, we can explore this variable.
table(data$ClaimNb)
##
## 0 1 2 3 4
## 397779 14633 726 28 3
prop.table(table(data$ClaimNb))
##
## 0 1 2 3 4
## 9.627513e-01 3.541650e-02 1.757150e-03 6.776888e-05 7.260951e-06
This is a count variable…
We can start with an annualized claims frequency.
Assume thay claims occurrenc, for an insured, is driven by a homogenous Poisson process, with intensity \(\lambda\), so that the process of claims is of occurrence has independent increments, that the number of claims observed during time intervel \([t, t+h]\) has a Poisson distribution with the form \(P(\lambda \cdot h)\).
This model will allow us to link frequency and observed frequency on a given period of exposure. For policyholder \(i\). We know that:
We are dealing with censored data*, as we are not able to observe the contract over a full year. Without any explanatory variable, the average annualized frequency and its empirical variance are:
\[ m_N=\frac{\sum_{i^=1}^{n}Y_i}{\sum_{i=1}^{n}E_i} \]
\[ S_N^2=\frac{\sum_{i^=1}^{n}[Y_i-m_N\cdot E_i]^2}{\sum_{i=1}^{n}E_i} \]
vY<- data$ClaimNb
vE<- data$Exposure
m<- sum(vY)/sum(vE)
v<- sum((vY-m*vE)^2)/sum(vE)
We can display the information as:
cat("Average=",m,"variance=",v,"phi=",v/m,"\n")
## Average= 0.06979859 variance= 0.07396742 phi= 1.059727
Where the value of \(\phi\) is such that:
\[ S_N^2= \phi \cdot m_X \]
For a Poisson distribution, the value of \(\phi\) should be 1. ]
Here, we should decide if we are ok with the value of \(\phi\) or we think this is too far from 1 to continue with a Poisson distribution.
We can include the effect of some covariates and see if they have an impact for the assumption of a Poisson distribution:
Lets use the covariate \(X\) as the region of the driver:
vX<-as.factor(data$Region)
for (i in 1:length(levels(vX))) {
vEi <- vE[vX==levels(vX)[i]]
vYi <- vY[vX==levels(vX)[i]]
mi <- sum(vYi)/sum(vEi)
vi <- sum((vYi-mi*vEi)^2)/sum(vEi)
cat("Average=",mi,"Variance=",vi,"phi=",vi/mi,"\n")
}
## Average= 0.07366204 Variance= 0.08064593 phi= 1.09481
## Average= 0.06788926 Variance= 0.07152207 phi= 1.053511
## Average= 0.06741501 Variance= 0.07016521 phi= 1.040795
## Average= 0.06303979 Variance= 0.06483763 phi= 1.028519
## Average= 0.06923285 Variance= 0.07388641 phi= 1.067216
## Average= 0.08577016 Variance= 0.09526731 phi= 1.110728
## Average= 0.08222055 Variance= 0.08952784 phi= 1.088874
## Average= 0.08210142 Variance= 0.09201134 phi= 1.120703
## Average= 0.07185182 Variance= 0.07590899 phi= 1.056466
## Average= 0.0716627 Variance= 0.07559456 phi= 1.054866
Again, for every value of \(X\) we have to find \(\phi\) such that:
\[ S_{N,x}^2= \phi \cdot m_{X,x} \]
\(\textbf{Poisson Regression}\)
Assume that \(N_i\) is the annual claims frequency for insured \(i\),and assume that \(N_i \sim P(\lambda)\). At this point all insured have the same \(\lmabda\). If insured \(i\) was observed during a period of time \(E_i\) (the exposure), then the number of claims is: \[ Y_i \sim P(\lambda \cdot E_i) \]
In order to estimate the value of \(\lambda\) we can use maximum likelihood techniques so that:
\[ \mathcal{L}(\lambda, Y, E)= \prod_{i=1}^{n} \frac{e^{-\lambda E_i}[\lambda E_i]^{Y_i}}{Y_i!} \] Resulting in the likelihood as:
\[ \log \mathcal{L}(\lambda, Y, E)= -\lambda \sum_{i=1}^n E_i + \sum_{i=1}^n Y_i \log [\lambda E_i] - \log \bigg(\prod_{i=1}^n Y_i! \bigg) \]
If we solve the first-order condition we have:
\[ \frac{\partial }{\partial \lambda } \log \mathcal{L}( \lambda,Y,E) \bigg\rvert_{\lambda = \hat{\lambda}} = - \sum_{i=1}^n E_i + \frac{1}{\hat{\lambda}} \sum_{i=1}^n Y_i=0 \] Resulting in:
\[ \hat{\lambda} = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n E_i} = \sum_{i=1}^n w_i \frac{Y_i}{E_i} \] This is, a weighted average, and where the value of \(w_i\) is: \[ w_i= \frac{E_i}{\sum_{i=1}^n E_i} \]
Y<- data$ClaimNb
E<- data$Exposure
(lambda<-sum(Y)/sum(E))
## [1] 0.06979859
weighted.mean(Y/E,E)
## [1] 0.06979859
dpois(0:3,lambda)*100
## [1] 93.258163240 6.509288286 0.227169572 0.005285372
If we remember from the observed data:
prop.table(table(data$ClaimNb))
##
## 0 1 2 3 4
## 9.627513e-01 3.541650e-02 1.757150e-03 6.776888e-05 7.260951e-06
We mentioend before that a valid assumption is that the value of \(\lambda\) depends on each policyholder, and that each \(\lambda_i\) is a function of some covariates.
Then, we can assume that \(N_i \sim P(\lambda_i)\) where \(N\) is an annualized claim frequency. However, since we cannot observe the real value for \(N_i\), we have to use the number of claims \(Y_i\) during the exposure period, \(E_i\).
We can assume that the \(N_i \sim P(\lambda_i)\) can be written as \(Y_i \sim P(E_i \cdot \lambda_i)\) with a log link function:
\[ Y_i \sim P \bigg( e^{X_i' \beta + \log E_i}\bigg) \]
All of this, was to justify the use of a glm model to estimate the values of the parameters \(beta\) in the model.
In our GLM model, we have to indicate some new things:
model1<- glm(ClaimNb~Gas+DriverAge+Density+offset(log(Exposure)),
family=poisson,data=data)
summary(model1)
##
## Call:
## glm(formula = ClaimNb ~ Gas + DriverAge + Density + offset(log(Exposure)),
## family = poisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.6868 -0.3469 -0.2687 -0.1511 6.4723
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.144e+00 2.706e-02 -79.244 <2e-16 ***
## GasRegular -1.350e-01 1.592e-02 -8.479 <2e-16 ***
## DriverAge -1.069e-02 5.623e-04 -19.014 <2e-16 ***
## Density 2.179e-05 1.478e-06 14.740 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 105613 on 413168 degrees of freedom
## Residual deviance: 104969 on 413165 degrees of freedom
## AIC: 136234
##
## Number of Fisher Scoring iterations: 6
exp(coef(model1))
## (Intercept) GasRegular DriverAge Density
## 0.1171671 0.8737493 0.9893658 1.0000218
We can change our model lets create groups for age and density. This is the option I told you makes the creation of groups easier:
data$ageg<- cut(data$DriverAge, c(17,22,26,42,74,Inf))
table(data$DriverAge, data$ageg)
##
## (17,22] (22,26] (26,42] (42,74] (74,Inf]
## 18 499 0 0 0 0
## 19 1614 0 0 0 0
## 20 2535 0 0 0 0
## 21 3046 0 0 0 0
## 22 3556 0 0 0 0
## 23 0 4168 0 0 0
## 24 0 4870 0 0 0
## 25 0 5575 0 0 0
## 26 0 6496 0 0 0
## 27 0 0 7434 0 0
## 28 0 0 8345 0 0
## 29 0 0 9127 0 0
## 30 0 0 9675 0 0
## 31 0 0 9846 0 0
## 32 0 0 10303 0 0
## 33 0 0 10320 0 0
## 34 0 0 10496 0 0
## 35 0 0 10272 0 0
## 36 0 0 10415 0 0
## 37 0 0 10308 0 0
## 38 0 0 10292 0 0
## 39 0 0 10457 0 0
## 40 0 0 10344 0 0
## 41 0 0 10307 0 0
## 42 0 0 10301 0 0
## 43 0 0 0 10100 0
## 44 0 0 0 9803 0
## 45 0 0 0 10114 0
## 46 0 0 0 9789 0
## 47 0 0 0 9695 0
## 48 0 0 0 9799 0
## 49 0 0 0 10055 0
## 50 0 0 0 10363 0
## 51 0 0 0 10476 0
## 52 0 0 0 10417 0
## 53 0 0 0 9878 0
## 54 0 0 0 9304 0
## 55 0 0 0 8776 0
## 56 0 0 0 7939 0
## 57 0 0 0 7146 0
## 58 0 0 0 6277 0
## 59 0 0 0 5386 0
## 60 0 0 0 4983 0
## 61 0 0 0 4630 0
## 62 0 0 0 4381 0
## 63 0 0 0 3940 0
## 64 0 0 0 3847 0
## 65 0 0 0 3646 0
## 66 0 0 0 3560 0
## 67 0 0 0 3477 0
## 68 0 0 0 3378 0
## 69 0 0 0 3118 0
## 70 0 0 0 3031 0
## 71 0 0 0 2949 0
## 72 0 0 0 2774 0
## 73 0 0 0 2560 0
## 74 0 0 0 2291 0
## 75 0 0 0 0 2139
## 76 0 0 0 0 2030
## 77 0 0 0 0 1756
## 78 0 0 0 0 1634
## 79 0 0 0 0 1446
## 80 0 0 0 0 1290
## 81 0 0 0 0 1074
## 82 0 0 0 0 825
## 83 0 0 0 0 620
## 84 0 0 0 0 431
## 85 0 0 0 0 274
## 86 0 0 0 0 236
## 87 0 0 0 0 203
## 88 0 0 0 0 159
## 89 0 0 0 0 140
## 90 0 0 0 0 122
## 91 0 0 0 0 89
## 92 0 0 0 0 47
## 93 0 0 0 0 40
## 94 0 0 0 0 27
## 95 0 0 0 0 21
## 96 0 0 0 0 12
## 97 0 0 0 0 9
## 98 0 0 0 0 3
## 99 0 0 0 0 59
And now for density:
data$pdens<- cut(data$Density,c(0,40,200,500,4500,Inf))
Now, we can fit our model
model2<- glm(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
family=poisson,data=data)
summary(model2)
##
## Call:
## glm(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)),
## family = poisson, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7655 -0.3385 -0.2669 -0.1488 6.5202
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.86471 0.04047 -46.079 < 2e-16 ***
## GasRegular -0.20598 0.01603 -12.846 < 2e-16 ***
## ageg(22,26] -0.61606 0.04608 -13.370 < 2e-16 ***
## ageg(26,42] -1.07967 0.03640 -29.657 < 2e-16 ***
## ageg(42,74] -1.07765 0.03549 -30.362 < 2e-16 ***
## ageg(74,Inf] -1.10706 0.05188 -21.338 < 2e-16 ***
## pdens(40,200] 0.18473 0.02675 6.905 5.02e-12 ***
## pdens(200,500] 0.31822 0.02966 10.730 < 2e-16 ***
## pdens(500,4.5e+03] 0.52694 0.02593 20.320 < 2e-16 ***
## pdens(4.5e+03,Inf] 0.63717 0.03482 18.300 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 105613 on 413168 degrees of freedom
## Residual deviance: 103986 on 413159 degrees of freedom
## AIC: 135263
##
## Number of Fisher Scoring iterations: 6
We have some other models for count date, right? Lets try a Negative Binomial model, for this, we will use the MASS library:
library(MASS)
model3<- glm.nb(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),data=data)
summary(model3)
##
## Call:
## glm.nb(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)),
## data = data, init.theta = 0.8385090906, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.7169 -0.3336 -0.2650 -0.1491 5.7938
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.83838 0.04242 -43.337 < 2e-16 ***
## GasRegular -0.20619 0.01653 -12.473 < 2e-16 ***
## ageg(22,26] -0.63135 0.04836 -13.055 < 2e-16 ***
## ageg(26,42] -1.10114 0.03840 -28.672 < 2e-16 ***
## ageg(42,74] -1.09898 0.03749 -29.311 < 2e-16 ***
## ageg(74,Inf] -1.13169 0.05410 -20.920 < 2e-16 ***
## pdens(40,200] 0.18541 0.02742 6.762 1.36e-11 ***
## pdens(200,500] 0.31951 0.03045 10.492 < 2e-16 ***
## pdens(500,4.5e+03] 0.52892 0.02663 19.864 < 2e-16 ***
## pdens(4.5e+03,Inf] 0.63686 0.03585 17.764 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.8385) family taken to be 1)
##
## Null deviance: 91137 on 413168 degrees of freedom
## Residual deviance: 89592 on 413159 degrees of freedom
## AIC: 134881
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.8385
## Std. Err.: 0.0559
##
## 2 x log-likelihood: -134859.4560
Which model is better?
Now, before we finish… for today… we can fit Zero-Inflated models:
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model4<-zeroinfl(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
data=data, dist="poisson")
summary(model4)
##
## Call:
## zeroinfl(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)),
## data = data, dist = "poisson")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.4038 -0.2219 -0.1941 -0.1383 67.4705
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.09995 0.09753 -11.279 < 2e-16 ***
## GasRegular 0.02114 0.04240 0.499 0.618086
## ageg(22,26] -0.44937 0.10448 -4.301 1.70e-05 ***
## ageg(26,42] -1.05816 0.08046 -13.151 < 2e-16 ***
## ageg(42,74] -1.00859 0.07751 -13.013 < 2e-16 ***
## ageg(74,Inf] -0.91876 0.13848 -6.635 3.26e-11 ***
## pdens(40,200] 0.16363 0.07699 2.125 0.033553 *
## pdens(200,500] 0.17926 0.08366 2.143 0.032144 *
## pdens(500,4.5e+03] 0.26492 0.07314 3.622 0.000292 ***
## pdens(4.5e+03,Inf] 0.26437 0.09547 2.769 0.005622 **
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.60691 0.17750 3.419 0.000628 ***
## GasRegular 0.40995 0.07850 5.223 1.76e-07 ***
## ageg(22,26] 0.25650 0.19020 1.349 0.177469
## ageg(26,42] -0.15600 0.15310 -1.019 0.308221
## ageg(42,74] -0.15037 0.14570 -1.032 0.302049
## ageg(74,Inf] -0.05926 0.23098 -0.257 0.797530
## pdens(40,200] -0.01134 0.12909 -0.088 0.930017
## pdens(200,500] -0.21401 0.14476 -1.478 0.139306
## pdens(500,4.5e+03] -0.37515 0.12780 -2.936 0.003330 **
## pdens(4.5e+03,Inf] -0.46035 0.19109 -2.409 0.015993 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 39
## Log-likelihood: -6.684e+04 on 20 Df
model5<-zeroinfl(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
data=data, dist="negbin")
## Warning in sqrt(diag(vc)[np]): NaNs produced
summary(model5)
## Warning in sqrt(diag(object$vcov)): NaNs produced
##
## Call:
## zeroinfl(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)),
## data = data, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -0.3907 -0.2217 -0.1948 -0.1366 67.3276
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.03359 0.09049 -11.422 < 2e-16 ***
## GasRegular -0.07266 0.03791 -1.917 0.055298 .
## ageg(22,26] -0.46980 0.10493 -4.477 7.56e-06 ***
## ageg(26,42] -1.09593 0.08036 -13.637 < 2e-16 ***
## ageg(42,74] -1.03733 0.07801 -13.298 < 2e-16 ***
## ageg(74,Inf] -1.91380 0.08132 -23.535 < 2e-16 ***
## pdens(40,200] 0.14175 0.06451 2.197 0.027996 *
## pdens(200,500] 0.18748 0.07182 2.610 0.009043 **
## pdens(500,4.5e+03] 0.27487 0.06311 4.356 1.33e-05 ***
## pdens(4.5e+03,Inf] 0.32496 0.08663 3.751 0.000176 ***
## Log(theta) 24.97945 NA NA NA
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.75141 0.16581 4.532 5.85e-06 ***
## GasRegular 0.23607 0.07223 3.268 0.00108 **
## ageg(22,26] 0.21518 0.18933 1.137 0.25573
## ageg(26,42] -0.23054 0.15177 -1.519 0.12876
## ageg(42,74] -0.20767 0.14523 -1.430 0.15272
## ageg(74,Inf] -13.89530 NA NA NA
## pdens(40,200] -0.05579 0.11228 -0.497 0.61924
## pdens(200,500] -0.20946 0.12844 -1.631 0.10292
## pdens(500,4.5e+03] -0.37633 0.11406 -3.299 0.00097 ***
## pdens(4.5e+03,Inf] -0.35202 0.17557 -2.005 0.04496 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 70540368654.6891
## Number of iterations in BFGS optimization: 82
## Log-likelihood: -6.688e+04 on 21 Df