Now, we will lear how to use GLM’s to fit a model for claim frequency. We will be working with the same datase from last session, so I will start by loading the file:

setwd("~/Dropbox/UDLAP/Cursos/2022 Primavera/Tema Selecto/Presentaciones")
data<-read.csv("freMTPLfreq.csv")

Just to remember the variables we had:

names(data)
##  [1] "X"         "PolicyID"  "ClaimNb"   "Exposure"  "Power"     "CarAge"   
##  [7] "DriverAge" "Brand"     "Gas"       "Region"    "Density"
head(data)
##   X PolicyID ClaimNb Exposure Power CarAge DriverAge
## 1 1        1       0     0.09     g      0        46
## 2 2        2       0     0.84     g      0        46
## 3 3        3       0     0.52     f      2        38
## 4 4        4       0     0.45     f      2        38
## 5 5        5       0     0.15     g      0        41
## 6 6        6       0     0.75     g      0        41
##                                Brand     Gas             Region Density
## 1 Japanese (except Nissan) or Korean  Diesel          Aquitaine      76
## 2 Japanese (except Nissan) or Korean  Diesel          Aquitaine      76
## 3 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais    3003
## 4 Japanese (except Nissan) or Korean Regular Nord-Pas-de-Calais    3003
## 5 Japanese (except Nissan) or Korean  Diesel   Pays-de-la-Loire      60
## 6 Japanese (except Nissan) or Korean  Diesel   Pays-de-la-Loire      60

Last session we fit a logistic model to analyse the likelihood of a policy presenting at least one claim. Now, we are interested in the number of claims.

For this, we will use the variable ClaimNb, we can explore this variable.

table(data$ClaimNb)
## 
##      0      1      2      3      4 
## 397779  14633    726     28      3
prop.table(table(data$ClaimNb))
## 
##            0            1            2            3            4 
## 9.627513e-01 3.541650e-02 1.757150e-03 6.776888e-05 7.260951e-06

This is a count variable…

We can start with an annualized claims frequency.

Assume thay claims occurrenc, for an insured, is driven by a homogenous Poisson process, with intensity \(\lambda\), so that the process of claims is of occurrence has independent increments, that the number of claims observed during time intervel \([t, t+h]\) has a Poisson distribution with the form \(P(\lambda \cdot h)\).

This model will allow us to link frequency and observed frequency on a given period of exposure. For policyholder \(i\). We know that:

We are dealing with censored data*, as we are not able to observe the contract over a full year. Without any explanatory variable, the average annualized frequency and its empirical variance are:

\[ m_N=\frac{\sum_{i^=1}^{n}Y_i}{\sum_{i=1}^{n}E_i} \]

\[ S_N^2=\frac{\sum_{i^=1}^{n}[Y_i-m_N\cdot E_i]^2}{\sum_{i=1}^{n}E_i} \]

vY<- data$ClaimNb
vE<- data$Exposure
m<- sum(vY)/sum(vE)
v<- sum((vY-m*vE)^2)/sum(vE)

We can display the information as:

cat("Average=",m,"variance=",v,"phi=",v/m,"\n")
## Average= 0.06979859 variance= 0.07396742 phi= 1.059727

Where the value of \(\phi\) is such that:

\[ S_N^2= \phi \cdot m_X \]

For a Poisson distribution, the value of \(\phi\) should be 1. ]

Here, we should decide if we are ok with the value of \(\phi\) or we think this is too far from 1 to continue with a Poisson distribution.

We can include the effect of some covariates and see if they have an impact for the assumption of a Poisson distribution:

Lets use the covariate \(X\) as the region of the driver:

vX<-as.factor(data$Region)
for (i in 1:length(levels(vX))) {
  vEi <- vE[vX==levels(vX)[i]]
  vYi <- vY[vX==levels(vX)[i]]
  mi <- sum(vYi)/sum(vEi)
  vi <- sum((vYi-mi*vEi)^2)/sum(vEi)
  cat("Average=",mi,"Variance=",vi,"phi=",vi/mi,"\n")
}
## Average= 0.07366204 Variance= 0.08064593 phi= 1.09481 
## Average= 0.06788926 Variance= 0.07152207 phi= 1.053511 
## Average= 0.06741501 Variance= 0.07016521 phi= 1.040795 
## Average= 0.06303979 Variance= 0.06483763 phi= 1.028519 
## Average= 0.06923285 Variance= 0.07388641 phi= 1.067216 
## Average= 0.08577016 Variance= 0.09526731 phi= 1.110728 
## Average= 0.08222055 Variance= 0.08952784 phi= 1.088874 
## Average= 0.08210142 Variance= 0.09201134 phi= 1.120703 
## Average= 0.07185182 Variance= 0.07590899 phi= 1.056466 
## Average= 0.0716627 Variance= 0.07559456 phi= 1.054866

Again, for every value of \(X\) we have to find \(\phi\) such that:

\[ S_{N,x}^2= \phi \cdot m_{X,x} \]

\(\textbf{Poisson Regression}\)

Assume that \(N_i\) is the annual claims frequency for insured \(i\),and assume that \(N_i \sim P(\lambda)\). At this point all insured have the same \(\lmabda\). If insured \(i\) was observed during a period of time \(E_i\) (the exposure), then the number of claims is: \[ Y_i \sim P(\lambda \cdot E_i) \]

In order to estimate the value of \(\lambda\) we can use maximum likelihood techniques so that:

\[ \mathcal{L}(\lambda, Y, E)= \prod_{i=1}^{n} \frac{e^{-\lambda E_i}[\lambda E_i]^{Y_i}}{Y_i!} \] Resulting in the likelihood as:

\[ \log \mathcal{L}(\lambda, Y, E)= -\lambda \sum_{i=1}^n E_i + \sum_{i=1}^n Y_i \log [\lambda E_i] - \log \bigg(\prod_{i=1}^n Y_i! \bigg) \]

If we solve the first-order condition we have:

\[ \frac{\partial }{\partial \lambda } \log \mathcal{L}( \lambda,Y,E) \bigg\rvert_{\lambda = \hat{\lambda}} = - \sum_{i=1}^n E_i + \frac{1}{\hat{\lambda}} \sum_{i=1}^n Y_i=0 \] Resulting in:

\[ \hat{\lambda} = \frac{\sum_{i=1}^n Y_i}{\sum_{i=1}^n E_i} = \sum_{i=1}^n w_i \frac{Y_i}{E_i} \] This is, a weighted average, and where the value of \(w_i\) is: \[ w_i= \frac{E_i}{\sum_{i=1}^n E_i} \]

Y<- data$ClaimNb
E<- data$Exposure
(lambda<-sum(Y)/sum(E))
## [1] 0.06979859
weighted.mean(Y/E,E)
## [1] 0.06979859
dpois(0:3,lambda)*100
## [1] 93.258163240  6.509288286  0.227169572  0.005285372

If we remember from the observed data:

prop.table(table(data$ClaimNb))
## 
##            0            1            2            3            4 
## 9.627513e-01 3.541650e-02 1.757150e-03 6.776888e-05 7.260951e-06

We mentioend before that a valid assumption is that the value of \(\lambda\) depends on each policyholder, and that each \(\lambda_i\) is a function of some covariates.

Then, we can assume that \(N_i \sim P(\lambda_i)\) where \(N\) is an annualized claim frequency. However, since we cannot observe the real value for \(N_i\), we have to use the number of claims \(Y_i\) during the exposure period, \(E_i\).

We can assume that the \(N_i \sim P(\lambda_i)\) can be written as \(Y_i \sim P(E_i \cdot \lambda_i)\) with a log link function:

\[ Y_i \sim P \bigg( e^{X_i' \beta + \log E_i}\bigg) \]

All of this, was to justify the use of a glm model to estimate the values of the parameters \(beta\) in the model.

In our GLM model, we have to indicate some new things:

model1<- glm(ClaimNb~Gas+DriverAge+Density+offset(log(Exposure)),
             family=poisson,data=data)
summary(model1)
## 
## Call:
## glm(formula = ClaimNb ~ Gas + DriverAge + Density + offset(log(Exposure)), 
##     family = poisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6868  -0.3469  -0.2687  -0.1511   6.4723  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.144e+00  2.706e-02 -79.244   <2e-16 ***
## GasRegular  -1.350e-01  1.592e-02  -8.479   <2e-16 ***
## DriverAge   -1.069e-02  5.623e-04 -19.014   <2e-16 ***
## Density      2.179e-05  1.478e-06  14.740   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 105613  on 413168  degrees of freedom
## Residual deviance: 104969  on 413165  degrees of freedom
## AIC: 136234
## 
## Number of Fisher Scoring iterations: 6
exp(coef(model1))
## (Intercept)  GasRegular   DriverAge     Density 
##   0.1171671   0.8737493   0.9893658   1.0000218

We can change our model lets create groups for age and density. This is the option I told you makes the creation of groups easier:

data$ageg<- cut(data$DriverAge, c(17,22,26,42,74,Inf))
table(data$DriverAge, data$ageg)
##     
##      (17,22] (22,26] (26,42] (42,74] (74,Inf]
##   18     499       0       0       0        0
##   19    1614       0       0       0        0
##   20    2535       0       0       0        0
##   21    3046       0       0       0        0
##   22    3556       0       0       0        0
##   23       0    4168       0       0        0
##   24       0    4870       0       0        0
##   25       0    5575       0       0        0
##   26       0    6496       0       0        0
##   27       0       0    7434       0        0
##   28       0       0    8345       0        0
##   29       0       0    9127       0        0
##   30       0       0    9675       0        0
##   31       0       0    9846       0        0
##   32       0       0   10303       0        0
##   33       0       0   10320       0        0
##   34       0       0   10496       0        0
##   35       0       0   10272       0        0
##   36       0       0   10415       0        0
##   37       0       0   10308       0        0
##   38       0       0   10292       0        0
##   39       0       0   10457       0        0
##   40       0       0   10344       0        0
##   41       0       0   10307       0        0
##   42       0       0   10301       0        0
##   43       0       0       0   10100        0
##   44       0       0       0    9803        0
##   45       0       0       0   10114        0
##   46       0       0       0    9789        0
##   47       0       0       0    9695        0
##   48       0       0       0    9799        0
##   49       0       0       0   10055        0
##   50       0       0       0   10363        0
##   51       0       0       0   10476        0
##   52       0       0       0   10417        0
##   53       0       0       0    9878        0
##   54       0       0       0    9304        0
##   55       0       0       0    8776        0
##   56       0       0       0    7939        0
##   57       0       0       0    7146        0
##   58       0       0       0    6277        0
##   59       0       0       0    5386        0
##   60       0       0       0    4983        0
##   61       0       0       0    4630        0
##   62       0       0       0    4381        0
##   63       0       0       0    3940        0
##   64       0       0       0    3847        0
##   65       0       0       0    3646        0
##   66       0       0       0    3560        0
##   67       0       0       0    3477        0
##   68       0       0       0    3378        0
##   69       0       0       0    3118        0
##   70       0       0       0    3031        0
##   71       0       0       0    2949        0
##   72       0       0       0    2774        0
##   73       0       0       0    2560        0
##   74       0       0       0    2291        0
##   75       0       0       0       0     2139
##   76       0       0       0       0     2030
##   77       0       0       0       0     1756
##   78       0       0       0       0     1634
##   79       0       0       0       0     1446
##   80       0       0       0       0     1290
##   81       0       0       0       0     1074
##   82       0       0       0       0      825
##   83       0       0       0       0      620
##   84       0       0       0       0      431
##   85       0       0       0       0      274
##   86       0       0       0       0      236
##   87       0       0       0       0      203
##   88       0       0       0       0      159
##   89       0       0       0       0      140
##   90       0       0       0       0      122
##   91       0       0       0       0       89
##   92       0       0       0       0       47
##   93       0       0       0       0       40
##   94       0       0       0       0       27
##   95       0       0       0       0       21
##   96       0       0       0       0       12
##   97       0       0       0       0        9
##   98       0       0       0       0        3
##   99       0       0       0       0       59

And now for density:

data$pdens<- cut(data$Density,c(0,40,200,500,4500,Inf))

Now, we can fit our model

model2<- glm(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
             family=poisson,data=data)
summary(model2)
## 
## Call:
## glm(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)), 
##     family = poisson, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7655  -0.3385  -0.2669  -0.1488   6.5202  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.86471    0.04047 -46.079  < 2e-16 ***
## GasRegular         -0.20598    0.01603 -12.846  < 2e-16 ***
## ageg(22,26]        -0.61606    0.04608 -13.370  < 2e-16 ***
## ageg(26,42]        -1.07967    0.03640 -29.657  < 2e-16 ***
## ageg(42,74]        -1.07765    0.03549 -30.362  < 2e-16 ***
## ageg(74,Inf]       -1.10706    0.05188 -21.338  < 2e-16 ***
## pdens(40,200]       0.18473    0.02675   6.905 5.02e-12 ***
## pdens(200,500]      0.31822    0.02966  10.730  < 2e-16 ***
## pdens(500,4.5e+03]  0.52694    0.02593  20.320  < 2e-16 ***
## pdens(4.5e+03,Inf]  0.63717    0.03482  18.300  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 105613  on 413168  degrees of freedom
## Residual deviance: 103986  on 413159  degrees of freedom
## AIC: 135263
## 
## Number of Fisher Scoring iterations: 6

We have some other models for count date, right? Lets try a Negative Binomial model, for this, we will use the MASS library:

library(MASS)
model3<- glm.nb(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),data=data)
summary(model3)
## 
## Call:
## glm.nb(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)), 
##     data = data, init.theta = 0.8385090906, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.7169  -0.3336  -0.2650  -0.1491   5.7938  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.83838    0.04242 -43.337  < 2e-16 ***
## GasRegular         -0.20619    0.01653 -12.473  < 2e-16 ***
## ageg(22,26]        -0.63135    0.04836 -13.055  < 2e-16 ***
## ageg(26,42]        -1.10114    0.03840 -28.672  < 2e-16 ***
## ageg(42,74]        -1.09898    0.03749 -29.311  < 2e-16 ***
## ageg(74,Inf]       -1.13169    0.05410 -20.920  < 2e-16 ***
## pdens(40,200]       0.18541    0.02742   6.762 1.36e-11 ***
## pdens(200,500]      0.31951    0.03045  10.492  < 2e-16 ***
## pdens(500,4.5e+03]  0.52892    0.02663  19.864  < 2e-16 ***
## pdens(4.5e+03,Inf]  0.63686    0.03585  17.764  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.8385) family taken to be 1)
## 
##     Null deviance: 91137  on 413168  degrees of freedom
## Residual deviance: 89592  on 413159  degrees of freedom
## AIC: 134881
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.8385 
##           Std. Err.:  0.0559 
## 
##  2 x log-likelihood:  -134859.4560

Which model is better?

Now, before we finish… for today… we can fit Zero-Inflated models:

library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model4<-zeroinfl(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
                 data=data, dist="poisson")
summary(model4)
## 
## Call:
## zeroinfl(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)), 
##     data = data, dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4038 -0.2219 -0.1941 -0.1383 67.4705 
## 
## Count model coefficients (poisson with log link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.09995    0.09753 -11.279  < 2e-16 ***
## GasRegular          0.02114    0.04240   0.499 0.618086    
## ageg(22,26]        -0.44937    0.10448  -4.301 1.70e-05 ***
## ageg(26,42]        -1.05816    0.08046 -13.151  < 2e-16 ***
## ageg(42,74]        -1.00859    0.07751 -13.013  < 2e-16 ***
## ageg(74,Inf]       -0.91876    0.13848  -6.635 3.26e-11 ***
## pdens(40,200]       0.16363    0.07699   2.125 0.033553 *  
## pdens(200,500]      0.17926    0.08366   2.143 0.032144 *  
## pdens(500,4.5e+03]  0.26492    0.07314   3.622 0.000292 ***
## pdens(4.5e+03,Inf]  0.26437    0.09547   2.769 0.005622 ** 
## 
## Zero-inflation model coefficients (binomial with logit link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.60691    0.17750   3.419 0.000628 ***
## GasRegular          0.40995    0.07850   5.223 1.76e-07 ***
## ageg(22,26]         0.25650    0.19020   1.349 0.177469    
## ageg(26,42]        -0.15600    0.15310  -1.019 0.308221    
## ageg(42,74]        -0.15037    0.14570  -1.032 0.302049    
## ageg(74,Inf]       -0.05926    0.23098  -0.257 0.797530    
## pdens(40,200]      -0.01134    0.12909  -0.088 0.930017    
## pdens(200,500]     -0.21401    0.14476  -1.478 0.139306    
## pdens(500,4.5e+03] -0.37515    0.12780  -2.936 0.003330 ** 
## pdens(4.5e+03,Inf] -0.46035    0.19109  -2.409 0.015993 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 39 
## Log-likelihood: -6.684e+04 on 20 Df
model5<-zeroinfl(ClaimNb~Gas+ageg+pdens+offset(log(Exposure)),
                 data=data, dist="negbin")
## Warning in sqrt(diag(vc)[np]): NaNs produced
summary(model5)
## Warning in sqrt(diag(object$vcov)): NaNs produced
## 
## Call:
## zeroinfl(formula = ClaimNb ~ Gas + ageg + pdens + offset(log(Exposure)), 
##     data = data, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3907 -0.2217 -0.1948 -0.1366 67.3276 
## 
## Count model coefficients (negbin with log link):
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.03359    0.09049 -11.422  < 2e-16 ***
## GasRegular         -0.07266    0.03791  -1.917 0.055298 .  
## ageg(22,26]        -0.46980    0.10493  -4.477 7.56e-06 ***
## ageg(26,42]        -1.09593    0.08036 -13.637  < 2e-16 ***
## ageg(42,74]        -1.03733    0.07801 -13.298  < 2e-16 ***
## ageg(74,Inf]       -1.91380    0.08132 -23.535  < 2e-16 ***
## pdens(40,200]       0.14175    0.06451   2.197 0.027996 *  
## pdens(200,500]      0.18748    0.07182   2.610 0.009043 ** 
## pdens(500,4.5e+03]  0.27487    0.06311   4.356 1.33e-05 ***
## pdens(4.5e+03,Inf]  0.32496    0.08663   3.751 0.000176 ***
## Log(theta)         24.97945         NA      NA       NA    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.75141    0.16581   4.532 5.85e-06 ***
## GasRegular           0.23607    0.07223   3.268  0.00108 ** 
## ageg(22,26]          0.21518    0.18933   1.137  0.25573    
## ageg(26,42]         -0.23054    0.15177  -1.519  0.12876    
## ageg(42,74]         -0.20767    0.14523  -1.430  0.15272    
## ageg(74,Inf]       -13.89530         NA      NA       NA    
## pdens(40,200]       -0.05579    0.11228  -0.497  0.61924    
## pdens(200,500]      -0.20946    0.12844  -1.631  0.10292    
## pdens(500,4.5e+03]  -0.37633    0.11406  -3.299  0.00097 ***
## pdens(4.5e+03,Inf]  -0.35202    0.17557  -2.005  0.04496 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 70540368654.6891 
## Number of iterations in BFGS optimization: 82 
## Log-likelihood: -6.688e+04 on 21 Df