1. Présentation du projet

Ce projet modélise la fréquence de sinistres sur un portefeuille auto français à l’aide d’un GLM Poisson avec offset d’exposition.

Les données proviennent du package CASdatasets (freMTPL2freq).

2. Chargement des données

library(CASdatasets)
library(ggplot2)

data(freMTPL2freq)
freMTPL2freq$ClaimNb <- as.numeric(freMTPL2freq$ClaimNb)
freq <- freMTPL2freq[freMTPL2freq$VehAge <= 30 & freMTPL2freq$DrivAge <= 90, ]

Le dataset contient 676498 contrats après nettoyage.

3. Statistiques descriptives

summary(freq[, c("ClaimNb", "Exposure", "BonusMalus", "DrivAge", "VehAge")])
##     ClaimNb            Exposure          BonusMalus        DrivAge     
##  Min.   : 0.00000   Min.   :0.002732   Min.   : 50.00   Min.   :18.00  
##  1st Qu.: 0.00000   1st Qu.:0.170000   1st Qu.: 50.00   1st Qu.:34.00  
##  Median : 0.00000   Median :0.490000   Median : 50.00   Median :44.00  
##  Mean   : 0.05329   Mean   :0.528399   Mean   : 59.77   Mean   :45.47  
##  3rd Qu.: 0.00000   3rd Qu.:0.990000   3rd Qu.: 64.00   3rd Qu.:55.00  
##  Max.   :16.00000   Max.   :2.010000   Max.   :230.00   Max.   :90.00  
##      VehAge      
##  Min.   : 0.000  
##  1st Qu.: 2.000  
##  Median : 6.000  
##  Mean   : 6.989  
##  3rd Qu.:11.000  
##  Max.   :30.000

Fréquence brute du portefeuille :

round(sum(freq$ClaimNb) / sum(freq$Exposure), 4)
## [1] 0.1008

4. Modélisation GLM Poisson

modele1 <- glm(ClaimNb ~ VehPower + VehAge + DrivAge + BonusMalus + VehGas + Area,
               offset = log(Exposure),
               family = poisson(link = "log"),
               data = freq)

summary(modele1)$coefficients
##                   Estimate   Std. Error    z value     Pr(>|z|)
## (Intercept)   -3.930470223 0.0395861235 -99.289091 0.000000e+00
## VehPower       0.018549613 0.0026061415   7.117654 1.097795e-12
## VehAge        -0.043247642 0.0010635270 -40.664356 0.000000e+00
## DrivAge        0.006737379 0.0003920559  17.184740 3.454917e-66
## BonusMalus     0.022624649 0.0003062542  73.875395 0.000000e+00
## VehGasRegular  0.064019089 0.0108296505   5.911464 3.390804e-09
## AreaB          0.043914814 0.0215003586   2.042515 4.110043e-02
## AreaC          0.075914123 0.0173688115   4.370715 1.238404e-05
## AreaD          0.159504151 0.0179520750   8.884998 6.392189e-19
## AreaE          0.206060644 0.0183812695  11.210360 3.627161e-29
## AreaF          0.196575169 0.0334508461   5.876538 4.189352e-09

5. Prédictions

freq$prediction <- predict(modele1, type = "response")
freq_filtre <- freq[freq$Exposure >= 0.5, ]
freq_filtre$freq_obs <- freq_filtre$ClaimNb / freq_filtre$Exposure
freq_filtre$freq_pred <- freq_filtre$prediction / freq_filtre$Exposure

plot(freq_filtre$freq_pred, freq_filtre$freq_obs,
     xlab = "Fréquence prédite",
     ylab = "Fréquence observée",
     main = "Prédictions vs Observations (Exposure >= 0.5)",
     pch = 16, col = rgb(0, 0, 1, 0.2))
abline(0, 1, col = "red")

## 6. Modèle de sévérité - GLM Gamma

La sévérité représente le coût moyen par sinistre. On utilise un GLM Gamma adapté aux distributions positives et asymétriques.

data(freMTPL2sev)

# Jointure avec le dataset fréquence
sev <- merge(freMTPL2sev, freq, by = "IDpol")

# Filtrage des sinistres extrêmes (99ème percentile)
seuil <- quantile(sev$ClaimAmount, 0.99)
sev_filtre <- sev[sev$ClaimAmount <= seuil, ]

# GLM Gamma
modele_sev <- glm(ClaimAmount ~ VehPower + VehAge + DrivAge + BonusMalus + VehGas + Area,
                  family = Gamma(link = "log"),
                  data = sev_filtre)

summary(modele_sev)$coefficients
##                   Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept)    7.031371096 0.0550270723 127.7802144 0.000000e+00
## VehPower       0.010516019 0.0035628626   2.9515647 3.164491e-03
## VehAge        -0.011031082 0.0014514647  -7.5999659 3.061365e-14
## DrivAge        0.001175863 0.0005563333   2.1135949 3.455933e-02
## BonusMalus     0.002254950 0.0004085475   5.5194327 3.433240e-08
## VehGasRegular -0.007762375 0.0146405985  -0.5301952 5.959811e-01
## AreaB          0.029236349 0.0303260167   0.9640682 3.350206e-01
## AreaC          0.030308918 0.0244375444   1.2402604 2.148902e-01
## AreaD          0.066943877 0.0248892855   2.6896665 7.156888e-03
## AreaE          0.070564733 0.0252615096   2.7933696 5.220000e-03
## AreaF         -0.121386405 0.0468898277  -2.5887577 9.637619e-03

7. Prime pure

La prime pure est le produit de la fréquence prédite par la sévérité prédite. Elle représente le coût espéré du risque par véhicule-année.

freq$sev_pred <- predict(modele_sev, newdata = freq, type = "response")
freq$prime_pure <- (freq$prediction / freq$Exposure) * freq$sev_pred

summary(freq$prime_pure)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    24.28   101.59   137.56   156.06   179.15 18016.19
hist(freq$prime_pure[freq$prime_pure < 500],
     breaks = 50,
     main = "Distribution de la prime pure (< 500€)",
     xlab = "Prime pure (€)",
     col = "steelblue")

8. Prime pure par tranche de Bonus-Malus

bm_tranches <- cut(freq$BonusMalus, breaks = seq(50, 230, by = 20))
prime_bm <- aggregate(prime_pure ~ bm_tranches, data = freq, FUN = mean)

barplot(prime_bm$prime_pure,
        names.arg = prime_bm$bm_tranches,
        main = "Prime pure moyenne par tranche de Bonus-Malus",
        xlab = "Bonus-Malus",
        ylab = "Prime pure moyenne (€)",
        col = "steelblue",
        las = 2)

La prime pure croît fortement avec le Bonus-Malus, confirmant que le modèle capte bien la segmentation du risque individuel.