Ce projet modélise la fréquence de sinistres sur un portefeuille auto français à l’aide d’un GLM Poisson avec offset d’exposition.
Les données proviennent du package CASdatasets
(freMTPL2freq).
library(CASdatasets)
library(ggplot2)
data(freMTPL2freq)
freMTPL2freq$ClaimNb <- as.numeric(freMTPL2freq$ClaimNb)
freq <- freMTPL2freq[freMTPL2freq$VehAge <= 30 & freMTPL2freq$DrivAge <= 90, ]
Le dataset contient 676498 contrats après nettoyage.
summary(freq[, c("ClaimNb", "Exposure", "BonusMalus", "DrivAge", "VehAge")])
## ClaimNb Exposure BonusMalus DrivAge
## Min. : 0.00000 Min. :0.002732 Min. : 50.00 Min. :18.00
## 1st Qu.: 0.00000 1st Qu.:0.170000 1st Qu.: 50.00 1st Qu.:34.00
## Median : 0.00000 Median :0.490000 Median : 50.00 Median :44.00
## Mean : 0.05329 Mean :0.528399 Mean : 59.77 Mean :45.47
## 3rd Qu.: 0.00000 3rd Qu.:0.990000 3rd Qu.: 64.00 3rd Qu.:55.00
## Max. :16.00000 Max. :2.010000 Max. :230.00 Max. :90.00
## VehAge
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 6.000
## Mean : 6.989
## 3rd Qu.:11.000
## Max. :30.000
Fréquence brute du portefeuille :
round(sum(freq$ClaimNb) / sum(freq$Exposure), 4)
## [1] 0.1008
modele1 <- glm(ClaimNb ~ VehPower + VehAge + DrivAge + BonusMalus + VehGas + Area,
offset = log(Exposure),
family = poisson(link = "log"),
data = freq)
summary(modele1)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.930470223 0.0395861235 -99.289091 0.000000e+00
## VehPower 0.018549613 0.0026061415 7.117654 1.097795e-12
## VehAge -0.043247642 0.0010635270 -40.664356 0.000000e+00
## DrivAge 0.006737379 0.0003920559 17.184740 3.454917e-66
## BonusMalus 0.022624649 0.0003062542 73.875395 0.000000e+00
## VehGasRegular 0.064019089 0.0108296505 5.911464 3.390804e-09
## AreaB 0.043914814 0.0215003586 2.042515 4.110043e-02
## AreaC 0.075914123 0.0173688115 4.370715 1.238404e-05
## AreaD 0.159504151 0.0179520750 8.884998 6.392189e-19
## AreaE 0.206060644 0.0183812695 11.210360 3.627161e-29
## AreaF 0.196575169 0.0334508461 5.876538 4.189352e-09
freq$prediction <- predict(modele1, type = "response")
freq_filtre <- freq[freq$Exposure >= 0.5, ]
freq_filtre$freq_obs <- freq_filtre$ClaimNb / freq_filtre$Exposure
freq_filtre$freq_pred <- freq_filtre$prediction / freq_filtre$Exposure
plot(freq_filtre$freq_pred, freq_filtre$freq_obs,
xlab = "Fréquence prédite",
ylab = "Fréquence observée",
main = "Prédictions vs Observations (Exposure >= 0.5)",
pch = 16, col = rgb(0, 0, 1, 0.2))
abline(0, 1, col = "red")
## 6. Modèle de sévérité - GLM Gamma
La sévérité représente le coût moyen par sinistre. On utilise un GLM Gamma adapté aux distributions positives et asymétriques.
data(freMTPL2sev)
# Jointure avec le dataset fréquence
sev <- merge(freMTPL2sev, freq, by = "IDpol")
# Filtrage des sinistres extrêmes (99ème percentile)
seuil <- quantile(sev$ClaimAmount, 0.99)
sev_filtre <- sev[sev$ClaimAmount <= seuil, ]
# GLM Gamma
modele_sev <- glm(ClaimAmount ~ VehPower + VehAge + DrivAge + BonusMalus + VehGas + Area,
family = Gamma(link = "log"),
data = sev_filtre)
summary(modele_sev)$coefficients
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.031371096 0.0550270723 127.7802144 0.000000e+00
## VehPower 0.010516019 0.0035628626 2.9515647 3.164491e-03
## VehAge -0.011031082 0.0014514647 -7.5999659 3.061365e-14
## DrivAge 0.001175863 0.0005563333 2.1135949 3.455933e-02
## BonusMalus 0.002254950 0.0004085475 5.5194327 3.433240e-08
## VehGasRegular -0.007762375 0.0146405985 -0.5301952 5.959811e-01
## AreaB 0.029236349 0.0303260167 0.9640682 3.350206e-01
## AreaC 0.030308918 0.0244375444 1.2402604 2.148902e-01
## AreaD 0.066943877 0.0248892855 2.6896665 7.156888e-03
## AreaE 0.070564733 0.0252615096 2.7933696 5.220000e-03
## AreaF -0.121386405 0.0468898277 -2.5887577 9.637619e-03
La prime pure est le produit de la fréquence prédite par la sévérité prédite. Elle représente le coût espéré du risque par véhicule-année.
freq$sev_pred <- predict(modele_sev, newdata = freq, type = "response")
freq$prime_pure <- (freq$prediction / freq$Exposure) * freq$sev_pred
summary(freq$prime_pure)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24.28 101.59 137.56 156.06 179.15 18016.19
hist(freq$prime_pure[freq$prime_pure < 500],
breaks = 50,
main = "Distribution de la prime pure (< 500€)",
xlab = "Prime pure (€)",
col = "steelblue")
bm_tranches <- cut(freq$BonusMalus, breaks = seq(50, 230, by = 20))
prime_bm <- aggregate(prime_pure ~ bm_tranches, data = freq, FUN = mean)
barplot(prime_bm$prime_pure,
names.arg = prime_bm$bm_tranches,
main = "Prime pure moyenne par tranche de Bonus-Malus",
xlab = "Bonus-Malus",
ylab = "Prime pure moyenne (€)",
col = "steelblue",
las = 2)
La prime pure croît fortement avec le Bonus-Malus, confirmant que le modèle capte bien la segmentation du risque individuel.