About me
Modeling pure premium using generalized linear models (GLM) is common in non-life insurance. GLMs are statistical models that allow the modeling of relationships between independent variables (rating factors) and a dependent variable (claim frequency or claim cost).
How can we calculate the pure premium using Generalized Linear Models (GLM) in the context of non-life insurance?
The project aims to achieve the following:
Data Processing: Cleanse the data by addressing irregularities such as missing values and outliers.
Selection of Independent Variables: Identify relevant variables for the pricing process.
Choice of Probability Distributions: Select appropriate probability distributions to model claim frequency and claim cost.
Selection of Link Functions: Choose suitable link functions to connect independent variables to pure premium.
Construction of GLM Models: Utilize a statistical software of your choice to build the models.
Model Training and Parameter Estimation: Train the models to estimate coefficients.
Model Evaluation: Assess the quality of the models using performance indicators like the coefficient of determination (\(R²\)), Akaike Information Criterion (AIC), or other performance measures.
Selection of Relevant Models: Choose the most suitable models.
Pricing of a New Insurance Policy: Apply the models to price a new insurance policy.
Final Report: Prepare a report summarizing the process.
The following R packages are required for this
study:
In this study we have two databases such as contracts and claims that we will use in our project, as follows:
In this section, we explore the critical process of variable selection, a pivotal step in refining our dataset for modeling purposes. The careful consideration and identification of relevant variables lay the foundation for constructing robust models in our insurance analysis.
nocontrat (Number of Contract): This variable
represents a unique identifier assigned to each insurance contract.zone: The zone variable indicates the geographical area
or zone associated with the insurance contract.puissance (Power): Refers to the power or engine
capacity of the insured vehicle.agevehicule (Vehicle Age): Represents the age of the
insured vehicle, indicating how long it has been in use (1=youngest,
4=oldest).ageconducteur (Driver Age): This variable represents
the age of the driver associated with the insurance contract.marque (Brand): Marque refers to the brand or
manufacturer of the insured vehicle.carburant (Fuel Type): Indicates the type of fuel used
by the insured vehicle, such as gasoline or diesel.densite (Population Density): Represents the population
density of the geographical area associated with the insurance
contract.region: The region variable denotes the specific region
or location associated with the insurance contract.no (Number): The variable “no” is a unique identifier
assigned to each claim, providing a distinct reference number.nocontrat (Number of Contract): This variable
represents the unique identifier assigned to the insurance contract
associated with the claim.garantie (Coverage): Garantie refers to the type or
level of coverage associated with the claim, indicating the specific
insurance coverage provided.cout (Cost): The variable “cout” represents the cost
associated with the claim, indicating the financial amount attributed to
the insurance claim.After selecting the variables we have imported the following data on R:
## # A tibble: 6 × 9
## nocontrat zone puissance agevehicule ageconducteur marque carburant densite
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 4075803 E 6 3 48 12 E 11
## 2 4024277 D 4 2 32 12 E 31
## 3 4083980 B 4 1 74 12 E 26
## 4 4032035 B 5 1 33 12 D 53
## 5 4020011 E 4 1 50 12 D 54
## 6 4001738 E 5 9 37 5 E 31
## # ℹ 1 more variable: region <dbl>
## # A tibble: 6 × 9
## nocontrat zone puissance agevehicule ageconducteur marque carburant densite
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 4072339 D 8 2 39 12 D 82
## 2 4032828 D 7 1 38 12 E 22
## 3 4033173 A 11 1 49 12 D 82
## 4 4056926 C 8 0 58 12 D 26
## 5 4029746 E 4 1 30 12 E 11
## 6 4062808 D 4 0 42 12 E 93
## # ℹ 1 more variable: region <dbl>
## # A tibble: 6 × 4
## no nocontrat garantie cout
## <dbl> <dbl> <chr> <dbl>
## 1 1 4075803 2DO 513.
## 2 2 4075803 1RC 0
## 3 3 4024277 2DO 89.7
## 4 4 4024277 1RC 1851.
## 5 5 4083980 2DO 567.
## 6 6 4032035 4BG 512.
## # A tibble: 6 × 4
## no nocontrat garantie cout
## <dbl> <dbl> <chr> <dbl>
## 1 174 4072339 4BG 491.
## 2 175 4032828 4BG 750.
## 3 176 4033173 2DO 1357.
## 4 177 4056926 2DO 97.6
## 5 178 4029746 4BG 58.6
## 6 179 4062808 2DO 1330.
We delve into the distinctive data structures of both the Contracts and Claims datasets, providing a comprehensive overview of the variables and their significance in our analytical journey.
## [1] 138 9
## tibble [138 × 9] (S3: tbl_df/tbl/data.frame)
## $ nocontrat : num [1:138] 4075803 4024277 4083980 4032035 4020011 ...
## $ zone : chr [1:138] "E" "D" "B" "B" ...
## $ puissance : num [1:138] 6 4 4 5 4 5 4 4 5 6 ...
## $ agevehicule : num [1:138] 3 2 1 1 1 9 4 15 0 4 ...
## $ ageconducteur: num [1:138] 48 32 74 33 50 37 51 56 26 35 ...
## $ marque : num [1:138] 12 12 12 12 12 5 12 1 12 2 ...
## $ carburant : chr [1:138] "E" "E" "E" "D" ...
## $ densite : num [1:138] 11 31 26 53 54 31 11 11 22 31 ...
## $ region : num [1:138] 13 -1 13 0 13 11 12 -1 2 13 ...
The dataset “Contracts” has 138 rows and 9 columns, indicating that there are 138 observations (entries) and 9 variables in the dataset.
The variables of dataset have different data types, including numerical (num) and character (chr) types. The variable
nocontratis of numerical type and seems to represent a unique identifier for each contract. The variablezoneis of character type, indicating a categorical variable. Variables likepuissance,agevehicule,ageconducteur,marque,densite, andregionare of numerical type. The variablecarburantis of character type, suggesting it may represent a categorical variable indicating the type of fuel used.
## [1] 179 4
## tibble [179 × 4] (S3: tbl_df/tbl/data.frame)
## $ no : num [1:179] 1 2 3 4 5 6 7 8 9 10 ...
## $ nocontrat: num [1:179] 4075803 4075803 4024277 4024277 4083980 ...
## $ garantie : chr [1:179] "2DO" "1RC" "2DO" "1RC" ...
## $ cout : num [1:179] 512.7 0 89.7 1851.1 566.8 ...
The dataset “Claims” has 179 rows and 4 columns, indicating that there are 179 observations (entries) and 4 variables in the dataset.
The variables have different data types, the variable
nois of numerical type and seems to represent a unique identifier for each claim. The variablenocontratis of numerical type, suggesting a linkage to the contract number in the “Contracts” dataset. The variablegarantieis of character type, indicating a categorical variable representing the type of coverage for each claim. The variablecoutis of numerical type, representing the cost associated with each claim.
We analyzed and processed missing values for both the Contracts and Claims datasets:
## named integer(0)
## named integer(0)
No missing values were found in either the “Contracts” or “Claims” datasets, indicating complete data integrity for the observed variables.
## nocontrat zone puissance agevehicule ageconducteur
## 0 0 0 0 0
## marque carburant densite region
## 0 0 0 0
## no nocontrat garantie cout
## 0 0 0 0
For the “Contracts” dataset, there are no missing values in any of the columns (
nocontrat,zone,puissance,agevehicule,ageconducteur,marque,carburant,densite,region). Similarly, for the “Claims” dataset, there are no missing values in any of the columns (no,nocontrat,garantie,cout).
## character(0)
## character(0)
For the “Contracts” dataset, there are no column names with missing values, as indicated by the empty character vector (
character(0)). Similarly, for the “Claims” dataset, there are no column names with missing values, as indicated by the empty character vector (character(0)).
Box and bar graphs (box plots) or boxplots represent statistical data
visually. They allow you to visualize variations in the data
distribution and general trends. We exclude contract number, zone, brand
and type of fuel in contracts data and the variable no,
nocontract, garantie for claims data because
they have real values and are not subject to outlier analysis, as well
as other variables that have no information in the analysis. We can now
check for outliers on each variable in the previous data cleaning by
using boxplots:
According to the boxplots, it is evident that the variables
power,agevehicle, andcostexhibit values beyond the whiskers of the boxplots, indicating the presence of outliers. Otherwise, it can be identified in the code as follows:
We will identify outliers for each variable in each data set:
## [1] 14
## [1] 9 15 15 10 10 10 10 9 10 10 9
## [1] 2735.14 4474.51 16129.91 2756.51 6539.97 19042.19 4158.45 3444.60
## [9] 35193.25 3733.68 14289.32 5434.82
- Variable
power: One outlier observed with a value of 14.- Variable
agevehicle: Multiple outliers observed with values 9, 15, 15, 10, 10, 10, 10, 9, 10, 10, and 9.- Variable
cost: Several outliers observed with values 2735.14, 4474.51, 16129.91, 2756.51, 6539.97, 19042.19, 4158.45, 3444.60, 35193.25, 3733.68, 14289.32, and 5434.82.
In this subsection, we delve into the critical process of outlier processing through linear interpolation, a method aimed at mitigating the impact of outliers on our data analysis. This approach involves filling in missing values by estimating intermediate points using linear interpolation, ensuring a more robust and representative dataset for further exploration.
We will repeat the treatment of outliers using linear interpolation
for the puissance, agevehicle variables.
We’ll check to be sure whether the outliers still exist.
When performing linear interpolation on the
cout variable
to address outliers, we identified missing values. Consequently, we
opted to replace these missing values with the median. Following this,
we conducted linear interpolation once again to address any remaining
outliers.
This code defines a winsorize function, which replaces values below a
lower limit or above an upper limit with those limits. It then applies
Winsorizing to the cout variable based on the 5th and 95th
percentiles.
Check again for outliers:
## numeric(0)
## numeric(0)
## numeric(0)
check again for missing values:
## character(0)
## named integer(0)
## no nocontrat garantie cout
## 0 0 0 0
Observations:
- For the
Contractsdataset, the variablepuissancedoes not have any identified outliers and missing values, as indicated by the empty numeric vector (numeric(0)).
- Similarly, for the variable
agevehiculein theContractsdataset, no outliers and missing values are detected, as shown by the empty numeric vector (numeric(0)).
- In the
Claimsdataset, the variablecoutalso exhibits no identified outliers and missing values, reflected by the empty numeric vector (numeric(0)).
After cleaning the data, we will select a subset of all the variables to continue the analysis. Nevertheless, the data cleaning performed in the previous section can be useful for further analysis of the whole data set.
## [1] 179 3
The first variable (no) in the dataset of Claims
contains no information, it’s just a counter, we can remove it !
These lines of code convert specific columns in the “Claims” and “Contracts” datasets to factors, facilitating categorical data representation.
Now, we will merge two datasets, Claims and Contracts, based on the common column ‘number of contrat’:
We want to create a binary variable claimnb indicating
whether a contract has a cost (claimed) or not (not claimed), you can
proceed as follows:
The first step in the exploration analysis is to make a descriptive statistic to structure and represent the information contained in the data.
## nocontrat garantie cout zone puissance
## Min. : 1545 1RC:40 Min. : 0.0 A:20 Min. : 4.000
## 1st Qu.:4014166 2DO:79 1st Qu.: 240.0 B:10 1st Qu.: 5.000
## Median :4036392 3VI: 1 Median : 598.0 C:36 Median : 6.000
## Mean :3747634 4BG:57 Mean : 683.8 D:47 Mean : 6.207
## 3rd Qu.:4056100 5CO: 2 3rd Qu.:1128.1 E:50 3rd Qu.: 7.000
## Max. :4086670 Max. :1901.0 F:16 Max. :10.000
## agevehicule ageconducteur marque carburant densite
## Min. :0.000 Min. :22.00 Min. : 1.0 D:90 Min. :11.00
## 1st Qu.:1.000 1st Qu.:34.00 1st Qu.:12.0 E:89 1st Qu.:11.00
## Median :1.000 Median :44.00 Median :12.0 Median :31.00
## Mean :1.877 Mean :44.76 Mean :10.7 Mean :47.25
## 3rd Qu.:3.000 3rd Qu.:55.00 3rd Qu.:12.0 3rd Qu.:82.00
## Max. :6.000 Max. :84.00 Max. :14.0 Max. :94.00
## region claimnb
## Min. :-1.000 Min. :0.000
## 1st Qu.: 6.000 1st Qu.:1.000
## Median :13.000 Median :1.000
## Mean : 9.637 Mean :0.905
## 3rd Qu.:13.000 3rd Qu.:1.000
## Max. :13.000 Max. :1.000
nocontrat: * Thenocontratvariable represents unique identifiers for contracts. * The values range from 1545 to 4086670, with a mean of 3747634. * The distribution shows that contracts span a wide range of identifier values.
zone: *zoneis a categorical variable indicating different geographical zones. * The summary provides counts and unique values for each zone.
puissance: *puissancerepresents the power or engine capacity of insured vehicles. * Values range from 4 to 10, with a mean of 6.207. * The distribution indicates a variation in vehicle power across contracts.
agevehicule: *agevehiculedenotes the age of insured vehicles. * The minimum age is 0, likely indicating new vehicles, and the maximum is 6. The mean age is 1.877, suggesting a relatively young vehicle portfolio.
ageconducteur: *ageconducteurrepresents the age of the drivers associated with contracts. * Driver ages vary from 22 to 84, with a mean of 44.76. * The distribution suggests a diverse age range among drivers.
marque: *marqueis a categorical variable indicating the brand or manufacturer of insured vehicles. * The summary provides counts and unique values for each brand.
carburant: *carburantrepresents the type of fuel used by insured vehicles. * The summary provides counts and unique values for each fuel type.
densite: *densiteindicates the population density of the geographical area associated with contracts. * Population density values range from 11 to 94, with a mean of 47.25. * The distribution reflects variability in the density of the regions.
region: *regionrepresents specific regions associated with contracts. * Region values range from -1 to 13, with a mean of 9.637. * The distribution indicates contracts are spread across various regions.
garantie: *garantieis a categorical variable indicating different types of coverage for claims. * The summary provides counts and unique values for each type of coverage.
cout: *coutrepresents the cost associated with each claim. * Claim costs range from 0.0 to 1901.0, with a mean of 683.8. * The distribution indicates variability in claim costs, with a significant range.
To get a first impression of the claims costs of the contracts, the important features and their interactions, we start our analysis with a small binary decision tree of depth 3.
The tree is designed to analyze interactions between various variables and their influence on claim costs. This binary decision tree, with a depth of 3, is split based on the following key variables: “garantie,” “puissance,” “agevehicule,” “zone,” and “densite.”
The root node, representing the entire dataset, displays a claim frequency of 10%, associated costs, and a proportion of 100%. The initial split, determined by the metric of gini impurity, separates the root node into two more homogeneous groups defined by
garantie = 4BG,5CO. Policies meeting this condition are displayed on the left, while those for cars with a vehicle age greater than or equal to one year are shown on the right. Notably, 67% of the cars are older than one year (VehAge >= 1), and theirdensite < 17accounts for 32%. Further refinement, based onpuissance < 7andzone = A,E, narrows this down to 9%.
Conversely, the claim rate on the left side of the decision tree is only 7%, contingent on
puissance >= 6and the age of the vehicle. These specific variables will be scrutinized more closely in the subsequent section.
In this subsection, we conduct a univariate analysis of individual variables. Examining each variable in isolation, we aim to reveal its distribution, characteristics, and statistical properties. This analysis provides essential insights into patterns, trends, and outliers, laying the groundwork for a comprehensive understanding of our data and guiding further exploration.
The “Overall” plot displays total claims costs across zones, with each bar color-coded by zone. The second plot, not in the code, likely illustrates claims costs for zones where costs exceed zero, aiding in identifying zones with higher claims costs. The two figures illustrate that zone E represents the highest cost of claims, followed by D, C, A, F, and B. Otherwise, the most representative fuel type claims cost is for diesel fuel following gasoline.
The histograms above show that car makes 12 has the highest claims costs in contracts, and the rest of the makes are almost zero. Otherwise, the 1-year-old vehicle is the most expensive in claims costs, followed by the 0, 2, and 4-year-old cars.
In this analysis, it is evident that the insurance type most advantageous for insurers is Personal Damage Cover (2DO), ranking first. Following closely is Glass Breakage Cover (4BG) in second place and Third Party Liability (1RC) in the third position. Fire and Collision cover (3VI) and Collision (5CO) show minimal prevalence, nearly approaching. Otherwise, in terms of horsepower, the cars with the highest contracts have a horsepower of 4, followed by 7, and at almost equal levels are 6, 5, and 8.
We can see that the age distribution of paid drivers in the (40, 50] age range is higher, followed by the (30, 40] and (50, 60] age ranges. On the other hand, claims costs are distributed like a Poisson distribution.
Bivariate analysis explores relationships between pairs of variables, revealing patterns and correlations. It helps uncover how changes in one variable relate to changes in another, providing insights through statistical techniques and visualizations for a more nuanced understanding of the data.
Other intriguing interactions appear to be present:
The code conducts a bivariate analysis of “Vehicle Power” and “Claim Frequency,” considering different “Fuel Types.” It generates a scatter plot with a smooth curve, illustrating the relationship between power, claim frequency, and fuel type. The larger font sizes in the title and axes enhance readability. The analysis aims to reveal patterns and correlations in the data, offering insights into the interplay between vehicle characteristics and insurance claim frequencies.Otherwise, the figure on the right clearly shows that the type of contract cover most interactive with the cost and number of claims is damage cover.
In this analysis, we explore the claim frequency in automobile insurance, focusing on the impact of car brand, vehicle age, and fuel type. We categorize vehicles based on age, considering those aged seven or more as a single group. After summarizing the claim frequency and cost, we filter out groups with less than 100 costs. The resulting scatter plot reveals the relationship between claim frequency and car brand, color-coded by vehicle age group, with separate facets for different fuel types. Larger points represent higher claim frequencies, and the analysis provides insights into the interplay of these factors in insurance claims.
From the two graphs above, we examine the interactions between car age and driver age, on the one hand, and vehicle make, on the other. To avoid the noisy effects of tiny groups, all cars over twenty years old belong to the same group. It is clear from the two graphs that there is an interaction between claims frequency, driver age, and car brand. Outliers are observed in the graph depicting the interaction between the age of the driver and the car brand. This is explained by the presence of individuals in certain age groups owning luxury cars, leading to extreme data points.
Correlation analysis is a crucial step in understanding relationships between variables within a dataset. In this section, we perform correlation analysis on the insurance dataset, focusing on both numerical and categorical variables.
We delve into the relationships between numerical variables such as Contract number, Cost, Power, Vehicle age, Driver age, Brand, Density, Region, and claim number. Utilizing correlation coefficients, scatter plots, and other visualization techniques, we aim to uncover patterns and dependencies among these quantitative features.
## The dataset has 9 numeric Variables and 3 factor variables
The output indicates that there are 9 numeric variables and 3 factor variables in the dataset.
From the correlation graph of numerical variables, we can see that the two variables, driver age with region, are strongly positively correlated, followed by the correlation of cost and number of claims. On the other hand, there is a negative correlation between the make and age of the vehicle and density. This shows that this analysis of the correlation of numerical variables provides infromations on our model, which we’ll build later.
When p-value < 0.05, we can accept that there is a significant relationship between the variables highlighted. As we can see, between the variables, cost, power, vehicle age, driver age, density, number of claims…etc.
This subsubsection explores the interplay between categorical and numerical variables like car brand, fuel type, and warranty coverage. Through appropriate statistical measures and visualizations, we seek to discern any associations or dependencies between these discrete attributes.
## tibble [179 × 12] (S3: tbl_df/tbl/data.frame)
## $ nocontrat : num [1:179] 4075803 4075803 4024277 4024277 4083980 ...
## $ garantie : num [1:179] 4 5 4 5 4 2 4 5 2 4 ...
## $ cout : num [1:179] 512.7 0 89.7 1851.1 566.8 ...
## $ zone : num [1:179] 2 2 3 3 5 5 2 2 2 1 ...
## $ puissance : num [1:179] 6 6 4 4 4 5 4 4 5 4 ...
## $ agevehicule : num [1:179] 3 3 2 2 1 1 1 1 0 4 ...
## $ ageconducteur: num [1:179] 48 48 32 32 74 33 50 50 37 51 ...
## $ marque : num [1:179] 12 12 12 12 12 12 12 12 5 12 ...
## $ carburant : num [1:179] 1 1 1 1 1 0 0 0 1 1 ...
## $ densite : num [1:179] 11 11 31 31 26 53 54 54 31 11 ...
## $ region : num [1:179] 13 13 -1 -1 13 0 13 13 11 12 ...
## $ claimnb : int [1:179] 1 0 1 1 1 1 1 1 1 1 ...
The correlation graph shows a strong correlation between the following variables: driver age and region, cost and number of claims, area, and density. On the other hand, there is a weak negative correlation between zone and fuel, as well as the make and age of the vehicle.
For p-value < 0.05, we can accept that there is a significant relationship between categorical and numerical variables. As we can see, between the variables, cost, fuel, warranty, power, vehicle age, driver age, density, number of claims…etc, there is a significant relationship.
We focus on exploring the distribution of variables and density plots to gain insights into the frequency and cost of insurance claims. The key objective is to select appropriate probability distributions that effectively model the statistical patterns inherent in the data. This step is crucial for building accurate and robust models to understand and predict claim behavior.
The analysis reveals that the numerical variables, such as driver age, claims cost, vehicle age, and horsepower, exhibit a distribution pattern resembling the Poisson distribution. In contrast, the density of other variables, including the number of claims, regions, and car makes, aligns more closely with a negative binomial probability distribution.
The bulk of those making claims is dense on the insurance cost scale. At the top of this scale, claims start to converge and earn higher and higher fees - the graph is simple and clearly delineates this zone. Otherwise, the mass of those with damage cover (2DO) most represents the insurance cost scale. Followed by cover types RC and BG - the graph is simple and delineates this zone.
Modeling pure premium in insurance is crucial for estimating future claim costs. Generalized Linear Models (GLM) offer a powerful approach by incorporating predictors, using a link function, and specifying a distribution. Key steps include data exploration, model specification, parameter estimation, validation, interpretation, and predictive analytics. GLMs provide flexibility for understanding risk factors and optimizing insurance portfolios.
\[ g(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k \]
Where:
The link function \(g(\mu)\) transforms the linear combination into the scale of the response variable. Common link functions include:
This code is part of the data preparation process for modeling
insurance costs. The goal is to predict insurance costs
(cout) based on various independent variables such as
contract number, insurance coverage type, geographical zone, car power,
vehicle age, driver age, car brand, and fuel type. The code uses the
tidyverse and caret libraries for data manipulation and modeling.
The target variable (cost) is assigned to the dataframe
y, while the independent variables are assigned to the
dataframe X, excluding the cost variable to prevent data
leakage during prediction. The dataset is then split into training and
validation sets with \(90\%\) for
training and \(10\%\) for
validation.
This step is essential to ensure a clean and suitable dataset for building an accurate predictive model for insurance costs.
Train-Test Split:
We’re going to make two specifications for modeling the relationships
between independent variables (pricing factors) and a dependent variable
(claims frequency claimnb and claims cost
cout).
\[X=\{x_1, x_2,x_3, x_4,x_5, x_6,x_8, x_9, x_{10}\}\] where, - \(x_1=`\text{garantie}`\), \(x_2=`\text{zone}`\), \(x_3=`\text{puissance}`\), \(x_4=`\text{agevehicule}`\), \(x_5=`\text{ageconducteur}`\), \(x_6=`\text{marque}`\), \(x_7=`\text{carburant}`\), \(x_8=`\text{densite}`\), \(x_9=`\text{region}`\), \(x_{10}=`\text{claimnb}`\)$
\[Y=`\text{cout}`\]
## [1] "nocontrat" "garantie" "cout" "zone"
## [5] "puissance" "agevehicule" "ageconducteur" "marque"
## [9] "carburant" "densite" "region" "claimnb"
## [1] 512.74 0.00 89.70 1851.11 566.84 512.48
## # A tibble: 6 × 11
## nocontrat garantie zone puissance agevehicule ageconducteur marque carburant
## <dbl> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <fct>
## 1 4075803 2DO E 6 3 48 12 E
## 2 4075803 1RC E 6 3 48 12 E
## 3 4024277 2DO D 4 2 32 12 E
## 4 4024277 1RC D 4 2 32 12 E
## 5 4083980 2DO B 4 1 74 12 E
## 6 4032035 4BG B 5 1 33 12 D
## # ℹ 3 more variables: densite <dbl>, region <dbl>, claimnb <int>
Poisson Regression operates similarly to ordinary linear
regression, with the key distinction that it assumes the response
variable (cout) follows a Poisson distribution. This choice
positions it as one of the two generalized linear models (GLMs) applied
in this project:
\[ Y_i \stackrel{iid}{\sim} \text{Pois}(\lambda) \]
This model is selected primarily as an initial demonstration of a GLM for predicting claim severity. However, it’s worth noting that more sophisticated alternatives might be more suitable for this purpose.
Binomial Regression functions analogously to ordinary linear
regression, with a crucial difference: it assumes a binomial
distribution for the response variable (cout). This sets it
as one of the two GLMs employed in this project:
\[ Y_i \stackrel{iid}{\sim} \text{Binomial}(n, p) \]
The choice of this model is made as an initial exploration into GLMs for predicting claim severity. Nevertheless, it’s important to recognize that more advanced alternatives might offer better suitability for this particular task.
Gaussian Regression functions similarly to ordinary linear
regression, assuming that the response variable (cout)
follows a Gaussian distribution. This model is part of the broader
family of GLMs used in this project:
\[ Y_i \stackrel{iid}{\sim} \mathcal{N}(\mu, \sigma^2) \]
In this context, \(\mu\) represents the mean of the Gaussian distribution, and \(\sigma^2\) is the variance. Gaussian Regression is chosen as it provides a framework for predicting continuous outcomes. However, the appropriateness of this choice should be critically assessed, as alternative models might better capture the underlying distribution of the data.
We’re going to build GLM models for each family to make sure we don’t miss any distrubtion families that might be suitable for our model. Here are the mathematical regression models for each family:
Poisson family
\[\text{1st Model: }\; Y = \text{Log}(y_{\text{train}}) \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{claimnb} + \hat{\beta}_2 \times \text{puissance} + \hat{\beta}_3 \times \text{garantie}_{\scriptsize{2DO}} + \hat{\beta}_4 \times \text{garantie}_{\scriptsize{3VI}} + \hat{\beta}_5 \times \text{garantie}_{\scriptsize{4BG}} \] \[+ \hat{\beta}_6 \times \text{garantie}_{\scriptsize{5CO}}+ \hat{\beta}_7 \times \text{carburant}_{\scriptsize{E}} + \hat{\beta}_8 \times \text{zone}_{\scriptsize{B}} + \hat{\beta}_9 \times \text{zone}_{\scriptsize{C}}\] \[+ \hat{\beta}_{10} \times \text{zone}_{\scriptsize{D}} + \hat{\beta}_{11} \times \text{zone}_{\scriptsize{E}} + \hat{\beta}_{12} \times \text{zone}_{\scriptsize{F}}+ \hat{\beta}_{13} \times \text{agevehicule}\] \[+ \hat{\beta}_{14} \times \text{marque} + \hat{\beta}_{15} \times \text{region}\]
where,
##
## Call:
## glm(formula = round(y_train) ~ claimnb + puissance + factor(garantie) +
## factor(carburant) + factor(zone) + agevehicule + marque +
## region + ageconducteur, family = poisson(link = "log"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.234e+01 7.283e+01 -0.169 0.865
## claimnb 1.918e+01 7.283e+01 0.263 0.792
## puissance 1.077e-02 2.023e-03 5.324 1.01e-07 ***
## factor(garantie)2DO -2.030e-01 7.758e-03 -26.163 < 2e-16 ***
## factor(garantie)3VI 4.800e-01 2.530e-02 18.968 < 2e-16 ***
## factor(garantie)4BG -9.123e-01 1.058e-02 -86.222 < 2e-16 ***
## factor(garantie)5CO -6.867e-01 3.113e-02 -22.062 < 2e-16 ***
## factor(carburant)E -1.116e-01 7.225e-03 -15.443 < 2e-16 ***
## factor(zone)B -1.732e-01 1.891e-02 -9.161 < 2e-16 ***
## factor(zone)C -1.224e-01 1.265e-02 -9.679 < 2e-16 ***
## factor(zone)D -7.067e-02 1.127e-02 -6.269 3.64e-10 ***
## factor(zone)E -1.671e-01 1.091e-02 -15.323 < 2e-16 ***
## factor(zone)F -5.060e-01 1.462e-02 -34.611 < 2e-16 ***
## agevehicule 2.190e-02 2.514e-03 8.710 < 2e-16 ***
## marque 1.831e-02 1.168e-03 15.668 < 2e-16 ***
## region -1.088e-02 1.017e-03 -10.697 < 2e-16 ***
## ageconducteur 3.092e-03 3.488e-04 8.865 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 70022 on 143 degrees of freedom
## Residual deviance: 33889 on 127 degrees of freedom
## AIC: 34971
##
## Number of Fisher Scoring iterations: 10
Poisson Regression Model for Cost of Insurance (Predictors and Estimated Effects)
claimnb): Each
additional claim significantly increases the expected cost of
insurance.puissance): A
higher power is associated with a significant increase in the expected
cost of insurance.garantie)): Different levels of insurance
guarantee significantly impact the cost of insurance compared to the
baseline.carburant)E):
Having fuel type E is associated with a significant decrease in the
expected cost of insurance.zone)):
Different zones significantly affect the cost of insurance compared to
Zone A.agevehicule): Each
additional year of vehicle age significantly increases the expected cost
of insurance.marque): Each unit
increase in the vehicle brand significantly raises the expected cost of
insurance.region): Each unit
increase in the geographic region significantly decreases the expected
cost of insurance.ageconducteur):
Each additional year of driver age significantly raises the expected
cost of insurance.P-values
Model Fit Statistics
Mathematically, its equation is :
\[\text{1st Model: }\; Y = \text{Log}(y_{\text{train}}) \sim -12.34 + 19.18 \times \text{claimnb} + 0.011 \times \text{puissance} - 0.203 \times \text{garantie}_{\scriptsize{2DO}}\] \[+ 0.48 \times \text{garantie}_{\scriptsize{3VI}} - 0.912 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.687 \times \text{garantie}_{\scriptsize{5CO}} - 0.112 \times \text{carburant}_{\scriptsize{E}} - 0.173 \times \text{zone}_{\scriptsize{B}} - 0.122 \times \text{zone}_{\scriptsize{C}}\] \[- 0.071 \times \text{zone}_{\scriptsize{D}} - 0.167 \times \text{zone}_{\scriptsize{E}} - 0.506 \times \text{zone}_{\scriptsize{F}}\]
\[+ 0.022 \times \text{agevehicule} + 0.018 \times \text{marque} - 0.011 \times \text{region} + 0.003 \times \text{ageconducteur}\]
##
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) +
## factor(carburant) + factor(zone) + agevehicule + marque +
## region + ageconducteur, family = poisson(link = "log"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.1200639 0.0224721 316.840 <2e-16 ***
## puissance -0.0290553 0.0020425 -14.226 <2e-16 ***
## factor(garantie)2DO -0.1098963 0.0076887 -14.293 <2e-16 ***
## factor(garantie)3VI 0.7220859 0.0251379 28.725 <2e-16 ***
## factor(garantie)4BG -0.6905203 0.0105927 -65.188 <2e-16 ***
## factor(garantie)5CO -0.3651179 0.0310821 -11.747 <2e-16 ***
## factor(carburant)E -0.1947441 0.0072050 -27.029 <2e-16 ***
## factor(zone)B -0.3337035 0.0188215 -17.730 <2e-16 ***
## factor(zone)C -0.3288956 0.0126607 -25.978 <2e-16 ***
## factor(zone)D -0.3150702 0.0110601 -28.487 <2e-16 ***
## factor(zone)E -0.4267584 0.0109482 -38.980 <2e-16 ***
## factor(zone)F -0.4919245 0.0146491 -33.581 <2e-16 ***
## agevehicule 0.0249593 0.0024006 10.397 <2e-16 ***
## marque 0.0161861 0.0011069 14.623 <2e-16 ***
## region 0.0018519 0.0010169 1.821 0.0686 .
## ageconducteur -0.0003913 0.0003439 -1.138 0.2553
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 70022 on 143 degrees of freedom
## Residual deviance: 59191 on 128 degrees of freedom
## AIC: 60272
##
## Number of Fisher Scoring iterations: 5
Coefficients Interpretation
Intercept (7.12) - The baseline log-rate of the response variable (
cost) when all other predictors are zero. - Large z-value and highly significant p-value indicate its significance.
Puissance (-0.0291) - A one-unit increase in
puissanceis associated with a decrease of approximately 2.91% in the log-rate. - Highly significant with a low p-value.
Factor levels of Garantie (2DO, 3VI, 4BG, 5CO) - Each factor level represents a category of the garantie variable. - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant.
Factor levels of Carburant (E) - Represents the log-rate change for the “E” category compared to the reference. - Highly significant with a low p-value.
Factor levels of Zone (B, C, D, E, F) - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant.
Agevehicule (0.0250) - A one-unit increase in agevehicule is associated with an increase of approximately 2.50% in the log-rate. - Highly significant.
Marque (0.0162) - A one-unit increase in marque is associated with an increase of approximately 1.62% in the log-rate. - Highly significant.
Region (0.00185) - A one-unit increase in region is associated with an increase of approximately 0.185% in the log-rate. - Borderline significance with a p-value close to 0.05.
Ageconducteur (-0.000391) - A one-unit increase in ageconducteur is associated with a decrease of approximately 0.0391% in the log-rate. - Not statistically significant (
p-value > 0.05).
Mathematically, its equation is :
\[\text{2nd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 7.12 - 0.029 \times \text{puissance} - 0.11 \times \text{garantie}_{\scriptsize{2DO}} + 0.722 \times \text{garantie}_{\scriptsize{3VI}} - 0.691 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.365 \times \text{garantie}_{\scriptsize{5CO}} - 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.333 \times \text{zone}_{\scriptsize{B}} - 0.323 \times \text{zone}_{\scriptsize{C}} - 0.315 \times \text{zone}_{\scriptsize{D}} - 0.427 \times \text{zone}_{\scriptsize{E}} - 0.492 \times \text{zone}_{\scriptsize{F}}\]
\[+ 0.025 \times \text{agevehicule} + 0.016 \times \text{marque} + 0.002 \times \text{region} - 0.001 \times \text{ageconducteur}\]
We will eliminate the two variables ageconducteur and
region since they are not significant at the 5% level.
##
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) +
## factor(carburant) + factor(zone) + agevehicule + marque,
## family = poisson(link = "log"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.110841 0.020814 341.63 <2e-16 ***
## puissance -0.028415 0.001991 -14.27 <2e-16 ***
## factor(garantie)2DO -0.107747 0.007583 -14.21 <2e-16 ***
## factor(garantie)3VI 0.729851 0.024773 29.46 <2e-16 ***
## factor(garantie)4BG -0.688195 0.010312 -66.74 <2e-16 ***
## factor(garantie)5CO -0.365425 0.031061 -11.77 <2e-16 ***
## factor(carburant)E -0.197264 0.006872 -28.70 <2e-16 ***
## factor(zone)B -0.336573 0.018552 -18.14 <2e-16 ***
## factor(zone)C -0.329223 0.012554 -26.22 <2e-16 ***
## factor(zone)D -0.316826 0.011018 -28.75 <2e-16 ***
## factor(zone)E -0.426848 0.010948 -38.99 <2e-16 ***
## factor(zone)F -0.494625 0.014382 -34.39 <2e-16 ***
## agevehicule 0.025691 0.002347 10.95 <2e-16 ***
## marque 0.016623 0.001052 15.80 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 70022 on 143 degrees of freedom
## Residual deviance: 59195 on 130 degrees of freedom
## AIC: 60271
##
## Number of Fisher Scoring iterations: 5
Model Summary:
Intercept (7.1108): - The baseline log-rate of the response variable when all other predictors are zero. - Highly significant (
p-value < 0.05).
Puissance (-0.0284): - A one-unit increase in puissance is associated with a decrease of approximately 2.84% in the log-rate. - Highly significant (
p-value < 0.05).
Factor levels of Garantie (2DO, 3VI, 4BG, 5CO): - Each factor level represents a category of the garantie variable. - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant (
p-value < 0.05).
Factor levels of Carburant (E): - Represents the log-rate change for the “E” category compared to the reference. - Highly significant (
p-value < 0.05).
Factor levels of Zone (B, C, D, E, F): - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant (
p-value < 0.05).
Agevehicule (0.0257): - A one-unit increase in agevehicule is associated with an increase of approximately 2.57% in the log-rate. - Highly significant (
p-value < 0.05).
Marque (0.0166): - A one-unit increase in marque is associated with an increase of approximately 1.66% in the log-rate. - Highly significant (
p-value < 0.05).
Model Fit: > - Null Deviance: 70022 (on 143 df) - Residual Deviance: 59195 (on 130 df) - AIC: 60271
Conclusions:** - The model appears to have a good fit, as indicated by the significant coefficients and the model fit statistics. - The predictors are statistically significant in explaining the variability in the response variable.
Mathematically, its equation is :
\[\text{3rd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 7.12 - 0.029 \times \text{puissance} - 0.11 \times \text{garantie}_{\scriptsize{2DO}} + 0.722 \times \text{garantie}_{\scriptsize{3VI}} - 0.691 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.365 \times \text{garantie}_{\scriptsize{5CO}} - 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.333 \times \text{zone}_{\scriptsize{B}} - 0.323 \times \text{zone}_{\scriptsize{C}} - 0.315 \times \text{zone}_{\scriptsize{D}} - 0.427 \times \text{zone}_{\scriptsize{E}} - 0.495 \times \text{zone}_{\scriptsize{F}}\]
\[+ 0.025 \times \text{agevehicule} +
0.016 \times \text{marque}\] We can add other varaible
ageconducteur factors and their powers to analyze the
relationships between the varaible and its powers.
##
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) +
## factor(carburant) + factor(zone) + agevehicule + marque +
## region + ageconducteur + log(ageconducteur) + I(ageconducteur^2) +
## I(ageconducteur^3) + I(ageconducteur^4), family = poisson(link = "log"),
## data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.139e+02 8.578e+00 36.592 < 2e-16 ***
## puissance -4.564e-02 2.160e-03 -21.127 < 2e-16 ***
## factor(garantie)2DO -1.497e-01 7.967e-03 -18.787 < 2e-16 ***
## factor(garantie)3VI 6.561e-01 2.532e-02 25.914 < 2e-16 ***
## factor(garantie)4BG -7.290e-01 1.093e-02 -66.700 < 2e-16 ***
## factor(garantie)5CO -3.482e-01 3.128e-02 -11.131 < 2e-16 ***
## factor(carburant)E -1.946e-01 7.437e-03 -26.169 < 2e-16 ***
## factor(zone)B -3.455e-01 1.893e-02 -18.251 < 2e-16 ***
## factor(zone)C -2.588e-01 1.322e-02 -19.574 < 2e-16 ***
## factor(zone)D -2.900e-01 1.132e-02 -25.610 < 2e-16 ***
## factor(zone)E -3.892e-01 1.109e-02 -35.097 < 2e-16 ***
## factor(zone)F -5.541e-01 1.491e-02 -37.164 < 2e-16 ***
## agevehicule 4.052e-02 2.471e-03 16.401 < 2e-16 ***
## marque 1.663e-02 1.197e-03 13.892 < 2e-16 ***
## region 3.686e-03 1.144e-03 3.222 0.00127 **
## ageconducteur 1.808e+01 4.786e-01 37.770 < 2e-16 ***
## log(ageconducteur) -1.887e+02 5.160e+00 -36.562 < 2e-16 ***
## I(ageconducteur^2) -3.102e-01 8.000e-03 -38.778 < 2e-16 ***
## I(ageconducteur^3) 3.017e-03 7.647e-05 39.448 < 2e-16 ***
## I(ageconducteur^4) -1.185e-05 2.984e-07 -39.726 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 70022 on 143 degrees of freedom
## Residual deviance: 56989 on 124 degrees of freedom
## AIC: 58077
##
## Number of Fisher Scoring iterations: 5
Deviance Residuals: - The deviance residuals measure the goodness of fit. The values range from -47.656 to 41.680, indicating some variability in the model’s ability to predict the observed outcomes.
Coefficients:
Intercept (313.9): - The baseline log-rate of the response variable when all other predictors are zero. - Highly significant (
p-value < 0.05).
Puissance (-0.0456): - A one-unit increase in puissance is associated with a decrease of approximately 4.56% in the log-rate. - Highly significant (
p-value < 0.05).
Factor levels of Garantie (2DO, 3VI, 4BG, 5CO): - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant (
p-value < 0.05).
Factor levels of Carburant (E): - Represents the log-rate change for the “E” category compared to the reference. - Highly significant (
p-value < 0.05).
Factor levels of Zone (B, C, D, E, F): - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant (
p-value < 0.05).
Agevehicule (0.0405): - A one-unit increase in agevehicule is associated with an increase of approximately 4.05% in the log-rate. - Highly significant (
p-value < 0.05).
Marque (0.0166): - A one-unit increase in marque is associated with an increase of approximately 1.66% in the log-rate. - Highly significant (
p-value < 0.05).
Region (0.00369): - A one-unit increase in region is associated with an increase of approximately 0.369% in the log-rate. - Highly significant (
p-value < 0.05).
Ageconducteur (18.08): - A one-unit increase in ageconducteur is associated with an increase of approximately 18.08 in the log-rate. - Highly significant (
p-value < 0.05).
Log(Ageconducteur) (-188.7): - The logarithm of ageconducteur is negatively associated with the log-rate. - Highly significant (
p-value < 0.05).
I(Ageconducteur^2) (-0.3102): - The second power of ageconducteur is negatively associated with the log-rate. - Highly significant (
p-value < 0.05).
I(Ageconducteur^3) (0.00302): - The third power of ageconducteur is positively associated with the log-rate. - Highly significant (
p-value < 0.05).
I(Ageconducteur^4) (-1.185e-05): - The fourth power of ageconducteur is negatively associated with the log-rate. - Highly significant (
p-value < 0.05).
Conclusions: - The model appears to have a good fit, as indicated by the significant coefficients and the model fit statistics. - The predictors are statistically significant in explaining the variability in the response variable.
\[\text{4rd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 313.9 -0.05\times \times \text{puissance} - 0.15 \times \text{garantie}_{\scriptsize{2DO}} + 0.656 \times \text{garantie}_{\scriptsize{3VI}} - 0.73 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.348 \times \text{garantie}_{\scriptsize{5CO}}- 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.346 \times \text{zone}_{\scriptsize{B}} - 0.259 \times \text{zone}_{\scriptsize{C}} \] \[- 0.29 \times \text{zone}_{\scriptsize{D}} - 0.389 \times \text{zone}_{\scriptsize{E}} - 0.554 \times \text{zone}_{\scriptsize{F}} - 0.04 \times \text{agevehicule} + 0.17 \times \text{marque}\]
\[ + 0.004 \times \text{region} + 10.81 \times \text{ageconducteur} - 188.7 \times \log(\text{ageconducteur}) - 0.31 \times \text{I}(\text{ageconducteur}^2) + 0.003 \times \text{I}(\text{ageconducteur}^3)\] \[- 0.012\times 10^{-3} \times \text{I}(\text{ageconducteur}^4)\]
Gaussain family
\[\text{GLM}_{\scriptsize{gaussain}}: \; Y = y_{\text{train}} \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{puissance} + \hat{\beta}_2 \times \text{garantie} + \hat{\beta}_3 \times \text{carburant} + \hat{\beta}_4 \times \text{zone}+ \hat{\beta}_5 \times \text{agevehicule} + \] \[\hat{\beta}_6 \times \text{marque} + \hat{\beta}_7 \times \text{region}+ \hat{\beta}_8 \times \text{ageconducteur}\]
## Non-significant variables to be removed: puissance factor(garantie)2DO factor(garantie)5CO factor(carburant)E factor(zone)B factor(zone)C factor(zone)D agevehicule marque region ageconducteur
##
## Call:
## glm(formula = y_train ~ puissance + factor(garantie) + factor(carburant) +
## factor(zone) + agevehicule + marque + region + ageconducteur,
## family = gaussian(), data = X_train)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1196.4008 323.1513 3.702 0.000316 ***
## puissance -18.8310 27.8195 -0.677 0.499692
## factor(garantie)2DO -87.2521 115.9998 -0.752 0.453327
## factor(garantie)3VI 997.9488 550.2456 1.814 0.072075 .
## factor(garantie)4BG -422.7739 137.1038 -3.084 0.002506 **
## factor(garantie)5CO -252.8871 392.6840 -0.644 0.520730
## factor(carburant)E -125.9489 97.7367 -1.289 0.199842
## factor(zone)B -291.2110 254.9280 -1.142 0.255451
## factor(zone)C -283.8471 186.0466 -1.526 0.129557
## factor(zone)D -272.7943 173.4099 -1.573 0.118160
## factor(zone)E -343.5463 170.8412 -2.011 0.046436 *
## factor(zone)F -381.4183 211.6159 -1.802 0.073834 .
## agevehicule 16.0345 33.0085 0.486 0.627961
## marque 10.1259 14.6213 0.693 0.489851
## region 0.8994 13.6124 0.066 0.947421
## ageconducteur -0.8821 4.6643 -0.189 0.850298
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 281024)
##
## Null deviance: 43622995 on 143 degrees of freedom
## Residual deviance: 35971072 on 128 degrees of freedom
## AIC: 2232.3
##
## Number of Fisher Scoring iterations: 2
We can see that some varaibles are significant estimates such as;
Intercept,factor(warranty)3VI,factor(warranty)4BG,factor(zone)E,factor(zone)Fand the rest are not significant.
Binomial family
\[\text{GLM}_{\scriptsize{binomial}}: \; Y = y_{\text{train}} \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{puissance} + \hat{\beta}_2 \times \text{garantie} + \hat{\beta}_3 \times \text{carburant} + \hat{\beta}_4 \times \text{zone}+ \hat{\beta}_5 \times \text{agevehicule} + \hat{\beta}_6 \times \text{marque}\] \[+ \hat{\beta}_7 \times \text{region}+ \hat{\beta}_8 \times \text{ageconducteur}\]
##
## Call:
## glm(formula = claimnb ~ puissance + factor(garantie) + factor(carburant) +
## factor(zone) + agevehicule + marque + region + ageconducteur,
## family = binomial(link = "logit"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.289e+01 4.384e+03 0.005 0.9958
## puissance -3.500e-01 2.041e-01 -1.715 0.0864 .
## factor(garantie)2DO 3.977e-01 6.491e-01 0.613 0.5401
## factor(garantie)3VI 1.852e+01 1.773e+04 0.001 0.9992
## factor(garantie)4BG 1.908e+01 2.456e+03 0.008 0.9938
## factor(garantie)5CO 1.972e+01 1.253e+04 0.002 0.9987
## factor(carburant)E -1.043e+00 7.939e-01 -1.313 0.1891
## factor(zone)B -2.657e+00 6.792e+03 0.000 0.9997
## factor(zone)C -1.923e+01 4.384e+03 -0.004 0.9965
## factor(zone)D -1.904e+01 4.384e+03 -0.004 0.9965
## factor(zone)E -1.954e+01 4.384e+03 -0.004 0.9964
## factor(zone)F 3.525e-01 6.198e+03 0.000 1.0000
## agevehicule 4.675e-02 2.140e-01 0.218 0.8271
## marque -1.101e-02 1.021e-01 -0.108 0.9142
## region 1.088e-01 9.532e-02 1.142 0.2535
## ageconducteur -1.570e-02 3.328e-02 -0.472 0.6372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 96.233 on 143 degrees of freedom
## Residual deviance: 66.955 on 128 degrees of freedom
## AIC: 98.955
##
## Number of Fisher Scoring iterations: 19
In this section, we delve into predicting claim costs for damage insurance contracts, with a focus on journey-related factors. We have access to a new database containing the pricing factors for 35 contracts. The primary objective is to develop a predictive model that takes into account specific journey characteristics to accurately estimate the potential cost of claims.
Insurance companies face the complex challenge of assessing risks associated with policyholders’ journeys. Journey characteristics such as distance traveled, frequency of trips, and other driving behavior-related factors can significantly influence the expected cost of claims. Thus, understanding and modeling these factors become crucial for accurate pricing of damage insurance contracts.
Our analysis is based on a freshly compiled database, gathering pricing data for 35 contracts. This database is carefully curated to include relevant information about coverages, vehicle characteristics, geographical areas, and most importantly, specifics about policyholders’ journeys.
We aim to develop a robust predictive model leveraging the information provided by this new database. By focusing on journeys, our model seeks to capture the intricate relationships between journey features and the likely cost of claims. In doing so, we intend to provide the insurance company with a powerful tool to adjust pricing based on the nuances of policyholders’ driving behavior.
Follow the upcoming sections to discover the details of our modeling approach, the achieved results, and the implications for more precise pricing of damage insurance contracts.
##
## Call:
## glm(formula = round(y_train) ~ puissance + factor(carburant) +
## factor(zone) + agevehicule + marque + region + ageconducteur +
## log(ageconducteur) + I(ageconducteur^2) + I(ageconducteur^3) +
## I(ageconducteur^4), family = poisson(link = "log"), data = X_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.784e+02 8.362e+00 45.254 < 2e-16 ***
## puissance -4.062e-02 2.143e-03 -18.958 < 2e-16 ***
## factor(carburant)E -2.306e-01 7.293e-03 -31.620 < 2e-16 ***
## factor(zone)B -7.347e-01 1.788e-02 -41.078 < 2e-16 ***
## factor(zone)C -5.933e-01 1.236e-02 -48.021 < 2e-16 ***
## factor(zone)D -3.069e-01 1.099e-02 -27.927 < 2e-16 ***
## factor(zone)E -4.369e-01 1.099e-02 -39.766 < 2e-16 ***
## factor(zone)F -6.115e-01 1.470e-02 -41.598 < 2e-16 ***
## agevehicule 3.712e-02 2.448e-03 15.163 < 2e-16 ***
## marque 1.382e-02 1.170e-03 11.806 < 2e-16 ***
## region -6.561e-03 1.098e-03 -5.978 2.26e-09 ***
## ageconducteur 2.096e+01 4.693e-01 44.669 < 2e-16 ***
## log(ageconducteur) -2.248e+02 5.042e+00 -44.597 < 2e-16 ***
## I(ageconducteur^2) -3.499e-01 7.873e-03 -44.443 < 2e-16 ***
## I(ageconducteur^3) 3.314e-03 7.548e-05 43.905 < 2e-16 ***
## I(ageconducteur^4) -1.272e-05 2.953e-07 -43.063 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 70022 on 143 degrees of freedom
## Residual deviance: 63158 on 128 degrees of freedom
## AIC: 64238
##
## Number of Fisher Scoring iterations: 5
## 1 2 3 4 5 6 7 8
## 514.8470 423.4333 807.8592 653.0120 688.1745 560.9412 623.7400 535.4638
## 9 10 11 12 13 14 15 16
## 577.6712 665.7525 665.7525 665.7525 1177.9360 693.6464 868.9974 761.3268
## 17 18 19 20 21 22 23 24
## 684.1723 709.8887 505.7045 518.4938 998.4163 783.2622 653.1340 666.7692
## 25 26 27 28 29 30 31 32
## 901.3655 635.9259 777.2931 1262.6729 1237.4653 450.4284 512.8170 879.3586
## 33 34 35
## 123.6644 628.7146 911.5023
## # A tibble: 35 × 8
## zone puissance agevehicule ageconducteur marque carburant densite region
## <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl>
## 1 C 7 0 56 12 D 93 13
## 2 C 10 10 42 12 D 93 13
## 3 A 5 4 31 3 D 21 8
## 4 E 5 1 58 12 E 52 0
## 5 B 7 8 22 2 E 26 0
## 6 F 7 5 39 12 E 11 0
## 7 E 6 3 39 12 D 22 11
## 8 E 6 6 36 1 D 24 12
## 9 A 12 9 69 12 E 73 13
## 10 E 7 7 26 12 E 53 1
## # ℹ 25 more rows
## 'data.frame': 35 obs. of 9 variables:
## $ garantie : chr "" "" "" "" ...
## $ zone : chr "" "" "" "" ...
## $ puissance : num 0 0 0 0 0 0 0 0 0 0 ...
## $ agevehicule : num 0 0 0 0 0 0 0 0 0 0 ...
## $ ageconducteur: num 0 0 0 0 0 0 0 0 0 0 ...
## $ marque : num 0 0 0 0 0 0 0 0 0 0 ...
## $ carburant : chr "" "" "" "" ...
## $ densite : num 0 0 0 0 0 0 0 0 0 0 ...
## $ region : num 0 0 0 0 0 0 0 0 0 0 ...
## 1 2 3 4 5 6 7 8
## 514.8470 423.4333 807.8592 653.0120 688.1745 560.9412 623.7400 535.4638
## 9 10 11 12 13 14 15 16
## 577.6712 665.7525 665.7525 665.7525 1177.9360 693.6464 868.9974 761.3268
## 17 18 19 20 21 22 23 24
## 684.1723 709.8887 505.7045 518.4938 998.4163 783.2622 653.1340 666.7692
## 25 26 27 28 29 30 31 32
## 901.3655 635.9259 777.2931 1262.6729 1237.4653 450.4284 512.8170 879.3586
## 33 34 35
## 123.6644 628.7146 911.5023
## [1] 4.994074e+02 8.317345e+02 9.613246e+02 5.842336e+02 7.824767e+02
## [6] 6.627201e+02 8.941041e+02 7.910521e+02 1.389241e+03 5.440081e+02
## [11] 7.444608e+02 8.388383e+02 1.471932e+03 6.202726e+02 8.887114e+02
## [16] 9.591204e+02 2.112077e+03 1.202928e+03 2.238176e+03 3.110020e+03
## [21] 9.114961e+02 6.744388e+02 2.223375e+03 9.519048e+02 5.077412e+03
## [26] 8.434790e+02 7.078886e+02 1.458576e+03 9.547458e+02 9.040387e+02
## [31] 2.226276e+05 2.496808e+01 3.023024e+03 1.216185e-01 2.757478e+03
## [1] "Claim Frequency GLM2, Test-Sample, Actual/Predicted: 9.32% "
Based on the provided PDF file, the findings of the study encompass the comprehensive data selection and cleaning process, the exploratory data analysis revealing insights into the relationships between variables, and the successful modeling and prediction of claim costs for insurance contracts. The discussion delves into the significance of the journey-related factors in predicting claim costs for damage insurance contracts, emphasizing the importance of understanding and modeling these factors for accurate pricing. Furthermore, the discussion highlights the relevance of the selected independent variables and probability distributions in the modeling process. In conclusion, the study successfully achieves its objectives by demonstrating the application of Generalized Linear Models (GLM) in calculating pure premium for non-life insurance, providing valuable insights for insurance companies to adjust pricing based on policyholders’ driving behavior and journey characteristics.