About me

Hassan OUKHOUYA
E-mail: hassan.oukhouya@um5r.ac.ma
ResearchGate: https://www.researchgate.net/profile/Hassan-Oukhouya
LinkedIn: https://www.linkedin.com/in/hassan-oukhouya-3901b816b/
ORCID iD: https://orcid.org/0000-0002-5058-2008
Upwork: https://www.upwork.com/services/product/time-series-analysis-with-python-or-r-studio-1449669530698514432?ref=project_share

Overview

Modeling pure premium using generalized linear models (GLM) is common in non-life insurance. GLMs are statistical models that allow the modeling of relationships between independent variables (rating factors) and a dependent variable (claim frequency or claim cost).

Research Question

How can we calculate the pure premium using Generalized Linear Models (GLM) in the context of non-life insurance?

Objectives

The project aims to achieve the following:

Data Processing: Cleanse the data by addressing irregularities such as missing values and outliers.
Selection of Independent Variables: Identify relevant variables for the pricing process.
Choice of Probability Distributions: Select appropriate probability distributions to model claim frequency and claim cost.
Selection of Link Functions: Choose suitable link functions to connect independent variables to pure premium.
Construction of GLM Models: Utilize a statistical software of your choice to build the models.
Model Training and Parameter Estimation: Train the models to estimate coefficients.
Model Evaluation: Assess the quality of the models using performance indicators like the coefficient of determination ($R²$), Akaike Information Criterion (AIC), or other performance measures.
Selection of Relevant Models: Choose the most suitable models.
Pricing of a New Insurance Policy: Apply the models to price a new insurance policy.
Final Report: Prepare a report summarizing the process.

Importing libraries

The following R packages are required for this study:

Data selection and cleaning

In this study we have two databases such as contracts and claims that we will use in our project, as follows:

Selection of variables

In this section, we explore the critical process of variable selection, a pivotal step in refining our dataset for modeling purposes. The careful consideration and identification of relevant variables lay the foundation for constructing robust models in our insurance analysis.

nocontrat (Number of Contract): This variable represents a unique identifier assigned to each insurance contract.
zone: The zone variable indicates the geographical area or zone associated with the insurance contract.
puissance (Power): Refers to the power or engine capacity of the insured vehicle.
agevehicule (Vehicle Age): Represents the age of the insured vehicle, indicating how long it has been in use (1=youngest, 4=oldest).
ageconducteur (Driver Age): This variable represents the age of the driver associated with the insurance contract.
marque (Brand): Marque refers to the brand or manufacturer of the insured vehicle.
carburant (Fuel Type): Indicates the type of fuel used by the insured vehicle, such as gasoline or diesel.
densite (Population Density): Represents the population density of the geographical area associated with the insurance contract.
region: The region variable denotes the specific region or location associated with the insurance contract.

no (Number): The variable “no” is a unique identifier assigned to each claim, providing a distinct reference number.
nocontrat (Number of Contract): This variable represents the unique identifier assigned to the insurance contract associated with the claim.
garantie (Coverage): Garantie refers to the type or level of coverage associated with the claim, indicating the specific insurance coverage provided.
cout (Cost): The variable “cout” represents the cost associated with the claim, indicating the financial amount attributed to the insurance claim.

After selecting the variables we have imported the following data on R:

## # A tibble: 6 × 9
##   nocontrat zone  puissance agevehicule ageconducteur marque carburant densite
##       <dbl> <chr>     <dbl>       <dbl>         <dbl>  <dbl> <chr>       <dbl>
## 1   4075803 E             6           3            48     12 E              11
## 2   4024277 D             4           2            32     12 E              31
## 3   4083980 B             4           1            74     12 E              26
## 4   4032035 B             5           1            33     12 D              53
## 5   4020011 E             4           1            50     12 D              54
## 6   4001738 E             5           9            37      5 E              31
## # ℹ 1 more variable: region <dbl>

## # A tibble: 6 × 9
##   nocontrat zone  puissance agevehicule ageconducteur marque carburant densite
##       <dbl> <chr>     <dbl>       <dbl>         <dbl>  <dbl> <chr>       <dbl>
## 1   4072339 D             8           2            39     12 D              82
## 2   4032828 D             7           1            38     12 E              22
## 3   4033173 A            11           1            49     12 D              82
## 4   4056926 C             8           0            58     12 D              26
## 5   4029746 E             4           1            30     12 E              11
## 6   4062808 D             4           0            42     12 E              93
## # ℹ 1 more variable: region <dbl>

## # A tibble: 6 × 4
##      no nocontrat garantie   cout
##   <dbl>     <dbl> <chr>     <dbl>
## 1     1   4075803 2DO       513. 
## 2     2   4075803 1RC         0  
## 3     3   4024277 2DO        89.7
## 4     4   4024277 1RC      1851. 
## 5     5   4083980 2DO       567. 
## 6     6   4032035 4BG       512.

## # A tibble: 6 × 4
##      no nocontrat garantie   cout
##   <dbl>     <dbl> <chr>     <dbl>
## 1   174   4072339 4BG       491. 
## 2   175   4032828 4BG       750. 
## 3   176   4033173 2DO      1357. 
## 4   177   4056926 2DO        97.6
## 5   178   4029746 4BG        58.6
## 6   179   4062808 2DO      1330.

Data structure

We delve into the distinctive data structures of both the Contracts and Claims datasets, providing a comprehensive overview of the variables and their significance in our analytical journey.

## [1] 138   9

## tibble [138 × 9] (S3: tbl_df/tbl/data.frame)
##  $ nocontrat    : num [1:138] 4075803 4024277 4083980 4032035 4020011 ...
##  $ zone         : chr [1:138] "E" "D" "B" "B" ...
##  $ puissance    : num [1:138] 6 4 4 5 4 5 4 4 5 6 ...
##  $ agevehicule  : num [1:138] 3 2 1 1 1 9 4 15 0 4 ...
##  $ ageconducteur: num [1:138] 48 32 74 33 50 37 51 56 26 35 ...
##  $ marque       : num [1:138] 12 12 12 12 12 5 12 1 12 2 ...
##  $ carburant    : chr [1:138] "E" "E" "E" "D" ...
##  $ densite      : num [1:138] 11 31 26 53 54 31 11 11 22 31 ...
##  $ region       : num [1:138] 13 -1 13 0 13 11 12 -1 2 13 ...

The dataset “Contracts” has 138 rows and 9 columns, indicating that there are 138 observations (entries) and 9 variables in the dataset.

The variables of dataset have different data types, including numerical (num) and character (chr) types. The variable nocontrat is of numerical type and seems to represent a unique identifier for each contract. The variable zone is of character type, indicating a categorical variable. Variables like puissance, agevehicule, ageconducteur, marque, densite, and region are of numerical type. The variable carburant is of character type, suggesting it may represent a categorical variable indicating the type of fuel used.

## [1] 179   4

## tibble [179 × 4] (S3: tbl_df/tbl/data.frame)
##  $ no       : num [1:179] 1 2 3 4 5 6 7 8 9 10 ...
##  $ nocontrat: num [1:179] 4075803 4075803 4024277 4024277 4083980 ...
##  $ garantie : chr [1:179] "2DO" "1RC" "2DO" "1RC" ...
##  $ cout     : num [1:179] 512.7 0 89.7 1851.1 566.8 ...

The dataset “Claims” has 179 rows and 4 columns, indicating that there are 179 observations (entries) and 4 variables in the dataset.

The variables have different data types, the variable no is of numerical type and seems to represent a unique identifier for each claim. The variable nocontrat is of numerical type, suggesting a linkage to the contract number in the “Contracts” dataset. The variable garantie is of character type, indicating a categorical variable representing the type of coverage for each claim. The variable cout is of numerical type, representing the cost associated with each claim.

Missing values and processing

We analyzed and processed missing values for both the Contracts and Claims datasets:

## named integer(0)

## named integer(0)

No missing values were found in either the “Contracts” or “Claims” datasets, indicating complete data integrity for the observed variables.

##     nocontrat          zone     puissance   agevehicule ageconducteur 
##             0             0             0             0             0 
##        marque     carburant       densite        region 
##             0             0             0             0

##        no nocontrat  garantie      cout 
##         0         0         0         0

For the “Contracts” dataset, there are no missing values in any of the columns (nocontrat, zone, puissance, agevehicule, ageconducteur, marque, carburant, densite, region). Similarly, for the “Claims” dataset, there are no missing values in any of the columns (no, nocontrat, garantie, cout).

## character(0)

## character(0)

For the “Contracts” dataset, there are no column names with missing values, as indicated by the empty character vector (character(0)). Similarly, for the “Claims” dataset, there are no column names with missing values, as indicated by the empty character vector (character(0)).

Analysis Outliers and processing

Box and bar graphs (box plots) or boxplots represent statistical data visually. They allow you to visualize variations in the data distribution and general trends. We exclude contract number, zone, brand and type of fuel in contracts data and the variable no, nocontract, garantie for claims data because they have real values and are not subject to outlier analysis, as well as other variables that have no information in the analysis. We can now check for outliers on each variable in the previous data cleaning by using boxplots:

According to the boxplots, it is evident that the variables power, agevehicle, and cost exhibit values beyond the whiskers of the boxplots, indicating the presence of outliers. Otherwise, it can be identified in the code as follows:

We will identify outliers for each variable in each data set:

## [1] 14

##  [1]  9 15 15 10 10 10 10  9 10 10  9

##  [1]  2735.14  4474.51 16129.91  2756.51  6539.97 19042.19  4158.45  3444.60
##  [9] 35193.25  3733.68 14289.32  5434.82

Variable power: One outlier observed with a value of 14.

Variable agevehicle: Multiple outliers observed with values 9, 15, 15, 10, 10, 10, 10, 9, 10, 10, and 9.

Variable cost: Several outliers observed with values 2735.14, 4474.51, 16129.91, 2756.51, 6539.97, 19042.19, 4158.45, 3444.60, 35193.25, 3733.68, 14289.32, and 5434.82.

Outlier processing

In this subsection, we delve into the critical process of outlier processing through linear interpolation, a method aimed at mitigating the impact of outliers on our data analysis. This approach involves filling in missing values by estimating intermediate points using linear interpolation, ensuring a more robust and representative dataset for further exploration.

We will repeat the treatment of outliers using linear interpolation for the puissance, agevehicle variables.

We’ll check to be sure whether the outliers still exist.

When performing linear interpolation on the cout variable to address outliers, we identified missing values. Consequently, we opted to replace these missing values with the median. Following this, we conducted linear interpolation once again to address any remaining outliers.

This code defines a winsorize function, which replaces values below a lower limit or above an upper limit with those limits. It then applies Winsorizing to the cout variable based on the 5th and 95th percentiles.

Check again for outliers:

## numeric(0)

## numeric(0)

## numeric(0)

check again for missing values:

## character(0)

## named integer(0)

##        no nocontrat  garantie      cout 
##         0         0         0         0

Observations:

For the Contracts dataset, the variable puissance does not have any identified outliers and missing values, as indicated by the empty numeric vector (numeric(0)).

Similarly, for the variable agevehicule in the Contracts dataset, no outliers and missing values are detected, as shown by the empty numeric vector (numeric(0)).

In the Claims dataset, the variable cout also exhibits no identified outliers and missing values, reflected by the empty numeric vector (numeric(0)).

Variables selection

After cleaning the data, we will select a subset of all the variables to continue the analysis. Nevertheless, the data cleaning performed in the previous section can be useful for further analysis of the whole data set.

## [1] 179   3

The first variable (no) in the dataset of Claims contains no information, it’s just a counter, we can remove it !

Exploratory Data Analysis (EDA)

These lines of code convert specific columns in the “Claims” and “Contracts” datasets to factors, facilitating categorical data representation.

Now, we will merge two datasets, Claims and Contracts, based on the common column ‘number of contrat’:

We want to create a binary variable claimnb indicating whether a contract has a cost (claimed) or not (not claimed), you can proceed as follows:

Descriptive statistics

The first step in the exploration analysis is to make a descriptive statistic to structure and represent the information contained in the data.

##    nocontrat       garantie      cout        zone     puissance     
##  Min.   :   1545   1RC:40   Min.   :   0.0   A:20   Min.   : 4.000  
##  1st Qu.:4014166   2DO:79   1st Qu.: 240.0   B:10   1st Qu.: 5.000  
##  Median :4036392   3VI: 1   Median : 598.0   C:36   Median : 6.000  
##  Mean   :3747634   4BG:57   Mean   : 683.8   D:47   Mean   : 6.207  
##  3rd Qu.:4056100   5CO: 2   3rd Qu.:1128.1   E:50   3rd Qu.: 7.000  
##  Max.   :4086670            Max.   :1901.0   F:16   Max.   :10.000  
##   agevehicule    ageconducteur       marque     carburant    densite     
##  Min.   :0.000   Min.   :22.00   Min.   : 1.0   D:90      Min.   :11.00  
##  1st Qu.:1.000   1st Qu.:34.00   1st Qu.:12.0   E:89      1st Qu.:11.00  
##  Median :1.000   Median :44.00   Median :12.0             Median :31.00  
##  Mean   :1.877   Mean   :44.76   Mean   :10.7             Mean   :47.25  
##  3rd Qu.:3.000   3rd Qu.:55.00   3rd Qu.:12.0             3rd Qu.:82.00  
##  Max.   :6.000   Max.   :84.00   Max.   :14.0             Max.   :94.00  
##      region          claimnb     
##  Min.   :-1.000   Min.   :0.000  
##  1st Qu.: 6.000   1st Qu.:1.000  
##  Median :13.000   Median :1.000  
##  Mean   : 9.637   Mean   :0.905  
##  3rd Qu.:13.000   3rd Qu.:1.000  
##  Max.   :13.000   Max.   :1.000

nocontrat: * The nocontrat variable represents unique identifiers for contracts. * The values range from 1545 to 4086670, with a mean of 3747634. * The distribution shows that contracts span a wide range of identifier values.

zone: * zone is a categorical variable indicating different geographical zones. * The summary provides counts and unique values for each zone.

puissance: * puissance represents the power or engine capacity of insured vehicles. * Values range from 4 to 10, with a mean of 6.207. * The distribution indicates a variation in vehicle power across contracts.

agevehicule: * agevehicule denotes the age of insured vehicles. * The minimum age is 0, likely indicating new vehicles, and the maximum is 6. The mean age is 1.877, suggesting a relatively young vehicle portfolio.

ageconducteur: * ageconducteur represents the age of the drivers associated with contracts. * Driver ages vary from 22 to 84, with a mean of 44.76. * The distribution suggests a diverse age range among drivers.

marque: * marque is a categorical variable indicating the brand or manufacturer of insured vehicles. * The summary provides counts and unique values for each brand.

carburant: * carburant represents the type of fuel used by insured vehicles. * The summary provides counts and unique values for each fuel type.

densite: * densite indicates the population density of the geographical area associated with contracts. * Population density values range from 11 to 94, with a mean of 47.25. * The distribution reflects variability in the density of the regions.

region: * region represents specific regions associated with contracts. * Region values range from -1 to 13, with a mean of 9.637. * The distribution indicates contracts are spread across various regions.

garantie: * garantie is a categorical variable indicating different types of coverage for claims. * The summary provides counts and unique values for each type of coverage.

cout: * cout represents the cost associated with each claim. * Claim costs range from 0.0 to 1901.0, with a mean of 683.8. * The distribution indicates variability in claim costs, with a significant range.

An overview of claims costs with other variables

To get a first impression of the claims costs of the contracts, the important features and their interactions, we start our analysis with a small binary decision tree of depth 3.

The tree is designed to analyze interactions between various variables and their influence on claim costs. This binary decision tree, with a depth of 3, is split based on the following key variables: “garantie,” “puissance,” “agevehicule,” “zone,” and “densite.”

The root node, representing the entire dataset, displays a claim frequency of 10%, associated costs, and a proportion of 100%. The initial split, determined by the metric of gini impurity, separates the root node into two more homogeneous groups defined by garantie = 4BG,5CO. Policies meeting this condition are displayed on the left, while those for cars with a vehicle age greater than or equal to one year are shown on the right. Notably, 67% of the cars are older than one year (VehAge >= 1), and their densite < 17 accounts for 32%. Further refinement, based on puissance < 7 and zone = A,E, narrows this down to 9%.

Conversely, the claim rate on the left side of the decision tree is only 7%, contingent on puissance >= 6 and the age of the vehicle. These specific variables will be scrutinized more closely in the subsequent section.

Univariate analysis of variables

In this subsection, we conduct a univariate analysis of individual variables. Examining each variable in isolation, we aim to reveal its distribution, characteristics, and statistical properties. This analysis provides essential insights into patterns, trends, and outliers, laying the groundwork for a comprehensive understanding of our data and guiding further exploration.

The “Overall” plot displays total claims costs across zones, with each bar color-coded by zone. The second plot, not in the code, likely illustrates claims costs for zones where costs exceed zero, aiding in identifying zones with higher claims costs. The two figures illustrate that zone E represents the highest cost of claims, followed by D, C, A, F, and B. Otherwise, the most representative fuel type claims cost is for diesel fuel following gasoline.

The histograms above show that car makes 12 has the highest claims costs in contracts, and the rest of the makes are almost zero. Otherwise, the 1-year-old vehicle is the most expensive in claims costs, followed by the 0, 2, and 4-year-old cars.

In this analysis, it is evident that the insurance type most advantageous for insurers is Personal Damage Cover (2DO), ranking first. Following closely is Glass Breakage Cover (4BG) in second place and Third Party Liability (1RC) in the third position. Fire and Collision cover (3VI) and Collision (5CO) show minimal prevalence, nearly approaching. Otherwise, in terms of horsepower, the cars with the highest contracts have a horsepower of 4, followed by 7, and at almost equal levels are 6, 5, and 8.

We can see that the age distribution of paid drivers in the (40, 50] age range is higher, followed by the (30, 40] and (50, 60] age ranges. On the other hand, claims costs are distributed like a Poisson distribution.

Bivariate analysis of variables

Bivariate analysis explores relationships between pairs of variables, revealing patterns and correlations. It helps uncover how changes in one variable relate to changes in another, providing insights through statistical techniques and visualizations for a more nuanced understanding of the data.

Plot claim frequencies with other variables

Other intriguing interactions appear to be present:

The code conducts a bivariate analysis of “Vehicle Power” and “Claim Frequency,” considering different “Fuel Types.” It generates a scatter plot with a smooth curve, illustrating the relationship between power, claim frequency, and fuel type. The larger font sizes in the title and axes enhance readability. The analysis aims to reveal patterns and correlations in the data, offering insights into the interplay between vehicle characteristics and insurance claim frequencies.Otherwise, the figure on the right clearly shows that the type of contract cover most interactive with the cost and number of claims is damage cover.

In this analysis, we explore the claim frequency in automobile insurance, focusing on the impact of car brand, vehicle age, and fuel type. We categorize vehicles based on age, considering those aged seven or more as a single group. After summarizing the claim frequency and cost, we filter out groups with less than 100 costs. The resulting scatter plot reveals the relationship between claim frequency and car brand, color-coded by vehicle age group, with separate facets for different fuel types. Larger points represent higher claim frequencies, and the analysis provides insights into the interplay of these factors in insurance claims.

From the two graphs above, we examine the interactions between car age and driver age, on the one hand, and vehicle make, on the other. To avoid the noisy effects of tiny groups, all cars over twenty years old belong to the same group. It is clear from the two graphs that there is an interaction between claims frequency, driver age, and car brand. Outliers are observed in the graph depicting the interaction between the age of the driver and the car brand. This is explained by the presence of individuals in certain age groups owning luxury cars, leading to extreme data points.

Analysis Correlation

Correlation analysis is a crucial step in understanding relationships between variables within a dataset. In this section, we perform correlation analysis on the insurance dataset, focusing on both numerical and categorical variables.

Analysis Correlation between numerical variables

We delve into the relationships between numerical variables such as Contract number, Cost, Power, Vehicle age, Driver age, Brand, Density, Region, and claim number. Utilizing correlation coefficients, scatter plots, and other visualization techniques, we aim to uncover patterns and dependencies among these quantitative features.

## The dataset has 9 numeric Variables and  3 factor variables

The output indicates that there are 9 numeric variables and 3 factor variables in the dataset.

From the correlation graph of numerical variables, we can see that the two variables, driver age with region, are strongly positively correlated, followed by the correlation of cost and number of claims. On the other hand, there is a negative correlation between the make and age of the vehicle and density. This shows that this analysis of the correlation of numerical variables provides infromations on our model, which we’ll build later.

When p-value < 0.05, we can accept that there is a significant relationship between the variables highlighted. As we can see, between the variables, cost, power, vehicle age, driver age, density, number of claims…etc.

Analysis Correlation between categorical variables

This subsubsection explores the interplay between categorical and numerical variables like car brand, fuel type, and warranty coverage. Through appropriate statistical measures and visualizations, we seek to discern any associations or dependencies between these discrete attributes.

## tibble [179 × 12] (S3: tbl_df/tbl/data.frame)
##  $ nocontrat    : num [1:179] 4075803 4075803 4024277 4024277 4083980 ...
##  $ garantie     : num [1:179] 4 5 4 5 4 2 4 5 2 4 ...
##  $ cout         : num [1:179] 512.7 0 89.7 1851.1 566.8 ...
##  $ zone         : num [1:179] 2 2 3 3 5 5 2 2 2 1 ...
##  $ puissance    : num [1:179] 6 6 4 4 4 5 4 4 5 4 ...
##  $ agevehicule  : num [1:179] 3 3 2 2 1 1 1 1 0 4 ...
##  $ ageconducteur: num [1:179] 48 48 32 32 74 33 50 50 37 51 ...
##  $ marque       : num [1:179] 12 12 12 12 12 12 12 12 5 12 ...
##  $ carburant    : num [1:179] 1 1 1 1 1 0 0 0 1 1 ...
##  $ densite      : num [1:179] 11 11 31 31 26 53 54 54 31 11 ...
##  $ region       : num [1:179] 13 13 -1 -1 13 0 13 13 11 12 ...
##  $ claimnb      : int [1:179] 1 0 1 1 1 1 1 1 1 1 ...

The correlation graph shows a strong correlation between the following variables: driver age and region, cost and number of claims, area, and density. On the other hand, there is a weak negative correlation between zone and fuel, as well as the make and age of the vehicle.

For p-value < 0.05, we can accept that there is a significant relationship between categorical and numerical variables. As we can see, between the variables, cost, fuel, warranty, power, vehicle age, driver age, density, number of claims…etc, there is a significant relationship.

Distribution of Variables and Density Plot

We focus on exploring the distribution of variables and density plots to gain insights into the frequency and cost of insurance claims. The key objective is to select appropriate probability distributions that effectively model the statistical patterns inherent in the data. This step is crucial for building accurate and robust models to understand and predict claim behavior.

The analysis reveals that the numerical variables, such as driver age, claims cost, vehicle age, and horsepower, exhibit a distribution pattern resembling the Poisson distribution. In contrast, the density of other variables, including the number of claims, regions, and car makes, aligns more closely with a negative binomial probability distribution.

The bulk of those making claims is dense on the insurance cost scale. At the top of this scale, claims start to converge and earn higher and higher fees - the graph is simple and clearly delineates this zone. Otherwise, the mass of those with damage cover (2DO) most represents the insurance cost scale. Followed by cover types RC and BG - the graph is simple and delineates this zone.

Modeling

Modeling pure premium in insurance is crucial for estimating future claim costs. Generalized Linear Models (GLM) offer a powerful approach by incorporating predictors, using a link function, and specifying a distribution. Key steps include data exploration, model specification, parameter estimation, validation, interpretation, and predictive analytics. GLMs provide flexibility for understanding risk factors and optimizing insurance portfolios.

\[ g(\mu) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \ldots + \beta_k X_k \]

Where:

$g(\mu)$ is the link function, connecting the linear combination of predictors to the expected value of the response variable.
$\mu$ is the mean of the response variable.
$\beta_0$ is the intercept.
$\beta_1, \beta_2, \ldots, \beta_k$ are the coefficients associated with the predictor variables $X_1, X_2, \ldots, X_k$.

The link function $g(\mu)$ transforms the linear combination into the scale of the response variable. Common link functions include:

Identity link ($g(\mu) = \mu$) for Gaussian distribution.
Logit link ($g(\mu) = \log\left(\frac{\mu}{1-\mu}\right)$) for binary data.
Log link ($g(\mu) = \log(\mu)$) for Poisson distribution.

Data preparation

This code is part of the data preparation process for modeling insurance costs. The goal is to predict insurance costs (cout) based on various independent variables such as contract number, insurance coverage type, geographical zone, car power, vehicle age, driver age, car brand, and fuel type. The code uses the tidyverse and caret libraries for data manipulation and modeling.

The target variable (cost) is assigned to the dataframe y, while the independent variables are assigned to the dataframe X, excluding the cost variable to prevent data leakage during prediction. The dataset is then split into training and validation sets with $90\%$ for training and $10\%$ for validation.

This step is essential to ensure a clean and suitable dataset for building an accurate predictive model for insurance costs.

Split training and validation

Train-Test Split:

Training Set: $80\%$ ($\approx 144\; \text{obs.}$)
Test Set: $20\%$ ($\approx 35\; \text{obs.}$)

We’re going to make two specifications for modeling the relationships between independent variables (pricing factors) and a dependent variable (claims frequency claimnb and claims cost cout).

\[X=\{x_1, x_2,x_3, x_4,x_5, x_6,x_8, x_9, x_{10}\}\] where, - $x_1=`\text{garantie}`$, $x_2=`\text{zone}`$, $x_3=`\text{puissance}`$, $x_4=`\text{agevehicule}`$, $x_5=`\text{ageconducteur}`$, $x_6=`\text{marque}`$, $x_7=`\text{carburant}`$, $x_8=`\text{densite}`$, $x_9=`\text{region}`$, $x_{10}=`\text{claimnb}`$$

\[Y=`\text{cout}`\]

##  [1] "nocontrat"     "garantie"      "cout"          "zone"         
##  [5] "puissance"     "agevehicule"   "ageconducteur" "marque"       
##  [9] "carburant"     "densite"       "region"        "claimnb"

## [1]  512.74    0.00   89.70 1851.11  566.84  512.48

## # A tibble: 6 × 11
##   nocontrat garantie zone  puissance agevehicule ageconducteur marque carburant
##       <dbl> <fct>    <fct>     <dbl>       <dbl>         <dbl>  <dbl> <fct>    
## 1   4075803 2DO      E             6           3            48     12 E        
## 2   4075803 1RC      E             6           3            48     12 E        
## 3   4024277 2DO      D             4           2            32     12 E        
## 4   4024277 1RC      D             4           2            32     12 E        
## 5   4083980 2DO      B             4           1            74     12 E        
## 6   4032035 4BG      B             5           1            33     12 D        
## # ℹ 3 more variables: densite <dbl>, region <dbl>, claimnb <int>

Build models

Poisson Regression Modeling

Poisson Regression operates similarly to ordinary linear regression, with the key distinction that it assumes the response variable (cout) follows a Poisson distribution. This choice positions it as one of the two generalized linear models (GLMs) applied in this project:

\[ Y_i \stackrel{iid}{\sim} \text{Pois}(\lambda) \]

This model is selected primarily as an initial demonstration of a GLM for predicting claim severity. However, it’s worth noting that more sophisticated alternatives might be more suitable for this purpose.

Binomial Regression Modeling

Binomial Regression functions analogously to ordinary linear regression, with a crucial difference: it assumes a binomial distribution for the response variable (cout). This sets it as one of the two GLMs employed in this project:

\[ Y_i \stackrel{iid}{\sim} \text{Binomial}(n, p) \]

The choice of this model is made as an initial exploration into GLMs for predicting claim severity. Nevertheless, it’s important to recognize that more advanced alternatives might offer better suitability for this particular task.

Gaussian Regression Modeling

Gaussian Regression functions similarly to ordinary linear regression, assuming that the response variable (cout) follows a Gaussian distribution. This model is part of the broader family of GLMs used in this project:

\[ Y_i \stackrel{iid}{\sim} \mathcal{N}(\mu, \sigma^2) \]

In this context, $\mu$ represents the mean of the Gaussian distribution, and $\sigma^2$ is the variance. Gaussian Regression is chosen as it provides a framework for predicting continuous outcomes. However, the appropriateness of this choice should be critically assessed, as alternative models might better capture the underlying distribution of the data.

Fit a model with intercept and all factors

We’re going to build GLM models for each family to make sure we don’t miss any distrubtion families that might be suitable for our model. Here are the mathematical regression models for each family:

Poisson family

\[\text{1st Model: }\; Y = \text{Log}(y_{\text{train}}) \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{claimnb} + \hat{\beta}_2 \times \text{puissance} + \hat{\beta}_3 \times \text{garantie}_{\scriptsize{2DO}} + \hat{\beta}_4 \times \text{garantie}_{\scriptsize{3VI}} + \hat{\beta}_5 \times \text{garantie}_{\scriptsize{4BG}} \] \[+ \hat{\beta}_6 \times \text{garantie}_{\scriptsize{5CO}}+ \hat{\beta}_7 \times \text{carburant}_{\scriptsize{E}} + \hat{\beta}_8 \times \text{zone}_{\scriptsize{B}} + \hat{\beta}_9 \times \text{zone}_{\scriptsize{C}}\] \[+ \hat{\beta}_{10} \times \text{zone}_{\scriptsize{D}} + \hat{\beta}_{11} \times \text{zone}_{\scriptsize{E}} + \hat{\beta}_{12} \times \text{zone}_{\scriptsize{F}}+ \hat{\beta}_{13} \times \text{agevehicule}\] \[+ \hat{\beta}_{14} \times \text{marque} + \hat{\beta}_{15} \times \text{region}\]

where,

$\hat{\beta}_0$, $\hat{\beta}_1$, $\hat{\beta}_2$, $\hat{\beta}_3$, $\ldots$, $\hat{\beta}_{15}$ are the coefficients estimated by the model.
The model uses a Poisson family with a logarithmic link function. The coefficients are estimated using the given data $X_{\text{train}}$ with the family specified as poisson and the link function specified as $\log$.

## 
## Call:
## glm(formula = round(y_train) ~ claimnb + puissance + factor(garantie) + 
##     factor(carburant) + factor(zone) + agevehicule + marque + 
##     region + ageconducteur, family = poisson(link = "log"), data = X_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -1.234e+01  7.283e+01  -0.169    0.865    
## claimnb              1.918e+01  7.283e+01   0.263    0.792    
## puissance            1.077e-02  2.023e-03   5.324 1.01e-07 ***
## factor(garantie)2DO -2.030e-01  7.758e-03 -26.163  < 2e-16 ***
## factor(garantie)3VI  4.800e-01  2.530e-02  18.968  < 2e-16 ***
## factor(garantie)4BG -9.123e-01  1.058e-02 -86.222  < 2e-16 ***
## factor(garantie)5CO -6.867e-01  3.113e-02 -22.062  < 2e-16 ***
## factor(carburant)E  -1.116e-01  7.225e-03 -15.443  < 2e-16 ***
## factor(zone)B       -1.732e-01  1.891e-02  -9.161  < 2e-16 ***
## factor(zone)C       -1.224e-01  1.265e-02  -9.679  < 2e-16 ***
## factor(zone)D       -7.067e-02  1.127e-02  -6.269 3.64e-10 ***
## factor(zone)E       -1.671e-01  1.091e-02 -15.323  < 2e-16 ***
## factor(zone)F       -5.060e-01  1.462e-02 -34.611  < 2e-16 ***
## agevehicule          2.190e-02  2.514e-03   8.710  < 2e-16 ***
## marque               1.831e-02  1.168e-03  15.668  < 2e-16 ***
## region              -1.088e-02  1.017e-03 -10.697  < 2e-16 ***
## ageconducteur        3.092e-03  3.488e-04   8.865  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70022  on 143  degrees of freedom
## Residual deviance: 33889  on 127  degrees of freedom
## AIC: 34971
## 
## Number of Fisher Scoring iterations: 10

Poisson Regression Model for Cost of Insurance (Predictors and Estimated Effects)

Number of Claims (claimnb): Each additional claim significantly increases the expected cost of insurance.
Power of the Vehicle (puissance): A higher power is associated with a significant increase in the expected cost of insurance.
Insurance Guarantee Levels (factor(garantie)): Different levels of insurance guarantee significantly impact the cost of insurance compared to the baseline.
Fuel Type E (factor(carburant)E): Having fuel type E is associated with a significant decrease in the expected cost of insurance.
Geographic Zones (factor(zone)): Different zones significantly affect the cost of insurance compared to Zone A.
Age of the Vehicle (agevehicule): Each additional year of vehicle age significantly increases the expected cost of insurance.
Vehicle Brand (marque): Each unit increase in the vehicle brand significantly raises the expected cost of insurance.
Geographic Region (region): Each unit increase in the geographic region significantly decreases the expected cost of insurance.
Age of the Driver (ageconducteur): Each additional year of driver age significantly raises the expected cost of insurance.

P-values

The p-values associated with each coefficient are below the commonly used significance level of 0.05, indicating that these predictors are statistically significant.
This suggests that the estimated effects of these predictors on the cost of insurance are unlikely to be due to random chance.

Model Fit Statistics

Null Deviance: 70022 (Deviance with no predictors)
Residual Deviance: 33889 (Deviance after fitting the model)
AIC: 34971 (Model fit measure penalizing complexity)

Mathematically, its equation is :

\[\text{1st Model: }\; Y = \text{Log}(y_{\text{train}}) \sim -12.34 + 19.18 \times \text{claimnb} + 0.011 \times \text{puissance} - 0.203 \times \text{garantie}_{\scriptsize{2DO}}\] \[+ 0.48 \times \text{garantie}_{\scriptsize{3VI}} - 0.912 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.687 \times \text{garantie}_{\scriptsize{5CO}} - 0.112 \times \text{carburant}_{\scriptsize{E}} - 0.173 \times \text{zone}_{\scriptsize{B}} - 0.122 \times \text{zone}_{\scriptsize{C}}\] \[- 0.071 \times \text{zone}_{\scriptsize{D}} - 0.167 \times \text{zone}_{\scriptsize{E}} - 0.506 \times \text{zone}_{\scriptsize{F}}\]

\[+ 0.022 \times \text{agevehicule} + 0.018 \times \text{marque} - 0.011 \times \text{region} + 0.003 \times \text{ageconducteur}\]

## 
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) + 
##     factor(carburant) + factor(zone) + agevehicule + marque + 
##     region + ageconducteur, family = poisson(link = "log"), data = X_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.1200639  0.0224721 316.840   <2e-16 ***
## puissance           -0.0290553  0.0020425 -14.226   <2e-16 ***
## factor(garantie)2DO -0.1098963  0.0076887 -14.293   <2e-16 ***
## factor(garantie)3VI  0.7220859  0.0251379  28.725   <2e-16 ***
## factor(garantie)4BG -0.6905203  0.0105927 -65.188   <2e-16 ***
## factor(garantie)5CO -0.3651179  0.0310821 -11.747   <2e-16 ***
## factor(carburant)E  -0.1947441  0.0072050 -27.029   <2e-16 ***
## factor(zone)B       -0.3337035  0.0188215 -17.730   <2e-16 ***
## factor(zone)C       -0.3288956  0.0126607 -25.978   <2e-16 ***
## factor(zone)D       -0.3150702  0.0110601 -28.487   <2e-16 ***
## factor(zone)E       -0.4267584  0.0109482 -38.980   <2e-16 ***
## factor(zone)F       -0.4919245  0.0146491 -33.581   <2e-16 ***
## agevehicule          0.0249593  0.0024006  10.397   <2e-16 ***
## marque               0.0161861  0.0011069  14.623   <2e-16 ***
## region               0.0018519  0.0010169   1.821   0.0686 .  
## ageconducteur       -0.0003913  0.0003439  -1.138   0.2553    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70022  on 143  degrees of freedom
## Residual deviance: 59191  on 128  degrees of freedom
## AIC: 60272
## 
## Number of Fisher Scoring iterations: 5

Coefficients Interpretation

Intercept (7.12) - The baseline log-rate of the response variable (cost) when all other predictors are zero. - Large z-value and highly significant p-value indicate its significance.

Puissance (-0.0291) - A one-unit increase in puissance is associated with a decrease of approximately 2.91% in the log-rate. - Highly significant with a low p-value.

Factor levels of Garantie (2DO, 3VI, 4BG, 5CO) - Each factor level represents a category of the garantie variable. - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant.

Factor levels of Carburant (E) - Represents the log-rate change for the “E” category compared to the reference. - Highly significant with a low p-value.

Factor levels of Zone (B, C, D, E, F) - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant.

Agevehicule (0.0250) - A one-unit increase in agevehicule is associated with an increase of approximately 2.50% in the log-rate. - Highly significant.

Marque (0.0162) - A one-unit increase in marque is associated with an increase of approximately 1.62% in the log-rate. - Highly significant.

Region (0.00185) - A one-unit increase in region is associated with an increase of approximately 0.185% in the log-rate. - Borderline significance with a p-value close to 0.05.

Ageconducteur (-0.000391) - A one-unit increase in ageconducteur is associated with a decrease of approximately 0.0391% in the log-rate. - Not statistically significant (p-value > 0.05).

Mathematically, its equation is :

\[\text{2nd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 7.12 - 0.029 \times \text{puissance} - 0.11 \times \text{garantie}_{\scriptsize{2DO}} + 0.722 \times \text{garantie}_{\scriptsize{3VI}} - 0.691 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.365 \times \text{garantie}_{\scriptsize{5CO}} - 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.333 \times \text{zone}_{\scriptsize{B}} - 0.323 \times \text{zone}_{\scriptsize{C}} - 0.315 \times \text{zone}_{\scriptsize{D}} - 0.427 \times \text{zone}_{\scriptsize{E}} - 0.492 \times \text{zone}_{\scriptsize{F}}\]

\[+ 0.025 \times \text{agevehicule} + 0.016 \times \text{marque} + 0.002 \times \text{region} - 0.001 \times \text{ageconducteur}\]

We will eliminate the two variables ageconducteur and region since they are not significant at the 5% level.

## 
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) + 
##     factor(carburant) + factor(zone) + agevehicule + marque, 
##     family = poisson(link = "log"), data = X_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          7.110841   0.020814  341.63   <2e-16 ***
## puissance           -0.028415   0.001991  -14.27   <2e-16 ***
## factor(garantie)2DO -0.107747   0.007583  -14.21   <2e-16 ***
## factor(garantie)3VI  0.729851   0.024773   29.46   <2e-16 ***
## factor(garantie)4BG -0.688195   0.010312  -66.74   <2e-16 ***
## factor(garantie)5CO -0.365425   0.031061  -11.77   <2e-16 ***
## factor(carburant)E  -0.197264   0.006872  -28.70   <2e-16 ***
## factor(zone)B       -0.336573   0.018552  -18.14   <2e-16 ***
## factor(zone)C       -0.329223   0.012554  -26.22   <2e-16 ***
## factor(zone)D       -0.316826   0.011018  -28.75   <2e-16 ***
## factor(zone)E       -0.426848   0.010948  -38.99   <2e-16 ***
## factor(zone)F       -0.494625   0.014382  -34.39   <2e-16 ***
## agevehicule          0.025691   0.002347   10.95   <2e-16 ***
## marque               0.016623   0.001052   15.80   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70022  on 143  degrees of freedom
## Residual deviance: 59195  on 130  degrees of freedom
## AIC: 60271
## 
## Number of Fisher Scoring iterations: 5

Model Summary:

Intercept (7.1108): - The baseline log-rate of the response variable when all other predictors are zero. - Highly significant (p-value < 0.05).

Puissance (-0.0284): - A one-unit increase in puissance is associated with a decrease of approximately 2.84% in the log-rate. - Highly significant (p-value < 0.05).

Factor levels of Garantie (2DO, 3VI, 4BG, 5CO): - Each factor level represents a category of the garantie variable. - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant (p-value < 0.05).

Factor levels of Carburant (E): - Represents the log-rate change for the “E” category compared to the reference. - Highly significant (p-value < 0.05).

Factor levels of Zone (B, C, D, E, F): - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant (p-value < 0.05).

Agevehicule (0.0257): - A one-unit increase in agevehicule is associated with an increase of approximately 2.57% in the log-rate. - Highly significant (p-value < 0.05).

Marque (0.0166): - A one-unit increase in marque is associated with an increase of approximately 1.66% in the log-rate. - Highly significant (p-value < 0.05).

Model Fit: > - Null Deviance: 70022 (on 143 df) - Residual Deviance: 59195 (on 130 df) - AIC: 60271

Conclusions:** - The model appears to have a good fit, as indicated by the significant coefficients and the model fit statistics. - The predictors are statistically significant in explaining the variability in the response variable.

Mathematically, its equation is :

\[\text{3rd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 7.12 - 0.029 \times \text{puissance} - 0.11 \times \text{garantie}_{\scriptsize{2DO}} + 0.722 \times \text{garantie}_{\scriptsize{3VI}} - 0.691 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.365 \times \text{garantie}_{\scriptsize{5CO}} - 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.333 \times \text{zone}_{\scriptsize{B}} - 0.323 \times \text{zone}_{\scriptsize{C}} - 0.315 \times \text{zone}_{\scriptsize{D}} - 0.427 \times \text{zone}_{\scriptsize{E}} - 0.495 \times \text{zone}_{\scriptsize{F}}\]

\[+ 0.025 \times \text{agevehicule} + 0.016 \times \text{marque}\] We can add other varaible ageconducteur factors and their powers to analyze the relationships between the varaible and its powers.

## 
## Call:
## glm(formula = round(y_train) ~ puissance + factor(garantie) + 
##     factor(carburant) + factor(zone) + agevehicule + marque + 
##     region + ageconducteur + log(ageconducteur) + I(ageconducteur^2) + 
##     I(ageconducteur^3) + I(ageconducteur^4), family = poisson(link = "log"), 
##     data = X_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          3.139e+02  8.578e+00  36.592  < 2e-16 ***
## puissance           -4.564e-02  2.160e-03 -21.127  < 2e-16 ***
## factor(garantie)2DO -1.497e-01  7.967e-03 -18.787  < 2e-16 ***
## factor(garantie)3VI  6.561e-01  2.532e-02  25.914  < 2e-16 ***
## factor(garantie)4BG -7.290e-01  1.093e-02 -66.700  < 2e-16 ***
## factor(garantie)5CO -3.482e-01  3.128e-02 -11.131  < 2e-16 ***
## factor(carburant)E  -1.946e-01  7.437e-03 -26.169  < 2e-16 ***
## factor(zone)B       -3.455e-01  1.893e-02 -18.251  < 2e-16 ***
## factor(zone)C       -2.588e-01  1.322e-02 -19.574  < 2e-16 ***
## factor(zone)D       -2.900e-01  1.132e-02 -25.610  < 2e-16 ***
## factor(zone)E       -3.892e-01  1.109e-02 -35.097  < 2e-16 ***
## factor(zone)F       -5.541e-01  1.491e-02 -37.164  < 2e-16 ***
## agevehicule          4.052e-02  2.471e-03  16.401  < 2e-16 ***
## marque               1.663e-02  1.197e-03  13.892  < 2e-16 ***
## region               3.686e-03  1.144e-03   3.222  0.00127 ** 
## ageconducteur        1.808e+01  4.786e-01  37.770  < 2e-16 ***
## log(ageconducteur)  -1.887e+02  5.160e+00 -36.562  < 2e-16 ***
## I(ageconducteur^2)  -3.102e-01  8.000e-03 -38.778  < 2e-16 ***
## I(ageconducteur^3)   3.017e-03  7.647e-05  39.448  < 2e-16 ***
## I(ageconducteur^4)  -1.185e-05  2.984e-07 -39.726  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70022  on 143  degrees of freedom
## Residual deviance: 56989  on 124  degrees of freedom
## AIC: 58077
## 
## Number of Fisher Scoring iterations: 5

Deviance Residuals: - The deviance residuals measure the goodness of fit. The values range from -47.656 to 41.680, indicating some variability in the model’s ability to predict the observed outcomes.

Coefficients:

Intercept (313.9): - The baseline log-rate of the response variable when all other predictors are zero. - Highly significant (p-value < 0.05).

Puissance (-0.0456): - A one-unit increase in puissance is associated with a decrease of approximately 4.56% in the log-rate. - Highly significant (p-value < 0.05).

Factor levels of Garantie (2DO, 3VI, 4BG, 5CO): - The coefficients represent the log-rate change compared to the reference category. - All factors are highly significant (p-value < 0.05).

Factor levels of Carburant (E): - Represents the log-rate change for the “E” category compared to the reference. - Highly significant (p-value < 0.05).

Factor levels of Zone (B, C, D, E, F): - Similar interpretation as above. Each level compared to the reference “A”. - All are highly significant (p-value < 0.05).

Agevehicule (0.0405): - A one-unit increase in agevehicule is associated with an increase of approximately 4.05% in the log-rate. - Highly significant (p-value < 0.05).

Marque (0.0166): - A one-unit increase in marque is associated with an increase of approximately 1.66% in the log-rate. - Highly significant (p-value < 0.05).

Region (0.00369): - A one-unit increase in region is associated with an increase of approximately 0.369% in the log-rate. - Highly significant (p-value < 0.05).

Ageconducteur (18.08): - A one-unit increase in ageconducteur is associated with an increase of approximately 18.08 in the log-rate. - Highly significant (p-value < 0.05).

Log(Ageconducteur) (-188.7): - The logarithm of ageconducteur is negatively associated with the log-rate. - Highly significant (p-value < 0.05).

I(Ageconducteur^2) (-0.3102): - The second power of ageconducteur is negatively associated with the log-rate. - Highly significant (p-value < 0.05).

I(Ageconducteur^3) (0.00302): - The third power of ageconducteur is positively associated with the log-rate. - Highly significant (p-value < 0.05).

I(Ageconducteur^4) (-1.185e-05): - The fourth power of ageconducteur is negatively associated with the log-rate. - Highly significant (p-value < 0.05).

Conclusions: - The model appears to have a good fit, as indicated by the significant coefficients and the model fit statistics. - The predictors are statistically significant in explaining the variability in the response variable.

\[\text{4rd Model: }\; Y = \text{Log}(y_{\text{train}}) \sim 313.9 -0.05\times \times \text{puissance} - 0.15 \times \text{garantie}_{\scriptsize{2DO}} + 0.656 \times \text{garantie}_{\scriptsize{3VI}} - 0.73 \times \text{garantie}_{\scriptsize{4BG}}\] \[- 0.348 \times \text{garantie}_{\scriptsize{5CO}}- 0.194 \times \text{carburant}_{\scriptsize{E}} - 0.346 \times \text{zone}_{\scriptsize{B}} - 0.259 \times \text{zone}_{\scriptsize{C}} \] \[- 0.29 \times \text{zone}_{\scriptsize{D}} - 0.389 \times \text{zone}_{\scriptsize{E}} - 0.554 \times \text{zone}_{\scriptsize{F}} - 0.04 \times \text{agevehicule} + 0.17 \times \text{marque}\]

\[ + 0.004 \times \text{region} + 10.81 \times \text{ageconducteur} - 188.7 \times \log(\text{ageconducteur}) - 0.31 \times \text{I}(\text{ageconducteur}^2) + 0.003 \times \text{I}(\text{ageconducteur}^3)\] \[- 0.012\times 10^{-3} \times \text{I}(\text{ageconducteur}^4)\]

Gaussain family

\[\text{GLM}_{\scriptsize{gaussain}}: \; Y = y_{\text{train}} \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{puissance} + \hat{\beta}_2 \times \text{garantie} + \hat{\beta}_3 \times \text{carburant} + \hat{\beta}_4 \times \text{zone}+ \hat{\beta}_5 \times \text{agevehicule} + \] \[\hat{\beta}_6 \times \text{marque} + \hat{\beta}_7 \times \text{region}+ \hat{\beta}_8 \times \text{ageconducteur}\]

## Non-significant variables to be removed: puissance factor(garantie)2DO factor(garantie)5CO factor(carburant)E factor(zone)B factor(zone)C factor(zone)D agevehicule marque region ageconducteur

## 
## Call:
## glm(formula = y_train ~ puissance + factor(garantie) + factor(carburant) + 
##     factor(zone) + agevehicule + marque + region + ageconducteur, 
##     family = gaussian(), data = X_train)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         1196.4008   323.1513   3.702 0.000316 ***
## puissance            -18.8310    27.8195  -0.677 0.499692    
## factor(garantie)2DO  -87.2521   115.9998  -0.752 0.453327    
## factor(garantie)3VI  997.9488   550.2456   1.814 0.072075 .  
## factor(garantie)4BG -422.7739   137.1038  -3.084 0.002506 ** 
## factor(garantie)5CO -252.8871   392.6840  -0.644 0.520730    
## factor(carburant)E  -125.9489    97.7367  -1.289 0.199842    
## factor(zone)B       -291.2110   254.9280  -1.142 0.255451    
## factor(zone)C       -283.8471   186.0466  -1.526 0.129557    
## factor(zone)D       -272.7943   173.4099  -1.573 0.118160    
## factor(zone)E       -343.5463   170.8412  -2.011 0.046436 *  
## factor(zone)F       -381.4183   211.6159  -1.802 0.073834 .  
## agevehicule           16.0345    33.0085   0.486 0.627961    
## marque                10.1259    14.6213   0.693 0.489851    
## region                 0.8994    13.6124   0.066 0.947421    
## ageconducteur         -0.8821     4.6643  -0.189 0.850298    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 281024)
## 
##     Null deviance: 43622995  on 143  degrees of freedom
## Residual deviance: 35971072  on 128  degrees of freedom
## AIC: 2232.3
## 
## Number of Fisher Scoring iterations: 2

We can see that some varaibles are significant estimates such as; Intercept, factor(warranty)3VI, factor(warranty)4BG, factor(zone)E, factor(zone)F and the rest are not significant.

Binomial family

\[\text{GLM}_{\scriptsize{binomial}}: \; Y = y_{\text{train}} \sim \hat{\beta}_0 + \hat{\beta}_1 \times \text{puissance} + \hat{\beta}_2 \times \text{garantie} + \hat{\beta}_3 \times \text{carburant} + \hat{\beta}_4 \times \text{zone}+ \hat{\beta}_5 \times \text{agevehicule} + \hat{\beta}_6 \times \text{marque}\] \[+ \hat{\beta}_7 \times \text{region}+ \hat{\beta}_8 \times \text{ageconducteur}\]

## 
## Call:
## glm(formula = claimnb ~ puissance + factor(garantie) + factor(carburant) + 
##     factor(zone) + agevehicule + marque + region + ageconducteur, 
##     family = binomial(link = "logit"), data = X_train)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)  
## (Intercept)          2.289e+01  4.384e+03   0.005   0.9958  
## puissance           -3.500e-01  2.041e-01  -1.715   0.0864 .
## factor(garantie)2DO  3.977e-01  6.491e-01   0.613   0.5401  
## factor(garantie)3VI  1.852e+01  1.773e+04   0.001   0.9992  
## factor(garantie)4BG  1.908e+01  2.456e+03   0.008   0.9938  
## factor(garantie)5CO  1.972e+01  1.253e+04   0.002   0.9987  
## factor(carburant)E  -1.043e+00  7.939e-01  -1.313   0.1891  
## factor(zone)B       -2.657e+00  6.792e+03   0.000   0.9997  
## factor(zone)C       -1.923e+01  4.384e+03  -0.004   0.9965  
## factor(zone)D       -1.904e+01  4.384e+03  -0.004   0.9965  
## factor(zone)E       -1.954e+01  4.384e+03  -0.004   0.9964  
## factor(zone)F        3.525e-01  6.198e+03   0.000   1.0000  
## agevehicule          4.675e-02  2.140e-01   0.218   0.8271  
## marque              -1.101e-02  1.021e-01  -0.108   0.9142  
## region               1.088e-01  9.532e-02   1.142   0.2535  
## ageconducteur       -1.570e-02  3.328e-02  -0.472   0.6372  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 96.233  on 143  degrees of freedom
## Residual deviance: 66.955  on 128  degrees of freedom
## AIC: 98.955
## 
## Number of Fisher Scoring iterations: 19

Prediction

Prediction of Claim Costs

In this section, we delve into predicting claim costs for damage insurance contracts, with a focus on journey-related factors. We have access to a new database containing the pricing factors for 35 contracts. The primary objective is to develop a predictive model that takes into account specific journey characteristics to accurately estimate the potential cost of claims.

Context

Insurance companies face the complex challenge of assessing risks associated with policyholders’ journeys. Journey characteristics such as distance traveled, frequency of trips, and other driving behavior-related factors can significantly influence the expected cost of claims. Thus, understanding and modeling these factors become crucial for accurate pricing of damage insurance contracts.

New Database

Our analysis is based on a freshly compiled database, gathering pricing data for 35 contracts. This database is carefully curated to include relevant information about coverages, vehicle characteristics, geographical areas, and most importantly, specifics about policyholders’ journeys.

Modeling Objective

We aim to develop a robust predictive model leveraging the information provided by this new database. By focusing on journeys, our model seeks to capture the intricate relationships between journey features and the likely cost of claims. In doing so, we intend to provide the insurance company with a powerful tool to adjust pricing based on the nuances of policyholders’ driving behavior.

Follow the upcoming sections to discover the details of our modeling approach, the achieved results, and the implications for more precise pricing of damage insurance contracts.

## 
## Call:
## glm(formula = round(y_train) ~ puissance + factor(carburant) + 
##     factor(zone) + agevehicule + marque + region + ageconducteur + 
##     log(ageconducteur) + I(ageconducteur^2) + I(ageconducteur^3) + 
##     I(ageconducteur^4), family = poisson(link = "log"), data = X_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         3.784e+02  8.362e+00  45.254  < 2e-16 ***
## puissance          -4.062e-02  2.143e-03 -18.958  < 2e-16 ***
## factor(carburant)E -2.306e-01  7.293e-03 -31.620  < 2e-16 ***
## factor(zone)B      -7.347e-01  1.788e-02 -41.078  < 2e-16 ***
## factor(zone)C      -5.933e-01  1.236e-02 -48.021  < 2e-16 ***
## factor(zone)D      -3.069e-01  1.099e-02 -27.927  < 2e-16 ***
## factor(zone)E      -4.369e-01  1.099e-02 -39.766  < 2e-16 ***
## factor(zone)F      -6.115e-01  1.470e-02 -41.598  < 2e-16 ***
## agevehicule         3.712e-02  2.448e-03  15.163  < 2e-16 ***
## marque              1.382e-02  1.170e-03  11.806  < 2e-16 ***
## region             -6.561e-03  1.098e-03  -5.978 2.26e-09 ***
## ageconducteur       2.096e+01  4.693e-01  44.669  < 2e-16 ***
## log(ageconducteur) -2.248e+02  5.042e+00 -44.597  < 2e-16 ***
## I(ageconducteur^2) -3.499e-01  7.873e-03 -44.443  < 2e-16 ***
## I(ageconducteur^3)  3.314e-03  7.548e-05  43.905  < 2e-16 ***
## I(ageconducteur^4) -1.272e-05  2.953e-07 -43.063  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 70022  on 143  degrees of freedom
## Residual deviance: 63158  on 128  degrees of freedom
## AIC: 64238
## 
## Number of Fisher Scoring iterations: 5

##         1         2         3         4         5         6         7         8 
##  514.8470  423.4333  807.8592  653.0120  688.1745  560.9412  623.7400  535.4638 
##         9        10        11        12        13        14        15        16 
##  577.6712  665.7525  665.7525  665.7525 1177.9360  693.6464  868.9974  761.3268 
##        17        18        19        20        21        22        23        24 
##  684.1723  709.8887  505.7045  518.4938  998.4163  783.2622  653.1340  666.7692 
##        25        26        27        28        29        30        31        32 
##  901.3655  635.9259  777.2931 1262.6729 1237.4653  450.4284  512.8170  879.3586 
##        33        34        35 
##  123.6644  628.7146  911.5023

## # A tibble: 35 × 8
##    zone  puissance agevehicule ageconducteur marque carburant densite region
##    <chr>     <dbl>       <dbl>         <dbl>  <dbl> <chr>       <dbl>  <dbl>
##  1 C             7           0            56     12 D              93     13
##  2 C            10          10            42     12 D              93     13
##  3 A             5           4            31      3 D              21      8
##  4 E             5           1            58     12 E              52      0
##  5 B             7           8            22      2 E              26      0
##  6 F             7           5            39     12 E              11      0
##  7 E             6           3            39     12 D              22     11
##  8 E             6           6            36      1 D              24     12
##  9 A            12           9            69     12 E              73     13
## 10 E             7           7            26     12 E              53      1
## # ℹ 25 more rows

## 'data.frame':    35 obs. of  9 variables:
##  $ garantie     : chr  "" "" "" "" ...
##  $ zone         : chr  "" "" "" "" ...
##  $ puissance    : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ agevehicule  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ ageconducteur: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ marque       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ carburant    : chr  "" "" "" "" ...
##  $ densite      : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ region       : num  0 0 0 0 0 0 0 0 0 0 ...

##         1         2         3         4         5         6         7         8 
##  514.8470  423.4333  807.8592  653.0120  688.1745  560.9412  623.7400  535.4638 
##         9        10        11        12        13        14        15        16 
##  577.6712  665.7525  665.7525  665.7525 1177.9360  693.6464  868.9974  761.3268 
##        17        18        19        20        21        22        23        24 
##  684.1723  709.8887  505.7045  518.4938  998.4163  783.2622  653.1340  666.7692 
##        25        26        27        28        29        30        31        32 
##  901.3655  635.9259  777.2931 1262.6729 1237.4653  450.4284  512.8170  879.3586 
##        33        34        35 
##  123.6644  628.7146  911.5023

##  [1] 4.994074e+02 8.317345e+02 9.613246e+02 5.842336e+02 7.824767e+02
##  [6] 6.627201e+02 8.941041e+02 7.910521e+02 1.389241e+03 5.440081e+02
## [11] 7.444608e+02 8.388383e+02 1.471932e+03 6.202726e+02 8.887114e+02
## [16] 9.591204e+02 2.112077e+03 1.202928e+03 2.238176e+03 3.110020e+03
## [21] 9.114961e+02 6.744388e+02 2.223375e+03 9.519048e+02 5.077412e+03
## [26] 8.434790e+02 7.078886e+02 1.458576e+03 9.547458e+02 9.040387e+02
## [31] 2.226276e+05 2.496808e+01 3.023024e+03 1.216185e-01 2.757478e+03

## [1] "Claim Frequency GLM2, Test-Sample, Actual/Predicted: 9.32% "

Findings, discussion and Conclusion

Based on the provided PDF file, the findings of the study encompass the comprehensive data selection and cleaning process, the exploratory data analysis revealing insights into the relationships between variables, and the successful modeling and prediction of claim costs for insurance contracts. The discussion delves into the significance of the journey-related factors in predicting claim costs for damage insurance contracts, emphasizing the importance of understanding and modeling these factors for accurate pricing. Furthermore, the discussion highlights the relevance of the selected independent variables and probability distributions in the modeling process. In conclusion, the study successfully achieves its objectives by demonstrating the application of Generalized Linear Models (GLM) in calculating pure premium for non-life insurance, providing valuable insights for insurance companies to adjust pricing based on policyholders’ driving behavior and journey characteristics.

Modeling Pure Premium with Generalized Linear Models (GLM)

Hassan OUKHOUYA

November 26th, 2023

Overview

Research Question

Objectives

Importing libraries

Data selection and cleaning

Selection of variables

Data structure

Missing values and processing

Analysis Outliers and processing

Outlier processing

Variables selection

Exploratory Data Analysis (EDA)

Descriptive statistics

An overview of claims costs with other variables

Univariate analysis of variables

Bivariate analysis of variables

Plot claim frequencies with other variables

Analysis Correlation

Analysis Correlation between numerical variables

Analysis Correlation between categorical variables

Distribution of Variables and Density Plot

Modeling

Data preparation

Split training and validation

Build models

Poisson Regression Modeling

Binomial Regression Modeling

Gaussian Regression Modeling

Fit a model with intercept and all factors

Prediction

Prediction of Claim Costs

Context

New Database

Modeling Objective

Findings, discussion and Conclusion

References