Customer Segmentation Based on Insurance Claims

1 Introduction

1.1 Sample Description

The data we’ll be analyzing in this article is related to insurance claims in first quarter of 2023. There is a total of 1000 observations and 29 variables in our data set, including age, gender, coverage start date, insurance claim amount, car brands, etc.

With relatively large size of different variables, we could apply some feature engineering techniques later on to reduce the amount of columns within our data as well providing potentially better representing features.

1.2 Research Purpose

The goal is to identify the risk between customers in the line of insurance business. By using customer segmentation techniques such as k-means clustering, we can group customers into different clusters, each characterized by similar behavior within the same cluster and distinguishable values between clusters.

This might help business in understanding their customers, knowing better decisions and actions relating to specific customer’s behavior.

2 Exploratory Data Analysis

2.1 Descriptive Statistics

First, we will observe the general overview of the features, such as the values that appears the most, the average value, etc

##     cust_age     coverage_start_date  cust_region sum_assured_group
##  Min.   :19.00   Min.   :1997-03-23   east :352   high:304         
##  1st Qu.:32.00   1st Qu.:2002-12-15   north:310   low :347         
##  Median :38.00   Median :2009-06-25   west :338   mid :349         
##  Mean   :38.95   Mean   :2009-05-20                                
##  3rd Qu.:44.00   3rd Qu.:2015-08-25                                
##  Max.   :64.00   Max.   :2022-12-28                                
##                                                                    
##  ins_deductible  annual_prem        zip_code   insured_sex        edu_lvl   
##  Min.   : 500   Min.   : 431.4   331170 :  2   F:535       associate  :145  
##  1st Qu.: 500   1st Qu.:1087.7   346863 :  2   M:465       college    :122  
##  Median :1000   Median :1255.3   356570 :  2               high school:160  
##  Mean   :1136   Mean   :1254.5   369397 :  2               jd         :161  
##  3rd Qu.:2000   3rd Qu.:1413.8   377663 :  2               masters    :143  
##  Max.   :2000   Max.   :2045.7   330072 :  1               md         :144  
##                                  (Other):989               phd        :125  
##    marital_status claim_incurred_date                     claim_type 
##  married  :325    Min.   :2023-01-01   multi-vehicle collision :419  
##  other    :534    1st Qu.:2023-01-15   parked car              : 84  
##  unmarried:141    Median :2023-01-31   single vehicle collision:403  
##                   Mean   :2023-01-30   theft                   : 94  
##                   3rd Qu.:2023-02-15                                 
##                   Max.   :2023-03-01                                 
##                                                                      
##             acc_type   emg_services_notified     incident_city incident_hour
##                 :178   Ambulance:196         Arlington  :152   17     : 54  
##  front collision:254   Fire     :223         Columbus   :149   3      : 53  
##  rear collision :292   None     : 91         Hillsdale  :141   0      : 52  
##  side collision :276   Other    :198         Northbend  :145   23     : 51  
##                        Police   :292         Northbrook :122   16     : 49  
##                                              Riverwood  :134   4      : 46  
##                                              Springfield:157   (Other):695  
##  num_vehicles_involved property_damage bodily_injuries   witnesses    
##  Min.   :1.000         ?  :360         Min.   :0.000   Min.   :0.000  
##  1st Qu.:1.000         NO :338         1st Qu.:0.000   1st Qu.:1.000  
##  Median :1.000         YES:302         Median :1.000   Median :1.000  
##  Mean   :1.839                         Mean   :0.988   Mean   :1.487  
##  3rd Qu.:3.000                         3rd Qu.:2.000   3rd Qu.:2.000  
##  Max.   :4.000                         Max.   :2.000   Max.   :3.000  
##                                                                       
##  police_report_avlbl total_claim_amount  injury_claim   property_claim 
##     :343             Min.   :   100     Min.   :    0   Min.   :    0  
##  NO :343             1st Qu.: 41813     1st Qu.: 4295   1st Qu.: 4445  
##  YES:314             Median : 58055     Median : 6775   Median : 6750  
##                      Mean   : 52944     Mean   : 7458   Mean   : 7430  
##                      3rd Qu.: 70593     3rd Qu.:11330   3rd Qu.:10885  
##                      Max.   :151632     Max.   :24726   Max.   :28054  
##                                                                        
##  vehicle_claim        car_brand      car_model   production_year
##  Min.   :    70   Saab     : 80   RAM     : 43   2002   : 56    
##  1st Qu.: 30293   Subaru   : 80   Wrangler: 42   2006   : 55    
##  Median : 42100   Dodge    : 79   A3      : 37   2012   : 54    
##  Mean   : 38057   Nissan   : 78   MDX     : 36   2013   : 53    
##  3rd Qu.: 50823   Chevrolet: 76   Neon    : 36   2018   : 53    
##  Max.   :106960   BMW      : 73   Jetta   : 35   2014   : 52    
##                   (Other)  :534   (Other) :771   (Other):677

Some insights we could derive from the statistical summary above:

Most of the numeric values distributes normally, with the median, 3rd quartile and maximum value don’t differ that much
For the categorical variables, we could see the most occuring values and how it compares to the others
We could also see the range of coverage start date within the claims data, which spans from 03/1997 to 12/2022
The column zip_code might suit better as characters format as it contains many levels with the most occurring values only appear twice
Claim incurred date are as specified in the description to be only for the first quarter of 2023
Column accident_type and police_report_avlbl has empty values, while column property_damage has question mark (“?”) values. These might represent unknown or missing values

2.2 Data Visualization

We could also analyze through visual attributes, such as the distribution, density, etc, by observing the plot

Numeric Variables

From the graph above, some patterns appear to exist in the columns:

Customer age (cust_age) and annual premium (annual_prem) spreads normally, centered in the middle of the distribution. This means that most of the customers are of the middle age (35-45), with most of the premium is around the same cost, most likely the standard premium cost
Most of the other variables don’t follow specific distributions, but they have extreme values where most of the observations are in. For example, in most cases, the property claim is either 0 or at the lower values

Other than using single variable (univariate), for numeric variables, we could also find relationship between variables (multivariate) using correlation score. The following chart shows how closely related each column with the others, with higher values signify a higher correlation.

We could see how most variables aren’t correlated, with correlation values very close to 0. On the other hand, total_claim_amount and each of vehicle_claim, property_claim, injury_claim are highly correlated with correlation score very close to 1. This is reasonable due to the column total_claim_amount, as the name suggests, is a sum of the three claim amounts mentioned previously. So for further analysis, we will only use the total_claim_amount column as it already represents the other claim amounts.

The other variable with quite significant correlation score is num_vehicles_involved. It has a correlation score of around 0.2 with the four variables mentioned above as well. This means that there is a tendency that the more vehicles involved, the higher the claim amount. Again, this is justified by the fact that the more vehicles involved means that the more damage inflicted, and thus the higher claim cost issued by the customer.

Categorical Variables

On the chart above, we could see the top 5 values for each feature. Most of the categorical variables have roughly equal distribution between each factors.

2.3 Missing Values

The data has no missing values

## [1] 0

3 Feature Engineering

Next, with the amount of columns we got in the data set, we could utilise some of the variables better either by modifying, or merging them with other variables.

3.1 Coverage to Claim Duration

We have a column called coverage_start_date, which stands for the date the claims coverage initiated, and claim_incurred_date, which stands for the date when the claim was issued. These columns have a lot of unique values due to being a date variable, thus using these columns more likely would add noise to our analysis. So instead we could convert these columns into the gap between the coverage start date and claim incurred date. Let’s call this new variable, coverage_to_claim_duration.

This would be a numeric column that would be helpful to determine how long have the insurance been available before a claim is issued.

Here are some values we got from the new feature

## [1] 5641 2583 1118 3651 2838 5048

We could then also see how the value distributes

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      36    2733    4953    5002    7359    9463

We could notice how the minimum value has a sizable gap with the other statistical summaries. This indicates a case of outliers with the value where there are extreme values that don’t fit the distribution. The rest of the values seems to distribute normally based on the observation above, where the gap between the median and maximum is normal.

And since this is a numeric value, we could also see how it correlates with the other numeric features

And it’s apparent that this new feature isn’t correlated with the other numeric columns.

3.2 Incident Time

The column incident_hour is also a column with many levels, representing each hour in a day. We could categorize this column better as four events during the day, i.e morning (6-12), afternoon (13-18), evening (19-24), and night (0-5). We could call this column incident_time

Here are some values from the new feature

## [1] Evening   Afternoon Night     Afternoon Night     Night    
## Levels: Night Morning Afternoon Evening

As with previous variables, we could observe the distribution of this new feature

##     Night   Morning Afternoon   Evening 
##       244       239       271       246

We could see how the incidents happened most often during the afternoon (12-18), but generally the difference between each time of the day isn’t significant.

3.3 Collision

In the data set, we have column for accident type, with the values of front, rear, or side collision. We also have column for claim type, with the values of single vehicle collision, multi-vehicle collision, parked car, and theft. We could simplify these columns by assigning a boolean variable called isCollision, which specifies whether the claim is for collision case or not.

Here’s a preview of the new column

## [1] Yes Yes Yes Yes Yes No 
## Levels: No Yes

As usual, we could observe the distribution of this new feature

##  No Yes 
## 178 822

It appears that most of the insurance claims issued were due to collision incident.

3.4 Final Features

After the various preparation steps and analysis we’ve done, we could conclude the features to use for further analysis.

We’ll keep most of the original variables, except those that either have been modified or would impact the analysis negatively

Column zip_code, incident_city, car brand, car_model will be removed due to having a lot of unique values
Column total_claim_amount will be the only claim amount column used as it already represents the other columns. Thus, injury_claim, property_claim, and vehicle_claim can be removed
Columns that were used to create new features will be removed as the information has been represented by these new columns

So finally, we got these features to use for the customer segmentation process

cust_age, cust_region, sum_assured_group, ins_deductible, annual_prem, insured_sex, edu_lvl, marital_status, emg_services_notified, num_vehicles_involved, property_damage, bodily_injuries, witnesses, police_report_avlbl, total_claim_amount, production_year, coverage_to_claim_duration, incident_time, isCollision

4 Customer Segmentation

Finally, we can get into the customer segmentation step. Here’s a snippet of the data we’ll use for the customer segmentation process

##        cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016       44       north               mid           2000     1007.48
## 316183       56       north               mid            500     1080.60
## 174430       28        east               mid           2000     1078.03
## 67527        53       north               low           1000     1026.55
## 259758       47        east               mid           2000     1484.15
## 90381        31       north               mid           2000     1110.15
##        insured_sex
## 225016           F
## 316183           F
## 174430           M
## 67527            F
## 259758           F
## 90381            F

## [1] 1000   19

We have 1000 observations and 19 features to use. We’ll be using the K-Means method, which is one of the most common clustering algorithm.

4.1 Factorization

K-Means algorithm could only deal with numeric features. This is due to its approach of calculating the distance between each data points. Thus, we need to first factorize our categorical columns into numeric. Each numeric value will represent each level from the factor in ascending order. For example, in the insured_sex column, F would be represented by 1 and M would be represented by 2.

##        cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016       44           2                 3           2000     1007.48
## 316183       56           2                 3            500     1080.60
## 174430       28           1                 3           2000     1078.03
## 67527        53           2                 2           1000     1026.55
## 259758       47           1                 3           2000     1484.15
## 90381        31           2                 3           2000     1110.15
##        insured_sex
## 225016           1
## 316183           1
## 174430           2
## 67527            1
## 259758           1
## 90381            1

4.2 Normalization

To avoid biased calculations where one variable is using different scale with the other, a normalization step is required. For example, while the value of annual premi is commonly in thousands, the value of number of vehicles involved is typically only around the single digit. This difference in scale could produce bias in the analysis where variables with higher range would be seen as more significant due to having higher values

A common normalization method used is the standard scaling, where each numeric value is subtracted from the average of its column, and then divided by the standard deviation of the column. In R, we can use the scale() function to achieve this goal.

Here’s how our data looks like after the normalization step

##          cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016  0.5527179  0.01684798        1.18305234      1.4120769  -1.0117491
## 316183  1.8655870  0.01684798        1.18305234     -1.0394455  -0.7122824
## 174430 -1.1977742 -1.18657940        1.18305234      1.4120769  -0.7228080
## 67527   1.5373697  0.01684798       -0.05574592     -0.2222714  -0.9336470
## 259758  0.8809352 -1.18657940        1.18305234      1.4120769   0.9404771
## 90381  -0.8695570  0.01684798        1.18305234      1.4120769  -0.5912589
##        insured_sex
## 225016  -0.9318206
## 316183  -0.9318206
## 174430   1.0720947
## 67527   -0.9318206
## 259758  -0.9318206
## 90381   -0.9318206

4.3 Optimal Number of Cluster

In the K-Means method, we need to specify the amount of clusters the model will have to separate. It’s common to use preset values when specified (such as 10 for the digit recognition data set), but otherwise, a more intuitive way is by using a method called Silhouette Method.

This method works by applying the K-Means model with various number of clusters. Then, for each cluster size, it calculates the average distance between each data point to the other within the same cluster (cohesion) and each data point to the other outside the cluster (separation). These two values are then compared to obtain the silhouette coefficient, and then the average silhouette score is plotted for each size of cluster.

Package factoextra offers a convenient way to do this using the fviz_nbclust function

We could see how based on the silhouette method, the optimal value for k is 2.

4.4 K-Means Clustering

Next, we can build our clustering model

By observing at the centers, we can see the average value from each cluster for each column. From these values, we could generate some insights on the general attributes and behaviors of the customers within a cluster/segment.

##      cust_age  cust_region sum_assured_group ins_deductible  annual_prem
## 1 -0.09879955 -0.003434501     -6.959541e-05   -0.011091532 -0.034923653
## 2  0.02139455  0.000743724      1.507054e-05    0.002401816  0.007562543
##    insured_sex     edu_lvl marital_status emg_services_notified
## 1  0.036363183 -0.13005635     0.00642342             0.5290164
## 2 -0.007874266  0.02816305    -0.00139096            -0.1145559
##   num_vehicles_involved property_damage bodily_injuries    witnesses
## 1            -0.8234529    -0.018515102    -0.012788450  0.016752800
## 2             0.1783146     0.004009353     0.002769275 -0.003627735
##   police_report_avlbl total_claim_amount production_year
## 1        -0.005809011         -1.7680133      0.05843700
## 2         0.001257912          0.3828545     -0.01265424
##   coverage_to_claim_duration incident_time isCollision
## 1                 0.08442484    -0.5792041  -2.1478733
## 2                -0.01828178     0.1254238   0.4651112

Some general information we could take from the summary above

Most of the columns doesn’t do well in classifying the customers (shown by the small difference between the values), implying that there are no specific patterns between these features and car insurance claim, such as cust_age, ins_deductible, and insured_sex
On the other hand, some columns such as total_claim_amount, shows a clear distinction between the clusters, where cluster 1 is for customers with low total claim amount and cluster 2 is for customers with high total claim amount

4.5 Cluster Identification

A further step in customer segmentation is to analyze the characteristics of each cluster, and interpret them in simplified terms. This could be achieved by observing the summary values above.

Here are some interpretations we could make regarding each clusters, based on the values for columns with clear separation between the two clusters

##                      Subject      Cluster.1     Cluster.2
## 1                  Collision             No           Yes
## 2               Claim Amount            Low          High
## 3       Vehicle Involvements Single-Vehicle Multi-Vehicle
## 4 Emergency Service Notified            Yes            No

And in conclusion,

Cluster 1 is for customers who were involved in non-collision incidents, such as theft. Thus, these customers subsequently have lower claim amount due to less amount of damage inflicted.

Cluster 2 is for customers who were involved in collision incidents, either single-vehicle or multi-vehicle. Thus, these customers issued a higher claim amount due to the severe amount of damage inflicted, including property and injury damage.

4.6 Cluster Visualization

Finally, we could observe how the cluster actually looks like in the data set. This way, we can see in clearer way how good the model actually separates the segments, are the distance between clusters distinguishable or not, and the size comparison between each segment.

Since we have multiple features in our data set, we cannot visualize the whole data into a single plot. There are two methods usually used to visualize clusters with multiple features. One, using principal component analysis (PCA) method, by utilising the two most impactful dimensions. And two, by using the most distinguishable columns within the data set (in this case, total_claim_amount)

Principal Component Analysis

We could see how the model perfectly separates the data (based on the PCA visualization) into the suitable segments.

Variable Selection

By using specific column with the highest influence on the clusters (total_claim_amount), we can also see a perfect separation between the two clusters, showing that the main difference between each cluster is the total claim amount between them, which was influenced by the type of claim (collision or not).

5 Conclusion

In conclusion, we have successfully build a customer segmentation model using the K-Means Clustering method, resulting in a total of 2 clusters. The main separator between each cluster is the total claim amount issued, which is related to the other variables, i.e whether the incident was a collision or not.

The first cluster is for customers with non-collision incidents, thus involving less vehicles and inflicting less damage. Therefore, resulting in lower total claim amount incurred.

The second cluster, on the other hand, is for customers with collision incidents, thus involving more vehicles and inflicting more damage. Therefore, resulting in higher total claim amount incurred.

This segmentation would be useful to prepare for the insurance claims in the future, where collision-related incidents would incur higher claim cost than those with non-collision incidents.