Customer Segmentation Based on Insurance Claims
1 Introduction
1.1 Sample Description
The data we’ll be analyzing in this article is related to insurance claims in first quarter of 2023. There is a total of 1000 observations and 29 variables in our data set, including age, gender, coverage start date, insurance claim amount, car brands, etc.
With relatively large size of different variables, we could apply some feature engineering techniques later on to reduce the amount of columns within our data as well providing potentially better representing features.
1.2 Research Purpose
The goal is to identify the risk between customers in the line of insurance business. By using customer segmentation techniques such as k-means clustering, we can group customers into different clusters, each characterized by similar behavior within the same cluster and distinguishable values between clusters.
This might help business in understanding their customers, knowing better decisions and actions relating to specific customer’s behavior.
2 Exploratory Data Analysis
2.1 Descriptive Statistics
First, we will observe the general overview of the features, such as the values that appears the most, the average value, etc
## cust_age coverage_start_date cust_region sum_assured_group
## Min. :19.00 Min. :1997-03-23 east :352 high:304
## 1st Qu.:32.00 1st Qu.:2002-12-15 north:310 low :347
## Median :38.00 Median :2009-06-25 west :338 mid :349
## Mean :38.95 Mean :2009-05-20
## 3rd Qu.:44.00 3rd Qu.:2015-08-25
## Max. :64.00 Max. :2022-12-28
##
## ins_deductible annual_prem zip_code insured_sex edu_lvl
## Min. : 500 Min. : 431.4 331170 : 2 F:535 associate :145
## 1st Qu.: 500 1st Qu.:1087.7 346863 : 2 M:465 college :122
## Median :1000 Median :1255.3 356570 : 2 high school:160
## Mean :1136 Mean :1254.5 369397 : 2 jd :161
## 3rd Qu.:2000 3rd Qu.:1413.8 377663 : 2 masters :143
## Max. :2000 Max. :2045.7 330072 : 1 md :144
## (Other):989 phd :125
## marital_status claim_incurred_date claim_type
## married :325 Min. :2023-01-01 multi-vehicle collision :419
## other :534 1st Qu.:2023-01-15 parked car : 84
## unmarried:141 Median :2023-01-31 single vehicle collision:403
## Mean :2023-01-30 theft : 94
## 3rd Qu.:2023-02-15
## Max. :2023-03-01
##
## acc_type emg_services_notified incident_city incident_hour
## :178 Ambulance:196 Arlington :152 17 : 54
## front collision:254 Fire :223 Columbus :149 3 : 53
## rear collision :292 None : 91 Hillsdale :141 0 : 52
## side collision :276 Other :198 Northbend :145 23 : 51
## Police :292 Northbrook :122 16 : 49
## Riverwood :134 4 : 46
## Springfield:157 (Other):695
## num_vehicles_involved property_damage bodily_injuries witnesses
## Min. :1.000 ? :360 Min. :0.000 Min. :0.000
## 1st Qu.:1.000 NO :338 1st Qu.:0.000 1st Qu.:1.000
## Median :1.000 YES:302 Median :1.000 Median :1.000
## Mean :1.839 Mean :0.988 Mean :1.487
## 3rd Qu.:3.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :4.000 Max. :2.000 Max. :3.000
##
## police_report_avlbl total_claim_amount injury_claim property_claim
## :343 Min. : 100 Min. : 0 Min. : 0
## NO :343 1st Qu.: 41813 1st Qu.: 4295 1st Qu.: 4445
## YES:314 Median : 58055 Median : 6775 Median : 6750
## Mean : 52944 Mean : 7458 Mean : 7430
## 3rd Qu.: 70593 3rd Qu.:11330 3rd Qu.:10885
## Max. :151632 Max. :24726 Max. :28054
##
## vehicle_claim car_brand car_model production_year
## Min. : 70 Saab : 80 RAM : 43 2002 : 56
## 1st Qu.: 30293 Subaru : 80 Wrangler: 42 2006 : 55
## Median : 42100 Dodge : 79 A3 : 37 2012 : 54
## Mean : 38057 Nissan : 78 MDX : 36 2013 : 53
## 3rd Qu.: 50823 Chevrolet: 76 Neon : 36 2018 : 53
## Max. :106960 BMW : 73 Jetta : 35 2014 : 52
## (Other) :534 (Other) :771 (Other):677
Some insights we could derive from the statistical summary above:
Most of the numeric values distributes normally, with the median, 3rd quartile and maximum value don’t differ that much
For the categorical variables, we could see the most occuring values and how it compares to the others
We could also see the range of coverage start date within the claims data, which spans from 03/1997 to 12/2022
The column zip_code might suit better as characters format as it contains many levels with the most occurring values only appear twice
Claim incurred date are as specified in the description to be only for the first quarter of 2023
Column accident_type and police_report_avlbl has empty values, while column property_damage has question mark (“?”) values. These might represent unknown or missing values
2.2 Data Visualization
We could also analyze through visual attributes, such as the distribution, density, etc, by observing the plot
Numeric Variables
From the graph above, some patterns appear to exist in the columns:
Customer age (cust_age) and annual premium (annual_prem) spreads normally, centered in the middle of the distribution. This means that most of the customers are of the middle age (35-45), with most of the premium is around the same cost, most likely the standard premium cost
Most of the other variables don’t follow specific distributions, but they have extreme values where most of the observations are in. For example, in most cases, the property claim is either 0 or at the lower values
Other than using single variable (univariate), for numeric variables, we could also find relationship between variables (multivariate) using correlation score. The following chart shows how closely related each column with the others, with higher values signify a higher correlation.
We could see how most variables aren’t correlated, with correlation values very close to 0. On the other hand, total_claim_amount and each of vehicle_claim, property_claim, injury_claim are highly correlated with correlation score very close to 1. This is reasonable due to the column total_claim_amount, as the name suggests, is a sum of the three claim amounts mentioned previously. So for further analysis, we will only use the total_claim_amount column as it already represents the other claim amounts.
The other variable with quite significant correlation score is num_vehicles_involved. It has a correlation score of around 0.2 with the four variables mentioned above as well. This means that there is a tendency that the more vehicles involved, the higher the claim amount. Again, this is justified by the fact that the more vehicles involved means that the more damage inflicted, and thus the higher claim cost issued by the customer.
Categorical Variables
On the chart above, we could see the top 5 values for each feature. Most of the categorical variables have roughly equal distribution between each factors.
2.3 Missing Values
The data has no missing values
## [1] 0
3 Feature Engineering
Next, with the amount of columns we got in the data set, we could utilise some of the variables better either by modifying, or merging them with other variables.
3.1 Coverage to Claim Duration
We have a column called coverage_start_date, which stands for the date the claims coverage initiated, and claim_incurred_date, which stands for the date when the claim was issued. These columns have a lot of unique values due to being a date variable, thus using these columns more likely would add noise to our analysis. So instead we could convert these columns into the gap between the coverage start date and claim incurred date. Let’s call this new variable, coverage_to_claim_duration.
This would be a numeric column that would be helpful to determine how long have the insurance been available before a claim is issued.
Here are some values we got from the new feature
## [1] 5641 2583 1118 3651 2838 5048
We could then also see how the value distributes
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 36 2733 4953 5002 7359 9463
We could notice how the minimum value has a sizable gap with the other statistical summaries. This indicates a case of outliers with the value where there are extreme values that don’t fit the distribution. The rest of the values seems to distribute normally based on the observation above, where the gap between the median and maximum is normal.
And since this is a numeric value, we could also see how it correlates with the other numeric features
And it’s apparent that this new feature isn’t correlated with the other numeric columns.
3.2 Incident Time
The column incident_hour is also a column with many levels, representing each hour in a day. We could categorize this column better as four events during the day, i.e morning (6-12), afternoon (13-18), evening (19-24), and night (0-5). We could call this column incident_time
Here are some values from the new feature
## [1] Evening Afternoon Night Afternoon Night Night
## Levels: Night Morning Afternoon Evening
As with previous variables, we could observe the distribution of this new feature
## Night Morning Afternoon Evening
## 244 239 271 246
We could see how the incidents happened most often during the afternoon (12-18), but generally the difference between each time of the day isn’t significant.
3.3 Collision
In the data set, we have column for accident type, with the values of front, rear, or side collision. We also have column for claim type, with the values of single vehicle collision, multi-vehicle collision, parked car, and theft. We could simplify these columns by assigning a boolean variable called isCollision, which specifies whether the claim is for collision case or not.
Here’s a preview of the new column
## [1] Yes Yes Yes Yes Yes No
## Levels: No Yes
As usual, we could observe the distribution of this new feature
## No Yes
## 178 822
It appears that most of the insurance claims issued were due to collision incident.
3.4 Final Features
After the various preparation steps and analysis we’ve done, we could conclude the features to use for further analysis.
We’ll keep most of the original variables, except those that either have been modified or would impact the analysis negatively
Column zip_code, incident_city, car brand, car_model will be removed due to having a lot of unique values
Column total_claim_amount will be the only claim amount column used as it already represents the other columns. Thus, injury_claim, property_claim, and vehicle_claim can be removed
Columns that were used to create new features will be removed as the information has been represented by these new columns
So finally, we got these features to use for the customer segmentation process
cust_age, cust_region, sum_assured_group, ins_deductible, annual_prem, insured_sex, edu_lvl, marital_status, emg_services_notified, num_vehicles_involved, property_damage, bodily_injuries, witnesses, police_report_avlbl, total_claim_amount, production_year, coverage_to_claim_duration, incident_time, isCollision
4 Customer Segmentation
Finally, we can get into the customer segmentation step. Here’s a snippet of the data we’ll use for the customer segmentation process
## cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016 44 north mid 2000 1007.48
## 316183 56 north mid 500 1080.60
## 174430 28 east mid 2000 1078.03
## 67527 53 north low 1000 1026.55
## 259758 47 east mid 2000 1484.15
## 90381 31 north mid 2000 1110.15
## insured_sex
## 225016 F
## 316183 F
## 174430 M
## 67527 F
## 259758 F
## 90381 F
## [1] 1000 19
We have 1000 observations and 19 features to use. We’ll be using the K-Means method, which is one of the most common clustering algorithm.
4.1 Factorization
K-Means algorithm could only deal with numeric features. This is due to its approach of calculating the distance between each data points. Thus, we need to first factorize our categorical columns into numeric. Each numeric value will represent each level from the factor in ascending order. For example, in the insured_sex column, F would be represented by 1 and M would be represented by 2.
## cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016 44 2 3 2000 1007.48
## 316183 56 2 3 500 1080.60
## 174430 28 1 3 2000 1078.03
## 67527 53 2 2 1000 1026.55
## 259758 47 1 3 2000 1484.15
## 90381 31 2 3 2000 1110.15
## insured_sex
## 225016 1
## 316183 1
## 174430 2
## 67527 1
## 259758 1
## 90381 1
4.2 Normalization
To avoid biased calculations where one variable is using different scale with the other, a normalization step is required. For example, while the value of annual premi is commonly in thousands, the value of number of vehicles involved is typically only around the single digit. This difference in scale could produce bias in the analysis where variables with higher range would be seen as more significant due to having higher values
A common normalization method used is the standard scaling, where each numeric value is subtracted from the average of its column, and then divided by the standard deviation of the column. In R, we can use the scale() function to achieve this goal.
Here’s how our data looks like after the normalization step
## cust_age cust_region sum_assured_group ins_deductible annual_prem
## 225016 0.5527179 0.01684798 1.18305234 1.4120769 -1.0117491
## 316183 1.8655870 0.01684798 1.18305234 -1.0394455 -0.7122824
## 174430 -1.1977742 -1.18657940 1.18305234 1.4120769 -0.7228080
## 67527 1.5373697 0.01684798 -0.05574592 -0.2222714 -0.9336470
## 259758 0.8809352 -1.18657940 1.18305234 1.4120769 0.9404771
## 90381 -0.8695570 0.01684798 1.18305234 1.4120769 -0.5912589
## insured_sex
## 225016 -0.9318206
## 316183 -0.9318206
## 174430 1.0720947
## 67527 -0.9318206
## 259758 -0.9318206
## 90381 -0.9318206
4.3 Optimal Number of Cluster
In the K-Means method, we need to specify the amount of clusters the model will have to separate. It’s common to use preset values when specified (such as 10 for the digit recognition data set), but otherwise, a more intuitive way is by using a method called Silhouette Method.
This method works by applying the K-Means model with various number of clusters. Then, for each cluster size, it calculates the average distance between each data point to the other within the same cluster (cohesion) and each data point to the other outside the cluster (separation). These two values are then compared to obtain the silhouette coefficient, and then the average silhouette score is plotted for each size of cluster.
Package factoextra offers a convenient way to do this using the fviz_nbclust function
We could see how based on the silhouette method, the optimal value for k is 2.
4.4 K-Means Clustering
Next, we can build our clustering model
By observing at the centers, we can see the average value from each cluster for each column. From these values, we could generate some insights on the general attributes and behaviors of the customers within a cluster/segment.
## cust_age cust_region sum_assured_group ins_deductible annual_prem
## 1 -0.09879955 -0.003434501 -6.959541e-05 -0.011091532 -0.034923653
## 2 0.02139455 0.000743724 1.507054e-05 0.002401816 0.007562543
## insured_sex edu_lvl marital_status emg_services_notified
## 1 0.036363183 -0.13005635 0.00642342 0.5290164
## 2 -0.007874266 0.02816305 -0.00139096 -0.1145559
## num_vehicles_involved property_damage bodily_injuries witnesses
## 1 -0.8234529 -0.018515102 -0.012788450 0.016752800
## 2 0.1783146 0.004009353 0.002769275 -0.003627735
## police_report_avlbl total_claim_amount production_year
## 1 -0.005809011 -1.7680133 0.05843700
## 2 0.001257912 0.3828545 -0.01265424
## coverage_to_claim_duration incident_time isCollision
## 1 0.08442484 -0.5792041 -2.1478733
## 2 -0.01828178 0.1254238 0.4651112
Some general information we could take from the summary above
Most of the columns doesn’t do well in classifying the customers (shown by the small difference between the values), implying that there are no specific patterns between these features and car insurance claim, such as cust_age, ins_deductible, and insured_sex
On the other hand, some columns such as total_claim_amount, shows a clear distinction between the clusters, where cluster 1 is for customers with low total claim amount and cluster 2 is for customers with high total claim amount
4.5 Cluster Identification
A further step in customer segmentation is to analyze the characteristics of each cluster, and interpret them in simplified terms. This could be achieved by observing the summary values above.
Here are some interpretations we could make regarding each clusters, based on the values for columns with clear separation between the two clusters
## Subject Cluster.1 Cluster.2
## 1 Collision No Yes
## 2 Claim Amount Low High
## 3 Vehicle Involvements Single-Vehicle Multi-Vehicle
## 4 Emergency Service Notified Yes No
And in conclusion,
Cluster 1 is for customers who were involved in non-collision incidents, such as theft. Thus, these customers subsequently have lower claim amount due to less amount of damage inflicted.
Cluster 2 is for customers who were involved in collision incidents, either single-vehicle or multi-vehicle. Thus, these customers issued a higher claim amount due to the severe amount of damage inflicted, including property and injury damage.
4.6 Cluster Visualization
Finally, we could observe how the cluster actually looks like in the data set. This way, we can see in clearer way how good the model actually separates the segments, are the distance between clusters distinguishable or not, and the size comparison between each segment.
Since we have multiple features in our data set, we cannot visualize the whole data into a single plot. There are two methods usually used to visualize clusters with multiple features. One, using principal component analysis (PCA) method, by utilising the two most impactful dimensions. And two, by using the most distinguishable columns within the data set (in this case, total_claim_amount)
Principal Component Analysis
We could see how the model perfectly separates the data (based on the PCA visualization) into the suitable segments.
Variable Selection
By using specific column with the highest influence on the clusters (total_claim_amount), we can also see a perfect separation between the two clusters, showing that the main difference between each cluster is the total claim amount between them, which was influenced by the type of claim (collision or not).
5 Conclusion
In conclusion, we have successfully build a customer segmentation model using the K-Means Clustering method, resulting in a total of 2 clusters. The main separator between each cluster is the total claim amount issued, which is related to the other variables, i.e whether the incident was a collision or not.
The first cluster is for customers with non-collision incidents, thus involving less vehicles and inflicting less damage. Therefore, resulting in lower total claim amount incurred.
The second cluster, on the other hand, is for customers with collision incidents, thus involving more vehicles and inflicting more damage. Therefore, resulting in higher total claim amount incurred.
This segmentation would be useful to prepare for the insurance claims in the future, where collision-related incidents would incur higher claim cost than those with non-collision incidents.