Introduction

Customer segmentation involves dividing customers into groups based on shared characteristics. In the insurance industry, segmentation helps to:

Tailor products to customer needs. Improve risk assessment. Enhance customer retention through targeted marketing. In this workshop, we will: 1. Load and explore a dataset. 2. Perform segmentation using K-Means clustering. 3. Interpret and visualize the results.

Dataset

We will use a simulated dataset with the following features: - CustomerID: Unique identifier for each customer. - Age: Age of the customer. - Income: Annual income of the customer (in USD). - PolicyType: Type of insurance policy (e.g., Auto, Home, Life). - ClaimFrequency: Number of claims made in the last year. - RiskProfile: A score indicating the customer’s risk level. We can use this formula as well: dataRiskProfile<−(data ClaimFrequency / max(dataClaimFrequency))∗0.6+(1−(data Income / max(data$Income))) * 0.4

Let’s create and load the dataset.

set.seed(123)

data <- data.frame(
  CustomerID = 1:200,
  Age = sample(18:70, 200, replace = TRUE),
  Income = sample(20000:120000, 200, replace = TRUE),
  PolicyType = sample(c("Auto", "Home", "Life"), 200, replace = TRUE),
  ClaimFrequency = sample(0:10, 200, replace = TRUE),
  RiskProfile = runif(200, 0, 1)
)

# Save the dataset for reference
write.csv(data, "insurance_customers.csv", row.names = FALSE)

head(data)
##   CustomerID Age Income PolicyType ClaimFrequency RiskProfile
## 1          1  48  48077       Auto              0  0.35710642
## 2          2  32  89368       Home              0  0.97536501
## 3          3  68 111563       Home              1  0.38437852
## 4          4  31  63405       Auto              1  0.48867354
## 5          5  20  61487       Home              3  0.49077623
## 6          6  59  67484       Auto             10  0.01225462

Exploratory Data Analysis

Let’s summarize and visualize the dataset.

library(ggplot2)

# Summary statistics
summary(data)
##    CustomerID          Age            Income        PolicyType       
##  Min.   :  1.00   Min.   :18.00   Min.   : 20030   Length:200        
##  1st Qu.: 50.75   1st Qu.:32.00   1st Qu.: 45821   Class :character  
##  Median :100.50   Median :44.00   Median : 73394   Mode  :character  
##  Mean   :100.50   Mean   :45.09   Mean   : 70877                     
##  3rd Qu.:150.25   3rd Qu.:58.25   3rd Qu.: 96259                     
##  Max.   :200.00   Max.   :70.00   Max.   :119700                     
##  ClaimFrequency    RiskProfile      
##  Min.   : 0.000   Min.   :0.004842  
##  1st Qu.: 2.000   1st Qu.:0.244109  
##  Median : 5.000   Median :0.488741  
##  Mean   : 4.845   Mean   :0.505674  
##  3rd Qu.: 7.250   3rd Qu.:0.747032  
##  Max.   :10.000   Max.   :0.999524
# Income distribution
ggplot(data, aes(x = Income)) +
  geom_histogram(bins = 20, fill = "steelblue", color = "black") +
  theme_minimal() +
  labs(title = "Income Distribution", x = "Income (USD)", y = "Count")

# ClaimFrequency vs Age
ggplot(data, aes(x = Age, y = ClaimFrequency)) +
  geom_point(color = "darkorange", alpha = 0.6) +
  theme_minimal() +
  labs(title = "Claims vs Age", x = "Age", y = "Claim Frequency")

# K-Means clustering We will segment customers based on Age, Income, ClaimFrequency, and RiskProfile.

K-Means clustering is an unsupervised machine learning algorithm used to partition a dataset into a predefined number of groups (clusters). It aims to group data points so that those in the same cluster are more similar to each other than to those in other clusters.

# Select relevant features
features <- data[, c("Age", "Income", "ClaimFrequency", "RiskProfile")]

# Standardize the data
features_scaled <- scale(features)

# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(features_scaled, centers = 3, nstart = 25)

data$Cluster <- as.factor(kmeans_result$cluster)

# Cluster centers
kmeans_result$centers
##           Age     Income ClaimFrequency RiskProfile
## 1 -0.35926774  0.7855373     0.21330437  -0.8939650
## 2 -0.02521289  0.3292201     0.05893589   1.0075631
## 3  0.37006035 -1.1039920    -0.26700944  -0.2305634

Visualization of clusters

# Cluster visualization (Age vs Income)
ggplot(data, aes(x = Age, y = Income, color = Cluster)) +
  geom_point(alpha = 0.7) +
  theme_minimal() +
  labs(title = "Customer Segments", x = "Age", y = "Income")

# Count of customers in each cluster
table(data$Cluster)
## 
##  1  2  3 
## 63 71 66

Insights and Recommendations

Cluster 1: Young customers with high income and low claim frequency. These customers are likely to be financially stable and low-risk. Recommendation: Offer premium policies with added benefits such as loyalty rewards or discounts on bundled policies. Cluster 2: Middle-aged customers with moderate income and medium risk profiles. They may be cost-conscious but reliable customers. Recommendation: Provide standard insurance plans with flexible payment options. Upsell additional coverage by highlighting value-for-money benefits. Cluster 3: Older customers with low income but high claim frequency. These customers may pose higher risk due to frequent claims. Recommendation: Monitor claims closely and introduce cost-effective policies. Offer risk management programs such as safety consultations or discounts on claim-free periods. General Insights: Customers with high claim frequency often require detailed risk assessment and potential policy adjustments to maintain profitability. Segmenting by PolicyType can further refine targeting strategies. For example: Auto insurance customers with frequent claims might benefit from telematics-based monitoring. Life insurance customers in younger age groups can be targeted for long-term plans. Income levels are critical for identifying opportunities for upselling or cross-selling. Actionable Steps: Implement automated alerts for high-risk clusters to prevent potential losses. Use the segmentation insights to design marketing campaigns tailored to each cluster. Continuously monitor cluster characteristics to adapt to changing customer profiles and market trends. Customer segmentation enables data-driven decision-making in insurance and helps tailor offerings to customer needs.