Customer segmentation involves dividing customers into groups based on shared characteristics. In the insurance industry, segmentation helps to:
Tailor products to customer needs. Improve risk assessment. Enhance customer retention through targeted marketing. In this workshop, we will: 1. Load and explore a dataset. 2. Perform segmentation using K-Means clustering. 3. Interpret and visualize the results.
We will use a simulated dataset with the following features: - CustomerID: Unique identifier for each customer. - Age: Age of the customer. - Income: Annual income of the customer (in USD). - PolicyType: Type of insurance policy (e.g., Auto, Home, Life). - ClaimFrequency: Number of claims made in the last year. - RiskProfile: A score indicating the customer’s risk level. We can use this formula as well: dataRiskProfile<−(data ClaimFrequency / max(dataClaimFrequency))∗0.6+(1−(data Income / max(data$Income))) * 0.4
Let’s create and load the dataset.
set.seed(123)
data <- data.frame(
CustomerID = 1:200,
Age = sample(18:70, 200, replace = TRUE),
Income = sample(20000:120000, 200, replace = TRUE),
PolicyType = sample(c("Auto", "Home", "Life"), 200, replace = TRUE),
ClaimFrequency = sample(0:10, 200, replace = TRUE),
RiskProfile = runif(200, 0, 1)
)
# Save the dataset for reference
write.csv(data, "insurance_customers.csv", row.names = FALSE)
head(data)
## CustomerID Age Income PolicyType ClaimFrequency RiskProfile
## 1 1 48 48077 Auto 0 0.35710642
## 2 2 32 89368 Home 0 0.97536501
## 3 3 68 111563 Home 1 0.38437852
## 4 4 31 63405 Auto 1 0.48867354
## 5 5 20 61487 Home 3 0.49077623
## 6 6 59 67484 Auto 10 0.01225462
Let’s summarize and visualize the dataset.
library(ggplot2)
# Summary statistics
summary(data)
## CustomerID Age Income PolicyType
## Min. : 1.00 Min. :18.00 Min. : 20030 Length:200
## 1st Qu.: 50.75 1st Qu.:32.00 1st Qu.: 45821 Class :character
## Median :100.50 Median :44.00 Median : 73394 Mode :character
## Mean :100.50 Mean :45.09 Mean : 70877
## 3rd Qu.:150.25 3rd Qu.:58.25 3rd Qu.: 96259
## Max. :200.00 Max. :70.00 Max. :119700
## ClaimFrequency RiskProfile
## Min. : 0.000 Min. :0.004842
## 1st Qu.: 2.000 1st Qu.:0.244109
## Median : 5.000 Median :0.488741
## Mean : 4.845 Mean :0.505674
## 3rd Qu.: 7.250 3rd Qu.:0.747032
## Max. :10.000 Max. :0.999524
# Income distribution
ggplot(data, aes(x = Income)) +
geom_histogram(bins = 20, fill = "steelblue", color = "black") +
theme_minimal() +
labs(title = "Income Distribution", x = "Income (USD)", y = "Count")
# ClaimFrequency vs Age
ggplot(data, aes(x = Age, y = ClaimFrequency)) +
geom_point(color = "darkorange", alpha = 0.6) +
theme_minimal() +
labs(title = "Claims vs Age", x = "Age", y = "Claim Frequency")
# K-Means clustering We will segment customers based on Age, Income,
ClaimFrequency, and RiskProfile.
K-Means clustering is an unsupervised machine learning algorithm used to partition a dataset into a predefined number of groups (clusters). It aims to group data points so that those in the same cluster are more similar to each other than to those in other clusters.
# Select relevant features
features <- data[, c("Age", "Income", "ClaimFrequency", "RiskProfile")]
# Standardize the data
features_scaled <- scale(features)
# Perform K-Means clustering
set.seed(123)
kmeans_result <- kmeans(features_scaled, centers = 3, nstart = 25)
data$Cluster <- as.factor(kmeans_result$cluster)
# Cluster centers
kmeans_result$centers
## Age Income ClaimFrequency RiskProfile
## 1 -0.35926774 0.7855373 0.21330437 -0.8939650
## 2 -0.02521289 0.3292201 0.05893589 1.0075631
## 3 0.37006035 -1.1039920 -0.26700944 -0.2305634
# Cluster visualization (Age vs Income)
ggplot(data, aes(x = Age, y = Income, color = Cluster)) +
geom_point(alpha = 0.7) +
theme_minimal() +
labs(title = "Customer Segments", x = "Age", y = "Income")
# Count of customers in each cluster
table(data$Cluster)
##
## 1 2 3
## 63 71 66
Cluster 1: Young customers with high income and low claim frequency. These customers are likely to be financially stable and low-risk. Recommendation: Offer premium policies with added benefits such as loyalty rewards or discounts on bundled policies. Cluster 2: Middle-aged customers with moderate income and medium risk profiles. They may be cost-conscious but reliable customers. Recommendation: Provide standard insurance plans with flexible payment options. Upsell additional coverage by highlighting value-for-money benefits. Cluster 3: Older customers with low income but high claim frequency. These customers may pose higher risk due to frequent claims. Recommendation: Monitor claims closely and introduce cost-effective policies. Offer risk management programs such as safety consultations or discounts on claim-free periods. General Insights: Customers with high claim frequency often require detailed risk assessment and potential policy adjustments to maintain profitability. Segmenting by PolicyType can further refine targeting strategies. For example: Auto insurance customers with frequent claims might benefit from telematics-based monitoring. Life insurance customers in younger age groups can be targeted for long-term plans. Income levels are critical for identifying opportunities for upselling or cross-selling. Actionable Steps: Implement automated alerts for high-risk clusters to prevent potential losses. Use the segmentation insights to design marketing campaigns tailored to each cluster. Continuously monitor cluster characteristics to adapt to changing customer profiles and market trends. Customer segmentation enables data-driven decision-making in insurance and helps tailor offerings to customer needs.